Measuring Machine Intelligence Through Visual Question Answering

Measuring Machine Intelligence Through Visual Question Answering

C. Lawrence Zitnick
Facebook AI Research
   Aishwarya Agrawal
Virginia Tech
   Stanislaw Antol
Virginia Tech
   Margaret Mitchell
Microsoft Research
   Dhruv Batra
Virginia Tech
   Devi Parikh
Virginia Tech

As machines have become more intelligent, there has been a renewed interest in methods for measuring their intelligence. A common approach is to propose tasks for which a human excels, but one which machines find difficult. However, an ideal task should also be easy to evaluate and not be easily gameable. We begin with a case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence. An alternative and more promising task is Visual Question Answering that tests a machine’s ability to reason about language and vision. We describe a dataset unprecedented in size created for the task that contains over 760,000 human generated questions about images. Using around 10 million human generated answers, machines may be easily evaluated.

1 Introduction

Humans have an amazing ability to both understand and reason about our world through a variety of senses or modalities. A sentence such as “Mary quickly ran away from the growling bear.”, conjures both vivid visual and auditory interpretations. We picture Mary running in the opposite direction of a ferocious bear with the sound of the bear being enough to frighten anyone. While interpreting a sentence such as this is effortless to a human, designing intelligent machines with the same deep understanding is anything but. How would a machine know Mary is frightened? What is likely to happen to Mary if she doesn’t run? Even simple implications of the sentence, such as “Mary is likely outside” may be nontrivial to deduce.

How can we determine if a machine has achieved the same deep understanding of our world as a human? In our example sentence above, a human’s understanding is rooted in multiple modalities. They can visualize a scene depicting Mary running, they can imagine the sound of the bear, and even how the bear’s fur might feel when touched. Conversely, if shown a picture or even an auditory recording of a woman running from a bear, a human may similarly describe the scene. Perhaps machine intelligence could be tested in a similar manner? Can a machine use natural language to describe a picture similar to a human? Similarly, could a machine generate a scene given a written description? In fact these tasks have been a goal of artificial intelligence research since its inception. Marvin Minsky famously stated in 1966 [8] to one of his students,“Connect a television camera to a computer and get the machine to describe what it sees.” At the time, and even today, the full complexities of this task are still being discovered.

2 Image Captioning

Figure 1: Example image captions written for an image sorted by caption length.

Are tasks such as image captioning [3, 22, 27, 16, 19, 15, 6, 11, 25, 21, 20, 32] promising candidates for testing artificial intelligence? These tasks have advantages, such as being easy to describe and being capable of capturing the imagination of the public [26]. Unfortunately, tasks such as image captioning have proven problematic as actual tests of intelligence. Most notably, the evaluation of image captions may be as difficult as the image captioning task itself [12, 30, 19, 22, 27]. It has been observed that captions judged as “good” by human observers may actually contain significant variance even though they describe the same image [30]. For instance see Figure 1. Many people would judge the longer more detailed captions as better. However, the details described by the captions varies significantly, e.g. “two hands”, “white t-shirt”, “black curly hair”, “label”, etc. How can we evaluate a caption if there is no consensus on what should be contained in a “good” caption? However, for shorter less detailed captions that are commonly written by humans a rough consensus is achieved “A man holding a beer bottle.” This leads to the somewhat counterintuitive conclusion that captions humans like aren’t necessarily “human-like”.

Figure 2: (left) An example image caption generated from [15]. (right) A set of semantically similar images in the MS COCO training dataset for which the same caption could apply.

The task of image captioning also suffers from another less obvious drawback. In many cases it might be too easy! Consider an example success from a recent paper on image captioning [15], Figure 2. Upon first inspection this caption appears to have been generated from a deep understanding of the image. For instance, in Figure 2 the machine must have detected a giraffe, grass and tree. It understood that the giraffe was standing, and the thing it was standing on was grass. It knows the tree and giraffe are “next to” each other, etc. Is this interpretation of the machine’s depth of understanding correct? When judging the results of an AI system, it is not only important to analyze its output, but the data used for its training. The results in Figure 2 were obtained by training on the Microsoft Common Objects in Context (MS COCO) dataset [23]. This dataset contains five independent captions written by humans for over 120,000 images [5]. If we examine the image in Figure 2 and the images in the training dataset we can make an interesting observation. For many testing images, there exists a significant number of semantically similar training images, Figure 2(right). If two images share enough semantic similarity, it is possible a single caption could describe them both.

This observation leads to a surprisingly simple algorithm for generating captions [9]. Given a test image, collect a set of captions from images that are visually similar. From this set, select the caption with highest consensus [30], i.e. the caption most similar to the other captions in the set. In many cases the consensus caption is indeed a good caption. When judged by humans, of these borrowed captions are judged to be equal to or better than those written by humans for the image specifically. Despite its simplicity, this approach is competitive with more advance approaches using recurrent neural networks [6, 11, 25, 21, 20, 32] and other language models [15] which can achieve when compared to human captions. Even methods using recurrent neural networks commonly produce captions that are identical to training captions even though they’re not explicitly trained to do so. If captions are “generated” by borrowing them from other images, these algorithms are clearly not demonstrating a deep understanding of language, semantics and their visual interpretation. The odds of two humans repeating a sentence is quite rare.

One could make the case that the fault is not with the algorithms but in the data used for training. That is, the dataset contains too many semantically similar images. However, even in randomly sampled images from the web, a photographer bias is found. Humans capture similar images to each other. Many of our tastes or preferences are universal.

3 Visual Question Answering

As we demonstrated using the task of image captioning, determining a multimodal task for measuring a machine’s intelligence is challenging. The task must be easy to evaluate, yet hard to solve. That is, it’s evaluation shouldn’t be as hard as the task itself, and it must not be solvable using “shortcuts” or “cheats”. To solve these two problems we propose the task of Visual Question Answering (VQA) [1, 18, 24, 29, 4, 17].

Figure 3: Example images and questions in the Visual Question Answering dataset (

The task of VQA requires a machine to answer a natural language question about an image as shown in Figure 3. Unlike the captioning task, evaluating answers to questions is relatively easy. The simplest approach is to pose the questions with multiple choice answers, much like standardized tests administered to students. Since computers don’t get tired of reading through long lists of answers, we can even increase the length of the answer list. Another more challenging option is to leave the answers open-ended. Since most answers are single words such as “yes”, “blue”, or “two” evaluating their correctness is straightforward.

Is the visual question answering task challenging? The task is inherently multimodal, since it requires knowledge of language and vision. Its complexity is further increased by the fact that many questions require commonsense knowledge to answer. For instance, if you ask “Does the man have “20/20” vision?”, you need the commonsense knowledge that having 20/20 vision implies you don’t wear glasses. Going one step further, one might be concerned that commonsense knowledge is all that’s needed to answer the questions. For example if the question was “What color is the sheep?”, our commonsense would tell us the answer is “white”. We may test the sufficiency of commonsense knowledge by asking subjects to answer questions without seeing the accompanying image. In this case, humans subjects did indeed perform poorly ( correct), indicating that commonsense may be necessary but not sufficient. Similarly, we may ask subjects to answer the question given only a caption describing the image. In this case the humans performed better ( correct), but still not as accurately as those able to view the image ( correct). This helps indicate the VQA task requires more detailed information about an image than is typically provided in an image caption.

How do you gather diverse and interesting questions for 100,000’s of images? Amazon’s Mechanical Turk provides a powerful platform for crowdsourcing tasks, but the design and prompts of the experiments must be careful chosen. For instance, we ran trial experiments prompting the subjects to write questions that would be difficult for a “toddler”, “alien”, or “smart robot” to answer. Upon examination, we determined that questions written for a smart robot were most interesting given their increased diversity and difficulty. In comparison, the questions stumping a toddler were a bit too easy. We also gathered three questions per image and ensured diversity by displaying the previously written questions and stating “Write a different question from those above that would stump a smart robot.” In total over 760,000 questions were gathered 111

Figure 4: Distribution of questions by their first four words. The ordering of the words starts towards the center and radiates outwards. The arc length is proportional to the number of questions containing the word. White areas indicate words with contributions too small to show.
Figure 5: Distribution of answers per question type when subjects provide answers when given the image (top) and when not given the image (bottom).

The diversity of questions supplied by the subjects on Amazon’s Mechanical Turk is impressive. In Figure 4, we show the distribution of words that begin the questions. The majority of questions begin with “What” and “Is”, but other questions include “How”, “Are”, “Does”, etc. Clearly no one type of question dominates. The answers to these questions have a varying diversity depending on the type of question. Since the answers may be ambiguous, e.g. “What is the person looking at?” we collected ten answers per question. As shown in Figure 5, many question types are simply answered “yes” or “no”. Other question types such as those that start with “What is” have a greater variety of answers. An interesting comparison is to examine the distribution of answers when subjects were asked to answer the questions with and without looking at the image. As shown in Figure 5 (bottom), there is a strong bias to many questions when subjects do not see the image. For instance “What color” questions invoke “red” as an answer, or for questions that are answered by “yes” or “no”, “yes” is highly favored.

Finally it is important to measure the difficulty of the questions. Some questions such as “What color is the ball?” or “How many people are in the room?” may seem quite simple. In contrast, other questions such as “Does this person expect company?” or “What government document is needed to partake in this activity?” may require quite advanced reasoning to answer. Unfortunately, the difficultly of a question is in many cases ambiguous. The question’s difficultly is as much dependent on the person or machine answering the question as the question itself. Each person or machine has different competencies.

In an attempt to gain insight into how challenging each question is to answer, we asked human subjects to guess how old a person would need to be to answer the question. It is unlikely most human subjects have adequate knowledge of human learning development to answer the question correctly. However, this does provide an effective proxy for question difficulty. That is, questions judged to be answerable by a 3-4 year old are easier than those judged answerable by a teenager. Note, we make no claims that questions judged answerable by a 3-4 year old will actually be answered correctly by toddlers. This would require additional experiments performed by the appropriate age groups. Since the task is ambiguous, we collected ten respondences for each question. In Figure 6 we show several questions for which a majority of subjects picked the specified age range. Surprisingly the perceived age needed to answer the questions is fairly well distributed across the different age ranges. As expected the questions that were judged answerable by an adult (18+) generally need specialized knowledge, where those answerable by a toddler (3-4) are more generic.

Figure 6: Example questions judged to be answerable by different age groups. The percentage of questions falling into each age group is shown in parentheses.

4 Abstract Scenes

The visual question answering task requires a variety of skills. The machine must be able to understand the image, interpret the question and reason about the answer. For many researchers exploring AI, they may not be interested in exploring the low-level tasks involved with perception and computer vision. Many of the questions may even be impossible to solve given the current capabilities of state-of-the-art computer vision algorithms. For instance the question “How many cellphones are in the image?” may not be answerable if the computer vision algorithms cannot accurately detect cellphones. In fact, even for state-of-the-art algorithms many objects are difficult to detect, especially small objects [23].

Figure 7: Example abstract scenes and their questions in the Visual Question Answering dataset (

To enable multiple avenues for researching VQA, we introduce abstract scenes into the dataset [2, 34, 35, 36]. Abstract scenes or cartoon images are created from sets of clip art, Figure 7. The scenes are created by human subjects using a graphical user interface that allows them to arrange a wide variety objects. For clip art depicting humans, their poses and expression may also be changed. Using the interface a wide variety of scenes can be created including ordinary scenes, scary scenes, or funny scenes. Since the type of clip art and it’s properties are exactly known, the problem of recognizing objects and their attributes is greatly simplified. This provides researchers an opportunity to more directly study the problems of question understanding and answering. Once computer vision algorithms “catch up”, perhaps some of the techniques developed for abstract scenes can be applied to real images. The abstract scenes may be useful for a variety of other tasks as well, such as learning common sense knowledge [35, 2, 7, 10, 31].

5 Discussion

While visual question answering appears to be a promising approach to measuring machine intelligence for multimodal tasks, it may prove to have unforseen shortcomings. We’ve explored several baseline algorithms that perform poorly when compared to human performance. As the dataset is explored, it is possible that solutions may be found that don’t require “true AI”. However, using proper analysis we hope to continuously update the dataset to reflect the current progress of the field. As certain question or image types become too easy to answer we can add new questions and images. Other modalities may also be explored such as audio and text-based stories [13, 14, 33, 28].

In conclusion, we believe designing a multimodal challenge is essential for accelerating and measuring the progress of AI. Visual question answering offers one approach for designing such challenges that allows for easy evaluation while maintaining the difficultly of the task. As the field progresses our tasks and challenges should be continuously reevaluated to ensure they are of appropriate difficultly given the state of research. Importantly, these tasks should be designed to push the frontiers of AI research, and help ensure their solutions lead us towards systems that are truly “AI complete”.


  • [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015.
  • [2] S. Antol, C. L. Zitnick, and D. Parikh. Zero-shot learning via visual abstraction. In European Conference on Computer Vision, pages 401–416. Springer, 2014.
  • [3] K. Barnard and D. Forsyth. Learning the semantics of words and pictures. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 2, pages 408–415. IEEE, 2001.
  • [4] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, et al. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pages 333–342. ACM, 2010.
  • [5] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  • [6] X. Chen and C. Lawrence Zitnick. Mind’s eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2422–2431, 2015.
  • [7] X. Chen, A. Shrivastava, and A. Gupta. Neil: Extracting visual knowledge from web data. In Proceedings of the IEEE International Conference on Computer Vision, pages 1409–1416, 2013.
  • [8] D. Crevier. AI: The tumultuous history of the search for artificial intelligence. Basic Books, Inc., 1993.
  • [9] J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467, 2015.
  • [10] S. K. Divvala, A. Farhadi, and C. Guestrin. Learning everything about anything: Webly-supervised visual concept learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3270–3277, 2014.
  • [11] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2634, 2015.
  • [12] D. Elliott and F. Keller. Comparing automatic evaluation measures for image description. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Short Papers, volume 452, page 457, 2014.
  • [13] A. Fader, L. Zettlemoyer, and O. Etzioni. Open question answering over curated and extracted knowledge bases. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1156–1165. ACM, 2014.
  • [14] A. Fader, L. S. Zettlemoyer, and O. Etzioni. Paraphrase-driven learning for open question answering. In ACL (1), pages 1608–1618. Citeseer, 2013.
  • [15] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1473–1482, 2015.
  • [16] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision, pages 15–29. Springer, 2010.
  • [17] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu. Are you talking to a machine? dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems, pages 2296–2304, 2015.
  • [18] D. Geman, S. Geman, N. Hallonquist, and L. Younes. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences, 112(12):3618–3623, 2015.
  • [19] M. Hodosh, P. Young, and J. Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–899, 2013.
  • [20] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3128–3137, 2015.
  • [21] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. Transactions of the Association for Computational Linguistics, 2015.
  • [22] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2891–2903, 2013.
  • [23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
  • [24] M. Malinowski and M. Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems, pages 1682–1690, 2014.
  • [25] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR, 2015.
  • [26] J. Markoff. Researchers announce advance in image-recognition software. The New York Times, 2014.
  • [27] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daumé III. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 747–756. Association for Computational Linguistics, 2012.
  • [28] M. Richardson, C. J. Burges, and E. Renshaw. Mctest: A challenge dataset for the open-domain machine comprehension of text. In EMNLP, volume 3, page 4, 2013.
  • [29] K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S.-C. Zhu. Joint video and text parsing for understanding events and answering queries. IEEE MultiMedia, 21(2):42–70, 2014.
  • [30] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575, 2015.
  • [31] R. Vedantam, X. Lin, T. Batra, C. Lawrence Zitnick, and D. Parikh. Learning common sense through visual abstraction. In Proceedings of the IEEE International Conference on Computer Vision, pages 2542–2550, 2015.
  • [32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015.
  • [33] J. Weston, A. Bordes, S. Chopra, A. M. Rush, B. van Merriënboer, A. Joulin, and T. Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015.
  • [34] C. L. Zitnick and D. Parikh. Bringing semantics into focus using visual abstraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3009–3016, 2013.
  • [35] C. L. Zitnick, D. Parikh, and L. Vanderwende. Learning the visual interpretation of sentences. In Proceedings of the IEEE International Conference on Computer Vision, pages 1681–1688, 2013.
  • [36] C. L. Zitnick, R. Vedantam, and D. Parikh. Adopting abstract images for semantic scene understanding. IEEE transactions on pattern analysis and machine intelligence, 38(4):627–638, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description