Modeling Gestalt Visual Reasoning on the Raven’s Progressive Matrices Intelligence Test Using Generative Image Inpainting Techniques
Psychologists recognize Raven’s Progressive Matrices as a very effective test of general human intelligence. While many computational models have been developed by the AI community to investigate different forms of top-down, deliberative reasoning on the test, there has been less research on bottom-up perceptual processes, like Gestalt image completion, that are also critical in human test performance. In this work, we investigate how Gestalt visual reasoning on the Raven’s test can be modeled using generative image inpainting techniques from computer vision. We demonstrate that a self-supervised inpainting model trained only on photorealistic images of objects achieves a score of 27/36 on the Colored Progressive Matrices, which corresponds to average performance for nine-year-old children. We also show that models trained on other datasets (faces, places, and textures) do not perform as well. Our results illustrate how learning visual regularities in real-world images can translate into successful reasoning about artificial test stimuli. On the flip side, our results also highlight the limitations of such transfer, which may explain why intelligence tests like the Raven’s are often sensitive to people’s individual sociocultural backgrounds.
Consider the matrix reasoning problem in Figure 1; the goal is to select the answer choice from the bottom that best fits in the blank portion on top. Such problems are found on many different human intelligence tests [roid1997leiter, wechsler2008wechsler], including on the Raven’s Progressive Matrices tests, which are considered to be the most effective single measure of general intelligence across all psychometric tests [snow1984topography].
As you may have guessed, the solution to this problem is answer choice #2. While this problem may seem quite simple, what is interesting about it is that there are multiple ways to solve it. For example, one might take a top-down, deliberative approach by first deciding that the top two elements are reflected across the horizontal axis, and then reflecting the bottom element to predict an answer–often called an Analytic approach [lynn2004sex, prabhakaran1997neural]. Alternatively, one might just “see” the answer emerge in the empty space, in a more bottom-up, automatic fashion–often called a Gestalt or figural approach.
While many computational models explore variations of the Analytic approach, less attention has been paid to the Gestalt approach, though both are critical in human intelligence. In human cognition, Gestalt principles refer to a diverse set of capabilities for detecting and predicting perceptual regularities such as symmetry, closure, similarity, etc. [wagemans2012century]. Here, we investigate how Gestalt reasoning on the Raven’s test can be modeled with generative image inpainting techniques from computer vision:
We describe a concrete framework for solving Raven’s problems through Gestalt visual reasoning, using a generic image inpainting model as a component.
We demonstrate that our framework, using an inpainting model trained on photorealistic object images from ImageNet, achieves a score of 27/36 on the Raven’s Colored Progressive Matrices test.
We show that test performance is sensitive to the inpainting model’s training data. Models trained on faces, places, and textures get scores of 11, 17, and 18, respectively, and we offer some potential reasons for these differences.
Background: Gestalt Reasoning
In humans, Gestalt phenomena have to do with how we integrate low-level perceptual elements into coherent, higher-level wholes [wagemans2012century]. For example, the left side of Figure 2 contains only scattered line segments, but we inescapably see a circle and rectangle. The right side of Figure 2 contains one whole key and one broken key, but we see two whole keys with occlusion.
In psychology, studies of Gestalt phenomena have enumerated a list of principles (or laws, perceptual/reasoning processes, etc.) that cover the kinds of things that human perceptual systems do [wertheimer1923untersuchungen, kanizsa1979organization]. Likewise, work in image processing and computer vision has attempted to define these principles mathematically or computationally [desolneux2007gestalt].
In more recent models, Gestalt principles are seen as emergent properties that reflect, rather than determine, perceptions of structure in an agent’s visual environment. For example, early approaches to image inpainting—i.e., reconstructing a missing/degraded part of an image—used rule-like principles to determine the structure of missing content, while later, machine-learning-based approaches attempt to learn structural regularities from data and apply them to new images [schonlieb2015partial]. This seems reasonable as a model of Gestalt phenomena in human cognition; after years of experience with the world around us, we see Figure 2 (left) as partially occluded/degraded views of whole objects.
Background: Image Inpainting
Machine-learning-based inpainting techniques typically either borrow information from within the occluded image itself [bertalmio2000image, barnes2009patchmatch, ulyanov2018deep] or from a prior learned from other images [hays2008scene, yu2018generative, zheng2019pluralistic]. The first type of approach often uses patch similarities to propagate low-level features, such as the texture of grass, from known background regions to unknown patches. Of course, such approaches suffer on images with low self-similarity or when the missing part involves semantic-level cognition, e.g., a part of a face.
The second approach aims to generalize regularities in visual content and structure across different images, and several impressive results have recently been achieved with the rise of deep-learning-based generative models. For example, Li and colleagues (\citeyearli2017generative) use an encoder-decoder neural network structure, regulated by an adversarial loss function, to recover partly occluded face images. More recently, Yu and colleagues (\citeyearyu2018generative) designed an architecture that not only can synthesize missing image parts but also explicitly utilizes surrounding image feature as context to make inpainting more precise. In general, most recent neural-network-based image inpainting algorithms represent some combination of variational autoencoders (VAE) and generative adversarial networks (GAN) and typically contain an encoder, a decoder, and an adversarial discriminator.
Generative Adversarial Networks (GAN)
Generative adversarial networks combine generative and discriminative models to learn very robust image priors [goodfellow2014generative]. In a typical formulation, the generator is a transposed convolutional neural network while the discriminator is a regular convolutional neural network. During training, the generator is fed random noise and outputs a generated image. The generated image is sent alongside a real image to the discriminator, which outputs a score to evaluate how real or fake the inputs are. The error between the output score and ground truth score is back-propagated to adjust the weights.
This training scheme forces the generator to produce images that will fool the discriminator into believing they are real images. In the end, training converges at an equilibrium where the generator cannot make the synthesized image more real, while the discriminator fails to tell whether an image is real or generated. Essentially, the training process of GANs forces the generated images to lay within the same distribution (in some latent space) as real images.
Variational autoencoders (VAE)
Autoencoders are deep neural networks, with a narrow bottleneck layer in the middle, that can reconstruct high dimensional data from original inputs. The bottleneck will capture a compressed latent encoding that can then be used for tasks other than reconstruction. Variational autoencoders use a similar encoder-decoder structure but also encourage continuous sampling within the bottleneck layer so that the decoder, once trained, functions as a generator [kingma2013auto].
While a GAN’s generated image outputs are often sharp and clear, a major disadvantage is that the training process can be unstable and prone to problems [goodfellow2014generative, Mao2016LeastSG]. Even if training problems can be solved, e.g., [arjovsky2017wasserstein], GANs still lack encoders that map real images to latent variables. Compared with GANs, VAE-generated images are often a bit blurrier, but the model structure in general is much more mathematically elegant and more easily trainable. To get the best of both worlds, Larsen and colleagues (\citeyearlarsen2015autoencoding) proposed an architecture that attaches an adversarial loss to a variational autoencoder, as shown in Figure 3.
Our Gestalt Reasoning Framework
In this section, we present a general framework for modeling Gestalt visual reasoning on the Raven’s test or similar types of problems. Our framework is intended to be agnostic to any type of encoder-decoder-based inpainting model. For our experiments, we adopt a recent VAE-GAN inpainting model [yu2018generative]; as we use the identical architecture and training configuration, we refer readers to the original paper for more details about the inpainting model itself.
Our framework makes use of a pre-trained encoder and corresponding decoder (where and indicate the encoder’s and decoder’s learned parameters, respectively). The partially visible image to be inpainted, in our case, is a Raven’s problem matrix with the fourth cell missing, accompanied with a mask, which is passed as input into the encoder . Then outputs an embedded feature representation , which is sent as input to the generator . Note that the learned feature representation could be of any form—a vector, matrix, tensor or any other encoding as long as it represents the latent features of input images.
The generator then outputs a generated image, and we cut out the generated part as the predicted answer. Finally, we choose the most similar candidate answer choice by computing the distance among feature representations of the various images (the prediction versus each answer choice), computed using the trained encoder again.
This process is illustrated in Figure 4. More concisely, let , , , be the three elements of the original problem matrix, be the image mask, and be the input comprised of these four images. Then, the process of solving the problem to determine the chosen answer can be written as:
where h and w are height and width of the reconstructed image, and is the answer choice space.
For our experiments, we used the same image inpainting model [yu2018generative] trained on four different datasets. The first model, which we call Model-Objects, we trained from scratch so that we could evaluate Raven’s test performance at multiple checkpoints during training. The latter three models, which we call Model-Faces, Model-Scenes, and Model-Textures, we obtained as pre-trained models [yu2018generative]. Details about each dataset are given below.
Note: The reader may wonder why we did not train an inpainting model on Raven’s-like images, i.e., black and white illustrations of 2D shapes. Our rationale follows the spirit of human intelligence testing: people are not meant to practice taking Raven’s-like problems. If they do, the test is no longer a valid measure of their intelligence [hayes2015we]. Here, our goal was to explore how “test-naive” Gestalt image completion processes would fare. (There are many more nuances to these ideas, of course, which we discuss further in Related Work.)
Model-Objects. The first model, Model-Objects, was trained on the Imagenet dataset [russakovsky2015imagenet]. We trained this model from scratch. We began with the full ImageNet dataset containing 14M images non-uniformly spanning 20,000 categories such as “windows,” “balloons,” and “giraffes. The model converged prior to one full training epoch on the randomized dataset; we halted training around 300,000 iterations, with a batch size of 36 images per iteration. The best Raven’s performance was found at around 80,000 iterations, which means that the final model we used saw only about 3M images in total during training.
Model-Faces. Our second model, Model-Faces, was trained on the Large-scale CelebFaces Attributes (CelebA) dataset [liu2015deep], which contains around 200,000 images of celebrity faces, covering around 10,000 individuals.
Model-Scenes. Our third model, Model-Scenes, was trained on the Places dataset [zhou2017places], which contains around 10M images spanning 434 categories, grouped into three macro-categories: indoor, nature, and urban.
Model-Textures. Our fourth model, Model-Textures, was trained on the Describable Textures Dataset (DTD) [cimpoi2014describing], which contains 5640 images, divided into 47 categories, of textures taken from real objects, such as knitting patterns, spiderwebs, or an animal’s skin.
four networks - imagenet, faces, and places, dtd
FIRST THING: table of actual Raven’s results: for each network, show score on each set (A, AB, B, C, D, E)
rows: each network columns: six columns, one for each set. plus one column for CPM total (A, AB, B), and then one column for SPM total (A, B, C, D, E). cell: number correct (out of 12)
imagenet: pick best one, and explain what we did others: explain that its using pre-trained versions
SECOND THING: imagenet training graph (CPM accuracy as a function of training iteration) and also loss graph
THIRD THING: show results from four networks on all five example problems
if NO differences, then perhaps come up with example problems to try to showcase differences
show results of each network running on photographs of a face, a place, and an object
Whether neural networks can learn abstract reasoning or whether they merely rely on superficial statistics is a topic of recent debate…
initial the random initialized architecture can correctly predict 10(8 on average actually) out of 36 (random should be 1/6) this should be attribute to cnn structure. [ulyanov2018deep]
The ultimate way, most similar with humans situation, to use this test as a test of machine intelligent is to invite the bot to sit down and use verbal instructions to show how the problem works and so on.
|3D Objects on Turntable|
Related Work on the Raven’s Test
Over the decades, there have been many exciting efforts in AI to computationally model various aspects of problem solving for matrix reasoning and similar geometric analogy problems, beginning with Evans’ classic ANALOGY program [evans1968program]. In this section, we review some major themes that seem to have emerged across these efforts, situate our current work within this broader context, and point out important gaps that remain unfilled.
Note that we do not attempt to list the “test scores” achieved by various models for two reasons. First, these models have collectively explored so many problem variants, problem contents, pre-processing methods, model constraints, etc., that it is exceedingly difficult to make apples-to-apples comparisons among them.
Second, and more importantly, we feel that better scientific knowledge has come from the systematic, within-model experiments presented by many of these studies than from the absolute levels of performance they achieve. Raven’s is not now (and probably never will be) a task that is of practical utility for AI systems in the world to be solving well, and so treating it as a black-box benchmark is of limited value. However, the test continues to be enormously profitable as a research tool for generating insights into the organization of intelligence, both in humans and in artificial systems.
Knowledge-based versus data-driven. Early models took a knowledge-based approach, meaning that they contained explicit, structured representations of certain key elements of domain knowledge. For example, Carpenter and colleagues (\citeyearcarpenter1990one) built a system that matched relationships among problem elements according to one of five predefined rules. Knowledge-based models tend to focus on what an agent does with its knowledge during reasoning; where this knowledge might come from remains an open question.
On the flip side, a recently emerging crop of data-driven models extract domain knowledge from a training set containing example problems that are similar to the test problems the model will eventually solve, e.g., [hoshen2017iq]. Data-driven models tend to focus on interactions between training data, learning architectures, and learning outcomes; how knowledge might be represented in a task-general manner and used flexibly during reasoning and decision-making remain open questions.
, which followed a deliberative, Analytic approach. One of the first works to address the Raven’s test specifically was Hunt’s One of the first algorithmic descriptions appeared in 1974, when Hunt described two
[carpenter1990one, rasmussen2011neural, kunda2013computational, lovett2017modeling, strannegaard2013anthropomorphic]
List previous computational models, and point out that NONE of them have modeled human reasoning using Gestalt perceptual principles
1. The IQ of Neural Networks [hoshen2017iq] They generated a raven-like geometry shape dataset where six images are treated as input to a neural net, two being the question cells and four being the answer choices. In the four candidate answer choices only one correct answer image will complete the progression of the two question cells.
If we put the question and answer choice in a row, we can observe progression in degrees of rotation, size of geometry, reflection, number of geometry object, shades of color, or addition relation between the three cells. The neural based model trained with these data will either predict the probability of different answer cells or generate the third cell. The model, trained and tested with this dataset, achieved, as the author puts it, top 5% of human performance. It’s not mentioned what is the score for raven’s matrix.
2. Measuring abstract reasoning in neural networks [barrett2018measuring] This work exams the ability of different neural structure such as CNN-MLP, ResNet, LSTM, WReN to generalized to a raven-like dataset. Experiments show WReN(Wild Relation Network) have the most inductive bias towards relational reasoning tasks.
3. Learning to make analogies by contrasting abstract relational structure [hill2019learning] This paper points out a different training sequence will facilitate the learning of abstract relational structure. For example, contrasting a row of cells with progression in object quantity with a row of cells with progression in object darkness, a dataset arranged in this manner greatly increase the model’s ability to generalize from trivial specific aspects of input image to a more general conceptual common ground between both rows.
4. Improving Generalization for Abstract Reasoning Tasks Using Disentangled Feature Representations [steenbrugge2018improving] This paper demonstrates a two stage training paradigm, first learn a feature extractor which encodes a disentangled feature representation in an unsupervised manner, then deploy a relational reasoning module, with correct answer as supervision signal, in the latent disentangled feature space. Compared with training the model end to end without disentanglement, the new paradigm exhibits a reasonable amount of superiority. This work demonstrates a preliminary result of this two stage learning paradigm.
5. Are Disentangled Representations Helpful for Abstract Visual Reasoning? [van2019disentangled]
In this paper, they use an RPM-like 3 by 3 visual reasoning matrix generated from dSprites dataset to test extensively whether disentangled representation truly facilitates down stream abstract reasoning tasks, compared with training both encoder and relational reasoning module end to end using WReN(Wild Relation Network). The paper shows that for modeling reasoning, the two stage paradigm leads to quicker learning with fewer examples.
6. Raven: A dataset for relational and analogical visual reasoning [zhang2019raven] created a raven-like data with structural annotations for augmentation purpose and human achievement for comparison. By utilizing the annotation as augmentation, models tested in [barrett2018measuring] all experience a boost in test accuracy.
All previous efforts assume that as long as a model achieves a perfect score on raven or raven-like tests, regardless of what training inputs are, the amount of intelligence the model have will be reflected. However, in human’s case, it’s been studied that the reliability of raven test is highly dependent on the test taker’s ignorance of the test style [hayes2015we] . The test is no longer valid for anyone who has been trained with millions of questions close to Raven’s. Though these positive results does exhibit great potential of abstract reasoning in neuron networks, the fact that these models takes tens of thousands of images to generalize reveals great disparity when compared with human mind. [bors2003effect]
In conclusion, …