Modeling Gestalt Visual Reasoning on the Raven’s Progressive Matrices Intelligence Test Using Generative Image Inpainting Techniques

Modeling Gestalt Visual Reasoning on the Raven’s Progressive Matrices Intelligence Test Using Generative Image Inpainting Techniques

Tianyu Hua1 2
Maithilee Kunda2
1China University of Geosciences (Beijing)
2Vanderbilt University
patrickhua.ty@gmail.com
mkunda@vanderbilt.edu
Abstract

Psychologists recognize Raven’s Progressive Matrices as a very effective test of general human intelligence. While many computational models have been developed by the AI community to investigate different forms of top-down, deliberative reasoning on the test, there has been less research on bottom-up perceptual processes, like Gestalt image completion, that are also critical in human test performance. In this work, we investigate how Gestalt visual reasoning on the Raven’s test can be modeled using generative image inpainting techniques from computer vision. We demonstrate that a self-supervised inpainting model trained only on photorealistic images of objects achieves a score of 27/36 on the Colored Progressive Matrices, which corresponds to average performance for nine-year-old children. We also show that models trained on other datasets (faces, places, and textures) do not perform as well. Our results illustrate how learning visual regularities in real-world images can translate into successful reasoning about artificial test stimuli. On the flip side, our results also highlight the limitations of such transfer, which may explain why intelligence tests like the Raven’s are often sensitive to people’s individual sociocultural backgrounds.

Introduction

Consider the matrix reasoning problem in Figure 1; the goal is to select the answer choice from the bottom that best fits in the blank portion on top. Such problems are found on many different human intelligence tests [roid1997leiter, wechsler2008wechsler], including on the Raven’s Progressive Matrices tests, which are considered to be the most effective single measure of general intelligence across all psychometric tests [snow1984topography].

As you may have guessed, the solution to this problem is answer choice #2. While this problem may seem quite simple, what is interesting about it is that there are multiple ways to solve it. For example, one might take a top-down, deliberative approach by first deciding that the top two elements are reflected across the horizontal axis, and then reflecting the bottom element to predict an answer–often called an Analytic approach [lynn2004sex, prabhakaran1997neural]. Alternatively, one might just “see” the answer emerge in the empty space, in a more bottom-up, automatic fashion–often called a Gestalt or figural approach.

Figure 1: Example problem like those on the Raven’s Progressive Matrices tests [kunda2013computational].

While many computational models explore variations of the Analytic approach, less attention has been paid to the Gestalt approach, though both are critical in human intelligence. In human cognition, Gestalt principles refer to a diverse set of capabilities for detecting and predicting perceptual regularities such as symmetry, closure, similarity, etc. [wagemans2012century]. Here, we investigate how Gestalt reasoning on the Raven’s test can be modeled with generative image inpainting techniques from computer vision:

  • [nolistsep,noitemsep]

  • We describe a concrete framework for solving Raven’s problems through Gestalt visual reasoning, using a generic image inpainting model as a component.

  • We demonstrate that our framework, using an inpainting model trained on photorealistic object images from ImageNet, achieves a score of 27/36 on the Raven’s Colored Progressive Matrices test.

  • We show that test performance is sensitive to the inpainting model’s training data. Models trained on faces, places, and textures get scores of 11, 17, and 18, respectively, and we offer some potential reasons for these differences.

Background: Gestalt Reasoning

Figure 2: Images eliciting Gestalt “completion” phenomena.

In humans, Gestalt phenomena have to do with how we integrate low-level perceptual elements into coherent, higher-level wholes [wagemans2012century]. For example, the left side of Figure 2 contains only scattered line segments, but we inescapably see a circle and rectangle. The right side of Figure 2 contains one whole key and one broken key, but we see two whole keys with occlusion.

In psychology, studies of Gestalt phenomena have enumerated a list of principles (or laws, perceptual/reasoning processes, etc.) that cover the kinds of things that human perceptual systems do [wertheimer1923untersuchungen, kanizsa1979organization]. Likewise, work in image processing and computer vision has attempted to define these principles mathematically or computationally [desolneux2007gestalt].

In more recent models, Gestalt principles are seen as emergent properties that reflect, rather than determine, perceptions of structure in an agent’s visual environment. For example, early approaches to image inpainting—i.e., reconstructing a missing/degraded part of an image—used rule-like principles to determine the structure of missing content, while later, machine-learning-based approaches attempt to learn structural regularities from data and apply them to new images [schonlieb2015partial]. This seems reasonable as a model of Gestalt phenomena in human cognition; after years of experience with the world around us, we see Figure 2 (left) as partially occluded/degraded views of whole objects.

Background: Image Inpainting

Machine-learning-based inpainting techniques typically either borrow information from within the occluded image itself [bertalmio2000image, barnes2009patchmatch, ulyanov2018deep] or from a prior learned from other images [hays2008scene, yu2018generative, zheng2019pluralistic]. The first type of approach often uses patch similarities to propagate low-level features, such as the texture of grass, from known background regions to unknown patches. Of course, such approaches suffer on images with low self-similarity or when the missing part involves semantic-level cognition, e.g., a part of a face.

The second approach aims to generalize regularities in visual content and structure across different images, and several impressive results have recently been achieved with the rise of deep-learning-based generative models. For example, Li and colleagues (\citeyearli2017generative) use an encoder-decoder neural network structure, regulated by an adversarial loss function, to recover partly occluded face images. More recently, Yu and colleagues (\citeyearyu2018generative) designed an architecture that not only can synthesize missing image parts but also explicitly utilizes surrounding image feature as context to make inpainting more precise. In general, most recent neural-network-based image inpainting algorithms represent some combination of variational autoencoders (VAE) and generative adversarial networks (GAN) and typically contain an encoder, a decoder, and an adversarial discriminator.

Generative Adversarial Networks (GAN)

Generative adversarial networks combine generative and discriminative models to learn very robust image priors [goodfellow2014generative]. In a typical formulation, the generator is a transposed convolutional neural network while the discriminator is a regular convolutional neural network. During training, the generator is fed random noise and outputs a generated image. The generated image is sent alongside a real image to the discriminator, which outputs a score to evaluate how real or fake the inputs are. The error between the output score and ground truth score is back-propagated to adjust the weights.

This training scheme forces the generator to produce images that will fool the discriminator into believing they are real images. In the end, training converges at an equilibrium where the generator cannot make the synthesized image more real, while the discriminator fails to tell whether an image is real or generated. Essentially, the training process of GANs forces the generated images to lay within the same distribution (in some latent space) as real images.

Variational autoencoders (VAE)

Autoencoders are deep neural networks, with a narrow bottleneck layer in the middle, that can reconstruct high dimensional data from original inputs. The bottleneck will capture a compressed latent encoding that can then be used for tasks other than reconstruction. Variational autoencoders use a similar encoder-decoder structure but also encourage continuous sampling within the bottleneck layer so that the decoder, once trained, functions as a generator [kingma2013auto].

Vae-Gan

Figure 3: Architecture of VAE-GAN

While a GAN’s generated image outputs are often sharp and clear, a major disadvantage is that the training process can be unstable and prone to problems [goodfellow2014generative, Mao2016LeastSG]. Even if training problems can be solved, e.g., [arjovsky2017wasserstein], GANs still lack encoders that map real images to latent variables. Compared with GANs, VAE-generated images are often a bit blurrier, but the model structure in general is much more mathematically elegant and more easily trainable. To get the best of both worlds, Larsen and colleagues (\citeyearlarsen2015autoencoding) proposed an architecture that attaches an adversarial loss to a variational autoencoder, as shown in Figure 3.

Figure 4: Reasoning framework for solving Raven’s test problems using Gestalt image completion, using any pre-trained encoder-decoder-based image inpainting model. Elements , , and from the problem matrix form the initial input, combined into a single image, along with a mask that indicates the missing portion. These are passed through the encoder , and the resulting image features in latent variable space are passed into the decoder . This creates a new complete matrix image ; the portion corresponding to the masked location is the predicted answer to the problem. This predicted answer , along with all of the answer choices , are again passed through the encoder to obtain feature representations in latent space, and the answer choice most similar to is selected as the final solution.

Our Gestalt Reasoning Framework

Figure 5: Examples of inpainting produced by same VAE-GAN model [yu2018generative] trained on four different datasets. Left to right: ImageNet (objects), CelebA (faces), Places (scenes), and DTD (textures).

In this section, we present a general framework for modeling Gestalt visual reasoning on the Raven’s test or similar types of problems. Our framework is intended to be agnostic to any type of encoder-decoder-based inpainting model. For our experiments, we adopt a recent VAE-GAN inpainting model [yu2018generative]; as we use the identical architecture and training configuration, we refer readers to the original paper for more details about the inpainting model itself.

Our framework makes use of a pre-trained encoder and corresponding decoder (where and indicate the encoder’s and decoder’s learned parameters, respectively). The partially visible image to be inpainted, in our case, is a Raven’s problem matrix with the fourth cell missing, accompanied with a mask, which is passed as input into the encoder . Then outputs an embedded feature representation , which is sent as input to the generator . Note that the learned feature representation could be of any form—a vector, matrix, tensor or any other encoding as long as it represents the latent features of input images.

The generator then outputs a generated image, and we cut out the generated part as the predicted answer. Finally, we choose the most similar candidate answer choice by computing the distance among feature representations of the various images (the prediction versus each answer choice), computed using the trained encoder again.

This process is illustrated in Figure 4. More concisely, let , , , be the three elements of the original problem matrix, be the image mask, and be the input comprised of these four images. Then, the process of solving the problem to determine the chosen answer can be written as:

where h and w are height and width of the reconstructed image, and is the answer choice space.

Inpainting Models

For our experiments, we used the same image inpainting model [yu2018generative] trained on four different datasets. The first model, which we call Model-Objects, we trained from scratch so that we could evaluate Raven’s test performance at multiple checkpoints during training. The latter three models, which we call Model-Faces, Model-Scenes, and Model-Textures, we obtained as pre-trained models [yu2018generative]. Details about each dataset are given below.

Note: The reader may wonder why we did not train an inpainting model on Raven’s-like images, i.e., black and white illustrations of 2D shapes. Our rationale follows the spirit of human intelligence testing: people are not meant to practice taking Raven’s-like problems. If they do, the test is no longer a valid measure of their intelligence [hayes2015we]. Here, our goal was to explore how “test-naive” Gestalt image completion processes would fare. (There are many more nuances to these ideas, of course, which we discuss further in Related Work.)

Model-Objects. The first model, Model-Objects, was trained on the Imagenet dataset [russakovsky2015imagenet]. We trained this model from scratch. We began with the full ImageNet dataset containing 14M images non-uniformly spanning 20,000 categories such as “windows,” “balloons,” and “giraffes. The model converged prior to one full training epoch on the randomized dataset; we halted training around 300,000 iterations, with a batch size of 36 images per iteration. The best Raven’s performance was found at around 80,000 iterations, which means that the final model we used saw only about 3M images in total during training.

Model-Faces. Our second model, Model-Faces, was trained on the Large-scale CelebFaces Attributes (CelebA) dataset [liu2015deep], which contains around 200,000 images of celebrity faces, covering around 10,000 individuals.

Model-Scenes. Our third model, Model-Scenes, was trained on the Places dataset [zhou2017places], which contains around 10M images spanning 434 categories, grouped into three macro-categories: indoor, nature, and urban.

Model-Textures. Our fourth model, Model-Textures, was trained on the Describable Textures Dataset (DTD) [cimpoi2014describing], which contains 5640 images, divided into 47 categories, of textures taken from real objects, such as knitting patterns, spiderwebs, or an animal’s skin.

Results

Figure 6: Caption

four networks - imagenet, faces, and places, dtd

FIRST THING: table of actual Raven’s results: for each network, show score on each set (A, AB, B, C, D, E)

rows: each network columns: six columns, one for each set. plus one column for CPM total (A, AB, B), and then one column for SPM total (A, B, C, D, E). cell: number correct (out of 12)

imagenet: pick best one, and explain what we did others: explain that its using pre-trained versions

SECOND THING: imagenet training graph (CPM accuracy as a function of training iteration) and also loss graph

THIRD THING: show results from four networks on all five example problems

if NO differences, then perhaps come up with example problems to try to showcase differences

show results of each network running on photographs of a face, a place, and an object

Discussion

Figure 7: Caption
Figure 8: Caption

Whether neural networks can learn abstract reasoning or whether they merely rely on superficial statistics is a topic of recent debate…

initial the random initialized architecture can correctly predict 10(8 on average actually) out of 36 (random should be 1/6) this should be attribute to cnn structure. [ulyanov2018deep]

The ultimate way, most similar with humans situation, to use this test as a test of machine intelligent is to invite the bot to sit down and use verbal instructions to show how the problem works and so on.

Reference Inputs Type Approach Evaluation
[carpenter1990one]
[kunda2013computational]
[strannegaard2013anthropomorphic]
[lovett2017modeling]
3D Objects on Turntable
3D Object
Intel Egocentric
EPFL-GIMS08
RGB-D
BigBIRD
Table 1: Computational models of various aspects of problem-solving on the Raven’s Progressive Matrices test or similar.

Related Work on the Raven’s Test

Over the decades, there have been many exciting efforts in AI to computationally model various aspects of problem solving for matrix reasoning and similar geometric analogy problems, beginning with Evans’ classic ANALOGY program [evans1968program]. In this section, we review some major themes that seem to have emerged across these efforts, situate our current work within this broader context, and point out important gaps that remain unfilled.

Note that we do not attempt to list the “test scores” achieved by various models for two reasons. First, these models have collectively explored so many problem variants, problem contents, pre-processing methods, model constraints, etc., that it is exceedingly difficult to make apples-to-apples comparisons among them.

Second, and more importantly, we feel that better scientific knowledge has come from the systematic, within-model experiments presented by many of these studies than from the absolute levels of performance they achieve. Raven’s is not now (and probably never will be) a task that is of practical utility for AI systems in the world to be solving well, and so treating it as a black-box benchmark is of limited value. However, the test continues to be enormously profitable as a research tool for generating insights into the organization of intelligence, both in humans and in artificial systems.

Knowledge-based versus data-driven. Early models took a knowledge-based approach, meaning that they contained explicit, structured representations of certain key elements of domain knowledge. For example, Carpenter and colleagues (\citeyearcarpenter1990one) built a system that matched relationships among problem elements according to one of five predefined rules. Knowledge-based models tend to focus on what an agent does with its knowledge during reasoning; where this knowledge might come from remains an open question.

On the flip side, a recently emerging crop of data-driven models extract domain knowledge from a training set containing example problems that are similar to the test problems the model will eventually solve, e.g., [hoshen2017iq]. Data-driven models tend to focus on interactions between training data, learning architectures, and learning outcomes; how knowledge might be represented in a task-general manner and used flexibly during reasoning and decision-making remain open questions.

, which followed a deliberative, Analytic approach. One of the first works to address the Raven’s test specifically was Hunt’s One of the first algorithmic descriptions appeared in 1974, when Hunt described two

[carpenter1990one, rasmussen2011neural, kunda2013computational, lovett2017modeling, strannegaard2013anthropomorphic]

List previous computational models, and point out that NONE of them have modeled human reasoning using Gestalt perceptual principles

1. The IQ of Neural Networks [hoshen2017iq] They generated a raven-like geometry shape dataset where six images are treated as input to a neural net, two being the question cells and four being the answer choices. In the four candidate answer choices only one correct answer image will complete the progression of the two question cells.

If we put the question and answer choice in a row, we can observe progression in degrees of rotation, size of geometry, reflection, number of geometry object, shades of color, or addition relation between the three cells. The neural based model trained with these data will either predict the probability of different answer cells or generate the third cell. The model, trained and tested with this dataset, achieved, as the author puts it, top 5% of human performance. It’s not mentioned what is the score for raven’s matrix.

2. Measuring abstract reasoning in neural networks [barrett2018measuring] This work exams the ability of different neural structure such as CNN-MLP, ResNet, LSTM, WReN to generalized to a raven-like dataset. Experiments show WReN(Wild Relation Network) have the most inductive bias towards relational reasoning tasks.

3. Learning to make analogies by contrasting abstract relational structure [hill2019learning] This paper points out a different training sequence will facilitate the learning of abstract relational structure. For example, contrasting a row of cells with progression in object quantity with a row of cells with progression in object darkness, a dataset arranged in this manner greatly increase the model’s ability to generalize from trivial specific aspects of input image to a more general conceptual common ground between both rows.

4. Improving Generalization for Abstract Reasoning Tasks Using Disentangled Feature Representations [steenbrugge2018improving] This paper demonstrates a two stage training paradigm, first learn a feature extractor which encodes a disentangled feature representation in an unsupervised manner, then deploy a relational reasoning module, with correct answer as supervision signal, in the latent disentangled feature space. Compared with training the model end to end without disentanglement, the new paradigm exhibits a reasonable amount of superiority. This work demonstrates a preliminary result of this two stage learning paradigm.

5. Are Disentangled Representations Helpful for Abstract Visual Reasoning? [van2019disentangled]

In this paper, they use an RPM-like 3 by 3 visual reasoning matrix generated from dSprites dataset to test extensively whether disentangled representation truly facilitates down stream abstract reasoning tasks, compared with training both encoder and relational reasoning module end to end using WReN(Wild Relation Network). The paper shows that for modeling reasoning, the two stage paradigm leads to quicker learning with fewer examples.

6. Raven: A dataset for relational and analogical visual reasoning [zhang2019raven] created a raven-like data with structural annotations for augmentation purpose and human achievement for comparison. By utilizing the annotation as augmentation, models tested in [barrett2018measuring] all experience a boost in test accuracy.

All previous efforts assume that as long as a model achieves a perfect score on raven or raven-like tests, regardless of what training inputs are, the amount of intelligence the model have will be reflected. However, in human’s case, it’s been studied that the reliability of raven test is highly dependent on the test taker’s ignorance of the test style [hayes2015we] . The test is no longer valid for anyone who has been trained with millions of questions close to Raven’s. Though these positive results does exhibit great potential of abstract reasoning in neuron networks, the fact that these models takes tens of thousands of images to generalize reveals great disparity when compared with human mind. [bors2003effect]

Figure 9: Caption

Conclusion

In conclusion, …

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398574
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description