Improving Generalization for Abstract Reasoning Tasks Using Disentangled Feature Representations

Improving Generalization for Abstract Reasoning Tasks Using Disentangled Feature Representations

Xander Steenbrugge
ML6 & IDlab, Ghent University - imec \AndSam Leroux, Tim Verbelen, Bart Dhoedt
IDlab, Ghent University - imec

In this work we explore the generalization characteristics of unsupervised representation learning by leveraging disentangled VAE’s to learn a useful latent space on a set of relational reasoning problems derived from Raven Progressive Matrices. We show that the latent representations, learned by unsupervised training using the right objective function, significantly outperform the same architectures trained with purely supervised learning, especially when it comes to generalization.


Improving Generalization for Abstract Reasoning Tasks Using Disentangled Feature Representations

  Xander Steenbrugge ML6 & IDlab, Ghent University - imec Sam Leroux, Tim Verbelen, Bart Dhoedt IDlab, Ghent University - imec


noticebox[b]32nd Conference on Neural Information Processing Systems (NIPS 2018), Workshop on Relational Representation Learning, Montréal, Canada.\end@float

1 Introduction

Reasoning about abstract concepts has been a long standing challenge in machine learning. Recent work by Barret et al. [1] introduces a concrete problem setting for testing generalization in the form of a relational reasoning problem derived from Raven Progressive Matrices that are often used in human IQ-tests. The problem exists of a grid of 3-by-3 related images where the bottom right one is missing and a set of 8 possible answers, of which exactly one is correct. Two examples are shown in Figure 1. In this work we use the same dataset, which can be downloaded from

Figure 1: Two example PGM problems: a grid of 3-by-3 related images where the bottom right one is missing and a set of 8 possible answers. The correct choice panels are A and C respectively.

To create a Procedurally Generated Matrices (PGM) dataset, a set of properties is first randomly sampled from the following primitive sets:

  • Relation types:  (, with elements ): progression, XOR, OR, AND, consistent union

  • Attribute types: (, with elements ): size, type, colour, position, number

  • Object types:    (, with elements ): shape, line

The structure of a PGM then, is a set of triples, . These triples determine the challenge posed by one particular matrix problem. In the used dataset, up to 4 triples can be present in a single problem: .

To solve a PGM problem, Barrett et al. propose a Wild Relation Network (WReN) architecture [2] as shown in Figure 2. In this architecture, all given images (the 8 context panels and the 8 choice panels, all represented as 80x80 grayscale images) are first processed by a small convolutional neural network (CNN), resulting in 16 feature embeddings (one per panel). The 8 context embeddings are then sequentially combined with each option embedding, yielding a total of 8 stacks of 9 embeddings. These are finally processed by the WReN network, yielding a single scalar value for each choice panel, indicating its ‘matching-score’ with the given problem. The entire pipeline is then trained to produce the label of the correct missing panel as an output answer by optimizing a cross entropy loss using stochastic gradient descent. To include spatial information in the panel embeddings, each CNN embedding is also concatenated with a one-hot label indicating the panels position, followed by a linear projection.

Figure 2: WReN model from [1]: A CNN processes each context panel and an individual answer choice panel independently to produce 9 vector embeddings. This set of embeddings is then passed to an RN network [2], whose output is a single sigmoid unit encoding the “score” for the associated answer choice panel. 8 such passes are made through this network (here we only depict 2 for clarity), one for each answer choice, and the scores are put through a softmax function to determine the model’s predicted answer.

Although WReNs achieve reasonable performance ( classification accuracy) on a randomly held-out test set of the complete training data (which includes triples from all possible primitives ), the generalization performance on new reasoning problems (containing primitives not seen during training) is significantly worse. This shows that while the model manages to fit the training distribution reasonably well, it fails to generalize in a meaningful way. One of the reasons for this lack of strong generalization is that there is no explicit pressure for the model to discover the generative, latent factors of the problem domain. In fact (as can be seen in Figure 8 of the appendix), the learned CNN embeddings seem to completely disregard the underlying causal structure of the problem domain. In this paper, we aim to improve the generalization performance of WReNs, by first learning a disentangled latent space that encodes the PGM panels, and then learning to reason within this space using the RN architecture.

2 Unsupervised representation learning with Variational Autoencoders

Because the structure of the Raven problems depend explicitly upon a set of generative factors (such as shape, size, colour, …), recovering these variables in a suitable latent space should prove beneficial for solving the relational reasoning problem. To test this hypothesis, we leverage Variational Autoencoders [3] to learn an unsupervised mapping from high-dimensional pixel space to a lower dimensional and more structured latent space that is subsequently used by the WReN model to complete the relational reasoning task.

The behavior of these models has been widely studied [4, 5, 6, 7] and a clear trade-off between desirable latent space properties (such as disentanglement of generative factors, linearity & sparsity) and reconstruction quality is usually present. The effect of these constraints on the generalization strength of the resulting latent space, however, has not been widely studied.

As such, our setup replaces the CNN-encoder of [1] with a disentangled ‘-VAE’ that is trained separately from the WReN model using the modified ELBO optimization objective as described in [6]:


In this case, different -values control the trade-off between reconstruction quality and latent variable disentanglement.

One common problem with the -VAE setting is that high -factors often constrain the latent space to such an extent that encoding small, visual details (which only have a marginal effect on reconstruction error) does not outweigh the KL-penalty that follows from the corresponding divergence from the imposed prior distribution, even though these details are often task-critical. To solve this problem we apply a variable- training regime as described in Appendix D. We ended up using the objective function from with a gradually increasing : . The effects of this training regime can be seen in Figure 3.

Figure 3: Effect of on reconstruction quality. Three sets of reconstructions are shown using the same VAE trained with different training regimes. In the variable- scenario the latent space first learns to capture small visual details and only later receives pressure to disentangle them.

We trained various models with latent dimensions on the PGM dataset using different factors and annealing schemes. Details of our VAE encoder and decoder architecture and training scheme are discussed in Appendices A and D. After training we can visualize that the disentangled latent space indeed captures many of the underlying generative factors of the problem domain (see Figure 4). A more extended set of latent space interpolations is shown in Appendix B.

Figure 4: Latent space visualization obtained by encoding the input images (left) and interpolating between the support boundaries of the posterior distribution for and , while keeping all other latents constant before rendering the resulting through the decoder network. Notice how the VAE’s latent space (=4.00) clearly disentangles multiple generative factors such as the colour/presence of the diamond-shaped raster (top rows) or the shape of an object in a single position (bottom rows).

3 Leveraging the learned latent space for relational reasoning

Here we test the generalization properties of the learned latent space by using the learned image embeddings in the PGM problem setting. To start the WReN training process, we freeze the pretrained VAE-encoder graph and use it to initialize the encoder step of the WReN architecture. In order to create a fair comparison, our VAE architecture uses the exact same encoder network as the CNN-embedder in [1]. But, since we use a latent space of latent dimensions, the convolutional-features from the encoder are passed through two FC-layers mapping onto -dimensional vectors representing the means and variances of a factorized, Gaussian distribution. At training time we randomly sample from this posterior to get the latent representations used in further processing. At test time, we simply use the mean vector. Apart from this difference in input feature dimensionality (64 in our VAE case vs 512 in the default CNN-encoder case), the entire WReN architecture is identical to the one used in [1].

Finally, the WReN model is trained for 6 epochs using a fixed encoder and then finetuned end-to-end for another 2 epochs to get the results displayed in Table 1. Notice that while we outperform the default WReN model trained with purely supervised learning on various generalization regimes as intended, surprisingly we also do better on two of the validation sets, indicating that the disentangled latent space does in fact make the relational reasoning problem more tractable for the RN network.

Model-type CNN-WReN [1] VAE-WReN (=4.00)
Generalization regime Val () Test () Test (kappa) Val () Test () Test (kappa)
Neutral 63.0 62.6 0.573 64.8 64.2 0.591
H.O. Triple Pairs 63.9 41.9 0.336 64.6 43.6 0.355
H.O. Attribute Pairs 46.7 27.2 0.168 70.1 36.8 0.278
H.O. Triples 63.4 19.0 0.074 59.5 24.6 0.138
Table 1: Relational Reasoning Results. Each data regime consists of 1.2M training images, a held-out validation set of 20K images drawn from the same problem distribution and a generalization test set of 200K images containing new problem sets not seen during training. Regimes are sorted according to the degree of generalization required to solve them (more info in [1]). We also list Cohen’s Kappa values which vary linearly from (random guessing) to (oracle).

4 Conclusion

In this paper, we show that disentangled variational autoencoders can be leveraged to learn a mapping from high-dimensional pixel space to a low-dimensional and more structured latent space without any explicit supervision. This disentangled latent space can subsequently be leveraged for solving a non-trivial relational reasoning problem and by doing so, outperforms the same architecture trained using a fully supervised approach.

4.1 Future work

Future work will focus on further investigating the desirable characteristics of a generally useful latent space (disentanglement, linearity, sparsity, …) and explore new objective functions that can be leveraged for unsupervised representation learning, such as various representation losses [8, 9], GAN-inspired discriminator networks [10] and predictive capacity [11, 12]. Additionally, the current WReN setup does not leverage the intrinsic variational properties of the learned latent space.


  • [1] D. G. T. Barrett, F. Hill, A. Santoro, A. S. Morcos, and T. Lillicrap. Measuring abstract reasoning in neural networks. International Conference on Machine Learning, 2018.
  • [2] A. Santoro, D. Raposo, D. G. T. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. Conference on Neural Information Processing Systems (NIPS), 2017.
  • [3] D. P Kingma and M. Welling. Auto-Encoding Variational Bayes. Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2014.
  • [4] H. Kim and A. Mnih. Disentangling by Factorising. Conference on Neural Information Processing Systems (NIPS), Learning Disentangled Representations: From Perception to Control Workshop, 2017.
  • [5] D. Jimenez Rezende and F. Viola. Taming VAEs. ArXiv e-print arXiv:1810.00597, 2018.
  • [6] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner. Understanding disentangling in -VAE. ArXiv e-print arXiv:1804.03599, April 2018.
  • [7] I. Higgins, L. Matthey, X. Glorot, A. Pal, B. Uria, C. Blundell, S. Mohamed, and A. Lerchner. Early Visual Concept Learning with Unsupervised Deep Learning. ArXiv e-print arXiv:1606.05579, June 2016.
  • [8] I. Higgins, A. Pal, A. A. Rusu, L. Matthey, C. P Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner. DARLA: Improving Zero-Shot Transfer in Reinforcement Learning. Proceedings of the 34rd International Conference on Machine Learning (ICML), 2017.
  • [9] A. Boesen Lindbo Larsen, S. Kaae Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML), 2016.
  • [10] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. CVAE-GAN: Fine-Grained Image Generation through Asymmetric Training. IEEE International Conference on Computer Vision (ICCV), 2017.
  • [11] Y. Bengio. The Consciousness Prior. ArXiv e-print arXiv:1709.08568, September 2017.
  • [12] A. van den Oord, Y. Li, and O. Vinyals. Representation Learning with Contrastive Predictive Coding. ArXiv e-print arXiv:1807.03748, July 2018.


Appendix A Architecture details

Our VAE setup uses a convolutional encoder-decoder architecture and a -Gaussian distribution as prior for the latent variables. The encoder has 4 layers intermitted with layers. The decoder uses the same architecture with layers. Every conv-layer uses 32 kernels of size 3 and a stride of 2. We did not use any pooling layers.

The VAE bottleneck has 64 latent variables which are parameterized through their respective means and variances. We use the common reparameterization trick from [3] for backpropagating through the non-deterministic bottleneck.

The VAE was trained using the ADAM optimizer with a learning rate of .0003 and a batch size of 32 PGM problems per batch ( = 512 panel images).

In the WReN network, we use dropout in the penultimate layer. All models were implemented in PyTorch and trained on a single Tesla K80 GPU.

Appendix B Additional latent space visualizations

To further aid the insight of the reader, we provide a bunch of additional visualizations we found helpful to understand the latent space behavior of disentangled VAEs. See Figures 5, 6, 7.

Figure 5: Latent space interpolations. VAE with =0,01 - avg-MSE = 6,06 and avg-KL-divergence = 120. Many latent traversals change a variety of generative factors simultaneously. Note that the numerical bounds of the interpolations for each latent variable are clipped using the support from the posterior distribution (generated by passing 5000 random train images through the encoder network).
Figure 6: Latent space interpolations. VAE with =4,00 - avg-MSE=20,2 and avg-KL-divergence=22. Most latent traversals correspond to a single, clear generative factor. Again, the interpolation bounds are clipped using the support from the posterior distribution.
Figure 7: Visualization of the latent distribution for the two variables with highest average KL-divergence from a disentangled VAE with =4.0. We run 5000 randomly sampled images through the encoder network and plot the resulting distributions of and as well as 5000 random samples drawn from each of the resulting Gaussian distributions. Notice that latent variables with a very high KL-divergence from the -prior (eg. top row where is always close to zero) can still result in a near-Gaussian distribution over sampled z-values.

Appendix C Failure modes of the WReN CNN

One of the problems with the purely supervised CNN approach is that the model receives no explicit pressure to discover the generative, latent structure of the problem domain. This can be clearly seen in Figure 8.

Figure 8: Input images (top) and reconstructions (bottom) obtained by training a convolutional-decoder on the panel features extracted by the CNN embedder from [1]. As can be seen, the generative factors defining the problem domain are not contained within the learned embedding space, lending support to the claim that the CNN simply overfits on specific visual features in the training examples instead of discovering useful latent structure.

Unfortunately, the used VAE approach based on a pixel-reconstruction loss also has its own drawbacks as can be seen in Figure 9. This lends support to the widely held assumption (see eg. [8, 9]) that simple, pixel-based reconstruction metrics are not the ideal optimization objectives for visual representation learning.

Figure 9: When forcing the VAE to trade-off reconstruction quality for smaller KL-divergence from the latent prior, small, grayscale objects will be sacrificed first since they correspond to the smallest increase in MSE penalty in pixel space. Larger and darker objects will maintain good reconstruction quality until much higher values are imposed. (Shown here on a custom dataset we generated for testing purposes.)

Appendix D Impact of annealing

One common problem with -VAEs is that imposing a large disentanglement constraint often causes the latent space to collapse to supporting only the most salient modes of the visual input domain, failing to capture more fine-grained visual information that is often task critical.

To tackle this problem we start training the VAE with a low -factor and gradually increase the disentanglement constraint until a desirable state is reached. By primarily focusing on the visual reconstruction error in the beginning of training, the latent space learns to capture most of the visual information before the disentanglement constraint begins dominating the training objective, leading to a much better final representation model.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description