1 Introduction


Separating high-dimensional data like images into independent latent factors remains an open research problem. Here we develop a method that jointly learns a linear independent component analysis (ICA) model with non-linear bijective feature maps. By combining these two methods, ICA can learn interpretable latent structure for images. For non-square ICA, where we assume the number of sources is less than the dimensionality of data, we achieve better unsupervised latent factor discovery than flow-based models and linear ICA. This performance scales to large image datasets such as CelebA.


1 Introduction

In linear Independent Component Analysis (ICA) data is modelled as having been created from linearly mixing together independent latent sources Cardoso (1989a, b); Jutten and Herault (1991); Comon (1994); Bell and Sejnowski (1995); Cardoso (1997); Lee et al. (2000). The canonical problem is blind source separation; the aim is to estimate the original sources of a mixed set of signals by learning an unmixing or decorrelating matrix (we use these terms interchangeably), which when multiplied with data recovers the values of these sources. While linear ICA is a powerful approach to undo the mixing of signals like sound, Everson and Roberts (2001); Hyvärinen et al. (2001), it has not been as effectively developed for learning compact representations of high-dimensional data like images, where the linear assumption is limiting. Non-linear ICA methods, where we assume the data has been created from a non-linear mixture of latent sources, offer better performance on such data.

In particular, flow-based models have been proposed as a non-linear approach to square ICA, where we assume the dimensionality of our latent source space is the same as that of our data Deco and Brauer (1995); Parra et al. (1995, 1996); Parra (1996); Dinh et al. (2015, 2017). These flows parameterise a bijective mapping between data and a feature space of the same dimension and can be trained under a maximum likelihood objective for a chosen prior in that space. While such models can create extremely powerful generative models, for most image data one could want to have fewer latent variables than the number of pixels in an image. In such situations, we wish to learn a non-square (dimensionality-reducing) ICA representation over images.

Here we propose a novel methodology for performing non-square ICA using a model with two jointly trained parts: a non-square linear ICA model operating on a feature space output by a bijective flow. The bijective flow is tasked with learning a representation for which linear ICA is a good model. It is as if we are learning the data for our ICA model. Further, to induce optimal source separation in our model, we introduce novel theory for the parameterisation of decorrelating, non-square ICA matrices close to the Stiefel Manifold Stiefel (1935), the space of orthonormal rectangular matrices. By doing so we introduce a novel non-square linear ICA model which can successfully induce dimensionality reduction in flow-based models and scales non-square non-linear ICA methods to high-dimensional image data.

We show that our hybrid model Bijecta, a flow jointly trained with our ICA model, outperforms each of its constituent components in isolation in terms of latent factor discovery. We demonstrate this on the MNIST, Fashion-MNIST, CIFAR-10, and CelebA datasets.

More broadly we demonstrate that:

  • By combining bijective mappings with non-square linear ICA we are able to learn a low dimensional ICA source representation for high-dimensional data.

  • We show that our method induces concentration of information into a low-dimensional manifold in the bijective space of the flow, unlike flows trained under a standard base distribution.

2 Independent Component Analysis

ICA is a highly diverse modelling paradigm with numerous variants: learning a mapping vs learning a model, linear vs non-linear, different loss functions, different generative models, and a wide array of methods of inference Cardoso (1989a); Jutten and Herault (1991); Mackay (1996); Roweis and Ghahramani (1999); Hyvärinen and Pajunen (1999); Lee et al. (2000); Lappalainen and Honkela (2000); Karhunen (2001). Generally, the goal of ICA is to learn a set of statistically independent sources that ‘explain’ our data.

In this paper, we follow the approach of specifying a generative model and finding point-wise maximum likelihood estimates of model parameters. This variety of ICA starts from demonstrations that earlier infomax-principle Linsker (1989) approaches to ICA Bell and Sejnowski (1995) are equivalent to maximum likelihood training Mackay (1996); Cardoso (1997); Pearlmutter and Parra (1997); Roberts (1998); Everson and Roberts (1999). Mean-field variational inference for ICA, both for non-linear and linear, has been developed in Lawrence and Bishop (2000); Choudrey (2000); Valpola et al. (2003); Roberts and Choudrey (2004); Honkela and Valpola (2005); Winther and Petersen (2007).

Concretely, we have a model with latent sources generating data , with . The generative model for linear ICA factorises as


where is a set of independent distributions appropriate for ICA,


In linear ICA, where all mappings are simple matrix multiplications, to enforce identifiability, the priors over the sources cannot be Gaussian distributions. Recall that we are mixing our sources to generate our data: A linear mixing of Gaussian random variables is itself Gaussian, so unmixing is impossible Lawrence and Bishop (2000). To be able to unmix, to break this symmetry, we can choose any heavy-tailed or light-tailed non-Gaussian distribution as our prior . That gives us axis alignment and independence between sources.

2.1 Non-Linear ICA

One approach to extend ICA is to have a non-linear mapping acting on the independent sources and data Burel (1992); Deco and Brauer (1995); Yang et al. (1998); Karhunen (2001); Almeida (2003); Valpola et al. (2003). In general non-linear ICA models have been shown to be hard to train, having problems of unidentifiability Hyvärinen and Pajunen (1999); Karhunen (2001); Almeida (2003); Hyvarinen et al. (2019). This means that for a given dataset the model has numerous local minima it can reach under its training objective, with potentially very different learnt sources associated with each.

Some non-linear ICA models have been specified with additional structure, such as putting priors on variables Lappalainen and Honkela (2000) or specifying the precise non-linear functions involved Lee and Koehler (1997); Taleb (2002), reducing its space of potential solutions. Recent work Khemakhem et al. (2020) has given a proof that conditioning the source distributions on some always-observed side information, such as time index or class of data, can be sufficient to induce identifiability in non-linear ICA.

3 Flows

Flows are models that stack numerous invertible changes of variables. One can then specify a relatively simple base distribution and learn a sequence of (invertible) transforms, defined to have tractable Jacobian matrices, such that one’s data is likely after that mapping.

Given a variable latent variable , we can specify our distribution over data as


where is a bijection and is the base distribution over the latent (Rezende and Mohamed, 2015; Papamakarios et al., 2019). To create more powerful and flexible distributions for we can use the properties of function composition to specify as the a series of transformations of our simple prior into a more complex multi-modal distribution, e.g. for a series of mappings,


By the properties of determinants under function composition we obtain


where denotes the variable resulting from the transformation , defines a density on the , and the bottom most variable is our data ().

3.1 Coupling Layers

Computing the determinant of the Jacobian, , in Eq. (3) can be prohibitively costly, especially when composing multiple functions as in Eq. (5). To address this, most flows use coupling layers that enforce a lower triangular Jacobian such that the determinant of the Jacobian is simply the product of its diagonal elements. We use the coupling layers of Durkan et al. (2019) to enforce this lower triangular structure. For an outline of these coupling layers, see Appendix E.

3.2 Spline Flows

Generally, must be bijective to be invertible. A powerful and flexible class of functions that satisfy this requirement are monotonic rational quadratic splines (RQS) Durkan et al. (2019).

We use these splines as part of the coupling layers detailed above and parameterise the parameters of these functions using deep neural networks. These networks encode monotonically increasing knots, which are the coordinate pairs through which the function passes: . These networks also encode the derivative at each of these knots. Using these parameters we can interpolate values between each of the knots using the equation for the RQS transformation detailed in Durkan et al. (2019). The resulting function is highly flexible so RQS flows require fewer composed mappings to achieve good performance relative to other coupling layers.

4 Bijecta: Combining Flows with ICA

We combine linear ICA with a dimensionality-preserving invertible flow . The flow acts between our data space and the representation from the linear ICA generative model.

We want the latent representation to be structured such that is it well fit by the simple, linear ICA model. In some sense, this can be thought of as ‘learning the data’ for an ICA model, where the ‘data’ the ICA model acts on is the output of the invertible feature map defined by the flow. The inferred latent sources of the ICA model can have a lower dimensionality than the data or flow representation.

We choose our base ICA source distribution, and the ICA generative likelihood, to be multivariate Laplace distributions Everson and Roberts (2001), To be clear, the result of this mixing of sources is not our data, but an intermediate representation which interacts with a flow that maps it to data.


where is our (unknown) ICA mixing matrix, which acts on the sources to produce a linear mixture; and is a learnt or fixed diagonal diversity.

Our model has the three sets of variables: the observed data , the flow representation , and ICA latent sources . It can be factorised as


To train the ICA part of the model by maximum likelihood we would marginalise out and evaluate the evidence in :


This marginal is intractable. We propose a contemporary approach for approximate inference in linear ICA, using advances in amortised variational inference. This means we introduce an approximate amortised posterior for and perform importance sampling on Eq (10), taking gradients through our samples using the reparameterisation trick Kingma and Welling (2014); Rezende et al. (2014).

By training this model using amortised stochastic variational inference we gain numerous benefits. Training scales to large datasets using stochastic gradient descent, and we are free to choose the functional and probabilistic form of our approximate posterior. We choose a linear mapping in our posterior, with


where we have introduced variational parameters corresponding to an unmixing matrix and a diagonal diversity.

We use samples from this posterior to define a lower bound on the evidence in


Using the change of variables equation, Eq (3), and this lower bound on the evidence in , we can obtain a variational lower bound on the evidence for our data as the sum of the ICA model’s ELBO (acting on ) and the log determinant of the flow:


As such our model is akin to a flow model, but with an additional latent variable ; the base distribution of the flow is defined through marginalizing out the linear mixing of the sources. We refer to a model with non-linear splines mapping from to as an -layer Bijecta model.

(a) Generative Model

(b) Variational Posterior

Figure 3: The generative model (a) and variational posterior (b), as defined in Eq (14).

In the case of non-square ICA, where our ICA model is not perfectly invertible, errors when reconstructing a mapping from to may amplify when mapping back to . We additionally penalise the error of each point when reconstructed into as an additional regularisation term in our loss. This penalisation can be weighted according to the importance of high-fidelity reconstructions for a given application.

5 Manifolds for the unmixing matrix

What are good choices for the mixing and unmixing matrices? As we will show, design choices as to the parameterisation of can accelerate the convergence of our flow-based ICA model, and provide guarantees on the structure of the learnt projections.

Before discussing how to pick unmixing matrices for non-square linear ICA, first we briefly cover the square case, where the number of generative factors is equal to the dimensionality of our data space, . This assumption greatly simplifies the construction of the unmixing matrix. Generally, members of the orthogonal group group have been shown to be optimal for decorrelating ICA sources; with sufficient data the maximum likelihood unmixing matrices lies on this decorrelating manifold and will be reached by models confined to this manifold Everson and Roberts (1999).

Thus we constrain to belong to the Lie group of special orthogonal matrices with determinant 1. We want to perform unconstrained optimisation in learning our matrix, so we wish to use a differentiable transformation from a class of simpler matrices to the class of orthogonal matrices.

For a given anti-symmetric matrix (i.e., satisfying ), its Cayley Transform Cayley (1846) is a member of . As such, we propose defining our square unmixing matrix as the Cayley transform of the anti-symmetric matrix ,


This can be formulated as an unconstrained problem, easing optimization, by defining and then optimizing over the square real-valued matrix .

5.1 for non-square ICA

Intuitively, for non-square ICA our aim is to construct an unmixing matrix that is approximately orthogonal, such that has the decorrelating properties we observe when constraining it to lie on in the case of square ICA. Here we propose a new method for non-square ICA, combining ideas from sketching Woodruff (2014) with these manifold-learning approaches.

The set of rectangular matrices that are exactly orthogonal, i.e. that fulfill ( the conjugate transpose of ) lie on the Stiefel Manifold Stiefel (1935); Edelman et al. (1998). For a given number of rows and columns , such a manifold is denoted . We wish to specify our rectangular unmixing matrix of dimensionality to be approximately-Stiefel, lying close to . This choice is justified by the following theorem which we prove in Appendix B:

Theorem 1

As the Frobenius norm , where and is the projection of onto , , where is the cross-correlation of the projection of data by , and is some diagonal matrix.

More simply this shows that as a matrix approaches the off diagonal elements of the cross-correlation matrix, , become smaller in magnitude and we achieve independent projections .

In the case of non-square ICA, the dimensionality of our latent space is less than the dimensionality of our data space, so cannot in general form a bijection. We need to compress from to and we need it to perform a rotation such that the learnt latent sources align with the axis-aligned priors we impose in . We decompose


each part doing one of these tasks. The projection matrix handles the dimensionality reduction, while performs unmixing in . In essence, is our decorrelating matrix and is our projection matrix. We make specific choices for the structure of both these matrices, which can ensure the resulting lies close to the manifold , while still permitting us to perform optimisation in an unconstrained parameter space.

The Lie group for : Because is a square matrix we can constrain it to belong to as in the case of square ICA. By doing so we ensure has the decorrelating properties we are seeking.


Where is an anti-symmetric matrix as detailed in Section 5.

for an approximately Stiefel : Our goal is to construct a rectangular matrix such that and . As stated above, we constrain , one part of to be exactly orthogonal. For our compressive matrix, , our goal is to constrain it to be approximately orthogonal. The justification for this is provided by the following theorem:

Theorem 2

For , , as the Frobenius norm , we also have , where is the projection of onto .

The proof of this theorem is in Appendix A. If satisfies , then lies close to by Theorem 2, and by Theorem 1 we can deduce will have decorrelating properties.

Recall that our optimisation space for lies in . For large dimensional data such as images this can constitute a prohibitively large search space for an approximately orthogonal matrix. We can choose to fix such that optimisation occurs in space, solely for the matrix . For most non-square ICA problems we assume that and fixing can greatly accelerate optimisation without limiting the space of solutions for the unmixing matrix .

Such approximately orthogonal matrices can be constructed very cheaply by way of Johnson-Lindenstrauss (JL) Projections Johnson and Lindenstrauss (1984); Dasgupta and Gupta (2003). We draw once from one such projection at initialisation and leave it fixed for the remainder of training.

Johnson-Lindenstrauss Projections

One can obtain a JL projection for by sampling from a simple binary distribution Achlioptas (2003):


This distribution satisfies , and as Achlioptas (2003) show, such a draw has ; then by Theorems 1 and 2 we know will be a decorrelating matrix.

5.2 The mixing matrix

The mixing matrix requires fewer constraints than , the unmixing matrix. For non-square ICA we construct as the product


where is exactly the transpose of the unmixing component of , and the matrix is unconstrained. We are essentially using the inverse of the unmixing matrix and projecting it from space to space by way of .

6 Experiments

Figure 4: Four axis-aligned latent-space traversals for a 12 layer Bijecta with trained on CelebA. Images to the left correspond to the original sample from the base distribution. As we move to the right these images correspond to linearly increasing values along a single latent dimension up to 10 standard deviations away whilst the other dimensions remain fixed. The first dimension encodes hair-color, the second gender, the third the degree of smiling and the fourth rotation. Most striking is the fact that these transforms are identity preserving: as we vary along one axis it is apparent that the features that are core to the original face are retained. In standard flow-based models, such transforms are typically not axis-aligned and most transformations in flow latent spaces are not identity preserving.

A good unsupervised ICA model is one that can: learn useful and informative latent representations; produce realistic-looking samples; and well approximate the underlying data distribution. Here we show that Bijecta outperforms both flows with fixed base distributions and linear ICA models by these criteria.

6.1 Likelihood Collapse

Our model is rewarded for learning representations that can be readily decorrelated by a linear ICA model. This presents some unique modes of failure which we observed when training these models. Under the objective we defined in Eq (14), a trivial solution for a flow model is to have a highly peaked distribution over reconstructions. For a given flow embedding , , collapses to a distribution with zero mean and a variance that is greater than the variance of the likelihood distribution, and in fact hardly depends on the input . Under this collapse, where is of low variance, the ICA model estimates to be a distribution with mean close to 0 and with greater variance than the appropriate embedding-dependent , for all datapoints. Because the values of have collapsed to 0 under the flow model, we obtain high likelihood values even if reconstructions and samples are poor.

To prevent this likelihood collapse, we add a final normalising layer to our flow , which prevents the variance of from drifting. Such layers ensure that data batches have zero mean and unit variance and prevent variance collapse. Each component , of the input to this layer is rescaled in the following fashion:


where is the batch size, is the batch mean over component . As Hoffer et al. (2018) show, the expected value of the denominator is the data standard deviation . As such we are enforcing unit variance. The log determinant of the Jacobian of this layer is simply the negative log of the denominator in Eq (20), .

To calculate batch statistics when generating data from our model, when we do not have an input , we keep a running average that is reminiscent of batch norm layers in neural networks Ioffe and Szegedy (2015). The running mean and standard deviation for the layer at step of training are:

where and are the statistics calculated over the current training batch and is a momentum term which regulates the variance at each update. These statistics are then used to calculate the inverse of the layer when generating data.

6.2 Results and Discussion

Dimensionality Reduction on flow models

We first contrast our model’s ability to automatically uncover sources relative to flow models with heavy-tailed base distributions. We do so by measuring the cumulative explained variance by the dimensions in for both these models. If a small number of dimensions explains most of ’s variance then the model has learnt a bijection which only requires a small number of dimensions to be invertible. It has in effect learnt the generating sources underpinning the data.

In Figure 5 we demonstrate that Bijecta is able to induce better-compressed representations in than a non-compressive flow on CIFAR-10 and Fashion-MNIST datasets. We compute the eigenvalues of the covariance matrix on the output of the flow, i.e. on , to see how much of the total variance in the learned feature space can be explained in a few dimensions. In doing so we see that our flows trained jointly with a linear ICA model with effectively concentrates variation into a small number of intrinsic dimensions; this is in stark contrast with the RQS flows trained with only a simple Laplace base distribution. This demonstrates that our model is capable of automatically detecting relevant directions on a low dimensional manifold in , and that the bijective component of our model is better able to isolate latent sources than a standard flow.

Visualisation of Learnt Latent Factors

For a visual illustration of this source separation we show the difference in generated images resulting from smoothly varying along each dimension in for Bijecta models and in for flow models. Bijecta’s ability to discover latent sources is highlighted visually in Figures 4 and 10, where our model is clearly able to learn axis-aligned transformations of CelebA faces, whereas a flow trained with equivalent computational budget does not. This improvement in source discovery translates into our model’s ability to more quickly converge to a well-conditioned generative model under a heavy-tailed prior, which is highlighted by the samples drawn from Bijecta and RQS flows in Figure 10.

Figure 5: Explained variance plots for the embedding in , as measured by the sums of the eigenvalues of the covariance matrix of the embeddings, for both our Bijecta model and for an RQS model trained with a Laplace base distribution. For both Fashion-MNIST (top) and CIFAR 10 (bottom) datasets we see that the Bijecta model has learned a compressive flow, where most of the variance can be explained by only a few linear projections. The shaded region denotes the first 64 dimensions, corresponding to the size of the target source embedding .

(a) 4-layer RQS flow latent traversal (b) 1-layer Bijecta (c) 1-layer RQS (d) 4-layer RQS

Figure 10: Plots comparing samples drawn from Bijecta and RQS flows. Samples from a 1-layer Bijecta (b) with are contrasted with samples from 1 (c) and 4-layer (d) RQS flows trained with a Laplace base distribution after 25000 training steps. It is apparent that our model has more quickly converged to a well-conditioned generative model, even when compared against a larger baseline. We also show latent traversals moving along 5 latent directions in a 4-layer RQS flow for an embedded training-set point (a). Images in the center correspond to the original sample from the latent space prior. As we move to the right or left these images correspond to linearly increasing values along a single latent dimension up to 10 standard deviations away whilst the other dimensions remain fixed. Note that though we have selected 5 dimensions, all dimensions in had similar latent traversals. It is apparent the flow has not discovered axis aligned transforms.

Improving Linear ICA using bijective maps

Having ascertained the benefit of using our model for source discovery relative to a flow with a heavy-tailed prior, we demonstrate the benefit of using a bijective map to improve on the performance of linear ICA. We do so by measuring metrics which assess the independence of a model’s uncovered factors in and the quality of a model’s compressed representation, generally evaluated by determining a model’s ability to reconstruct an input from a compressed embedding in . Table 1 shows the dramatic improvement in such metrics when adding a single bijective mapping to linear ICA. This table also shows that improvements scale as we stack non-linear mappings.

Linear-ICA -4.8 -4.29 -3.0 -9.0
reconstruction error in 3.0 3.1 2.9 2.9
reconstruction error in 27.0 32.9 17.3 57.2
1-layer Bijecta -4.2 -2.8 -2.2 -7.3
reconstruction error in 2.0 1.0 1.6 1.9
reconstruction error in 5.6 4.6 3.8 13.0
4-layer Bijecta -4.7 -3.0 -2.2 -7.8
reconstruction error in 1.9 0.6 1.2 1.4
reconstruction error in 5.1 3.0 2.4 10.1
Table 1: Here we evaluate the source-separation and reconstruction quality of non-square ICA models. The best performing results are highlighted in bold. We evaluate source separation under a given model by evaluating the mean log probability of the validation set embeddings in under our heavy-tailed prior, normalised by the dimensionality of space: (). As our base distribution is heavy-tailed, this metric evaluates the axis-alignment, i.e the independence, of uncovered factors. We also evaluate the quality of compressed representations using the reconstruction error from the linear component of each model. In Bijecta this corresponds to the reconstruction error in ; in linear ICA, this corresponds to reconstruction error in . Generally, this highlights a model’s ability to reconstruct linearly compressed data. We also evaluate the quality of compressed representations by measuring the reconstruction error in . Generally, a better compression encodes more information in , making it easier for the model to then reconstruct in and subsequently in . Most striking is the improvement across all metrics when introducing a single bijective mapping. In particular the lower linear model reconstruction error coupled with high are clear indicators of better source separation. The low linear model error in for Bijecta models means the flow component of our models has learnt a representation under which the linear-ICA model can more easily separate sources and reconstruct inputs (also seen by the decrease in reconstruction error in data space ). Our 4-layer model further improves the quality of the compressed representations as seen by the lower reconstruction errors.

Such improvements can be appraised visually in figure E.31, which highlights the better reconstructions and sample generation when introducing a single bijective layer. For other examples of such reconstructions see Appendix F. As before, we show latent traversal plots for flows and linear ICA models in Appendix F for CelebA. It is apparent that our models are better able to learn axis-aligned transforms in .

(a) Linear ICA (b) 1-layer Bijecta (c) Linear ICA (d) 1-layer Bijecta

(e) Original Images (f) linear-ICA (g) 1-layer Bijecta

Figure 18: Plots comparing samples and reconstructions from Bijecta and linear ICA models with . Here we show samples drawn from the generative model of linear-ICA and a 1-layer Bijecta model trained on fashion-MNIST (a), (b) and CelebA (c), (d). The superior quality and most importantly the greater diversity of the samples drawn from our model is very apparent. Subfigures (e), (f), (g) show reconstructions from a linear-ICA model and a 1-layer Bijecta model trained on fashion-MNIST. The introduction of the bijective non-linearity significantly improves the quality of reconstructions of linear-ICA.

Viewed as a whole, these results show that the bijective map induces a representation that is better decorrelated by a linear mapping and is also easier to reconstruct under this linear mapping.

6.3 Methods

All non-square linear-ICA baselines are trained under the objective detailed in Eq. (12) but with no flow so . We construct the mixing and unmixing matrices as detailed in section 5 and they differ from our Bijecta model solely in their lack of bijective mapping that preprocesses data. Similarly, all flow-based baselines are trained using the objective in Eq (5). We match these baselines to the bijective maps of our models in terms of size, neural network architectures, and the presence of normalising layers. Note that unless specified otherwise, all compressive models use a latent space size of 64. Training hyperparameters and network architectures are detailed in Appendix G.

7 Related Work

Modern flow-based models were originally proposed as an approach to non-linear square ICA Dinh et al. (2015), but are also motivated by desires in generative modelling for more expressive priors and posteriors Kingma et al. (2016); Papamakarios et al. (2019). There were similar early approaches, known as symplectic maps Deco and Brauer (1995); Parra et al. (1995, 1996); Parra (1996), which were also proposed for use with ICA. Overall they offer a variety of expressive dimensionality-preserving (and sometimes volume-preserving) bijective mappings Dinh et al. (2017); Kingma and Dhariwal (2018); Papamakarios et al. (2019).

Orthogonal transforms have been used in normalizing flows before, to improve the optimization properties of Sylvester flows Van Den Berg et al. (2018); Golinski et al. (2019). The latter work also uses a Cayley map to parameterise orthogonal matrices. Here our analysis is flow-agnostic and works with state of the art non-volume preserving flows. Researchers have also looked at imposing Stiefel-manifold structure on the weights of neural networks Li et al. (2020).

Disentangling, potentially synonymous with non-linear non-square ICA, occurs when there is a one-to-one correspondence between dimensions of a latent space and some human-interpretable features in data generated by the model. Intuitively this means that smoothly varying along an axis-aligned direction in a model’s latent space should result in smooth changes in a single feature of the data Higgins et al. (2017). There are Variational Autoencoder Rezende et al. (2014); Kingma and Welling (2014) derived models that obtain this, commonly by upweighting components of their training objective that are associated with statistical independence of their latent distributions Higgins et al. (2017); Burgess et al. (2017); Kim and Mnih (2018); Chen et al. (2018); Esmaeili et al. (2019). Khemakhem et al. (2020) prove that in non-linear ICA, latent variables that are not conditioned on other variables cannot produce disentangled representations, but that by conditioning the prior on additional supervised information the model can become identifiable.

8 Conclusion

Here we have developed a method for performing non-linear ICA which combines state-of-the-art flow-based models and a novel theoretically grounded non-square linear ICA method. Not only is our model able to learn a representation under which sources are separable by linear unmixing; we have also shown that the flow component concentrates information into a low dimensional manifold in its representation space . We have also demonstrated the value of our method for latent factor discovery on large high-dimensional image datasets. Bijecta learns a low dimensional, explanatory set of latent representations in as demonstrated by our latent traversals, and draws from the model are realistic-looking.

Appendix A Proof of optimality

Definition: We say a matrix is strictly more orthogonal than a matrix if .

Theorem 1

As the Frobenius norm , where and is the projection of onto , , where is the cross-correlation of the projection of data by , and is some diagonal matrix.

Proof Let be some projection matrix of data . We have .

The cross-correlation is expressed as . In the case where is perfectly decorrelated, we have: where is a diagonal matrix. We know that according to Everson and Roberts (1999) the Stiefel manifold (defined in Eq (LABEL:eq:stiefel)) holds the set of all linearly decorrelating matrices. As Absil and Malick (2012) show, the unique projection onto this manifold of with polar decomposition , is simply . As such given any matrix we have a polar decomposition , where , a linear decorrelating matrix, is the projection onto and denotes a positive-semidefinite Hermitian matrix. For any matrix and its projection onto , we have , where is the complex conjugate of . Consequently, given that the Frobenius norm is unitary invariant and the fact that is unitary:

The last line comes from the fact that is Hermitian and as such . As is a constant, as . As shown in Eq (LABEL:eq:limit) . Consequently:


Note, however, that though a matrix may be decorrelating it will in general not be the optimal ICA matrix for a given dataset, though the optimal ICA matrix is itself decorrelating Everson and Roberts (1999).

Appendix B Sub- or Super-Gaussian Sources?

In classical non-linear ICA a simple non-linear function (such as a matrix mapping followed by an activation function) is used to map directly from data to the setting of the sources for that datapoint Bell and Sejnowski (1995). In this noiseless model, the activation function is related to the prior one is implicitly placing over the sources Roweis and Ghahramani (1999). The choice of non-linearity here is thus a claim on whether the sources were sub- or super- Gaussian. If the choice is wrong, ICA models struggle to unmix the data Mackay (1996); Roweis and Ghahramani (1999). Previous linear ICA methods enabled both mixed learning of super and sub sources Lee and Sejnowski (1997), and learning the nature of the source Everson and Roberts (1999).

Here, our model abstracts away this design choice and transforms the original data such that it can be readily modelled as having either super- or sub-Gaussian sources by the linear ICA model, regardless of the true underlying sources.

Appendix C Identifiability of Linear ICA

Recall that for noiseless linear ICA the learnt sources will vary between different trained models only in their ordering and scaling Choudrey (2000); Hyvärinen et al. (2001); Everson and Roberts (2001). Under this model our data matrix is


where each column of is distributed according to . Given a permutation matrix and a diagonal matrix , both , we define new source and mixing matrices such that is unchanged: and Choudrey (2000). With fixed source variance the scaling ambiguity is removed, so linear ICA is easy to render identifiable up to a permutation and sign-flip of sources. However, this is only the case when the underlying sources are non-Gaussian (Hyvärinen et al., 2001). The Bijecta model can be interpreted as learning the “data” for linear ICA, with


In this setting, a natural question to ask is whether or not, given a particular function , the linear ICA is identifiable. The Bijecta model explicitly aims to induce this linear identifiability on its feature map, as we impose a Laplace prior , with fatter-than-Gaussian tails.

Appendix D Coupling Layers in Flows

Coupling layers in flows are designed to produce lower triangular Jacobians for ease of calculation of determinants. Durkan et al. (2019) achieve this in the following fashion:

  1. Given an input , split into two parts and ;

  2. Using a neural network, compute parameters for a bijective function using one half of : ; parameters are learnable parameters that do not depend on the input;

  3. The output of the layer is then for .

These coupling transforms thus act elementwise on their inputs.

Appendix E Reconstructions and Latent Traversals

e.1 Reconstructions

(a) F-MNIST Input Data (b) F-MNIST Linear ICA Recon (c) F-MNIST 1-Bijecta Recon (d) F-MNIST 4-Bijecta Recon

(e) CIFAR-10 Input Data (f) CIFAR-10 Linear ICA Recon (g) CIFAR-10 1-Bijecta Recon (h) CIFAR-10 4-Bijecta Recon (i) CelebA Input Data (j) CelebA Linear ICA Recon (k) CelebA 1-Bijecta Recon (l) CelebA 4-Bijecta Recon

Figure E.31: Original data and reconstructions for Fashion MNIST (a)-(d), CIFAR-10 (e)-(h) and CelebA (i)-(l). The first column – (a), (e), (i) – shows a random selection of 40 images from the training set for each dataset. The second column – (b), (f), (j) – shows the reconstructions obtained by linear non-square ICA using our approximately-Stiefel unmixing matrix as is Section 5, with for all datasets. Note that in Fashion-MNIST the model struggles to give uniform regions of the same intensity, instead we have these bands of noise. For CIFAR-10 and CelebA much of the finer detail is lost. The third column – (c), (g), (k) – shows the reconstructions for Bijecta with an RQS flow with a single layer and . The fourth column – (d), (h), (l) – for a Bijecta with a four-layer RQS flow, . Both of these models show much higher fidelity reconstructions than the linear ICA model. They give similar quality reconstructions for Fashion-MNIST and CIFAR-10. For CelebA the difference between 1-layer and 4-layer Bijectas is more clear, with the 4-layer model giving better preservation of the identity of the person in the input image in the reconstructions.

(a) CelebA Input Data (b) CelebA 12-Bijecta Recon

Figure E.34: Here we show reconstructions for a 12 layer Bijecta model, trained on a batch size of 64 for the CelebA dataset, with a latent space dimensionality . The quality of the reconstructions clearly illustrate that as we stack invertible layers our model and increase the size of we is able to reconstruct images with a high degree of accuracy.

e.2 Latent Traversals

(a) CelebA RQS traversal (b) CelebA ICA traversal (c) CelebA 1-Bijecta traversal

Figure E.38: Here we show latent traversals moving along the first five learnt latent directions in (a) a 4 layer RQS flow, (b) a linear ICA model with 64 latent dimensions in , and (c) a 1 layer bijecta model also with 64 latent dimensions for an embedded training-set point of the CelebA dataset. Images in the center correspond to the original embedded point. As we move to the right or left these images correspond to linearly increasing values along a single latent dimension up to 10 standard deviations away whilst the other dimensions remain fixed. It is very apparent that Bijecta is best able to learn axis-aligned transformations of the data.

Appendix F Network Architectures and Hyperparameters

Within all Rational Quadratic Spline (RQS) flows we parameterise 2 knots for each spline transform. The hyper-parameters of the knots were as in the reference implementation from Durkan et al. (2019), available at github.com/bayesiains/nsf: we set the minimum parameterised gradient of each knot to be 0.001, the minimum bin width between each encoded knot and the origin to be 0.001, and the minimum height between each knot and the origin to be 0.001.

Unlike in Durkan et al. (2019), where they view a single RQS ‘layer’ as composed of a sequence of numerous coupling layers, in this paper the number of layers we describe a model as having is exactly the number of coupling layers present. So for our 4-layer models there are four rational quadratic splines. Each layer in our flows are composed of: an actnorm layer, an invertible 1x1 convolution, an RQS coupling transform and a final invertible 1x1 convolution.

The parameters of the knots were themselves parameterised using ResNets nets, as used in RealNVP Dinh et al. (2017), for each of which we used 3 residual blocks and batch normalisation. As in Dinh et al. (2017) we factor-out after each layer. All training was done using ADAM Kingma and Lei Ba (2015), with default , a learning rate of 0.0005 and a batch size of 512. We perform cosine decay on the learning rate during training, training for 25,000 steps.

Data was rescaled to 5-bit integers and we used RealNVP affine pre-processing so our input data was in the range with .


  1. Projection-like retractions on matrix manifolds. SIAM Journal on Optimization 22 (1), pp. 135–158. External Links: Document, ISSN 10526234 Cited by: Appendix A, §8.
  2. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. In Journal of Computer and System Sciences, Vol. 66, pp. 671–687. External Links: Document Cited by: §5.1.
  3. MISEP – Linear and Nonlinear ICA Based on Mutual Information. Journal of Machine Learning Research 4, pp. 1297–1318. Cited by: §2.1.
  4. An information-maximisation approach to blind separation and blind deconvolution. Neural Computation 7 (6), pp. 1004–1034. Cited by: Appendix B, §1, §2.
  5. Blind separation of sources: A nonlinear neural algorithm. Neural Networks 5 (6), pp. 937–947. External Links: Document, ISSN 08936080 Cited by: §2.1.
  6. Understanding disentangling in -VAE. In NeurIPS, Cited by: §7.
  7. Blind identification of independent components with higher-order statistics. In IEEE Workshop on Higher-Order Spectral Analysis, Cited by: §1, §2.
  8. Source separation using higher order moments. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Vol. 4, pp. 2109–2112. External Links: Document, ISSN 07367791 Cited by: §1.
  9. Infomax and Maximum Likelihood for Blind Source Separation. IEEE Letters on Signal Processing 4, pp. 112–114. Cited by: §1, §2.
  10. Sur quelques propriétés des déterminants gauches. Journal für die reine und angewandte Mathematik 32, pp. 119–123. Cited by: §5.
  11. Isolating Sources of Disentanglement in Variational Autoencoders. In NeurIPS, Cited by: §7.
  12. Variational Methods for Bayesian Independent Component Analysis. Ph.D. Thesis, University of Oxford. Cited by: Appendix C, §2.
  13. Independent component analysis, A new concept?. Signal Processing 36 (3), pp. 287–314. External Links: Document, ISSN 01651684 Cited by: §1.
  14. An Elementary Proof of a Theorem of Johnson and Lindenstrauss. Random Structures and Algorithms 22 (1), pp. 60–65. External Links: Document, ISSN 10429832 Cited by: §5.1.
  15. Higher Order Statistical Decorrelation without Information Loss. In NeurIPS, Cited by: §1, §2.1, §7.
  16. NICE: Non-linear Independent Components Estimation. In ICLR, Cited by: §1, §7.
  17. Density estimation using Real NVP. In ICLR, Cited by: Appendix F, §1, §7.
  18. Neural Spline Flows. In NeurIPS, Cited by: Appendix D, Appendix F, Appendix F, §3.1, §3.2, §3.2.
  19. The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications 20 (2), pp. 303–353. External Links: Document, ISSN 08954798 Cited by: §5.1.
  20. Structured Disentangled Representations. In AISTATS, Cited by: §7.
  21. Independent Component Analysis: A Flexible Nonlinearity and Decorrelating Manifold Approach. Neural Computation 11 (8), pp. 1957–83. Cited by: Appendix A, Appendix A, Appendix B, §2, §5.
  22. Independent Component Analysis. Cambridge University Press. External Links: ISBN 9780521792981, Document Cited by: Appendix C, §1, §4.
  23. Improving Normalizing Flows via Better Orthogonal Parameterizations. In ICML Workshop on Invertible Neural Networks and Normalizing Flows, Cited by: §7.
  24. -VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In ICLR, External Links: Document, ISSN 1078-0874 Cited by: §7.
  25. Norm matters: Efficient and accurate normalization schemes in deep networks. In NeurIPS, Cited by: §6.1.
  26. Unsupervised variational Bayesian learning of nonlinear models. NeurIPS. Cited by: §2.
  27. Independent Component Analysis. John Wiley. Cited by: Appendix C, §1.
  28. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks 12 (3), pp. 429–439. External Links: Document, ISSN 08936080 Cited by: §2.1, §2.
  29. Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning. In AISTATS, Cited by: §2.1.
  30. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), External Links: ISBN 9780874216561, Document, ISSN 0717-6163 Cited by: §6.1.
  31. Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics 26 (1), pp. 189–206. External Links: Document Cited by: §5.1.
  32. Blind separation of sources, Part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing 24 (1), pp. 1–10. External Links: Document, ISSN 01651684 Cited by: §1, §2.
  33. Nonlinear Independent Component Analysis. In ICA: Principles and Practive, R. Everson and S. J. Roberts (Eds.), pp. 113–134. Cited by: §2.1, §2.
  34. Variational Autoencoders and Nonlinear ICA: A Unifying Framework. In AISTATS, Cited by: §2.1, §7.
  35. Disentangling by Factorising. In NeurIPS, Cited by: §7.
  36. Adam: A Method for Stochastic Optimisation. In ICLR, External Links: Link Cited by: Appendix F, §8.
  37. Improved Variational Inference with Inverse Autoregressive Flow. In NeurIPS, Cited by: §7.
  38. Auto-encoding Variational Bayes. In ICLR, Cited by: §4, §7.
  39. Glow: Generative flow with invertible 1x1 convolutions. NeurIPS. Cited by: §7.
  40. Bayesian Non-Linear Independent Component Analysis by Multi-Layer Perceptrons. In Advances in Independent Component Analysis, M. Girolami (Ed.), pp. 93–121. External Links: Document Cited by: §2.1, §2.
  41. Variational Bayesian Independent Component Analysis. Technical report University of Cambridge. Cited by: §2, §2.
  42. Blind source separation of nonlinear mixing models. Neural Networks for Signal Processing - Proceedings of the IEEE Workshop, pp. 406–415. External Links: ISBN 0780342569, Document Cited by: §2.1.
  43. A Unifying Information-Theoretic Framework for Independent Component Analysis. Computers & Mathematics with Applications 39 (11), pp. 1–21. Cited by: §1, §2.
  44. Independent Component Analysis for Mixed Sub-Gaussian and Super-Gaussian Sources. Joint Symposium on Neural Computation, pp. 6–13. External Links: Link Cited by: Appendix B.
  45. Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform. In ICLR, Cited by: §7.
  46. An application of the principle of maximum information preservation to linear systems. NeurIPS. Cited by: §2.
  47. Maximum Likelihood and Covariant Algorithms for Independent Component Analysis. Technical report University of Cambridge. Cited by: Appendix B, §2, §2.
  48. Normalizing Flows for Probabilistic Modeling and Inference. Technical report DeepMind, London, UK. Cited by: §3, §7.
  49. Symplectic nonlinear component analysis. In NeurIPS, pp. 437–443. Cited by: §1, §7.
  50. Redundancy reduction with information-preserving nonlinear maps. Network: Computation in Neural Systems 6 (1), pp. 61–72. External Links: Document, ISSN 0954898X Cited by: §1, §7.
  51. Statistical Independence and Novelty Detection with Information Preserving Nonlinear Maps. Neural Computation 8 (2), pp. 260–269. External Links: Document, ISSN 08997667 Cited by: §1, §7.
  52. Maximum likelihood blind source separation: A context-sensitive generalization of ICA. In NeurIPS, Cited by: §2.
  53. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In ICML, Cited by: §4, §7.
  54. Variational Inference with Normalizing Flows. In ICML, Cited by: §3.
  55. Bayesian independent component analysis with prior constraints: An application in biosignal analysis. In Deterministic and Statistical Methods in Machine Learning, First International Workshop, Sheffield, UK. External Links: ISBN 3540290737, Document, ISSN 03029743 Cited by: §2.
  56. Independent Component Analysis: Source Assessment & Separation, a Bayesian Approach. IEEE Proceedings-Vision, Image and Signal Processing 145 (3), pp. 149–154. Cited by: §2.
  57. A unifying review of linear gaussian models. Neural Computation 11 (2), pp. 305–345. External Links: Document, ISSN 08997667 Cited by: Appendix B, §2.
  58. Richtungsfelder und Fernparallelismus in n-dimensionalen Mannigfaltigkeiten.. Commentarii mathematici Helvetici 8, pp. 305–353. Cited by: §1, §5.1.
  59. A generic framework for blind source separation in structured nonlinear models. IEEE Transactions on Signal Processing 50 (8), pp. 1819–1830. External Links: Document, ISSN 1053587X Cited by: §2.1.
  60. Nonlinear blind source separation by variational Bayesian learning. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E86-A (3), pp. 532–541. External Links: ISSN 09168508 Cited by: §2.1, §2.
  61. Sylvester normalizing flows for variational inference. In UAI, Vol. 1, pp. 393–402. External Links: ISBN 9781510871601 Cited by: §7.
  62. Bayesian independent component analysis: Variational methods and non-negative decompositions. Digital Signal Processing 17 (5), pp. 858–872. External Links: Document, ISSN 10512004 Cited by: §2.
  63. Sketching as a Tool for Numerical Linear Algebra. Foundations and Trends in Theoretical Computer Science 10 (2), pp. 1–157. External Links: Document Cited by: §5.1.
  64. Information-theoretic approach to blind separation of sources in non-linear mixture. Signal Processing 64 (3), pp. 291–300. External Links: Document, ISSN 01651684 Cited by: §2.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description