Abstract
Separating highdimensional data like images into independent latent factors remains an open research problem. Here we develop a method that jointly learns a linear independent component analysis (ICA) model with nonlinear bijective feature maps. By combining these two methods, ICA can learn interpretable latent structure for images. For nonsquare ICA, where we assume the number of sources is less than the dimensionality of data, we achieve better unsupervised latent factor discovery than flowbased models and linear ICA. This performance scales to large image datasets such as CelebA.
1 Introduction
In linear Independent Component Analysis (ICA) data is modelled as having been created from linearly mixing together independent latent sources Cardoso (1989a, b); Jutten and Herault (1991); Comon (1994); Bell and Sejnowski (1995); Cardoso (1997); Lee et al. (2000). The canonical problem is blind source separation; the aim is to estimate the original sources of a mixed set of signals by learning an unmixing or decorrelating matrix (we use these terms interchangeably), which when multiplied with data recovers the values of these sources. While linear ICA is a powerful approach to undo the mixing of signals like sound, Everson and Roberts (2001); Hyvärinen et al. (2001), it has not been as effectively developed for learning compact representations of highdimensional data like images, where the linear assumption is limiting. Nonlinear ICA methods, where we assume the data has been created from a nonlinear mixture of latent sources, offer better performance on such data.
In particular, flowbased models have been proposed as a nonlinear approach to square ICA, where we assume the dimensionality of our latent source space is the same as that of our data Deco and Brauer (1995); Parra et al. (1995, 1996); Parra (1996); Dinh et al. (2015, 2017). These flows parameterise a bijective mapping between data and a feature space of the same dimension and can be trained under a maximum likelihood objective for a chosen prior in that space. While such models can create extremely powerful generative models, for most image data one could want to have fewer latent variables than the number of pixels in an image. In such situations, we wish to learn a nonsquare (dimensionalityreducing) ICA representation over images.
Here we propose a novel methodology for performing nonsquare ICA using a model with two jointly trained parts: a nonsquare linear ICA model operating on a feature space output by a bijective flow. The bijective flow is tasked with learning a representation for which linear ICA is a good model. It is as if we are learning the data for our ICA model. Further, to induce optimal source separation in our model, we introduce novel theory for the parameterisation of decorrelating, nonsquare ICA matrices close to the Stiefel Manifold Stiefel (1935), the space of orthonormal rectangular matrices. By doing so we introduce a novel nonsquare linear ICA model which can successfully induce dimensionality reduction in flowbased models and scales nonsquare nonlinear ICA methods to highdimensional image data.
We show that our hybrid model Bijecta, a flow jointly trained with our ICA model, outperforms each of its constituent components in isolation in terms of latent factor discovery. We demonstrate this on the MNIST, FashionMNIST, CIFAR10, and CelebA datasets.
More broadly we demonstrate that:

By combining bijective mappings with nonsquare linear ICA we are able to learn a low dimensional ICA source representation for highdimensional data.

We show that our method induces concentration of information into a lowdimensional manifold in the bijective space of the flow, unlike flows trained under a standard base distribution.
2 Independent Component Analysis
ICA is a highly diverse modelling paradigm with numerous variants: learning a mapping vs learning a model, linear vs nonlinear, different loss functions, different generative models, and a wide array of methods of inference Cardoso (1989a); Jutten and Herault (1991); Mackay (1996); Roweis and Ghahramani (1999); Hyvärinen and Pajunen (1999); Lee et al. (2000); Lappalainen and Honkela (2000); Karhunen (2001). Generally, the goal of ICA is to learn a set of statistically independent sources that ‘explain’ our data.
In this paper, we follow the approach of specifying a generative model and finding pointwise maximum likelihood estimates of model parameters. This variety of ICA starts from demonstrations that earlier infomaxprinciple Linsker (1989) approaches to ICA Bell and Sejnowski (1995) are equivalent to maximum likelihood training Mackay (1996); Cardoso (1997); Pearlmutter and Parra (1997); Roberts (1998); Everson and Roberts (1999). Meanfield variational inference for ICA, both for nonlinear and linear, has been developed in Lawrence and Bishop (2000); Choudrey (2000); Valpola et al. (2003); Roberts and Choudrey (2004); Honkela and Valpola (2005); Winther and Petersen (2007).
Concretely, we have a model with latent sources generating data , with . The generative model for linear ICA factorises as
(1) 
where is a set of independent distributions appropriate for ICA,
(2) 
In linear ICA, where all mappings are simple matrix multiplications, to enforce identifiability, the priors over the sources cannot be Gaussian distributions. Recall that we are mixing our sources to generate our data: A linear mixing of Gaussian random variables is itself Gaussian, so unmixing is impossible Lawrence and Bishop (2000). To be able to unmix, to break this symmetry, we can choose any heavytailed or lighttailed nonGaussian distribution as our prior . That gives us axis alignment and independence between sources.
2.1 NonLinear ICA
One approach to extend ICA is to have a nonlinear mapping acting on the independent sources and data Burel (1992); Deco and Brauer (1995); Yang et al. (1998); Karhunen (2001); Almeida (2003); Valpola et al. (2003). In general nonlinear ICA models have been shown to be hard to train, having problems of unidentifiability Hyvärinen and Pajunen (1999); Karhunen (2001); Almeida (2003); Hyvarinen et al. (2019). This means that for a given dataset the model has numerous local minima it can reach under its training objective, with potentially very different learnt sources associated with each.
Some nonlinear ICA models have been specified with additional structure, such as putting priors on variables Lappalainen and Honkela (2000) or specifying the precise nonlinear functions involved Lee and Koehler (1997); Taleb (2002), reducing its space of potential solutions. Recent work Khemakhem et al. (2020) has given a proof that conditioning the source distributions on some alwaysobserved side information, such as time index or class of data, can be sufficient to induce identifiability in nonlinear ICA.
3 Flows
Flows are models that stack numerous invertible changes of variables. One can then specify a relatively simple base distribution and learn a sequence of (invertible) transforms, defined to have tractable Jacobian matrices, such that one’s data is likely after that mapping.
Given a variable latent variable , we can specify our distribution over data as
(3) 
where is a bijection and is the base distribution over the latent (Rezende and Mohamed, 2015; Papamakarios et al., 2019). To create more powerful and flexible distributions for we can use the properties of function composition to specify as the a series of transformations of our simple prior into a more complex multimodal distribution, e.g. for a series of mappings,
(4) 
By the properties of determinants under function composition we obtain
(5) 
where denotes the variable resulting from the transformation , defines a density on the , and the bottom most variable is our data ().
3.1 Coupling Layers
Computing the determinant of the Jacobian, , in Eq. (3) can be prohibitively costly, especially when composing multiple functions as in Eq. (5). To address this, most flows use coupling layers that enforce a lower triangular Jacobian such that the determinant of the Jacobian is simply the product of its diagonal elements. We use the coupling layers of Durkan et al. (2019) to enforce this lower triangular structure. For an outline of these coupling layers, see Appendix E.
3.2 Spline Flows
Generally, must be bijective to be invertible. A powerful and flexible class of functions that satisfy this requirement are monotonic rational quadratic splines (RQS) Durkan et al. (2019).
We use these splines as part of the coupling layers detailed above and parameterise the parameters of these functions using deep neural networks. These networks encode monotonically increasing knots, which are the coordinate pairs through which the function passes: . These networks also encode the derivative at each of these knots. Using these parameters we can interpolate values between each of the knots using the equation for the RQS transformation detailed in Durkan et al. (2019). The resulting function is highly flexible so RQS flows require fewer composed mappings to achieve good performance relative to other coupling layers.
4 Bijecta: Combining Flows with ICA
We combine linear ICA with a dimensionalitypreserving invertible flow . The flow acts between our data space and the representation from the linear ICA generative model.
We want the latent representation to be structured such that is it well fit by the simple, linear ICA model. In some sense, this can be thought of as ‘learning the data’ for an ICA model, where the ‘data’ the ICA model acts on is the output of the invertible feature map defined by the flow. The inferred latent sources of the ICA model can have a lower dimensionality than the data or flow representation.
We choose our base ICA source distribution, and the ICA generative likelihood, to be multivariate Laplace distributions Everson and Roberts (2001), To be clear, the result of this mixing of sources is not our data, but an intermediate representation which interacts with a flow that maps it to data.
(6)  
(7) 
where is our (unknown) ICA mixing matrix, which acts on the sources to produce a linear mixture; and is a learnt or fixed diagonal diversity.
Our model has the three sets of variables: the observed data , the flow representation , and ICA latent sources . It can be factorised as
(8)  
(9) 
To train the ICA part of the model by maximum likelihood we would marginalise out and evaluate the evidence in :
(10) 
This marginal is intractable. We propose a contemporary approach for approximate inference in linear ICA, using advances in amortised variational inference. This means we introduce an approximate amortised posterior for and perform importance sampling on Eq (10), taking gradients through our samples using the reparameterisation trick Kingma and Welling (2014); Rezende et al. (2014).
By training this model using amortised stochastic variational inference we gain numerous benefits. Training scales to large datasets using stochastic gradient descent, and we are free to choose the functional and probabilistic form of our approximate posterior. We choose a linear mapping in our posterior, with
(11) 
where we have introduced variational parameters corresponding to an unmixing matrix and a diagonal diversity.
We use samples from this posterior to define a lower bound on the evidence in
(12)  
Using the change of variables equation, Eq (3), and this lower bound on the evidence in , we can obtain a variational lower bound on the evidence for our data as the sum of the ICA model’s ELBO (acting on ) and the log determinant of the flow:
(13)  
(14) 
As such our model is akin to a flow model, but with an additional latent variable ; the base distribution of the flow is defined through marginalizing out the linear mixing of the sources. We refer to a model with nonlinear splines mapping from to as an layer Bijecta model.
In the case of nonsquare ICA, where our ICA model is not perfectly invertible, errors when reconstructing a mapping from to may amplify when mapping back to . We additionally penalise the error of each point when reconstructed into as an additional regularisation term in our loss. This penalisation can be weighted according to the importance of highfidelity reconstructions for a given application.
5 Manifolds for the unmixing matrix
What are good choices for the mixing and unmixing matrices? As we will show, design choices as to the parameterisation of can accelerate the convergence of our flowbased ICA model, and provide guarantees on the structure of the learnt projections.
Before discussing how to pick unmixing matrices for nonsquare linear ICA, first we briefly cover the square case, where the number of generative factors is equal to the dimensionality of our data space, . This assumption greatly simplifies the construction of the unmixing matrix. Generally, members of the orthogonal group group have been shown to be optimal for decorrelating ICA sources; with sufficient data the maximum likelihood unmixing matrices lies on this decorrelating manifold and will be reached by models confined to this manifold Everson and Roberts (1999).
Thus we constrain to belong to the Lie group of special orthogonal matrices with determinant 1. We want to perform unconstrained optimisation in learning our matrix, so we wish to use a differentiable transformation from a class of simpler matrices to the class of orthogonal matrices.
For a given antisymmetric matrix (i.e., satisfying ), its Cayley Transform Cayley (1846) is a member of . As such, we propose defining our square unmixing matrix as the Cayley transform of the antisymmetric matrix ,
(15) 
This can be formulated as an unconstrained problem, easing optimization, by defining and then optimizing over the square realvalued matrix .
5.1 for nonsquare ICA
Intuitively, for nonsquare ICA our aim is to construct an unmixing matrix that is approximately orthogonal, such that has the decorrelating properties we observe when constraining it to lie on in the case of square ICA. Here we propose a new method for nonsquare ICA, combining ideas from sketching Woodruff (2014) with these manifoldlearning approaches.
The set of rectangular matrices that are exactly orthogonal, i.e. that fulfill ( the conjugate transpose of ) lie on the Stiefel Manifold Stiefel (1935); Edelman et al. (1998). For a given number of rows and columns , such a manifold is denoted . We wish to specify our rectangular unmixing matrix of dimensionality to be approximatelyStiefel, lying close to . This choice is justified by the following theorem which we prove in Appendix B:
Theorem 1
As the Frobenius norm , where and is the projection of onto , , where is the crosscorrelation of the projection of data by , and is some diagonal matrix.
More simply this shows that as a matrix approaches the off diagonal elements of the crosscorrelation matrix, , become smaller in magnitude and we achieve independent projections .
In the case of nonsquare ICA, the dimensionality of our latent space is less than the dimensionality of our data space, so cannot in general form a bijection. We need to compress from to and we need it to perform a rotation such that the learnt latent sources align with the axisaligned priors we impose in . We decompose
(16) 
each part doing one of these tasks. The projection matrix handles the dimensionality reduction, while performs unmixing in . In essence, is our decorrelating matrix and is our projection matrix. We make specific choices for the structure of both these matrices, which can ensure the resulting lies close to the manifold , while still permitting us to perform optimisation in an unconstrained parameter space.
The Lie group for : Because is a square matrix we can constrain it to belong to as in the case of square ICA. By doing so we ensure has the decorrelating properties we are seeking.
(17) 
Where is an antisymmetric matrix as detailed in Section 5.
for an approximately Stiefel : Our goal is to construct a rectangular matrix such that and . As stated above, we constrain , one part of to be exactly orthogonal. For our compressive matrix, , our goal is to constrain it to be approximately orthogonal. The justification for this is provided by the following theorem:
Theorem 2
For , , as the Frobenius norm , we also have , where is the projection of onto .
The proof of this theorem is in Appendix A. If satisfies , then lies close to by Theorem 2, and by Theorem 1 we can deduce will have decorrelating properties.
Recall that our optimisation space for lies in . For large dimensional data such as images this can constitute a prohibitively large search space for an approximately orthogonal matrix. We can choose to fix such that optimisation occurs in space, solely for the matrix . For most nonsquare ICA problems we assume that and fixing can greatly accelerate optimisation without limiting the space of solutions for the unmixing matrix .
Such approximately orthogonal matrices can be constructed very cheaply by way of JohnsonLindenstrauss (JL) Projections Johnson and Lindenstrauss (1984); Dasgupta and Gupta (2003). We draw once from one such projection at initialisation and leave it fixed for the remainder of training.
JohnsonLindenstrauss Projections
5.2 The mixing matrix
The mixing matrix requires fewer constraints than , the unmixing matrix. For nonsquare ICA we construct as the product
(19) 
where is exactly the transpose of the unmixing component of , and the matrix is unconstrained. We are essentially using the inverse of the unmixing matrix and projecting it from space to space by way of .
6 Experiments
A good unsupervised ICA model is one that can: learn useful and informative latent representations; produce realisticlooking samples; and well approximate the underlying data distribution. Here we show that Bijecta outperforms both flows with fixed base distributions and linear ICA models by these criteria.
6.1 Likelihood Collapse
Our model is rewarded for learning representations that can be readily decorrelated by a linear ICA model. This presents some unique modes of failure which we observed when training these models. Under the objective we defined in Eq (14), a trivial solution for a flow model is to have a highly peaked distribution over reconstructions. For a given flow embedding , , collapses to a distribution with zero mean and a variance that is greater than the variance of the likelihood distribution, and in fact hardly depends on the input . Under this collapse, where is of low variance, the ICA model estimates to be a distribution with mean close to 0 and with greater variance than the appropriate embeddingdependent , for all datapoints. Because the values of have collapsed to 0 under the flow model, we obtain high likelihood values even if reconstructions and samples are poor.
To prevent this likelihood collapse, we add a final normalising layer to our flow , which prevents the variance of from drifting. Such layers ensure that data batches have zero mean and unit variance and prevent variance collapse. Each component , of the input to this layer is rescaled in the following fashion:
(20) 
where is the batch size, is the batch mean over component . As Hoffer et al. (2018) show, the expected value of the denominator is the data standard deviation . As such we are enforcing unit variance. The log determinant of the Jacobian of this layer is simply the negative log of the denominator in Eq (20), .
To calculate batch statistics when generating data from our model, when we do not have an input , we keep a running average that is reminiscent of batch norm layers in neural networks Ioffe and Szegedy (2015). The running mean and standard deviation for the layer at step of training are:
where and are the statistics calculated over the current training batch and is a momentum term which regulates the variance at each update. These statistics are then used to calculate the inverse of the layer when generating data.
6.2 Results and Discussion
Dimensionality Reduction on flow models
We first contrast our model’s ability to automatically uncover sources relative to flow models with heavytailed base distributions. We do so by measuring the cumulative explained variance by the dimensions in for both these models. If a small number of dimensions explains most of ’s variance then the model has learnt a bijection which only requires a small number of dimensions to be invertible. It has in effect learnt the generating sources underpinning the data.
In Figure 5 we demonstrate that Bijecta is able to induce bettercompressed representations in than a noncompressive flow on CIFAR10 and FashionMNIST datasets. We compute the eigenvalues of the covariance matrix on the output of the flow, i.e. on , to see how much of the total variance in the learned feature space can be explained in a few dimensions. In doing so we see that our flows trained jointly with a linear ICA model with effectively concentrates variation into a small number of intrinsic dimensions; this is in stark contrast with the RQS flows trained with only a simple Laplace base distribution. This demonstrates that our model is capable of automatically detecting relevant directions on a low dimensional manifold in , and that the bijective component of our model is better able to isolate latent sources than a standard flow.
Visualisation of Learnt Latent Factors
For a visual illustration of this source separation we show the difference in generated images resulting from smoothly varying along each dimension in for Bijecta models and in for flow models. Bijecta’s ability to discover latent sources is highlighted visually in Figures 4 and 10, where our model is clearly able to learn axisaligned transformations of CelebA faces, whereas a flow trained with equivalent computational budget does not. This improvement in source discovery translates into our model’s ability to more quickly converge to a wellconditioned generative model under a heavytailed prior, which is highlighted by the samples drawn from Bijecta and RQS flows in Figure 10.
Improving Linear ICA using bijective maps
Having ascertained the benefit of using our model for source discovery relative to a flow with a heavytailed prior, we demonstrate the benefit of using a bijective map to improve on the performance of linear ICA. We do so by measuring metrics which assess the independence of a model’s uncovered factors in and the quality of a model’s compressed representation, generally evaluated by determining a model’s ability to reconstruct an input from a compressed embedding in . Table 1 shows the dramatic improvement in such metrics when adding a single bijective mapping to linear ICA. This table also shows that improvements scale as we stack nonlinear mappings.
CIFAR10  MNIST  fashionMNIST  CELEBA  

LinearICA  4.8  4.29  3.0  9.0  
reconstruction error in  3.0  3.1  2.9  2.9  
reconstruction error in  27.0  32.9  17.3  57.2  
1layer Bijecta  4.2  2.8  2.2  7.3  
reconstruction error in  2.0  1.0  1.6  1.9  
reconstruction error in  5.6  4.6  3.8  13.0  
4layer Bijecta  4.7  3.0  2.2  7.8  
reconstruction error in  1.9  0.6  1.2  1.4  
reconstruction error in  5.1  3.0  2.4  10.1 
Such improvements can be appraised visually in figure E.31, which highlights the better reconstructions and sample generation when introducing a single bijective layer. For other examples of such reconstructions see Appendix F. As before, we show latent traversal plots for flows and linear ICA models in Appendix F for CelebA. It is apparent that our models are better able to learn axisaligned transforms in .
Viewed as a whole, these results show that the bijective map induces a representation that is better decorrelated by a linear mapping and is also easier to reconstruct under this linear mapping.
6.3 Methods
All nonsquare linearICA baselines are trained under the objective detailed in Eq. (12) but with no flow so . We construct the mixing and unmixing matrices as detailed in section 5 and they differ from our Bijecta model solely in their lack of bijective mapping that preprocesses data. Similarly, all flowbased baselines are trained using the objective in Eq (5). We match these baselines to the bijective maps of our models in terms of size, neural network architectures, and the presence of normalising layers. Note that unless specified otherwise, all compressive models use a latent space size of 64. Training hyperparameters and network architectures are detailed in Appendix G.
7 Related Work
Modern flowbased models were originally proposed as an approach to nonlinear square ICA Dinh et al. (2015), but are also motivated by desires in generative modelling for more expressive priors and posteriors Kingma et al. (2016); Papamakarios et al. (2019). There were similar early approaches, known as symplectic maps Deco and Brauer (1995); Parra et al. (1995, 1996); Parra (1996), which were also proposed for use with ICA. Overall they offer a variety of expressive dimensionalitypreserving (and sometimes volumepreserving) bijective mappings Dinh et al. (2017); Kingma and Dhariwal (2018); Papamakarios et al. (2019).
Orthogonal transforms have been used in normalizing flows before, to improve the optimization properties of Sylvester flows Van Den Berg et al. (2018); Golinski et al. (2019). The latter work also uses a Cayley map to parameterise orthogonal matrices. Here our analysis is flowagnostic and works with state of the art nonvolume preserving flows. Researchers have also looked at imposing Stiefelmanifold structure on the weights of neural networks Li et al. (2020).
Disentangling, potentially synonymous with nonlinear nonsquare ICA, occurs when there is a onetoone correspondence between dimensions of a latent space and some humaninterpretable features in data generated by the model. Intuitively this means that smoothly varying along an axisaligned direction in a model’s latent space should result in smooth changes in a single feature of the data Higgins et al. (2017). There are Variational Autoencoder Rezende et al. (2014); Kingma and Welling (2014) derived models that obtain this, commonly by upweighting components of their training objective that are associated with statistical independence of their latent distributions Higgins et al. (2017); Burgess et al. (2017); Kim and Mnih (2018); Chen et al. (2018); Esmaeili et al. (2019). Khemakhem et al. (2020) prove that in nonlinear ICA, latent variables that are not conditioned on other variables cannot produce disentangled representations, but that by conditioning the prior on additional supervised information the model can become identifiable.
8 Conclusion
Here we have developed a method for performing nonlinear ICA which combines stateoftheart flowbased models and a novel theoretically grounded nonsquare linear ICA method. Not only is our model able to learn a representation under which sources are separable by linear unmixing; we have also shown that the flow component concentrates information into a low dimensional manifold in its representation space . We have also demonstrated the value of our method for latent factor discovery on large highdimensional image datasets. Bijecta learns a low dimensional, explanatory set of latent representations in as demonstrated by our latent traversals, and draws from the model are realisticlooking.
Appendix A Proof of optimality
Definition: We say a matrix is strictly more orthogonal than a matrix if .
Theorem 1
As the Frobenius norm , where and is the projection of onto , , where is the crosscorrelation of the projection of data by , and is some diagonal matrix.
Proof Let be some projection matrix of data . We have .
The crosscorrelation is expressed as . In the case where is perfectly decorrelated, we have: where is a diagonal matrix. We know that according to Everson and Roberts (1999) the Stiefel manifold (defined in Eq (LABEL:eq:stiefel)) holds the set of all linearly decorrelating matrices. As Absil and Malick (2012) show, the unique projection onto this manifold of with polar decomposition , is simply . As such given any matrix we have a polar decomposition , where , a linear decorrelating matrix, is the projection onto and denotes a positivesemidefinite Hermitian matrix. For any matrix and its projection onto , we have , where is the complex conjugate of . Consequently, given that the Frobenius norm is unitary invariant and the fact that is unitary:
The last line comes from the fact that is Hermitian and as such . As is a constant, as . As shown in Eq (LABEL:eq:limit) . Consequently:
(21) 
Note, however, that though a matrix may be decorrelating it will in general not be the optimal ICA matrix for a given dataset, though the optimal ICA matrix is itself decorrelating Everson and Roberts (1999).
Appendix B Sub or SuperGaussian Sources?
In classical nonlinear ICA a simple nonlinear function (such as a matrix mapping followed by an activation function) is used to map directly from data to the setting of the sources for that datapoint Bell and Sejnowski (1995). In this noiseless model, the activation function is related to the prior one is implicitly placing over the sources Roweis and Ghahramani (1999). The choice of nonlinearity here is thus a claim on whether the sources were sub or super Gaussian. If the choice is wrong, ICA models struggle to unmix the data Mackay (1996); Roweis and Ghahramani (1999). Previous linear ICA methods enabled both mixed learning of super and sub sources Lee and Sejnowski (1997), and learning the nature of the source Everson and Roberts (1999).
Here, our model abstracts away this design choice and transforms the original data such that it can be readily modelled as having either super or subGaussian sources by the linear ICA model, regardless of the true underlying sources.
Appendix C Identifiability of Linear ICA
Recall that for noiseless linear ICA the learnt sources will vary between different trained models only in their ordering and scaling Choudrey (2000); Hyvärinen et al. (2001); Everson and Roberts (2001). Under this model our data matrix is
(22) 
where each column of is distributed according to . Given a permutation matrix and a diagonal matrix , both , we define new source and mixing matrices such that is unchanged: and Choudrey (2000). With fixed source variance the scaling ambiguity is removed, so linear ICA is easy to render identifiable up to a permutation and signflip of sources. However, this is only the case when the underlying sources are nonGaussian (Hyvärinen et al., 2001). The Bijecta model can be interpreted as learning the “data” for linear ICA, with
(23) 
In this setting, a natural question to ask is whether or not, given a particular function , the linear ICA is identifiable. The Bijecta model explicitly aims to induce this linear identifiability on its feature map, as we impose a Laplace prior , with fatterthanGaussian tails.
Appendix D Coupling Layers in Flows
Coupling layers in flows are designed to produce lower triangular Jacobians for ease of calculation of determinants. Durkan et al. (2019) achieve this in the following fashion:

Given an input , split into two parts and ;

Using a neural network, compute parameters for a bijective function using one half of : ; parameters are learnable parameters that do not depend on the input;

The output of the layer is then for .
These coupling transforms thus act elementwise on their inputs.
Appendix E Reconstructions and Latent Traversals
e.1 Reconstructions
e.2 Latent Traversals
Appendix F Network Architectures and Hyperparameters
Within all Rational Quadratic Spline (RQS) flows we parameterise 2 knots for each spline transform. The hyperparameters of the knots were as in the reference implementation from Durkan et al. (2019), available at github.com/bayesiains/nsf: we set the minimum parameterised gradient of each knot to be 0.001, the minimum bin width between each encoded knot and the origin to be 0.001, and the minimum height between each knot and the origin to be 0.001.
Unlike in Durkan et al. (2019), where they view a single RQS ‘layer’ as composed of a sequence of numerous coupling layers, in this paper the number of layers we describe a model as having is exactly the number of coupling layers present. So for our 4layer models there are four rational quadratic splines. Each layer in our flows are composed of: an actnorm layer, an invertible 1x1 convolution, an RQS coupling transform and a final invertible 1x1 convolution.
The parameters of the knots were themselves parameterised using ResNets nets, as used in RealNVP Dinh et al. (2017), for each of which we used 3 residual blocks and batch normalisation. As in Dinh et al. (2017) we factorout after each layer. All training was done using ADAM Kingma and Lei Ba (2015), with default , a learning rate of 0.0005 and a batch size of 512. We perform cosine decay on the learning rate during training, training for 25,000 steps.
Data was rescaled to 5bit integers and we used RealNVP affine preprocessing so our input data was in the range with .
References
 Projectionlike retractions on matrix manifolds. SIAM Journal on Optimization 22 (1), pp. 135–158. External Links: Document, ISSN 10526234 Cited by: Appendix A, §8.
 Databasefriendly random projections: JohnsonLindenstrauss with binary coins. In Journal of Computer and System Sciences, Vol. 66, pp. 671–687. External Links: Document Cited by: §5.1.
 MISEP – Linear and Nonlinear ICA Based on Mutual Information. Journal of Machine Learning Research 4, pp. 1297–1318. Cited by: §2.1.
 An informationmaximisation approach to blind separation and blind deconvolution. Neural Computation 7 (6), pp. 1004–1034. Cited by: Appendix B, §1, §2.
 Blind separation of sources: A nonlinear neural algorithm. Neural Networks 5 (6), pp. 937–947. External Links: Document, ISSN 08936080 Cited by: §2.1.
 Understanding disentangling in VAE. In NeurIPS, Cited by: §7.
 Blind identification of independent components with higherorder statistics. In IEEE Workshop on HigherOrder Spectral Analysis, Cited by: §1, §2.
 Source separation using higher order moments. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing  Proceedings, Vol. 4, pp. 2109–2112. External Links: Document, ISSN 07367791 Cited by: §1.
 Infomax and Maximum Likelihood for Blind Source Separation. IEEE Letters on Signal Processing 4, pp. 112–114. Cited by: §1, §2.
 Sur quelques propriétés des déterminants gauches. Journal für die reine und angewandte Mathematik 32, pp. 119–123. Cited by: §5.
 Isolating Sources of Disentanglement in Variational Autoencoders. In NeurIPS, Cited by: §7.
 Variational Methods for Bayesian Independent Component Analysis. Ph.D. Thesis, University of Oxford. Cited by: Appendix C, §2.
 Independent component analysis, A new concept?. Signal Processing 36 (3), pp. 287–314. External Links: Document, ISSN 01651684 Cited by: §1.
 An Elementary Proof of a Theorem of Johnson and Lindenstrauss. Random Structures and Algorithms 22 (1), pp. 60–65. External Links: Document, ISSN 10429832 Cited by: §5.1.
 Higher Order Statistical Decorrelation without Information Loss. In NeurIPS, Cited by: §1, §2.1, §7.
 NICE: Nonlinear Independent Components Estimation. In ICLR, Cited by: §1, §7.
 Density estimation using Real NVP. In ICLR, Cited by: Appendix F, §1, §7.
 Neural Spline Flows. In NeurIPS, Cited by: Appendix D, Appendix F, Appendix F, §3.1, §3.2, §3.2.
 The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications 20 (2), pp. 303–353. External Links: Document, ISSN 08954798 Cited by: §5.1.
 Structured Disentangled Representations. In AISTATS, Cited by: §7.
 Independent Component Analysis: A Flexible Nonlinearity and Decorrelating Manifold Approach. Neural Computation 11 (8), pp. 1957–83. Cited by: Appendix A, Appendix A, Appendix B, §2, §5.
 Independent Component Analysis. Cambridge University Press. External Links: ISBN 9780521792981, Document Cited by: Appendix C, §1, §4.
 Improving Normalizing Flows via Better Orthogonal Parameterizations. In ICML Workshop on Invertible Neural Networks and Normalizing Flows, Cited by: §7.
 VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In ICLR, External Links: Document, ISSN 10780874 Cited by: §7.
 Norm matters: Efficient and accurate normalization schemes in deep networks. In NeurIPS, Cited by: §6.1.
 Unsupervised variational Bayesian learning of nonlinear models. NeurIPS. Cited by: §2.
 Independent Component Analysis. John Wiley. Cited by: Appendix C, §1.
 Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks 12 (3), pp. 429–439. External Links: Document, ISSN 08936080 Cited by: §2.1, §2.
 Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning. In AISTATS, Cited by: §2.1.
 Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), External Links: ISBN 9780874216561, Document, ISSN 07176163 Cited by: §6.1.
 Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics 26 (1), pp. 189–206. External Links: Document Cited by: §5.1.
 Blind separation of sources, Part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing 24 (1), pp. 1–10. External Links: Document, ISSN 01651684 Cited by: §1, §2.
 Nonlinear Independent Component Analysis. In ICA: Principles and Practive, R. Everson and S. J. Roberts (Eds.), pp. 113–134. Cited by: §2.1, §2.
 Variational Autoencoders and Nonlinear ICA: A Unifying Framework. In AISTATS, Cited by: §2.1, §7.
 Disentangling by Factorising. In NeurIPS, Cited by: §7.
 Adam: A Method for Stochastic Optimisation. In ICLR, External Links: Link Cited by: Appendix F, §8.
 Improved Variational Inference with Inverse Autoregressive Flow. In NeurIPS, Cited by: §7.
 Autoencoding Variational Bayes. In ICLR, Cited by: §4, §7.
 Glow: Generative flow with invertible 1x1 convolutions. NeurIPS. Cited by: §7.
 Bayesian NonLinear Independent Component Analysis by MultiLayer Perceptrons. In Advances in Independent Component Analysis, M. Girolami (Ed.), pp. 93–121. External Links: Document Cited by: §2.1, §2.
 Variational Bayesian Independent Component Analysis. Technical report University of Cambridge. Cited by: §2, §2.
 Blind source separation of nonlinear mixing models. Neural Networks for Signal Processing  Proceedings of the IEEE Workshop, pp. 406–415. External Links: ISBN 0780342569, Document Cited by: §2.1.
 A Unifying InformationTheoretic Framework for Independent Component Analysis. Computers & Mathematics with Applications 39 (11), pp. 1–21. Cited by: §1, §2.
 Independent Component Analysis for Mixed SubGaussian and SuperGaussian Sources. Joint Symposium on Neural Computation, pp. 6–13. External Links: Link Cited by: Appendix B.
 Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform. In ICLR, Cited by: §7.
 An application of the principle of maximum information preservation to linear systems. NeurIPS. Cited by: §2.
 Maximum Likelihood and Covariant Algorithms for Independent Component Analysis. Technical report University of Cambridge. Cited by: Appendix B, §2, §2.
 Normalizing Flows for Probabilistic Modeling and Inference. Technical report DeepMind, London, UK. Cited by: §3, §7.
 Symplectic nonlinear component analysis. In NeurIPS, pp. 437–443. Cited by: §1, §7.
 Redundancy reduction with informationpreserving nonlinear maps. Network: Computation in Neural Systems 6 (1), pp. 61–72. External Links: Document, ISSN 0954898X Cited by: §1, §7.
 Statistical Independence and Novelty Detection with Information Preserving Nonlinear Maps. Neural Computation 8 (2), pp. 260–269. External Links: Document, ISSN 08997667 Cited by: §1, §7.
 Maximum likelihood blind source separation: A contextsensitive generalization of ICA. In NeurIPS, Cited by: §2.
 Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In ICML, Cited by: §4, §7.
 Variational Inference with Normalizing Flows. In ICML, Cited by: §3.
 Bayesian independent component analysis with prior constraints: An application in biosignal analysis. In Deterministic and Statistical Methods in Machine Learning, First International Workshop, Sheffield, UK. External Links: ISBN 3540290737, Document, ISSN 03029743 Cited by: §2.
 Independent Component Analysis: Source Assessment & Separation, a Bayesian Approach. IEEE ProceedingsVision, Image and Signal Processing 145 (3), pp. 149–154. Cited by: §2.
 A unifying review of linear gaussian models. Neural Computation 11 (2), pp. 305–345. External Links: Document, ISSN 08997667 Cited by: Appendix B, §2.
 Richtungsfelder und Fernparallelismus in ndimensionalen Mannigfaltigkeiten.. Commentarii mathematici Helvetici 8, pp. 305–353. Cited by: §1, §5.1.
 A generic framework for blind source separation in structured nonlinear models. IEEE Transactions on Signal Processing 50 (8), pp. 1819–1830. External Links: Document, ISSN 1053587X Cited by: §2.1.
 Nonlinear blind source separation by variational Bayesian learning. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences E86A (3), pp. 532–541. External Links: ISSN 09168508 Cited by: §2.1, §2.
 Sylvester normalizing flows for variational inference. In UAI, Vol. 1, pp. 393–402. External Links: ISBN 9781510871601 Cited by: §7.
 Bayesian independent component analysis: Variational methods and nonnegative decompositions. Digital Signal Processing 17 (5), pp. 858–872. External Links: Document, ISSN 10512004 Cited by: §2.
 Sketching as a Tool for Numerical Linear Algebra. Foundations and Trends in Theoretical Computer Science 10 (2), pp. 1–157. External Links: Document Cited by: §5.1.
 Informationtheoretic approach to blind separation of sources in nonlinear mixture. Signal Processing 64 (3), pp. 291–300. External Links: Document, ISSN 01651684 Cited by: §2.1.