Nonlinear ICA Using Auxiliary Variables
and Generalized Contrastive Learning
Abstract
Nonlinear ICA is a fundamental problem for unsupervised representation learning, emphasizing the capacity to recover the underlying latent variables generating the data (i.e., identifiability). Recently, the very first identifiability proofs for nonlinear ICA have been proposed, leveraging the temporal structure of the independent components. Here, we propose a general framework for nonlinear ICA, which, as a special case, can make use of temporal structure. It is based on augmenting the data by an auxiliary variable, such as the time index, the history of the time series, or any other available information. We propose to learn nonlinear ICA by discriminating between true augmented data, or data in which the auxiliary variable has been randomized. This enables the framework to be implemented algorithmically through logistic regression, possibly in a neural network. We provide a comprehensive proof of the identifiability of the model as well as the consistency of our estimation method. The approach not only provides a general theoretical framework combining and generalizing previously proposed nonlinear ICA models and algorithms, but also brings practical advantages.
1 Introduction
Nonlinear ICA is a fundamental problem in unsupervised learning which has attracted a considerable amount of attention recently. It promises a principled approach to representation learning, for example using deep neural networks. Nonlinear ICA attempts to find nonlinear components, or features, in multidimensional data, so that they correspond to a welldefined generative model (Hyvabook, ; Jutten10, ). The essential difference to most methods for unsupervised representation learning is that the approach starts by defining a generative model in which the original latent variables can be recovered, i.e. the model is identifiable by design.
Denote an observed dimensional random vector by . We assume it is generated using independent latent variables called independent components, . A straightforward definition of the nonlinear ICA problem is to assume that the observed data is an arbitrary (but smooth and invertible) transformation of the latent variables as
(1) 
The goal is then to recover the inverse function as well as the independent components based on observations of alone.
Research in nonlinear ICA has been hampered by the fact that such simple approaches to nonlinear ICA are not identifiable, in stark contrast to the linear ICA case. In particular, if the observed data are obtained as i.i.d. samples, i.e. there is no temporal or similar structure in the data, the model is seriously unidentifiable (Hyva99NN, ), although attempts have been made to estimate it nevertheless (Deco95, ; Tan01, ; Almeida03, ; Dinh15, ). This is a major problem since in fact most of the utility of linear ICA rests on the fact that the model is identifiable, or—in alternative terminology—the “sources can be separated”. Proving the identifiability of linear ICA (Comon94, ) was a great advance on the classical theory of factor analysis, where an orthogonal factor rotation could not be identified.
Fortunately, a solution to nonidentifiability in nonlinear ICA can be found by utilizing temporal structure in the data (Harmeling03, ; Sprekeler14, ; Hyva16NIPS, ; Hyva17AISTATS, ). In recent work, various identifiability conditions has been proposed, assuming that the independent components are actually time series and have autocorrelations (Sprekeler14, ), general nonGaussian dependencies (Hyva17AISTATS, ), or nonstationarities (Hyva16NIPS, ). These generalize earlier identifiability conditions for linear ICA (Belo97, ; Pham01, ).
Here, we propose a very general form of nonlinear ICA, based on the idea that the independent components are dependent on some additional, auxiliary variable. This unifies and generalizes the methods in (Harmeling03, ; Sprekeler14, ; Hyva16NIPS, ; Hyva17AISTATS, ). It gives a general framework where it is not necessary to specifically have a temporal (or even spatial) structure in the data. We prove exact identifiability conditions for the new framework, and show how it extends previous conditions, both from the viewpoint of theory and practice. We further provide a practical algorithm for estimating the model using the idea of contrastive learning (Gutmann12JMLR, ; Hyva16NIPS, ; Hyva17AISTATS, ), and prove its consistency.
2 Nonlinear ICA using auxiliary variables
2.1 Definition of generative model
Assume the general mixing model in (1) where the mixing function is only assumed invertible and smooth (in the sense of having continuous second derivatives, and the same for its inverse). We emphasize the point that we do not restrict the function to any particular functional form. It can be modelled by a general neural network, since even the asumption of invertibility usually (empirically) seems to hold for the estimated by the methods developed here, even without enforcing it.
The key idea here is that we further assume that each is statistically dependent on some fullyobserved dimensional random variable , but conditionally independent of the other :
(2) 
for some functions .
First, to see how this generalizes previous work on nonlinear ICA using time structure, we note that the auxiliary variable could be the past of the component in the time series, giving rise to temporally dependent components as in permutationcontrastive learning or PCL (Hyva17AISTATS, ) and the earlier methods in (Harmeling03, ; Sprekeler14, ). Or, could be the time index itself in a time series, or the index of a time segment, leading to nonstationary components as in timecontrastive learning or TCL (Hyva16NIPS, ). These connections will be considered in more detail below.
Thus, we obtain a unification of the separation principles of temporal dependencies and nonstationarity. This is remarkable since these principles are wellknown in linear ICA literature, but they have been considered as two distinct principles (Matsuoka95, ; Belo97, ; Cardoso01, ; Hyvabook, ).
Furthermore, we can define in completely new ways. In the case where each observation of is an image or an image patch, a rather obvious generalization of TCL would be to assume is the pixel index, or any similar spatial index, thus giving rise to nonlinear representation learning by the . Related to an image colorization task (larsson2017colorization, ), could be the grayscale image and the hue. Moreover, could be a class label, giving rise to something more related to conventional representation learning by a supervised neural network (e.g. ImageNet), but now connected to the theory of nonlinear ICA and identifiability (this is also considered in detail below). In a neuroscience context, could be brain imaging data, and the could be some quantity related to the stimuli in the experiment. The appropriate definition of obviously depends on the application domain, and the list above is by no means exhaustive.
It should be noted that the conditional independence does not imply that the would be marginally independent. If affects the distributions of the somehow independently (intuitively speaking), the are likely to be marginally independent. This would be case, for example, if each is of the from , that is, each source has one auxiliary variable which is not shared with the other sources, and the are independent of each other. Thus, the formulation above is actually generalizing the ordinary independence in ICA to some extent.
2.2 Learning algorithm
To estimate our nonlinear ICA model, we propose a general form of contrastive learning, inspired by the idea of transforming unsupervised learning to supervised learning earlier explored in (Gutmann12JMLR, ; Goodfellow14, ; gutmann2014likelihood, ). More specifically, we use the idea of discriminating between a real data set and some randomized version of it, as used in PCL (Hyva17AISTATS, ). Thus we define two datasets
(3)  
(4) 
where is a random value from the distribution of the , but independent of , created in practice by random permutation of the empirical sample of the . We learn a nonlinear logistic regression system (e.g. a neural network) using a regression function of the form
(5) 
which then gives the posterior probability of the first class as . Here, the scalar features would typically be computed by hidden units in a neural network. Universal approximation capacity (Hornik, ) is assumed for the models of and . This is a variant of the “contrastive learning” approach to nonlinear ICA in (Hyva16NIPS, ; Hyva17AISTATS, ), and we will see below that it in fact unifies and generalizes those earlier results.
3 Convergence and identifiability theory
In this section, we give exact conditions for the convergence (consistency) of our learning algorithm, which also leads to constructive proofs of identifiability of our nonlinear ICA model with auxiliary variables. It turns out we have two cases that need to be considered separately, based on the property of conditional exponentiality.
3.1 Definition of conditional exponentiality
We start by a basic definition describing distributions which are in some sense pathological in our theory.
Definition 1
A random variable (independent component) is conditionally exponential of order given random vector if its conditional pdf can be given in the form
(6) 
almost everywhere in the support of , with , , , and scalarvalued functions. The sufficient statistics are assumed linearly independent (over ).
This definition is a simple variant of the conventional theory of exponential families, adding conditioning by which comes through the parameters only.
As a simple illustration, consider a (stationary) Gaussian time series as , and define as the past of the time series. The past of the time series can be compressed in a single statistic which essentially gives the conditional expectation of . Thus, models of independent components using Gaussian autocorrelations lead to the conditionally exponential case, of order . As is wellknown, the basic theory of linear ICA relies heavily on nonGaussianity, the intuitive idea being that the Gaussian distribution is too “simple” to support identifiability. Here, we see a reflection of the same idea. Note also that if and are independent, is conditionally exponential, since then we simply set .
3.2 Theory for general case
First, we consider the much more general case of distributions which are not “pathological”. Our main theorem, proven in Supplementary Material, is as follows:
Theorem 1
Assume

The conditional logpdf in (2) is sufficiently smooth as a function of , for any fixed .

[Assumption of Variability] For any , there exist values for , denoted by such that the vectors in given by
(7) with
(8) are linearly independent.

In the regression function in Eq. (5), we constrain to be invertible, as well as smooth, and constrain the inverse to be smooth as well.
Then, in the limit of infinite data, in the regression function provides a consistent estimator of demixing in the nonlinear ICA model: The functions (hidden units) give the independent components, up to scalar (componentwise) invertible transformations.
Essentially, the Theorem shows that under mostly weak assumptions, including invertibility of and smoothness of of the pdfs, and of course independence of the components, our learning system will recover the independent components given an infinite amount of data. Thus, we also obtain a constructive identifiability proof of our new, general nonlinear ICA model.
Among the assumptions above, the only one which cannot considered weak or natural is clearly Assumption of Variability (#8), which is central in the our developments. It is basically saying that the auxiliary variable must have a sufficiently strong and diverse effect on the distributions of the independent components. To further understand this condition, we give the following Theorem:
Theorem 2
Assume the independent components are conditionally exponential given , with the same order for all components. Then,

If , the Assumption of Variability cannot hold.

Assume and for each component , the vectors are not all proportional to each other for different , for almost everywhere. Then, the Assumption of Variability holds almost surely if the ’s are independently randomly generated from a distribution with support of nonzero measure.
Loosely speaking, the Assumption of Variability holds if the sources, or rather their modulation by , is not “too simple”, which is here quantified as the order of the exponential family from which the are generated. Furthermore, for the second point of the Theorem to hold, the sufficient statistics cannot be linear (which would lead to zero second derivatives), thus excluding the Gaussian scalelocation family as too simple as well.
Another nontrivial assumption is the invertibility of . It is hoped that the constraint of invertibility is only necessary to have a rigorous theory, and not necessary in any practical implementation. Our simulations below, as well as our next Theorem, seem to back up this conjecture to some extent.
3.3 Theory for conditionally exponential case
The theory above excluded the conditionally exponential case of order one (Theorem 2). This is a bit curious since it is actually the main model considered in TCL (Hyva16NIPS, ). In fact, the exponential family model of nonstationarities in that work is nothing other than a special case of our “conditionally exponential” family of distributions; we will consider the connection in detail in the next section.
There is actually a fundamental difference between Theorem 1 above and the TCL theory in (Hyva16NIPS, ). In TCL, and in contrast to our current results, a linear indeterminacy remains—but the TCL theory never showed that such an indeterminacy is a property of the model and not only of the particular TCL algorithm employed in (Hyva16NIPS, ). Next, we construct a theory for exponential families adapting our current framework, and indeed, we see the same kind of linear indeterminacy appear.
We have the following Theorem regarding conditionally exponential sources. We give the result for general , although the case is mainly of interest.
Theorem 3
Assume

Each is conditionally exponential given (Def. 1).

There exist points such that the matrix of size
(9) is invertible (here, the rows corresponds to all the possible subscript pairs for ).
Then,

The optimal regression function can be expressed in the form
(10) for some functions , and two scalarvalued functions .

In the limit of infinite data, provides a consistent estimator of the nonlinear ICA model, up to a linear transformation of pointwise scalar (not necessarily invertible) functions of the independent components. The pointwise nonlinearities are given by the sufficient statistics . In other words,
(11) for some unknown matrix and an unknown vector .
The proof, found in Supplementary Material, is quite similar to the proof of Theorem 1 in (Hyva16NIPS, ). Although the statistical assumptions made here are different, the very goal of modelling exponential sources by logistic regression means the same linear indeterminacy appears, based on the linearity of the logpdf in the exponential family. On the other hand, Theorem 3 has the advantage of not requiring smoothness of the logpdf’s, or the invertibility of , in contrast to Theorem 1. The condition of invertibility above simply means that the sources are somehow “independently” modulated.
4 Different definitions of auxiliary variables
Next, we consider different possible definitions of the auxiliary variable, and show some exact connections and generalization of previous work.
4.1 Using time as auxiliary variable
A real practical utility of the new framework can be seen in the case of nonstationary data. Consider a time series , where we assume a mixing model
(12) 
Assume the independent components are nonstationary, with densities . For analysing such nonstationary data in our framework, define and . We can easily consider the time index as a random variable, observed for each data point, and coming from a uniform distribution. Thus, we create two new datasets by augmenting the data by adding the time index:
(13)  
(14) 
We analyse the nonstationary structure of the data by learning to discriminate between and by logistic regression. Directly applying the general theory above, we define the regression function to have the following form:
(15) 
where each is . Intuitively, this means that the nonstationarity is separately modelled for each component, with no interactions.
Theorems 1 and 3 above give exact conditions for the consistency of such a method. This provides an alternative way of estimating the nonstationary nonlinear ICA model proposed in (Hyva16NIPS, ) as a target for the TCL method.
A practical advantage is that if the assumptions of Theorem 1 hold, the method actually captures the independent components directly: There is no indeterminacy of a linear transformation unlike in TCL. Nor is there any nonlinear noninvertible transformation (e.g. squaring) as in TCL, although this may come at the price of constraining to be invertible. The Assumption of Variability in Theorem 1 is quite comparable to the corresponding full rank condition in the convergence theory of TCL. Another advantage of our new method is that there is no need to segment the data, although any smoothness imposed on would have a somewhat similar effect, and in our simulations below we found that segmentation is computationally very useful. From a theoretical perspective, the current theory in Theorem 1 is also much more general than the TCL theory since no assumption of an exponential family is needed — “too simple” exponential families are in fact considered separately in Theorem 3.
4.2 Using history as auxiliary variables
Next, we consider the theory in the case where is the history of each variable. For the purposes of our present theory, we define and based on a timeseries model in (12). So, the nonlinear ICA model in (1–2) holds. Note that here, it does not make any difference if we use the past of or of as since they are invertible functions of each other. Each component follows a distribution
(16) 
This model is the same as in (Hyva17AISTATS, ). In fact, the intuitive idea of discriminating between true concatenation of real data vs. randomized (permuted) concatenation was used in PCL (Hyva17AISTATS, ), in which it was proposed to discriminate between
(17)  
(18) 
with a random time index . This is in fact the same discrimination problem as the one we just formulated. Likewise, the restriction of the regression function in (5) is very similar to the form imposed in Eq. (12) of (Hyva17AISTATS, ). Thus, essentially, Theorem 1 above provides an alternative identifiability proof of the model in (Hyva17AISTATS, ), with quite similar constraints. More precisely, in (Hyva17AISTATS, ), the model was proven to be identifiable under two assumptions: First, the joint logpdf of two consecutive time points is not “factorizable” in the conditionally exponential form of order one, and second, there is a rather strong kind of temporal dependency between the time points, which was called uniform dependency. Here, we need no such latter condition, essentially because here we constrain invertible, which was not done in (Hyva17AISTATS, ), but seems to have a somewhat similar effect. Our goal here is thus not to sharpen the analysis of (Hyva17AISTATS, ), but merely to show that that model falls into our general framework with minimal modification.
4.3 Combining time and history
Another generalization of previously published theory which could be of great interest in practice is to combine the nonstationaritybased model in TCL (Hyva16NIPS, ) with the temporal dependencies model in PCL (Hyva17AISTATS, ). Clearly, we can combine these two by defining , , and thus discriminating between
(19)  
(20) 
with a random time index , and accordingly defining the regression function as
(21) 
Such a method now has the potential of using both nonstationarity and temporal dependencies for nonlinear ICA. Thus, there is no need to choose which method to use, since this combined method uses both properties.
4.4 Using class label as auxiliary variable
Finally, we consider the very interesting case where the data includes class labels as in a classical supervised setting, and we use them as the auxiliary variable. Let us note that the existence of labels does not mean a nonlinear ICA model is not interesting, because our interest might not be in classifying the data using these labels, but rather in understanding the structure of the data, or possibly, finding useful features for classification using some other labels. In particular, in scientific data analysis, the main goal is usually to understand the structure of the data; if the labels correspond to different treatments, or experimental conditions, the classification problem in itself may not be of great interest. It could also be that the classes are somehow artificially created, as in TCL, and thus the whole classification problem is of secondary interest.
Formally, denote by the class label with different classes. As a straightforward application of the theory above, we learn to discriminate between
(22)  
(23) 
where is the class label of , and is a randomized class label (i.e. a number randomly drawn from ). Note that we could also apply the TCL method and theory on such data, simply using the as class labels instead of the time segment indices as in (Hyva16NIPS, ). In either case, we use the given class labels to estimate the independent components of the data, thus combining supervised and unsupervised learning in an interesting, new way.
5 Simulations
To test the performance of the method, we applied it on nonstationary sources similar to those used in TCL. This is the case of main interest here since for temporally correlated sources, the framework gives PCL. It is not our goal to claim that the new method performs better than TCL, but rather to confirm that our new very general framework includes something similar to TCL as well.
First, we consider the nonconditionallyexponential case in Theorem 1, where the data does not follow a conditionally exponential family, and the regression function has the general form in (5). We artificially generated nonstationary sources on a 2D grid indexed by by a scale mixture model:, , where is a standardized Laplacian variable, and the scale components were generated by creating Gaussian blobs in random locations to represent areas of higher variance. The number of dimensions was and the number of data points . The mixing function was a random threelayer feedforward neural network as in (Hyva16NIPS, ). We used the spatial index pair as . We modelled by a feedforward neural network with three layers: The number of units in the hidden layers was , except in the final layer where it was ; the nonlinearity was maxout except for the last layer where absolute values were taken; regularization was used to prevent overlearning. The function was also modelled by a neural network. In contrast to the assumptions of Theorem 1, no constraint related to the invertibility of was imposed. After learning the neural network, we further applied FastICA to the estimated features (heuristically inspired by Theorem 3). Performance was evaluated by the Pearson correlation between the estimated sources and the original sources (after optimal matching and sign flipping). The results are shown in Fig. 1 a). Our method has performance similar to TCL.
Second, we considered the conditionally exponential family case as in Theorem 3. We generated nonstationary sources as above, but we generated them as timeseries, and divided the time series into equispaced segments. We used a simple random neural network to generate separate variances inside each segment. The mixing function was as above. Here, we used the index of the segment as . This means we are also testing the applicability of using a class label as the auxiliary variable as in Section 4.4. We modelled as above. The and in (10) were modelled by constant parameter vectors inside each segment, and by another neural network. Performance was evaluated by the Pearson correlation of the absolute values of the components, since the sign remains unresolved in this case. The results are shown in Fig. 1 b). Again our method has performance similar to TCL, confirming that source separation by nonstationarity, as well as using class labels as in Section 4.4, can be modelled in our new framework.
6 Conclusion
We introduced a new framework for nonlinear ICA. To solve the problem of nonidentifiability central to nonlinear ICA theory, we assume there is an external, auxiliary variable, such that conditioning by the auxiliary variables changes the distributions of the independent components. In a time series, the auxiliary variable can correspond to the history, or the time index, thus unifying the previous frameworks (Sprekeler14, ; Hyva16NIPS, ; Hyva17AISTATS, ) both in theory and practice. However, the framework is quite versatile, and the auxiliary variables can be defined in many different ways depending on the application domain, and we proposed various cases. We also provided a learning algorithm based on the idea of contrastive learning by logistic regression, and proved its consistency.
We gave exact conditions for identifiability, showing how the definition of conditional exponentiality in a loose sense divides the problem into two domains. Conditional exponentiality interestingly corresponds to the simplest case of TCL theory in (Hyva16NIPS, ). In the special case of nonstationary components like in TCL, we actually relaxed the assumption of an exponential family model for the independent components, and removed the need to segment the data, which may be difficult in practice; nor was there any remaining linear mixing, unlike in TCL. This result carried over to the case where we actually have class labels available; we argued that the identifiability theory of nonlinear ICA is interesting even in such an apparently supervised learning case.
Further work is needed to ascertain the applicability of the framework on real data. We believe it is necessary to first lay the theoretical groundwork for nonlinear ICA; linear ICA has always greatly benefitted from the fact that it is a rigorously defined framework which allows for rigorous theoretical development and analysis of different methods.
References
 [1] L. B. Almeida. MISEP—linear and nonlinear ICA based on mutual information. J. of Machine Learning Research, 4:1297–1318, 2003.
 [2] A. Belouchrani, K. Abed Meraim, J.F. Cardoso, and E. Moulines. A blind source separation technique based on second order statistics. IEEE Trans. on Signal Processing, 45(2):434–444, 1997.
 [3] J.F. Cardoso. The three easy routes to independent component analysis: contrasts and geometry. In Proc. Int. Workshop on Independent Component Analysis and Blind Signal Separation (ICA2001), San Diego, California, 2001.
 [4] P. Comon. Independent component analysis—a new concept? Signal Processing, 36:287–314, 1994.
 [5] G. Deco and D. Obradovic. Linear redundancy reduction learning. Neural Networks, 8(5):751–755, 1995.
 [6] L. Dinh, D. Krueger, and Y. Bengio. NICE: Nonlinear independent components estimation. In Workshop at Int. Conf. on Learning Representations (ICLR2015), 2015.
 [7] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning. Springer: New York, 2001.
 [8] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 [9] M. U. Gutmann, R. Dutta, S. Kaski, and J. Corander. Likelihoodfree inference via classification. Statistics and Computing, 2017. doi:10.1007/s1122201797386.
 [10] M. U. Gutmann and A. Hyvärinen. Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. of Machine Learning Research, 13:307–361, 2012.
 [11] S. Harmeling, A. Ziehe, M. Kawanabe, and K.R. Müller. Kernelbased nonlinear blind source separation. Neural Computation, 15(5):1089–1124, 2003.
 [12] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989.
 [13] A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley Interscience, 2001.
 [14] A. Hyvärinen and H. Morioka. Nonlinear ICA of temporally dependent stationary sources. In Proc. Artificial Intelligence and Statistics (AISTATS2017), 2017.
 [15] A. Hyvärinen and H. Morioka. Unsupervised feature extraction by timecontrastive learning and nonlinear ICA. In Advances in Neural Information Processing Systems (NIPS2016), 2017.
 [16] A. Hyvärinen and P. Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429–439, 1999.
 [17] C. Jutten, M. BabaieZadeh, and J. Karhunen. Nonlinear mixtures. Handbook of Blind Source Separation, Independent Component Analysis and Applications, pages 549–592, 2010.
 [18] G. Larsson, M. Maire, and G. Shakhnarovich. Colorization as a proxy task for visual understanding. In CVPR, volume 2, page 8, 2017.
 [19] K. Matsuoka, M. Ohya, and M. Kawamoto. A neural net for blind separation of nonstationary signals. Neural Networks, 8(3):411–419, 1995.
 [20] D.T. Pham and J.F. Cardoso. Blind separation of instantaneous mixtures of non stationary sources. IEEE Trans. Signal Processing, 49(9):1837–1848, 2001.
 [21] H. Sprekeler, T. Zito, and L. Wiskott. An extension of slow feature analysis for nonlinear blind source separation. J. of Machine Learning Research, 15(1):921–947, 2014.
 [22] Y. Tan, J. Wang, and J.M. Zurada. Nonlinear blind source separation using a radial basis function network. IEEE Transactions on Neural Networks, 12(1):124–134, 2001.
Nonlinear ICA using auxiliary variables and
generalized contrastive learning
Supplementary Material (Proofs)
Appendix A Proof of Theorem 1
By wellknown theory [7], after convergence of logistic regression, with infinite data and a function approximator with universal approximation capability, the regression function will equal the difference of the logdensities in the two classes:
(24) 
where the is the marginal logdensity of the components when is integrated out (as pointed above, it does not need to be factorial), is the marginal density of the auxiliary variables, , and the are the Jacobians of the inverse mixing—which nicely cancel out. Also, the marginals cancel out here.
Now, change variables to and define , which is possible by the assumption of invertibility of . We then have
(25) 
What we need to prove is that this can be true for all and only if the depend on only one of the .
Denote . Taking derivatives of both sides of (25) with respect to , denoting the derivatives by a superscript as
(26)  
(27) 
and likewise for , and , we obtain
(28) 
Taking another derivative with respect to with , the lefthandside vanishes, and we have
(29) 
where the are secondorder crossderivatives. Collect all these equations in vector form by defining as a vector collecting all entries (we omit diagonal terms, and by symmetry, take only one half of the indices). Likewise, collect all the entries in the vector , and all the entries in the vector . We can thus write the equations above as a single system of equations
(30) 
Now, collect the and into a matrix :
(31) 
Equation (30) takes the form of the following linear system
(32) 
where is defined in the Assumption of Variability, Eq. (8). This must hold for all and . Note that the size of is .
Now, fix . Consider the points given for that by the Assumption of Variability. Collect the equations (32) above for the points starting from index :
(33) 
and collect likewise the equation for index repeated times:
(34) 
Now, subtract (34) from (33) to obtain
(35) 
The matrix consisting of the here has, by the Assumption of Variability, linearly independent columns. It is square, of size , so it is invertible. This implies is zero, and thus by definition in (31), the and are all zero.
In particular, being zero implies no row of the Jacobian of can have more than one nonzero entry. This holds for any . By continuity of the Jacobian and its invertibility, the nonzero entries in the Jacobian must be in the same places for all : If they switched places, there would have to be a point where the Jacobian is singular, which would contradict the assumption of invertibility of .
This means that each is a function of only one . The invertibility of also implies that each of these scalar functions is invertible. Thus, we have proven the convergence of our method, as well as provided a new identifiability result for nonlinear ICA.
Appendix B Proof of Theorem 2
For notational simplicity, consider just the case ; the results are clearly simple to generalize to any dimensions. Furthermore, we set ; again, the proof easily generalizes. The assumption of conditional exponentiality means
(36)  
(37) 
and by definition of in (8), we get
(38) 
Now we fix like in the Assumption of Variability, and drop it from the equation. The above can be written as
(39) 
So, we see that for fixed is basically given by a linear combination of fixed “basis” vectors, with the ’s giving their coefficients.
If , it is impossible to obtain the linearly independent vectors since there are only basis vectors. On the other hand, if , the vectors vectors for each span a 2D subspace by assumption. For different , they are clearly independent since the nonzero entries are in different places. Thus, the basis vectors span a dimensional subspace, which means we will obtain linearly independent vectors by this construction for randomly chosen . Subtraction of does not reduce the independence almost surely, since it is simply redefining the origin, and does not change the linear independence.
Appendix C Proof of Theorem 3
Denote by the marginal logdensity of . As in the proof of Theorem 1, assuming infinite data, wellknown theory says that the regression function will converge to
(40) 
provided that such a distribution can be approximated by the regression function. Here, we define . In fact, the approximation is clearly possible since the difference of the logpdf’s is linear in the same sense as the regression function. In other words, a solution is possible as
(41) 
with
(42)  
(43)  
(44)  
(45) 
Thus, we can have the special form for the regression function in (10). Next, we have to prove that this is the only solution up to the indeterminacies given in the Theorem.
Collect these equations for all the given by Assumption 3 in the Theorem. Denote by a matrix of the , with the product of giving row index and column index. Denote a vector of all the sufficient statistics of all the independent components as .Collect all the into a matrix with again as the column index. Collect the terms for all the different into a vector .
Expressing (41) for all the time points in matrix form, we have
(46) 
where is a vector of ones. Now, on both sides of the equation, subtract the first row from each of the other rows. We get
(47) 
where the matrices with bars are such differences of the rows of and , and likewise for . We see that the last term in (46) disappears.
Now, the matrix is indeed the same as in Assumption 3 of the Theorem, which says that the modulations of the distributions of the are independent in the sense that is invertible. Then, we can multiply both sides by the inverse of and get
(48) 
with an unknown matrix , and a constant vector .
Thus, just like in TCL, we see that the hidden units give the sufficient statistics , up to a linear transformation , and the Theorem is proven.