The Scaling Limit of HighDimensional Online Independent Component Analysis
Abstract
We analyze the dynamics of an online algorithm for independent component analysis in the highdimensional scaling limit. As the ambient dimension tends to infinity, and with proper time scaling, we show that the timevarying joint empirical measure of the target feature vector and the estimates provided by the algorithm will converge weakly to a deterministic measuredvalued process that can be characterized as the unique solution of a nonlinear PDE. Numerical solutions of this PDE, which involves two spatial variables and one time variable, can be efficiently obtained. These solutions provide detailed information about the performance of the ICA algorithm, as many practical performance metrics are functionals of the joint empirical measures. Numerical simulations show that our asymptotic analysis is accurate even for moderate dimensions. In addition to providing a tool for understanding the performance of the algorithm, our PDE analysis also provides useful insight. In particular, in the highdimensional limit, the original coupled dynamics associated with the algorithm will be asymptotically “decoupled”, with each coordinate independently solving a 1D effective minimization problem via stochastic gradient descent. Exploiting this insight to design new algorithms for achieving optimal tradeoffs between computational and statistical efficiency may prove an interesting line of future research.
The Scaling Limit of HighDimensional Online Independent Component Analysis
Chuang Wang and Yue M. Lu John A. Paulson School of Engineering and Applied Sciences Harvard University 33 Oxford Street, Cambridge, MA 02138, USA {chuangwang,yuelu}@seas.harvard.edu
noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float
1 Introduction
Online learning methods based on stochastic gradient descent are widely used in many learning and signal processing problems. Examples includes the classical least mean squares (LMS) algorithm haykin2003least () in adaptive filtering, principal component analysis Oja1985 (); Biehl1998 (), independent component analysis (ICA) hyvarinen1997one (), and the training of shallow or deep artificial neural networks Biehl1994 (); shamir2013stochastic (); lecun2015deep (). Analyzing the convergence rate of stochastic gradient descent has already been the subject of a vast literature (see, e.g., jin2017escape (); Mitliagkas2013 (); Li2016b (); Ge2015 ().) Unlike existing work that analyze the behaviors of the algorithms in finite dimensions, we present in this paper a framework for studying the exact dynamics of stochastic gradient algorithms in the highdimensional limit, using online ICA as a concrete example. Instead of minimizing a generic function as considered in the optimization literature, the stochastic algorithm we analyze here is solving an estimation problem. The extra assumptions on the ground truth (e.g., the feature vector) and the generative models for the observations allow us to obtain the exact asymptotic dynamics of the algorithms.
As the main result of this work, we show that, as the ambient dimension and with proper timescaling, the timevarying joint empirical measure of the true underlying independent component and its estimate converges weakly to the unique solution of a nonlinear partial differential equation (PDE) [see (6).] Since many performance metrics, such as the correlation between and and the support recover rate, are functionals of the joint empirical measure, knowledge about the asymptotics of the latter allows us to easily compute the asymptotic limits of various performance metrics of the algorithm.
This work is an extension of a recent analysis on the dynamics of online sparse PCA Wang2016 () to more general settings. The idea of studying the scaling limits of online learning algorithm first appeared in a series of work that mostly came from the statistical physics communities Biehl1998 (); Biehl1994 (); Saad1995 (); Saad1997 (); Rattray2002a (); Basalyga2004 () in the 1990s. Similar to our setting, those early papers studied the dynamics of various online learning algorithms in high dimensions. In particular, they show that the meansquared error (MSE) of the estimation, together with a few other “order parameters”, can be characterized as the solution of a deterministic system of coupled ordinary differential equations (ODEs) in the large system limit. One limitation of such ODElevel analysis is that it cannot provide information about the distributions of the estimates. The latter are often needed when one wants to understand more general performance metrics beyond the MSE. Another limitation is that the ODE analysis cannot handle cases where the algorithms have nonquadratic regularization terms (e.g., the incorporation of norms to promote sparsity.) In this paper, we show that both limitations can be eliminated by using our PDElevel analysis, which tracks the asymptotic evolution of the probability distributions of the estimates given by the algorithm. In a recent paper Li2016b (), the dynamics of an ICA algorithm was studied via a diffusion approximation. As an important distinction, the analysis in Li2016b () keeps the ambient dimension fixed and studies the scaling limit of the algorithm as the step size tends to . The resulting PDEs involve spatial variables. In contrast, our analysis studies the limit as the dimension , with a constant step size. The resulting PDEs only involve spatial variables. This lowdimensional characterization makes our limiting results more practical to use, especially when the dimension is large.
The basic idea underlying our analysis can trace its root to the early work of McKean mckean1967propagation (); mckean1966class (), who studied the statistical mechanics of Markoviantype meanfield interactive particles. The mathematical foundation of this line of research has been further established in the 1980s (see, e.g., meleard1987propagation (); Sznitman1991 ()). This theoretical tool has been used in the analysis of highdimensional MCMC algorithms Roberts1997a (). In our work, we study algorithms through the lens of highdimensional stochastic processes. Interestingly, the analysis does not explicitly depend on whether the underlying optimization problem is convex or nonconvex. This feature makes the presented analysis techniques a potentially very useful tool in understanding the effectiveness of using lowcomplexity iterative algorithms for solving highdimensional nonconvex estimation problems, a line of research that has recently attracted much attention (see, e.g., Netrapalli:2013qv (); Candes:2015fv (); zhang2016provable (); li2016rapid ().)
The rest of the paper is organized as follows. We first describe in Section 2 the observation model and the online ICA algorithm studied in this work. The main convergence results are given in Section 3, where we show that the timevarying joint empirical measure of the target independent component and its estimates given by the algorithm can be characterized, in the highdimensional limit, by the solution of a deterministic PDE. Due to space constraint, we only provide in the appendix a formal derivation leading to the PDE, and leave the rigorous proof of the convergence to a followup paper. Finally, in Section 4 we present some insight obtained from our asymptotic analysis. In particular, in the highdimensional limit, the original coupled dynamics associated with the algorithm will be asymptotically “decoupled”, with each coordinate independently solving a 1D effective minimization problem via stochastic gradient descent.
Notations and Conventions:
Throughout this paper, we use boldfaced lowercase letters, such as and , to represent dimensional vectors. The subscript in denotes the discretetime iteration step. The th component of the vectors and are written as and , respectively.
2 Data model and online ICA
We consider a generative model where a stream of sample vectors , are generated according to
(1) 
where is a unique feature vector we want to recover. (For simplicity, we consider the case of recovering a single feature vector, but our analysis technique can be generalized to study cases involving a finite number of feature vectors.) Here is an i.i.d. random variable drawn from an unknown nonGaussian distribution with zero mean and unit variance. And models background noise. We use the normalization so that in the large limit, all elements of the vector are quantities. The observation model (1) is equivalent to the standard sphered data model , where is an orthonormal matrix with the first column being and is an i.i.d. dimensional standard Gaussian random vector.
To establish the large limit, we shall assume that the empirical measure of defined by converges weakly to a deterministic measure with finite moments. Note that this assumption can be satisfied in a stochastic setting, where each element of is an i.i.d. random variable drawn from , or in a deterministic setting [e.g., , in which case .]
We use an online learning algorithm to extract the nonGaussian component from the data stream . Let be the estimate of at step . Starting from an initial estimate , the algorithm update by
(2)  
where is a given twice differentiable function and is an elementwise nonlinear mapping introduced to enforce prior information about , e.g., sparsity. The scaling factor in the above equations makes sure that each component of the estimate is of size in the large limit.
The above online learning scheme can be viewed as a projected stochastic gradient algorithm for solving an optimization problem
(3) 
where and
(4) 
is a regularization function. In (2), we update using an instantaneous noisy estimation , in place of the true gradient , once a new sample is received.
In practice, one can use or to extract symmetric nonGaussian signals (for which and ) and use to extract asymmetric nonGaussian signals. The algorithm in (2) with can also be regarded as implementing a lowrank tensor decomposition related to the empirical kurtosis tensor of Li2016b (); Ge2015 ().
For the nonlinear mapping , the choice of for some corresponds to using an norm in the regularization term . If the feature vector is known to be sparse, we can set , which is equivalent to adding an regularization term.
3 Main convergence result
We provide an exact characterization of the dynamics of the online learning algorithm (2) when the ambient dimension goes to infinity. First, we define the joint empirical measure of the feature vector and its estimate as
(5) 
with defined by . Here we rescale (i.e., “accelerate”) the time by a factor of .
The joint empirical measure defined above carries a lot of information about the performance of the algorithm. For example, as both and have the same norm by definition, the normalized correlation between and defined by
can be computed as , i.e., the expectation of taken with respect to the empirical measure. More generally, any separable performance metric with some function can be expressed as an expectation with respect to the empirical measure , i.e., .
Directly computing via the expectation is challenging, as is a random probability measure. We bypass this difficulty by investigating the limiting behavior of the joint empirical measure defined in (5). Our main contribution is to show that, as , the sequence of random probability measures converges weakly to a deterministic measure . Note that the limiting value of can then be computed from the limiting measure via the identity .
Let be the density function of the limiting measure at time . We show that it is characterized as the unique solution of the following nonlinear PDE:
(6) 
with
(7)  
(8) 
where the two functions and are defined as
(9)  
(10) 
where
(11) 
In the above equations, and denote two independent random variables, with and , the nonGaussian distribution of introduced in (2); the notation denotes the expectation over and ; and and are the two functions used in the online learning algorithm (2).
Example 1
As a concrete example, we consider the case when is drawn from a symmetric nonGaussian distribution. Due to symmetry, . Write and . We use in (2) to detect the feature vector . Substituting this specific into (9) and (11), we obtain
(12)  
(13) 
and can be computed by substituting (12) and (13) into (10). Moreover, for the case , we derive a simple ODE for as
(14) 
Numerical verifications of the ODE results are shown in Figure 1(a). In our experiment, the ambient dimension is set to and we plot the averaged results as well as error bars (corresponding to one standard deviation) over 10 independent trials. Two different initial values of are used. In both cases, the asymptotic theoretical predictions match the numerical results very well.


The ODE in (14) can be solved analytically. Next we briefly discuss its stability. The righthand side of (14) is plotted in Figure 1(b) as a function of . It is clear that the ODE (14) always admits a solution , which corresponding to a trivial, noninformative solution. Moreover, this trivial solution is always a stable fixed point. When the stepsize for some constant , is also the unique stable fixed point. When however, two additional solutions of the ODE emerge. One is a stable fixed point denoted by and the other is an unstable fixed point denoted by , with . Thus, in order to reach an informative solution, one must initialize the algorithm with . This insight agrees with a previous stability analysis done in Rattray2002 (), where the authors investigated the dynamics near via a small expansion.
Example 2
In this experiment, we verify the accuracy of the asymptotic predictions given by the PDE (6). The settings are similar to those in Example 1. In addition, we assume that the feature vector is sparse, consisting of nonzero elements, each of which is equal to . Figure 2 shows the asymptotic conditional density for and at two different times. These theoretical predictions are obtained by solving the PDE (6) numerically. Also shown in the figure are the empirical conditional densities associated with one realization of the ICA algorithm. Again, we observe that the theoretical predictions and numerical results have excellent agreement.
To demonstrate the usefulness of the PDE analysis in providing detailed information about the performance of the algorithm, we show in Figure 3 the performance of sparse support recovery using a simple hardthresholding scheme on the estimates provided by the algorithm. By changing the threshold values, one can have tradeoffs between the true positive and false positive rates. As we can see from the figure, this precise tradeoff can be accurately predicted by our PDE analysis.
4 Insights given by the PDE analysis
In this section, we present some insights that can be gained from our highdimensional analysis. To simplify the PDE in (6), we can assume that the two functions and in (7) and (8) are given to us in an oracle way. Under this assumption, the PDE (6) describes the limiting empirical measure of the following stochastic process
(15) 
where is a sequence of independent standard Gaussian random variables. Unlike the original online learning update equation (2) where different coordinates of are coupled, the above process is uncoupled. Each component for evolves independently when conditioned on and . The continuoustime limit of (15) is described by a stochastic differential equation (SDE)
where is the standard Brownian motion.
We next have a closer look at the equation (15). Given a scalar , and , we can define a timevarying 1D regularized quadratic optimization problem with the effective potential
(16) 
where , and is the regularization term defined in (4). Then, the stochastic process (15) can be viewed as a stochastic gradient descent for solving this 1D problem with a stepsize equal to . One can verify that the exact gradient of (16) is . The third term in (15) adds stochastic noise to the true gradient. Interestingly, although the original optimization problem (3) is nonconvex, its 1D effective optimization problem is always convex for convex regularizers (e.g., .) This provides an intuitive explanation for the practical success of online ICA.
To visualize this 1D effective optimization problem, we plot in Figure 2(b) the effective potential at and , respectively. From Figure (2), we can see that the norm always introduces a bias in the estimation for all nonzero , as the minimum point in the effective 1D cost function is always shifted towards the origin. It is hopeful that the insights gained from the 1D effective optimization problem can guide the design of a better regularization function to achieve smaller estimation errors without sacrificing the convergence speed. This may prove an interesting line of future work.
This uncoupling phenomenon is a typical consequence of meanfield dynamics, e.g., the SherringtonKirkpatrick model Cugliandolo1994 () in statistical physics. Similar phenomena are observed or proved in other high dimensional algorithms especially those related to approximate message passing (AMP) Barbier2016 (); Bayati2011 (); Donoho2016 (). However, for these algorithms using batch updating rules with the Onsager reaction term, the limiting densities of iterands are Gaussian. Thus the evolution of such densities can be characterized by tracking a few scalar parameters in discrete time. For our case, the limiting densities are typically nonGaussian and they cannot be parametrized by finitely many scalars. Thus the PDE limit (6) is required.
Appendix: A Formal derivation of the PDE
In this appendix, we present a formal derivation of the PDE (6). We first note that with forms an exchangeable Markov chain on driven by the random variable and the Gaussian random vector . The drift coefficient and the diffusion coefficient in the PDE (6) are determined, respectively, by the conditional mean and variance of the increment , conditioned upon the previous state vector and .
Let the increment of the gradientdescent step in the learning rule (2) be
(17) 
where is the th component of the output . Let denote the conditional expectation with respect to and given and .
where and . We use the Taylor expansion of around up to the first order and get
where includes all higher order terms. As , the random variable converges to a deterministic quantity . Moreover, and are both zeromean Gaussian with the covariance matrix . We thus have
and
where in the last line, we use the Taylor expansion again to expand around and the bracket denotes the average over two independent random variables and . Thus, we have
where the function is defined in (11).
To compute the (conditional) variance, we have
Next, we deal with the normalization step. Again, we use the Taylor expansion for the term up to the first order, which yields
where includes all higher order terms. Note that , and , we have
Finally, the normalization step does not change the variance term, and thus
The above computation of and connects the dynamics (2) to (15). In fact, both (2) and (15) have the same limiting empirical measure described by (6).
A rigorous proof of our asymptotic result is built on the weak convergence approach for measurevalued processes. Details will be presented in an upcoming paper. Here we only provide a sketch of the general proof strategy: First, we prove the tightness of the measurevalued stochastic process on , where denotes the space of càdlàg processes taking values from the space of probability measures. This then implies that any sequence of the measurevalued process (indexed by ) must have a weakly converging subsequence. Second, we prove any converging (sub)sequence must converge weakly to a solution of the weak form of the PDE (6). Third, we prove the uniqueness of the solution of the weak form of the PDE (6) by constructing a contraction mapping. Combining these three statements, we can then conclude that any sequence must converge to this unique solution.
Acknowledgments
This work is supported by US Army Research Office under contract W911NF161 0265 and by the US National Science Foundation under grants CCF1319140 and CCF1718698.
References
 (1) Simon Haykin and Bernard Widrow. Leastmeansquare adaptive filters, volume 31. John Wiley & Sons, 2003.
 (2) Erkki Oja and Juha Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. J. Math. Anal. Appl., 106(1):69–84, 1985.
 (3) Michael Biehl and E Schlösser. The dynamics of online principal component analysis. J. Phys. A. Math. Gen., 31(5):97–103, 1998.
 (4) Aapo Hyvärinen and Erkki Oja. Oneunit learning rules for independent component analysis. In Adv. Neural Inf. Process. Syst., pages 480–486, 1997.
 (5) M Biehl. An Exactly Solvable Model of Unsupervised Learning. Europhys. Lett., 25(5):391–396, 1994.
 (6) Ohad Shamir and Tong Zhang. Stochastic gradient descent for nonsmooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79, 2013.
 (7) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
 (8) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to Escape Saddle Points Efficiently. arXiv:1703.00887, 2017.
 (9) Ioannis Mitliagkas, Constantine Caramanis, and Prateek Jain. Memory Limited , Streaming PCA. In Adv. Neural Inf. Process. Syst., 2013.
 (10) Chris Junchi Li, Zhaoran Wang, and Han Liu. Online ICA: Understanding Global Dynamics of Nonconvex Optimization via Diffusion Processes. In Adv. Neural Inf. Process. Syst., pages 4961–4969, 2016.
 (11) Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping From Saddle Points ? Online Stochastic Gradient for Tensor Decomposition. In JMLR Work. Conf. Proc., volume 40, pages 1–46, 2015.
 (12) Chuang Wang and Yue M. Lu. Online Learning for Sparse PCA in High Dimensions: Exact Dynamics and Phase Transitions. In Inf. Theory Work. (ITW), 2016 IEEE, pages 186–190, 2016.
 (13) David Saad and Sara A Solla. Exact Solution for OnLine Learning in Multilayer Neural Networks. Phys. Rev. Lett., 74(21):4337–4340, 1995.
 (14) David Saad and Magnus Rattray. Globally optimal parameters for online learning in multilayer neural networks. Phys. Rev. Lett., 79(13):2578, 1997.
 (15) Magnus Rattray and Gleb Basalyga. Scaling laws and local minima in Hebbian ICA. In Adv. Neural Inf. Process. Syst., pages 495–502, 2002.
 (16) G. Basalyga and M. Rattray. Statistical dynamics of online independent component analysis. J. Mach. Learn. Res., 4(78):1393–1410, 2004.
 (17) Henry P McKean. Propagation of chaos for a class of nonlinear parabolic equations. Stoch. Differ. Equations (Lecture Ser. Differ. Equations, Sess. 7, Cathol. Univ., 1967), pages 41–57, 1967.
 (18) Henry P McKean. A class of Markov processes associated with nonlinear parabolic equations. Proc. Natl. Acad. Sci., 56(6):1907–1911, 1966.
 (19) Sylvie Méléard and Sylvie RoellyCoppoletta. A propagation of chaos result for a system of particles with moderate interaction. Stoch. Process. their Appl., 26:317–332, 1987.
 (20) AlainSol Sznitman. Topics in progagation of chaos. In PaulLouis Hennequin, editor, Ec. d’{’e}t{’e} Probab. SaintFlour XIX–1989, pages 165–251. Springer Berlin Heidelberg, 1991.
 (21) G. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal scaling of random walk Metropolis algorithms. Ann. Appl. Probab., 7(1):110–120, 1997.
 (22) Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating minimization. In Adv. Neural Inf. Process. Syst., pages 2796–2804, 2013.
 (23) Emmanuel J. Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Trans. Inf. Theory, 61(4):1985–2007, 2015.
 (24) Huishuai Zhang, Yuejie Chi, and Yingbin Liang. Provable nonconvex phase retrieval with outliers: Median truncated wirtinger flow. arXiv:1603.03805, 2016.
 (25) Xiaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. Rapid, robust, and reliable blind deconvolution via nonconvex optimization. arXiv:1606.04933, 2016.
 (26) Magnus Rattray. Stochastic trapping in a solvable model of online independent component analysis. Neural Comput., 14(2):17, 2002.
 (27) L. F. Cugliandolo and J. Kurchan. On the outofequilibrium relaxation of the SherringtonKirkpatrick model. J. Phys. A. Math. Gen., 27(17):5749–5772, 1994.
 (28) Jean Barbier, Mohamad Dia, Nicolas Macris, Florent Krzakala, Thibault Lesieur, and Lenka Zdeborová. Mutual information for symmetric rankone matrix estimation: A proof of the replica formula. In Adv. Neural Inf. Process. Syst., pages 424–432, 2016.
 (29) Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inf. Theory, 57(2):764–785, 2011.
 (30) David Donoho and Andrea Montanari. High dimensional robust Mestimation: asymptotic variance via approximate message passing. Probab. Theory Relat. Fields, 166(34):935–969, 2016.