The Scaling Limit of High-Dimensional Online Independent Component Analysis

The Scaling Limit of High-Dimensional Online Independent Component Analysis

Chuang Wang and Yue M. Lu
John A. Paulson School of Engineering and Applied Sciences
Harvard University
33 Oxford Street, Cambridge, MA 02138, USA
{chuangwang,yuelu}@seas.harvard.edu
Abstract

We analyze the dynamics of an online algorithm for independent component analysis in the high-dimensional scaling limit. As the ambient dimension tends to infinity, and with proper time scaling, we show that the time-varying joint empirical measure of the target feature vector and the estimates provided by the algorithm will converge weakly to a deterministic measured-valued process that can be characterized as the unique solution of a nonlinear PDE. Numerical solutions of this PDE, which involves two spatial variables and one time variable, can be efficiently obtained. These solutions provide detailed information about the performance of the ICA algorithm, as many practical performance metrics are functionals of the joint empirical measures. Numerical simulations show that our asymptotic analysis is accurate even for moderate dimensions. In addition to providing a tool for understanding the performance of the algorithm, our PDE analysis also provides useful insight. In particular, in the high-dimensional limit, the original coupled dynamics associated with the algorithm will be asymptotically “decoupled”, with each coordinate independently solving a 1-D effective minimization problem via stochastic gradient descent. Exploiting this insight to design new algorithms for achieving optimal trade-offs between computational and statistical efficiency may prove an interesting line of future research.

 

The Scaling Limit of High-Dimensional Online Independent Component Analysis


  Chuang Wang and Yue M. Lu John A. Paulson School of Engineering and Applied Sciences Harvard University 33 Oxford Street, Cambridge, MA 02138, USA {chuangwang,yuelu}@seas.harvard.edu

\@float

noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float

1 Introduction

Online learning methods based on stochastic gradient descent are widely used in many learning and signal processing problems. Examples includes the classical least mean squares (LMS) algorithm haykin2003least () in adaptive filtering, principal component analysis Oja1985 (); Biehl1998 (), independent component analysis (ICA) hyvarinen1997one (), and the training of shallow or deep artificial neural networks Biehl1994 (); shamir2013stochastic (); lecun2015deep (). Analyzing the convergence rate of stochastic gradient descent has already been the subject of a vast literature (see, e.g., jin2017escape (); Mitliagkas2013 (); Li2016b (); Ge2015 ().) Unlike existing work that analyze the behaviors of the algorithms in finite dimensions, we present in this paper a framework for studying the exact dynamics of stochastic gradient algorithms in the high-dimensional limit, using online ICA as a concrete example. Instead of minimizing a generic function as considered in the optimization literature, the stochastic algorithm we analyze here is solving an estimation problem. The extra assumptions on the ground truth (e.g., the feature vector) and the generative models for the observations allow us to obtain the exact asymptotic dynamics of the algorithms.

As the main result of this work, we show that, as the ambient dimension and with proper time-scaling, the time-varying joint empirical measure of the true underlying independent component and its estimate converges weakly to the unique solution of a nonlinear partial differential equation (PDE) [see (6).] Since many performance metrics, such as the correlation between and and the support recover rate, are functionals of the joint empirical measure, knowledge about the asymptotics of the latter allows us to easily compute the asymptotic limits of various performance metrics of the algorithm.

This work is an extension of a recent analysis on the dynamics of online sparse PCA Wang2016 () to more general settings. The idea of studying the scaling limits of online learning algorithm first appeared in a series of work that mostly came from the statistical physics communities Biehl1998 (); Biehl1994 (); Saad1995 (); Saad1997 (); Rattray2002a (); Basalyga2004 () in the 1990s. Similar to our setting, those early papers studied the dynamics of various online learning algorithms in high dimensions. In particular, they show that the mean-squared error (MSE) of the estimation, together with a few other “order parameters”, can be characterized as the solution of a deterministic system of coupled ordinary differential equations (ODEs) in the large system limit. One limitation of such ODE-level analysis is that it cannot provide information about the distributions of the estimates. The latter are often needed when one wants to understand more general performance metrics beyond the MSE. Another limitation is that the ODE analysis cannot handle cases where the algorithms have non-quadratic regularization terms (e.g., the incorporation of norms to promote sparsity.) In this paper, we show that both limitations can be eliminated by using our PDE-level analysis, which tracks the asymptotic evolution of the probability distributions of the estimates given by the algorithm. In a recent paper Li2016b (), the dynamics of an ICA algorithm was studied via a diffusion approximation. As an important distinction, the analysis in Li2016b () keeps the ambient dimension fixed and studies the scaling limit of the algorithm as the step size tends to . The resulting PDEs involve spatial variables. In contrast, our analysis studies the limit as the dimension , with a constant step size. The resulting PDEs only involve spatial variables. This low-dimensional characterization makes our limiting results more practical to use, especially when the dimension is large.

The basic idea underlying our analysis can trace its root to the early work of McKean mckean1967propagation (); mckean1966class (), who studied the statistical mechanics of Markovian-type mean-field interactive particles. The mathematical foundation of this line of research has been further established in the 1980s (see, e.g., meleard1987propagation (); Sznitman1991 ()). This theoretical tool has been used in the analysis of high-dimensional MCMC algorithms Roberts1997a (). In our work, we study algorithms through the lens of high-dimensional stochastic processes. Interestingly, the analysis does not explicitly depend on whether the underlying optimization problem is convex or nonconvex. This feature makes the presented analysis techniques a potentially very useful tool in understanding the effectiveness of using low-complexity iterative algorithms for solving high-dimensional nonconvex estimation problems, a line of research that has recently attracted much attention (see, e.g., Netrapalli:2013qv (); Candes:2015fv (); zhang2016provable (); li2016rapid ().)

The rest of the paper is organized as follows. We first describe in Section 2 the observation model and the online ICA algorithm studied in this work. The main convergence results are given in Section 3, where we show that the time-varying joint empirical measure of the target independent component and its estimates given by the algorithm can be characterized, in the high-dimensional limit, by the solution of a deterministic PDE. Due to space constraint, we only provide in the appendix a formal derivation leading to the PDE, and leave the rigorous proof of the convergence to a follow-up paper. Finally, in Section 4 we present some insight obtained from our asymptotic analysis. In particular, in the high-dimensional limit, the original coupled dynamics associated with the algorithm will be asymptotically “decoupled”, with each coordinate independently solving a 1-D effective minimization problem via stochastic gradient descent.

Notations and Conventions:

Throughout this paper, we use boldfaced lowercase letters, such as and , to represent -dimensional vectors. The subscript in denotes the discrete-time iteration step. The th component of the vectors and are written as and , respectively.

2 Data model and online ICA

We consider a generative model where a stream of sample vectors , are generated according to

(1)

where is a unique feature vector we want to recover. (For simplicity, we consider the case of recovering a single feature vector, but our analysis technique can be generalized to study cases involving a finite number of feature vectors.) Here is an i.i.d. random variable drawn from an unknown non-Gaussian distribution with zero mean and unit variance. And models background noise. We use the normalization so that in the large limit, all elements of the vector are quantities. The observation model (1) is equivalent to the standard sphered data model , where is an orthonormal matrix with the first column being and is an i.i.d. -dimensional standard Gaussian random vector.

To establish the large limit, we shall assume that the empirical measure of defined by converges weakly to a deterministic measure with finite moments. Note that this assumption can be satisfied in a stochastic setting, where each element of is an i.i.d. random variable drawn from , or in a deterministic setting [e.g., , in which case .]

We use an online learning algorithm to extract the non-Gaussian component from the data stream . Let be the estimate of at step . Starting from an initial estimate , the algorithm update by

(2)

where is a given twice differentiable function and is an element-wise nonlinear mapping introduced to enforce prior information about , e.g., sparsity. The scaling factor in the above equations makes sure that each component of the estimate is of size in the large limit.

The above online learning scheme can be viewed as a projected stochastic gradient algorithm for solving an optimization problem

(3)

where and

(4)

is a regularization function. In (2), we update using an instantaneous noisy estimation , in place of the true gradient , once a new sample is received.

In practice, one can use or to extract symmetric non-Gaussian signals (for which and ) and use to extract asymmetric non-Gaussian signals. The algorithm in (2) with can also be regarded as implementing a low-rank tensor decomposition related to the empirical kurtosis tensor of Li2016b (); Ge2015 ().

For the nonlinear mapping , the choice of for some corresponds to using an norm in the regularization term . If the feature vector is known to be sparse, we can set , which is equivalent to adding an -regularization term.

3 Main convergence result

We provide an exact characterization of the dynamics of the online learning algorithm (2) when the ambient dimension goes to infinity. First, we define the joint empirical measure of the feature vector and its estimate as

(5)

with defined by . Here we rescale (i.e., “accelerate”) the time by a factor of .

The joint empirical measure defined above carries a lot of information about the performance of the algorithm. For example, as both and have the same norm by definition, the normalized correlation between and defined by

can be computed as , i.e., the expectation of taken with respect to the empirical measure. More generally, any separable performance metric with some function can be expressed as an expectation with respect to the empirical measure , i.e., .

Directly computing via the expectation is challenging, as is a random probability measure. We bypass this difficulty by investigating the limiting behavior of the joint empirical measure defined in (5). Our main contribution is to show that, as , the sequence of random probability measures converges weakly to a deterministic measure . Note that the limiting value of can then be computed from the limiting measure via the identity .

Let be the density function of the limiting measure at time . We show that it is characterized as the unique solution of the following nonlinear PDE:

(6)

with

(7)
(8)

where the two functions and are defined as

(9)
(10)

where

(11)

In the above equations, and denote two independent random variables, with and , the non-Gaussian distribution of introduced in (2); the notation denotes the expectation over and ; and and are the two functions used in the online learning algorithm (2).

When (and therefore ), we can derive a simple ODE for from (6) and (7):

Example 1

As a concrete example, we consider the case when is drawn from a symmetric non-Gaussian distribution. Due to symmetry, . Write and . We use in (2) to detect the feature vector . Substituting this specific into (9) and (11), we obtain

(12)
(13)

and can be computed by substituting (12) and (13) into (10). Moreover, for the case , we derive a simple ODE for as

(14)

Numerical verifications of the ODE results are shown in Figure 1(a). In our experiment, the ambient dimension is set to and we plot the averaged results as well as error bars (corresponding to one standard deviation) over 10 independent trials. Two different initial values of are used. In both cases, the asymptotic theoretical predictions match the numerical results very well.

(a)
(b)
Figure 1: (a) Comparison between the analytical prediction given by the ODE in (14) with numerical simulations of the online ICA algorithm. We consider two different initial values for the algorithm. The top one, which starts from a better initial guess, converges to an informative estimation, whereas the bottom one, with a worse initial guess, converges to a non-informative solution. (b) The stability of the ODE in (14). We draw for different value of from top to bottom.
(a)
(b)
Figure 2: (a) A demonstration of the accuracy of our PDE analysis. See the discussions in Example 2 for details. (b) Effective 1-D cost functions.

The ODE in (14) can be solved analytically. Next we briefly discuss its stability. The right-hand side of (14) is plotted in Figure 1(b) as a function of . It is clear that the ODE (14) always admits a solution , which corresponding to a trivial, non-informative solution. Moreover, this trivial solution is always a stable fixed point. When the stepsize for some constant , is also the unique stable fixed point. When however, two additional solutions of the ODE emerge. One is a stable fixed point denoted by and the other is an unstable fixed point denoted by , with . Thus, in order to reach an informative solution, one must initialize the algorithm with . This insight agrees with a previous stability analysis done in Rattray2002 (), where the authors investigated the dynamics near via a small expansion.

Example 2

In this experiment, we verify the accuracy of the asymptotic predictions given by the PDE (6). The settings are similar to those in Example 1. In addition, we assume that the feature vector is sparse, consisting of nonzero elements, each of which is equal to . Figure 2 shows the asymptotic conditional density for and at two different times. These theoretical predictions are obtained by solving the PDE (6) numerically. Also shown in the figure are the empirical conditional densities associated with one realization of the ICA algorithm. Again, we observe that the theoretical predictions and numerical results have excellent agreement.

To demonstrate the usefulness of the PDE analysis in providing detailed information about the performance of the algorithm, we show in Figure 3 the performance of sparse support recovery using a simple hard-thresholding scheme on the estimates provided by the algorithm. By changing the threshold values, one can have trade-offs between the true positive and false positive rates. As we can see from the figure, this precise trade-off can be accurately predicted by our PDE analysis.

4 Insights given by the PDE analysis

In this section, we present some insights that can be gained from our high-dimensional analysis. To simplify the PDE in (6), we can assume that the two functions and in (7) and (8) are given to us in an oracle way. Under this assumption, the PDE (6) describes the limiting empirical measure of the following stochastic process

(15)

where is a sequence of independent standard Gaussian random variables. Unlike the original online learning update equation (2) where different coordinates of are coupled, the above process is uncoupled. Each component for evolves independently when conditioned on and . The continuous-time limit of (15) is described by a stochastic differential equation (SDE)

where is the standard Brownian motion.

Figure 3: Trade-offs between the true positive and false positive rates in sparse support recovery. In our experiment, , and the sparsity level is set to . The theoretical results obtained by our PDE analysis can accurately predict the actual performance at any run-time of the algorithm.

We next have a closer look at the equation (15). Given a scalar , and , we can define a time-varying 1-D regularized quadratic optimization problem with the effective potential

(16)

where , and is the regularization term defined in (4). Then, the stochastic process (15) can be viewed as a stochastic gradient descent for solving this 1-D problem with a step-size equal to . One can verify that the exact gradient of (16) is . The third term in (15) adds stochastic noise to the true gradient. Interestingly, although the original optimization problem (3) is non-convex, its 1-D effective optimization problem is always convex for convex regularizers (e.g., .) This provides an intuitive explanation for the practical success of online ICA.

To visualize this 1-D effective optimization problem, we plot in Figure 2(b) the effective potential at and , respectively. From Figure (2), we can see that the norm always introduces a bias in the estimation for all non-zero , as the minimum point in the effective 1-D cost function is always shifted towards the origin. It is hopeful that the insights gained from the 1-D effective optimization problem can guide the design of a better regularization function to achieve smaller estimation errors without sacrificing the convergence speed. This may prove an interesting line of future work.

This uncoupling phenomenon is a typical consequence of mean-field dynamics, e.g., the Sherrington-Kirkpatrick model Cugliandolo1994 () in statistical physics. Similar phenomena are observed or proved in other high dimensional algorithms especially those related to approximate message passing (AMP) Barbier2016 (); Bayati2011 (); Donoho2016 (). However, for these algorithms using batch updating rules with the Onsager reaction term, the limiting densities of iterands are Gaussian. Thus the evolution of such densities can be characterized by tracking a few scalar parameters in discrete time. For our case, the limiting densities are typically non-Gaussian and they cannot be parametrized by finitely many scalars. Thus the PDE limit (6) is required.

Appendix: A Formal derivation of the PDE

In this appendix, we present a formal derivation of the PDE (6). We first note that with forms an exchangeable Markov chain on driven by the random variable and the Gaussian random vector . The drift coefficient and the diffusion coefficient in the PDE (6) are determined, respectively, by the conditional mean and variance of the increment , conditioned upon the previous state vector and .

Let the increment of the gradient-descent step in the learning rule (2) be

(17)

where is the th component of the output . Let denote the conditional expectation with respect to and given and .

We first compute and . From (1) and (17) we have

where and . We use the Taylor expansion of around up to the first order and get

where includes all higher order terms. As , the random variable converges to a deterministic quantity . Moreover, and are both zero-mean Gaussian with the covariance matrix . We thus have

and

where in the last line, we use the Taylor expansion again to expand around and the bracket denotes the average over two independent random variables and . Thus, we have

where the function is defined in (11).

To compute the (conditional) variance, we have

Next, we deal with the normalization step. Again, we use the Taylor expansion for the term up to the first order, which yields

where includes all higher order terms. Note that , and , we have

Finally, the normalization step does not change the variance term, and thus

The above computation of and connects the dynamics (2) to (15). In fact, both (2) and (15) have the same limiting empirical measure described by (6).

A rigorous proof of our asymptotic result is built on the weak convergence approach for measure-valued processes. Details will be presented in an upcoming paper. Here we only provide a sketch of the general proof strategy: First, we prove the tightness of the measure-valued stochastic process on , where denotes the space of càdlàg processes taking values from the space of probability measures. This then implies that any sequence of the measure-valued process (indexed by ) must have a weakly converging subsequence. Second, we prove any converging (sub)sequence must converge weakly to a solution of the weak form of the PDE (6). Third, we prove the uniqueness of the solution of the weak form of the PDE (6) by constructing a contraction mapping. Combining these three statements, we can then conclude that any sequence must converge to this unique solution.

Acknowledgments

This work is supported by US Army Research Office under contract W911NF-16-1- 0265 and by the US National Science Foundation under grants CCF-1319140 and CCF-1718698.

References

  • (1) Simon Haykin and Bernard Widrow. Least-mean-square adaptive filters, volume 31. John Wiley & Sons, 2003.
  • (2) Erkki Oja and Juha Karhunen. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. J. Math. Anal. Appl., 106(1):69–84, 1985.
  • (3) Michael Biehl and E Schlösser. The dynamics of on-line principal component analysis. J. Phys. A. Math. Gen., 31(5):97–103, 1998.
  • (4) Aapo Hyvärinen and Erkki Oja. One-unit learning rules for independent component analysis. In Adv. Neural Inf. Process. Syst., pages 480–486, 1997.
  • (5) M Biehl. An Exactly Solvable Model of Unsupervised Learning. Europhys. Lett., 25(5):391–396, 1994.
  • (6) Ohad Shamir and Tong Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79, 2013.
  • (7) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • (8) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to Escape Saddle Points Efficiently. arXiv:1703.00887, 2017.
  • (9) Ioannis Mitliagkas, Constantine Caramanis, and Prateek Jain. Memory Limited , Streaming PCA. In Adv. Neural Inf. Process. Syst., 2013.
  • (10) Chris Junchi Li, Zhaoran Wang, and Han Liu. Online ICA: Understanding Global Dynamics of Nonconvex Optimization via Diffusion Processes. In Adv. Neural Inf. Process. Syst., pages 4961–4969, 2016.
  • (11) Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping From Saddle Points ? Online Stochastic Gradient for Tensor Decomposition. In JMLR Work. Conf. Proc., volume 40, pages 1–46, 2015.
  • (12) Chuang Wang and Yue M. Lu. Online Learning for Sparse PCA in High Dimensions: Exact Dynamics and Phase Transitions. In Inf. Theory Work. (ITW), 2016 IEEE, pages 186–190, 2016.
  • (13) David Saad and Sara A Solla. Exact Solution for On-Line Learning in Multilayer Neural Networks. Phys. Rev. Lett., 74(21):4337–4340, 1995.
  • (14) David Saad and Magnus Rattray. Globally optimal parameters for online learning in multilayer neural networks. Phys. Rev. Lett., 79(13):2578, 1997.
  • (15) Magnus Rattray and Gleb Basalyga. Scaling laws and local minima in Hebbian ICA. In Adv. Neural Inf. Process. Syst., pages 495–502, 2002.
  • (16) G. Basalyga and M. Rattray. Statistical dynamics of on-line independent component analysis. J. Mach. Learn. Res., 4(7-8):1393–1410, 2004.
  • (17) Henry P McKean. Propagation of chaos for a class of non-linear parabolic equations. Stoch. Differ. Equations (Lecture Ser. Differ. Equations, Sess. 7, Cathol. Univ., 1967), pages 41–57, 1967.
  • (18) Henry P McKean. A class of Markov processes associated with nonlinear parabolic equations. Proc. Natl. Acad. Sci., 56(6):1907–1911, 1966.
  • (19) Sylvie Méléard and Sylvie Roelly-Coppoletta. A propagation of chaos result for a system of particles with moderate interaction. Stoch. Process. their Appl., 26:317–332, 1987.
  • (20) Alain-Sol Sznitman. Topics in progagation of chaos. In Paul-Louis Hennequin, editor, Ec. d’{’e}t{’e} Probab. Saint-Flour XIX–1989, pages 165–251. Springer Berlin Heidelberg, 1991.
  • (21) G. O. Roberts, A. Gelman, and W. R. Gilks. Weak convergence and optimal scaling of random walk Metropolis algorithms. Ann. Appl. Probab., 7(1):110–120, 1997.
  • (22) Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi. Phase retrieval using alternating minimization. In Adv. Neural Inf. Process. Syst., pages 2796–2804, 2013.
  • (23) Emmanuel J. Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Trans. Inf. Theory, 61(4):1985–2007, 2015.
  • (24) Huishuai Zhang, Yuejie Chi, and Yingbin Liang. Provable non-convex phase retrieval with outliers: Median truncated wirtinger flow. arXiv:1603.03805, 2016.
  • (25) Xiaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. Rapid, robust, and reliable blind deconvolution via nonconvex optimization. arXiv:1606.04933, 2016.
  • (26) Magnus Rattray. Stochastic trapping in a solvable model of on-line independent component analysis. Neural Comput., 14(2):17, 2002.
  • (27) L. F. Cugliandolo and J. Kurchan. On the out-of-equilibrium relaxation of the Sherrington-Kirkpatrick model. J. Phys. A. Math. Gen., 27(17):5749–5772, 1994.
  • (28) Jean Barbier, Mohamad Dia, Nicolas Macris, Florent Krzakala, Thibault Lesieur, and Lenka Zdeborová. Mutual information for symmetric rank-one matrix estimation: A proof of the replica formula. In Adv. Neural Inf. Process. Syst., pages 424–432, 2016.
  • (29) Mohsen Bayati and Andrea Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inf. Theory, 57(2):764–785, 2011.
  • (30) David Donoho and Andrea Montanari. High dimensional robust M-estimation: asymptotic variance via approximate message passing. Probab. Theory Relat. Fields, 166(3-4):935–969, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
44775
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description