Complete Dictionary Recovery over the Sphere

Complete Dictionary Recovery over the Sphere

Ju Sun, Qing Qu, and John Wright
{js4038, qq2105, jw2966}@columbia.edu
Department of Electrical Engineering, Columbia University, New York, USA
April 25, 2015  Revised: July 6, 2019
Abstract

We consider the problem of recovering a complete (i.e., square and invertible) matrix , from with , provided is sufficiently sparse. This recovery problem is central to the theoretical understanding of dictionary learning, which seeks a sparse representation for a collection of input signals, and finds numerous applications in modern signal processing and machine learning. We give the first efficient algorithm that provably recovers when has nonzeros per column, under suitable probability model for . In contrast, prior results based on efficient algorithms provide recovery guarantees when has only nonzeros per column for any constant .

Our algorithmic pipeline centers around solving a certain nonconvex optimization problem with a spherical constraint, and hence is naturally phrased in the language of manifold optimization. To show this apparently hard problem is tractable, we first provide a geometric characterization of the high-dimensional objective landscape, which shows that with high probability there are no “spurious” local minima. This particular geometric structure allows us to design a Riemannian trust region algorithm over the sphere that provably converges to one local minimizer with an arbitrary initialization, despite the presence of saddle points. The geometric approach we develop here may also shed light on other problems arising from nonconvex recovery of structured signals.

Keywords. Dictionary learning, Nonconvex optimization, Spherical constraint, Trust region method, Escaping saddle point, Manifold optimization, Function landscape, Second-order geometry, Inverse problem, Structured signal, Nonlinear approximation

Mathematics Subject Classification. 68P30, 58C05, 94A12, 94A08, 68T05, 90C26, 90C48, 90C55

Acknowledgement. We thank Dr. Boaz Barak for pointing out an inaccurate comment made on overcomplete dictionary learning using SOS. We thank Cun Mu and Henry Kuo of Columbia University for discussions related to this project. JS thanks the Wei Family Private Foundation for their generous support. This work was partially supported by grants ONR N00014-13-1-0492, NSF 1343282, and funding from the Moore and Sloan Foundations.

Note. This technical report has subsequently been divided into two papers [SQWa] and [SQWb]. All future updates will be made only to the separate papers.

Contents

1 Introduction

Given signal samples from , i.e., , is it possible to construct a dictionary with much smaller than , such that and the coefficient matrix has as few nonzeros as possible? In other words, this model dictionary learning (DL) problem seeks a concise representation for a collection of input signals. Concise signal representations play a central role in compression, and also prove useful for many other important tasks, such as signal acquisition, denoising, and classification.

Traditionally, concise signal representations have relied heavily on explicit analytic bases constructed in nonlinear approximation and harmonic analysis. This constructive approach has proved highly successfully; the numerous theoretical advances in these fields (see, e.g.,  [DeV98, Tem03, DeV09, Can02, MP10a] for summary of relevant results) provide ever more powerful representations, ranging from the classic Fourier to modern multidimensional, multidirectional, multiresolution bases, including wavelets, curvelets, ridgelets, and so on. However, two challenges confront practitioners in adapting these results to new domains: which function class best describes signals at hand, and consequently which representation is most appropriate. These challenges are coupled, as function classes with known “good” analytic bases are rare. 111As Donoho et al [DVDD98] put it, “…in effect, uncovering the optimal codebook structure of naturally occurring data involves more challenging empirical questions than any that have ever been solved in empirical work in the mathematical sciences.”

Around 1996, neuroscientists Olshausen and Field discovered that sparse coding, the principle of encoding a signal with few atoms from a learned dictionary, reproduces important properties of the receptive fields of the simple cells that perform early visual processing [OF96, OF97]. The discovery has spurred a flurry of algorithmic developments and successful applications for DL in the past two decades, spanning classical image processing, visual recognition, compressive signal acquisition, and also recent deep architectures for signal classification (see, e.g., [Ela10, MBP14] for review this development).

The learning approach is particularly relevant to modern signal processing and machine learning, which deal with data of huge volume and great variety (e.g., images, audios, graphs, texts, genome sequences, time series, etc). The proliferation of problems and data seems to preclude analytically deriving optimal representations for each new class of data in a timely manner. On the other hand, as datasets grow, learning dictionaries directly from data looks increasingly attractive and promising. When armed with sufficiently many data samples of one signal class, by solving the model DL problem, one would expect to obtain a dictionary that allows sparse representation for the whole class. This hope has been borne out in a number of successful examples [Ela10, MBP14] and theories [MP10b, VMB11, MG13, GJB13].

1.1 Theoretical and Algorithmic Challenges

In contrast to the above empirical successes, the theoretical study of dictionary learning is still developing. For applications in which dictionary learning is to be applied in a “hands-free” manner, it is desirable to have efficient algorithms which are guaranteed to perform correctly, when the input data admit a sparse model. There have been several important recent results in this direction, which we will review in Section 1.5, after our sketching main results. Nevertheless, obtaining algorithms that provably succeed under broad and realistic conditions remains an important research challenge.

To understand where the difficulties arise, we can consider a model formulation, in which we attempt to obtain the dictionary and coefficients which best trade-off sparsity and fidelity to the observed data:

(1.1)

Here, promotes sparsity of the coefficients, trades off the level of coefficient sparsity and quality of approximation, and imposes desired structures on the dictionary.

This formulation is nonconvex: the admissible set is typically nonconvex (e.g., orthogonal group, matrices with normalized columns)222For example, in nonlinear approximation and harmonic analysis, orthonormal basis or (tight-)frames are preferred; to fix the scale ambiguity discussed in the text, a common practice is to require that to be column-normalized. There is no obvious reason to believe that convexifying these constraint sets would leave the optima unchanged. For example, the convex hull of the orthogonal group is the operator norm ball . If there are no effective symmetry breaking constraints, any convex objective function tends to have minimizers inside the ball, which obviously will not be orthogonal matrices. Other ideas such as lifting may not play together with the objective function, nor yield tight relaxations (see, e.g., [BKS13, BR14])., while the most daunting nonconvexity comes from the bilinear mapping: . Because and result in the same objective value for the conceptual formulation (1.1), where is any permutation matrix, and any diagonal matrix with diagonal entries in , and denotes matrix transpose. Thus, we should expect the problem to have combinatorially many global minima. Because there are multiple isolated global minima, the problem does not appear to be amenable to convex relaxation (see similar discussions in, e.g., [GS10] and [GW11]).333Semidefinite programming (SDP) lifting may be one useful general strategy to convexify bilinear inverse problems, see, e.g., [ARR14, CM14]. However, for problems with general nonlinear constraints, it is unclear whether the lifting always yield tight relaxation, consider, e.g., [BKS13, BR14] again. This contrasts sharply with problems in sparse recovery and compressed sensing, in which simple convex relaxations are often provably effective [DT09, OH10, CLMW11, DGM13, MT14, MHWG13, CRPW12, CSV13, ALMT14, Can14]. Is there any hope to obtain global solutions to the DL problem?

1.2 An Intriguing Numerical Experiment with Real Images

Figure 1: Alternating direction method for (1.2) on uncompressed real images seems to always produce the same solution! Top: Each image is in resolution and encoded in the uncompressed pgm format (uncompressed images to prevent possible bias towards standard bases used for compression, such as DCT or wavelet bases). Each image is evenly divided into non-overlapping image patches ( in total), and these patches are all vectorized and then stacked as columns of the data matrix . Bottom: Given each , we solve (1.2) times with independent and randomized (uniform over the orthogonal group) initialization . The plots show the values of across the independent repetitions. They are virtually the same and the relative differences are less than !

We provide empirical evidence in support of a positive answer to the above question. Specifically, we learn orthogonal bases (orthobases) for real images patches. Orthobases are of interest because typical hand-designed dictionaries such as discrete cosine (DCT) and wavelet bases are orthogonal, and orthobases seem competitive in performance for applications such as image denoising, as compared to overcomplete dictionaries [BCJ13]444See Section 1.3 for more detailed discussions of this point. [LGBB05] also gave motivations and algorithms for learning (union of) orthobases as dictionaries. .

We divide a given greyscale image into non-overlapping patches, which are converted into -dimensional vectors and stacked column-wise into a data matrix . Specializing (1.1) to this setting, we obtain the optimization problem:

(1.2)

To derive a concrete algorithm for (1.2), one can deploy the alternating direction method (ADM)555This method is also called alternating minimization or (block) coordinate descent method. see, e.g.,  [BT89, Tse01] for classic results and [ABRS10, BST14] for several interesting recent developments. , i.e., alternately minimizing the objective function with respect to (w.r.t.) one variable while fixing the other. The iteration sequence actually takes very simple form: for ,

where denotes the well-known soft-thresholding operator acting elementwise on matrices, i.e., for any scalar .

Figure 1 shows what we obtained using the simple ADM algorithm, with independent and randomized initializations:

The algorithm seems to always produce the same solution, regardless of the initialization.

This observation implies the heuristic ADM algorithm may always converge to one global minimizer! 666Technically, the converge to global solutions is surprising because even convergence of ADM to critical points is atypical, see, e.g., [ABRS10, BST14] and references therein. Section 6 includes more detailed discussions on this point. Equally surprising is that the phenomenon has been observed on real images777Actually the same phenomenon is also observed for simulated data when the coefficient matrix obeys the Bernoulli-Gaussian model, which is defined later. The result on real images supports that previously claimed empirical successes over two decades may be non-incidental. . One may imagine only random data typically have “favorable” structures; in fact, almost all existing theories for DL pertain only to random data [SWW12, AAJ13, AGM13, AAN13, ABGM14, AGMM15].

1.3 Dictionary Recovery and Our Results

In this paper, we take a step towards explaining the surprising effectiveness of nonconvex optimization heuristics for DL. We focus on the dictionary recovery (DR) setting: given a data matrix generated as , where and is “reasonably sparse”, try to recover and . Here recovery means to return any pair , where is a permutation matrix and is a nonsingular diagonal matrix, i.e., recovering up to sign, scale, and permutation.

To define a reasonably simple and structured problem, we make the following assumptions:

  • The target dictionary is complete, i.e., square and invertible (). In particular, this class includes orthogonal dictionaries. Admittedly overcomplete dictionaries tend to be more powerful for modeling and to allow sparser representations. Nevertheless, most classic hand-designed dictionaries in common use are orthogonal. Orthobases are competitive in performance for certain tasks such as image denoising [BCJ13], and admit faster algorithms for learning and encoding. 888Empirically, there is no systematic evidence supporting that overcomplete dictionaries are strictly necessary for good performance in all published applications (though [OF97] argues for the necessity from neuroscience perspective). Some of the ideas and tools developed here for complete dictionaries may also apply to certain classes of structured overcomplete dictionaries, such as tight frames. See Section 6 for relevant discussion.

  • The coefficient matrix follows the Bernoulli-Gaussian (BG) model with rate : , with and , where all the different random variables are mutually independent. We write compactly .

We prove the following result:

Theorem 1.1 (Informal statement of our results)

For any , given with a complete dictionary and , there is a polynomial time algorithm that recovers and with high probability (at least ) whenever for a fixed polynomial , where is the condition number of and is a parameter that can be set as for a fixed positive numerical constant .

Obviously, even if is known, one needs to make the identification problem well posed. Under our particular probabilistic model, a simple coupon collection argument implies that one needs to ensure all atoms in are observed with high probability (w.h.p.). To ensure that an efficient algorithm exists may demand more. Our result implies when is polynomial in , and , recovery with efficient algorithm is possible.

The parameter controls the sparsity level of . Intuitively, the recovery problem is easy for small and becomes harder for large .999Indeed, when is small enough such that columns of are predominately -sparse, one directly observes scaled versions of the atoms (i.e., columns of ); when is fully dense corresponding to , recovery is never possible as one can easily find another complete and fully dense such that with not equivalent to . It is perhaps surprising that an efficient algorithm can succeed up to constant , i.e., linear sparsity in . Compared to the case when is known, there is only at most a constant gap in the sparsity level one can deal with.

For DL, our result gives the first efficient algorithm that provably recovers complete and when has nonzeros per column under appropriate probability model. Section 1.5 provides detailed comparison of our result with other recent recovery results for complete and overcomplete dictionaries.

1.4 Main Ingredients and Innovations

In this section we describe three main ingredients that we use to obtain the stated result.

1.4.1 A Nonconvex Formulation

Since and is complete, ( denotes the row space of a matrix) and hence rows of are sparse vectors in the known (linear) subspace . We can use this fact to first recover the rows of , and subsequently recover by solving a system of linear equations. In fact, for , rows of are the sparsest vectors (directions) in w.h.p. whenever  [SWW12]. Thus one might try to recover rows of by solving

(1.3)

The objective is discontinuous, and the domain is an open set. In particular, the homogeneous constraint is nonconventional and tricky to deal with. Since the recovery is up to scale, one can remove the homogeneity by fixing the scale of . Known relaxations [SWW12, DH14] fix the scale by setting , where is the elementwise norm. The optimization problem reduces to a sequence of convex programs, which recover for very sparse , but provably break down when columns of has more than nonzeros, or . Inspired by our previous image experiment, we work with a nonconvex alternative101010A similar formulation has been proposed in [ZP01] in the context of blind source separation; see also [QSW14]. :

(1.4)

where is a proxy for (i.e., after appropriate processing), indexes columns of , and is the usual norm for vectors. Here is chosen to be a convex smooth approximation to , namely,

(1.5)

which is infinitely differentiable and controls the smoothing level.111111In fact, there is nothing special about this choice and we believe that any valid smooth (twice continuously differentiable) approximation to would work and yield qualitatively similar results. We also have some preliminary results showing the latter geometric picture remains the same for certain nonsmooth functions, such as a modified version of the Huber function, though the analysis involves handling a different set of technical subtleties. The algorithm also needs additional modifications. The spherical constraint is nonconvex. Hence, a-priori, it is unclear whether (1.4) admits efficient algorithms that attain global optima. Surprisingly, simple descent algorithms for (1.4) exhibit very striking behavior: on many practical numerical examples121212… not restricted to the model we assume here for and . , they appear to produce global solutions. Our next section will uncover interesting geometrical structures underlying the phenomenon.

1.4.2 A Glimpse into High-dimensional Function Landscape

Figure 2: Why is dictionary learning over tractable? Assume the target dictionary is orthogonal. Left: Large sample objective function . The only local minima are the columns of and their negatives. Center: the same function, visualized as a height above the plane ( is the first column of ). Right: Around the optimum, the function exhibits a small region of positive curvature, a region of large gradient, and finally a region in which the direction away from is a direction of negative curvature.

For the moment, suppose is orthogonal, and take in (1.4). Figure 2 (left) plots over (). Remarkably, has no spurious local minima. In fact, every local minimizer produces a row of : for some .

To better illustrate the point, we take the particular case and project the upper hemisphere above the equatorial plane onto . The projection is bijective and we equivalently define a reparameterization of . Figure 2 (center) plots the graph of . Obviously the only local minimizers are , and they are also global minimizers. Moreover, the apparent nonconvex landscape has interesting structures around : when moving away from , one sees successively a strongly convex region, a nonzero gradient region, and a region where at each point one can always find a direction of negative curvature, as shown schematically in Figure 2 (right). This geometry implies that at any nonoptimal point, there is always at least one direction of descent. Thus, any algorithm that can take advantage of the descent directions will likely converge to one global minimizer, irrespective of initialization.

Two challenges stand out when implementing this idea. For geometry, one has to show similar structure exists for general complete , in high dimensions (), when the number of observations is finite (vs. the expectation in the experiment). For algorithms, we need to be able to take advantage of this structure without knowing ahead of time. In Section 1.4.3, we describe a Riemannian trust region method which addresses the latter challenge.

Geometry for orthogonal .

In this case, we take . Since , the landscape of is simply a rotated version of that of , i.e., when . Hence we will focus on the case when . Among the symmetric sections of centered around the signed basis vectors , we work with the symmetric section around as an example. The result will carry over to all sections with the same argument; together this provides a complete characterization of the function over .

We again invoke the projection trick described above, this time onto the equatorial plane . This can be formally captured by the reparameterization mapping:

(1.6)

where is the new variable in and is the unit ball in . We first study the composition over the set

(1.7)

It can be verified the section we chose to work with is contained in this set131313Indeed, if for any , , implying . The reason we have defined an open set instead of a closed (compact) one is to avoid potential trivial local minimizers located on the boundary. .

Our analysis characterizes the properties of by studying three quantities

respectively over three consecutive regions moving away from the origin, corresponding to the three regions in Figure 2 (right). In particular, through typical expectation-concentration style argument, we show that there exists a positive constant such that

(1.8)

over the respective regions w.h.p., confirming our low-dimensional observations described above. In particular, the favorable structure we observed for persists in high dimensions, w.h.p., even when is large yet finite, for the case is orthogonal. Moreover, the local minimizer of over is very close to , within a distance of .

Geometry for complete .

For general complete dictionaries , we hope that the function retains the nice geometric structure discussed above. We can ensure this by “preconditioning” such that the output looks as if being generated from a certain orthogonal matrix, possibly plus a small perturbation. We can then argue that the perturbation does not significantly affect the properties of the graph of the objective function. Write

(1.9)

Note that for , . Thus, one expects to behave roughly like and hence to behave like

(1.10)

where we write the SVD of as . It is easy to see is an orthogonal matrix. Hence the preconditioning scheme we have introduced is technically sound.

Our analysis shows that can be written as

(1.11)

where is a matrix with small magnitude. Simple perturbation argument shows that the constant in (1.8) is at most shrunk to for all when is sufficiently large. Thus, the qualitative aspects of the geometry have not been changed by the perturbation.

1.4.3 A Second-order Algorithm on Manifold: Riemannian Trust Region Method

We do not know ahead of time, so our algorithm needs to take advantage of the structure described above without knowledge of . Intuitively, this seems possible as the descent direction in the space appears to also be a local descent direction for over the sphere. Another issue is that although the optimization problem has no spurious local minima, it does have many saddle points (Figure 2). We can use second-order information to guarantee to escape saddle points. We derive an algorithm based on the Riemannian trust region method (TRM) [ABG07, AMS09] over the sphere for this purpose.

For a function and an unconstrained optimization problem

typical (second-order) TRM proceeds by successively forming second-order approximations to at the current iterate,

(1.12)

where is a proxy for the Hessian matrix , which encodes the second-order geometry. The next movement direction is determined by seeking a minimum of over a small region, normally a norm ball , called the trust region, inducing the well studied trust-region subproblem:

(1.13)

where is called the trust-region radius that controls how far the movement can be made. A ratio

(1.14)

is defined to measure the progress and typically the radius is updated dynamically according to to adapt to the local function behavior. Detailed introductions to the classical TRM can be found in the texts [CGT00a, NW06].

Figure 3: Illustrations of the tangent space and exponential map defined on the sphere .

To generalize the idea to smooth manifolds, one natural choice is to form the approximation over the tangent spaces [ABG07, AMS09]. Specific to our spherical manifold, for which the tangent space at an iterate is (see Figure 3), we work with a “quadratic” approximation defined as

(1.15)

To interpret this approximation, let be the orthoprojector onto and write (3.2) into an equivalent form:

The two terms

are the Riemannian gradient and Riemannian Hessian of w.r.t. , respectively [ABG07, AMS09]; the above approximation is reminiscent of the usual quadratic approximation described in (1.12).

Then the Riemannian trust-region subproblem is

(1.16)

where we take the simple norm ball for the trust region. This can be transformed into a classical trust region subprolem: indeed, taking any orthonormal basis for , the above problem is equivalent to

(1.17)

where the objective is quadratic in . This is the classical trust region problem (with norm ball constraint) that admits very efficient numerical algorithms [MS83, HK14]. Once we obtain the minimizer , we set , which solves (1.16).

One additional issue as compared to the Euclidean setting is that now is one vector in the tangent space and additive update leads to a point outside the sphere. We resort to the natural exponential map to pull the tangent vector to a point on the sphere:

(1.18)

As seen from Figure 3, the movement to the next iterate is “along the direction"141414Technically, moving along the geodesic whose velocity at time zero is . of while staying over the sphere.

Using the above geometric characterizations, we prove that w.h.p., the algorithm converges to a local minimizer when the parameter is sufficiently small151515For simplicity of analysis, we have assumed is fixed throughout the analysis. In practice, dynamic updates to lead to faster convergence.. In particular, we show that (1) the trust region step induces at least a fixed amount of decrease to the objective value in the negative curvature and nonzero gradient region; (2) the trust region iterate sequence will eventually move to and stay in the strongly convex region, and converge to the local minimizer contained in the region with an asymptotic quadratic rate. In short, the geometric structure implies that from any initialization, the iterate sequence converges to a close approximation to the target solution in a polynomial number of steps.

1.5 Prior Arts and Connections

It is far too ambitious to include here a comprehensive review of the exciting developments of DL algorithms and applications after the pioneer work [OF96]. We refer the reader to Chapter 12 - 15 of the book [Ela10] and the survey paper [MBP14] for summaries of relevant developments in image analysis and visual recognition. In the following, we focus on reviewing recent developments on the theoretical side of dictionary learning, and draw connections to problems and techniques that are relevant to the current work.

Theoretical Dictionary Learning.

The theoretical study of DL in the recovery setting started only very recently. [AEB06] was the first to provide an algorithmic procedure to correctly extract the generating dictionary. The algorithm requires exponentially many samples and has exponential running time; see also [HS11]. Subsequent work [GS10, GW11, Sch14a, Sch14b, Sch15] studied when the target dictionary is a local optimum of natural recovery criteria. These meticulous analyses show that polynomially many samples are sufficient to ensure local correctness under natural assumptions. However, these results do not imply that one can design efficient algorithms to obtain the desired local optimum and hence the dictionary.

[SWW12] initiated the on-going research effort to provide efficient algorithms that globally solve DR. They showed that one can recover a complete dictionary from by solving a certain sequence of linear programs, when is a sparse random matrix with nonzeros per column. [AAJ13, AAN13] and [AGM13, AGMM15] give efficient algorithms that provably recover overcomplete () and incoherent dictionaries, based on a combination of {clustering or spectral initialization} and local refinement. These algorithms again succeed when has 161616The suppresses some logarithm factors. nonzeros per column. Recent work [BKS14] provides the first polynomial-time algorithm that provably recovers most “nice” overcomplete dictionaries when has nonzeros per column for any constant . However, the proposed algorithm runs in super-polynomial time when the sparsity level goes up to . Similarly, [ABGM14] also proposes a super-polynomial (quasipolynomial) time algorithm that guarantees recovery with (almost) nonzeros per column. By comparison, we give the first polynomial-time algorithm that provably recovers complete dictionary when has nonzeros per column.

Aside from efficient recovery, other theoretical work on DL includes results on identifiability [AEB06, HS11, WY15], generalization bounds [MP10b, VMB11, MG13, GJB13], and noise stability [GJB14].

Finding Sparse Vectors in a Linear Subspace.

We have followed [SWW12] and cast the core problem as finding the sparsest vectors in a given linear subspace, which is also of independent interest. Under a planted sparse model171717… where one sparse vector embedded in an otherwise random subspace., [DH14] shows solving a sequence of linear programs similar to [SWW12] can recover sparse vectors with sparsity up to , sublinear in the vector dimension. [QSW14] improved the recovery limit to by solving a nonconvex spherical constrained problem similar to (1.4)181818The only difference is that they chose to work with the Huber function as a proxy of the function. via an ADM algorithm. The idea of seeking rows of sequentially by solving the above core problem sees precursors in [ZP01] for blind source separation, and [GN10] for matrix sparsification. [ZP01] also proposed a nonconvex optimization similar to (1.4) here and that employed in [QSW14].

Nonconvex Optimization Problems.

For other nonconvex optimization problems of recovery of structured signals191919This is a body of recent work studying nonconvex recovery up to statistical precision, including, e.g., [LW11, LW13, WLL14, BWY14, WGNL14, LW14, Loh15, SLLC15]. , including low-rank matrix completion/recovery [KMO10, JNS13, Har14, HW14, NNS14, JN14, SL14, ZL15, TBSR15, CW15], phase retreival [NJS13, CLS15, CC15, WWS15], tensor recovery [JO14, AGJ14b, AGJ14a, AJSN15], mixed regression [YCS13, LWB13], structured element pursuit [QSW14], and recovery of simultaneously structured signals [LWB13], numerical linear algebra and optimization [JJKN15, BKS15], the initialization plus local refinement strategy adopted in theoretical DL [AAJ13, AAN13, AGM13, AGMM15, ABGM14] is also crucial: nearness to the target solution enables exploiting the local geometry of the target to analyze the local refinement.202020The powerful framework [ABRS10, BST14] to establish local convergence of ADM algorithms to critical points applies to DL/DR also, see, e.g., [BJQS14, BQJ14, BJS14]. However, these results do not guarantee to produce global optima. By comparison, we provide a complete characterization of the global geometry, which admits efficient algorithms without any special initialization. The idea of separating the geometric analysis and algorithmic design may also prove valuable for other nonconvex problems discussed above.

Optimization over Riemannian Manifolds.

Our trust-region algorithm on the sphere builds on the extensive research efforts to generalize Euclidean numerical algorithms to (Riemannian) manifold settings. We refer the reader to the monographs [Udr94, HMG94, AMS09] for survey of developments in this field. In particular,  [EAS98] developed Newton and conjugate-gradient methods for the Stiefel manifolds, of which the spherical manifold is a special case. [ABG07] generalized the trust-region methods to Riemannian manifolds. We cannot, however, adopt the existing convergence results that concern either global convergence (convergence to critical points) or local convergence (convergence to a local minimum within a radius). The particular geometric structure forces us to piece together different arguments to obtain the global result.

Independent Component Analysis (ICA) and Other Matrix Factorization Problems.
(a) Correlated Gaussian,
(b) Correlated Uniform,
(c) Independent Uniform,
(d) Correlated Gaussian,
(e) Correlated Uniform,
(f) Independent Uniform,
Figure 4: Asymptotic function landscapes when rows of are not independent. W.l.o.g., we again assume . In (a) and (d), , with and columns of i.i.d. Gaussian vectors obeying for symmetric with ’s on the diagonal and i.i.d. off-diagonal entries distributed as . Similarly, in (b) and (e), , with and columns of i.i.d. vectors generated as with . For comparison, in (c) and (f), with and . Here denote the elementwise product, and the objective function is still based on the function as in (1.4).

DL can also be considered in the general framework of matrix factorization problems, which encompass the classic principal component analysis (PCA), ICA, and clustering, and more recent problems such as nonnegative matrix factorization (NMF), multi-layer neural nets (deep learning architectures). Most of these problems are NP-hard. Identifying tractable cases of practical interest and providing provable efficient algorithms are subject of on-going research endeavors; see, e.g., recent progresses on NMF [AGKM12], and learning deep neural nets [ABGM13, SA14, NP13, LSSS14].

ICA factors a data matrix as such that is square and rows of are as independent as possible [HO00, HO01]. In theoretical study of the recovery problem, it is often assumed that rows of are (weakly) independent (see, e.g.,  [Com94, FJK96, AGMS12]). Our i.i.d. probability model on implies rows of are independent, aligning our problem perfectly with the ICA problem. More interestingly, the objective we analyze here was proposed as a general-purpose contrast function in ICA that has not been thoroughly analyzed [Hyv99], and algorithm and analysis with another popular contrast function, the fourth-order cumulants, indeed overlap with ours considerably [FJK96, AGMS12]212121Nevertheless, the objective functions are apparently different. Moreover, we have provided a complete geometric characterization of the objective, in contrast to [FJK96, AGMS12]. We believe the geometric characterization could not only provide insight to the algorithm, but also help improve the algorithm in terms of stability and also finding all components. . While this interesting connection potentially helps port our analysis to ICA, it is a fundamental question to ask what is playing the vital role for DR, sparsity or independence.

Figure 4 helps shed some light in this direction, where we again plot the asymptotic objective landscape with the natural reparameterization as in Section 1.4.2. From the left and central panels, it is evident even without independence, with sparse columns induces the familiar geometric structures we saw in Figure 2; such structures are broken when the sparsity level becomes large. We believe all our later analyses can be generalized to the correlated cases we experimented with. On the other hand, from the right panel222222We have not showed the results on the BG model here, as it seems the structure persists even when approaches . We suspect the “phase transition” of the landscape occurs at different points for different distributions and Gaussian is the outlying case where the transition occurs at . , it seems with independence, the function landscape undergoes a transition as sparsity level grows - target solution goes from minimizers of the objective to the maximizers of the objective. Without adequate knowledge of the true sparsity, it is unclear whether one would like to minimize or maximize the objective.232323For solving the ICA problem, this suggests the contrast function, that works well empirically [Hyv99], may not work for all distributions (rotation-invariant Gaussian excluded of course). This suggests sparsity, instead of independence, makes our current algorithm for DR work.

Nonconvex Problems with Similar Geometric Structure

Besides ICA discussed above, it turns out that a handful of other practical problems arising in signal processing and machine learning induce the “no spurious minimizers, all saddles are second-order” structure under natural setting, including the eigenvalue problem, generalized phase retrieval [SQW15a], tensor decomposition [GHJY15], linear neural nets learning [BH89]. [SQW15b] gave a review of these problems, and discussed how the methodology developed in this and the companion paper [SQWb] can be generalized to solve those problems.

1.6 Notations, Organization, and Reproducible Research

We use bold capital and small letters such as and to denote matrices and vectors, respectively. Small letters are reserved for scalars. Several specific mathematical objects we will frequently work with: for the orthogonal group of order , for the unit sphere in , for the unit ball in , and for positive integers , , . We use for matrix transposition, causing no confusion as we will work entirely on the real field. We use superscript to index rows of a matrix, such as for the -th row of the matrix , and subscript to index columns, such as . All vectors are defaulted to column vectors. So the -th row of as a row vector will be written as . For norms, is the usual norm for a vector and to the operator norm (i.e., ) for a matrix; all other norms will be indexed by subscript, for example the Frobenius norm for matrices and the element-wise max-norm . We use to mean that the random variable is distributed according to the law . Let denote the Gaussian law. Then means that is a standard Gaussian vector. Similarly, we use to mean elements of are independently and identically distributed according to the law . So the fact is equivalent to that . One particular distribution of interest for this paper is the Bernoulli-Gaussian with rate : , with and . We also write this compactly as . We frequently use indexed and for numerical constants when stating and proving technical results. The scopes of such constants are local unless otherwise noted. We use standard notations for most other cases, with exceptions clarified locally.

The rest of the paper is organized as follows. In Section 2 we present major technical results for a complete characterization of the geometry sketched in Section 1.4.2. Similarly in Section 3 we present necessary technical machinery and results for convergence proof of the Riemannian trust-region algorithm over the sphere, corresponding to Section 1.4.3. In Section 4, we discuss the whole algorithmic pipeline for recovering complete dictionaries given , and present the main theorems. After presenting a simple simulation to corroborate our theory in Section 5, we wrap up the main content in Section 6 by discussing possible improvement and future directions after this work. All major proofs of geometrical and algorithmic results are deferred to Section 7 and Section 8, respectively. Section 9 augments the main results. The appendices cover some recurring technical tools and auxiliary results for the proofs.

The codes to reproduce all the figures and experimental results can be found online:

https://github.com/sunju/dl_focm

2 High-dimensional Function Landscapes

To characterize the function landscape of over , we mostly work with the function

(2.1)

induced by the reparametrization

(2.2)

In particular, we focus our attention to the smaller set

(2.3)

because contains all points with and we can characterize other parts of on using projection onto other equatorial planes. Note that over , .

2.1 Main Geometric Theorems

Theorem 2.1 (High-dimensional landscape - orthogonal dictionary)

Suppose and hence . There exist positive constants and , such that for any and , whenever , the following hold simultaneously with high probability:

(2.4)
(2.5)
(2.6)

and the function has exactly one local minimizer over the open set , which satisfies

(2.7)

In particular, with this choice of , the probability the claim fails to hold is at most . Here to are all positive numerical constants.

Here , which exactly recovers the last row of , . Though the unique local minimizer may not be , it is very near to . Hence the resulting produces a close approximation to . Note that (strictly) contains all points such that . We can characterize the graph of the function in the vicinity of other signed basis vector simply by changing the plane to . Doing this times (and multiplying the failure probability in Theorem 2.1 by ), we obtain a characterization of over the entirety of .242424In fact, it is possible to pull the very detailed geometry captured in (2.4) through (2.6) back to the sphere (i.e., the space) also; analysis of the Riemannian trust-region algorithm later does part of these. We will stick to this simple global version here. The result is captured by the next corollary.

Corollary 2.2

Suppose and hence . There exist positive constant , such that for any and , whenever , with probability at least , the function has exactly local minimizers over the sphere . In particular, there is a bijective map between these minimizers and signed basis vectors , such that the corresponding local minimizer and satisfy

(2.8)

Here to are numerical constants (possibly different from that in the above theorem).

Proof By Theorem 2.1, over , is the unique local minimizer. Suppose not. Then there exist with and , such that for all satisfying . Since the mapping is -Lipschitz (Lemma 7.7), for all satisfying , implying is a local minimizer different from , a contradiction. Let . Straightforward calculation shows

Repeating the argument times in the vicinity of other signed basis vectors gives local minimizers of . Indeed, the symmetric sections cover the sphere with certain overlaps, and a simple calculation shows that no such local minimizer lies in the overlapped regions (due to nearness to a signed basis vector). There is no extra local minimizer, as such local minimizer is contained in at least one of the symmetric sections, resulting two different local minimizers in one section, contradicting the uniqueness result we obtained above.  

Though the isolated local minimizers may have different objective values, they are equally good in the sense any of them produces a close approximation to a certain row of . As discussed in Section 1.4.2, for cases is an orthobasis other than , the landscape of is simply a rotated version of the one we characterized above.

Theorem 2.3 (High-dimensional landscape - complete dictionary)

Suppose is complete with its condition number . There exist positive constants and , such that for any and , when and , , the following hold simultaneously with high probability:

(2.9)
(2.10)
(2.11)

and the function has exactly one local minimizer over the open set , which satisfies

(2.12)

In particular, with this choice of , the probability the claim fails to hold is at most . Here to are all positive numerical constants.

Corollary 2.4

Suppose is complete with its condition number . There exist positive constants and , such that for any and , when