Provable Bounds for Learning Some Deep Representations
Abstract
We give algorithms with provable guarantees that learn a class of deep nets in the generative model view popularized by Hinton and others. Our generative model is an node multilayer neural net that has degree at most for some and each edge has a random edge weight in . Our algorithm learns almost all networks in this class with polynomial running time. The sample complexity is quadratic or cubic depending upon the details of the model.
The algorithm uses layerwise learning. It is based upon a novel idea of observing correlations among features and using these to infer the underlying edge structure via a global graph recovery procedure. The analysis of the algorithm reveals interesting structure of neural networks with random edge weights. ^{1}^{1}1The first 18 pages of this document serve as an extended abstract of the paper, and a long technical appendix follows.
1 Introduction
Can we provide theoretical explanation for the practical success of deep nets? Like many other ML tasks, learning deep neural nets is NPhard, and in fact seems “badly NPhard”because of many layers of hidden variables connected by nonlinear operations. Usually one imagines that NPhardness is not a barrier to provable algorithms in ML because the inputs to the learner are drawn from some simple distribution and are not worstcase. This hope was recently borne out in case of generative models such as HMMs, Gaussian Mixtures, LDA etc., for which learning algorithms with provable guarantees were given [HKZ12, MV10, HK13, AGM12, AFH12]. However, supervised learning of neural nets even on random inputs still seems as hard as cracking cryptographic schemes: this holds for depth neural nets [JKS02] and even ANDs of thresholds (a simple depth two network) [KS09].
However, modern deep nets are not “just”neural nets (see the survey [Ben09]). The underlying assumption is that the net (or some modification) can be run in reverse to get a generative model for a distribution that is a close fit to the empirical input distribution. Hinton promoted this viewpoint, and suggested modeling each level as a Restricted Boltzmann Machine (RBM), which is “reversible”in this sense. Vincent et al. [VLBM08] suggested using many layers of a denoising autoencoder, a generalization of the RBM that consists of a pair of encoderdecoder functions (see Definition 1). These viewpoints allow a different learning methodology than classical backpropagation: layerwise learning of the net, and in fact unsupervised learning. The bottom (observed) layer is learnt in unsupervised fashion using the provided data. This gives values for the next layer of hidden variables, which are used as the “data” to learn the next higher layer, and so on. The final net thus learnt is also a good generative model for the distribution of the bottom layer. In practice the unsupervised phase is followed by supervised training^{2}^{2}2Recent work suggests that classical backpropagationbased learning of neural nets together with a few modern ideas like convolution and dropout training also performs very well [KSH12], though the authors suggest that unsupervised pretraining should help further..
This viewpoint of reversible deep nets is more promising for theoretical work because it involves a generative model, and also seems to get around cryptographic hardness. But many barriers still remain. There is no known mathematical condition that describes neural nets that are or are not denoising autoencoders. Furthermore, learning even a a single layer sparse denoising autoencoder seems at least as hard as learning sparseused overcomplete dictionaries (i.e., a single hidden layer with linear operations), for which there were no provable bounds at all until the very recent manuscript [AGM13]^{3}^{3}3The parameter choices in that manuscript make it less interesting in context of deep learning, since the hidden layer is required to have no more than nonzeros where is the size of the observed layer —in other words, the observed vector must be highly compressible..
The current paper presents both an interesting family of denoising autoencoders as well as new algorithms to provably learn almost all models in this family. Our ground truth generative model is a simple multilayer neural net with edge weights in and simple threshold (i.e., ) computation at the nodes. A sparse assignment is provided at the top hidden layer, which is computed upon by successive hidden layers in the obvious way until the “observed vector”appears at the bottommost layer. If one makes no further assumptions, then the problem of learning the network given samples from the bottom layer is still harder than breaking some cryptographic schemes. (To rephrase this in autoencoder terminology: our model comes equipped with a decoder function at each layer. But this is not enough to guarantee an efficient encoder function—this may be tantamount to breaking cryptographic schemes.)
So we make the following additional assumptions about the unknown “ground truth deep net”(see Section 2): (i) Each feature/node activates/inhibits at most features at the layer below, and is itself activated/inhibited by at most features in the layer above, where is some small constant; in other words the ground truth net is not a complete graph. (ii) The graph of these edges is chosen at random and the weights on these edges are random numbers in .
Our algorithm learns almost all networks in this class very efficiently and with low sample complexity; see Theorem 1. The algorithm outputs a network whose generative behavior is statistically indistinguishable from the ground truth net. (If the weights are discrete, say in then it exactly learns the ground truth net.)
Along the way we exhibit interesting properties of such randomlygenerated neural nets. (a) Each pair of adjacent layers constitutes a denoising autoencoder in the sense of Vincent et al.; see Lemma 2. Since the model definition already includes a decoder, this involves showing the existence of an encoder that completes it into an autoencoder. (b) The encoder is actually the same neural network run in reverse by appropriately changing the thresholds at the computation nodes. (c) The reverse computation is stable to dropouts and noise. (d) The distribution generated by a twolayer net cannot be represented by any single layer neural net (see Section 8), which in turn suggests that a random tlayer network cannot be represented by any level neural net^{4}^{4}4Formally proving this for is difficult however since showing limitations of even 2layer neural nets is a major open problem in computational complexity theory. Some deep learning papers mistakenly cite an old paper for such a result, but the result that actually exists is far weaker..
Note that properties (a) to (d) are assumed in modern deep net work: for example (b) is a heuristic trick called “weight tying”. The fact that they provably hold for our random generative model can be seen as some theoretical validation of those assumptions.
Context. Recent papers have given theoretical analyses of models with multiple levels of hidden features, including SVMs [CS09, LSSS13]. However, none of these solves the task of recovering a groundtruth neural network given its output distribution.
Though reallife neural nets are not random, our consideration of random deep networks makes some sense for theory. Sparse denoising autoencoders are reminiscent of other objects such as errorcorrecting codes, compressed sensing, etc. which were all first analysed in the random case. As mentioned, provable reconstruction of the hidden layer (i.e., input encoding) in a known autoencoder already seems a nonlinear generalization of compressed sensing, whereas even the usual (linear) version of compressed sensing seems possible only if the adjacency matrix has “randomlike” properties (low coherence or restricted isoperimetry or lossless expansion). In fact our result that a single layer of our generative model is a sparse denoising autoencoder can be seen as an analog of the fact that random matrices are good for compressed sensing/sparse reconstruction (see Donoho [Don06] for general matrices and Berinde et al. [BGI08] for sparse matrices). Of course, in compressed sensing the matrix of edge weights is known whereas here it has to be learnt, which is the main contribution of our work. Furthermore, we show that our algorithm for learning a single layer of weights can be extended to do layerwise learning of the entire network.
Does our algorithm yield new approaches in practice? We discuss this possibility after sketching our algorithm in the next section.
2 Definitions and Results
Our generative neural net model (“ground truth”) has hidden layers of vectors of binary variables , , .., (where is the top layer) and an observed layer at bottom. The number of vertices at layer is denoted by , and the set of edges between layers and by . In this abstract we assume all ; in appendix we allow them to differ.^{5}^{5}5When the layer sizes differ the sparsity of the layers are related by . Nothing much else changes. (The long technical appendix serves partially as a full version of the paper with exact parameters and complete proofs). The weighted graph between layers and has degree at most and all edge weights are in . The generative model works like a neural net where the threshold at every node^{6}^{6}6It is possible to allow these thresholds to be higher and to vary between the nodes, but the calculations are harder and the algorithm is much less efficient. is . The top layer is initialized with a assignment where the set of nodes that are is picked uniformly^{7}^{7}7It is possible to prove the result when the top layer has not a random sparse vector and has some bounded correlations among them. This makes the algorithm more complicated. among all sets of size . For down to , each node in layer computes a weighted sum of its neighbors in layer , and becomes iff that sum strictly exceeds . We will use to denote the threshold function that is if and else. (Applying to a vector involves applying it componentwise.) Thus the network computes as follows: for all and (i.e., no threshold at the observed layer)^{8}^{8}8 It is possible to stay with a generative deep model in which the last layer also has values. Then our calculations require the fraction of ’s in the lowermost (observed) layer to be at most . This could be an OK model if one assumes that some handcoded net (or a nonrandom layer like convolutional net) has been used on the real data to produce a sparse encoding, which is the bottom layer of our generative model. However, if one desires a generative model in which the observed layer is not sparse, then we can do this by allowing realvalued assignments at the observed layer, and remove the threshold gates there. This is the model described here.. Here stands for both the weighted bipartite graph at a level and its weight matrix.
Random deep net assumption: We assume that in this ground truth the edges between layers are chosen randomly subject to expected degree being^{9}^{9}9In the appendix we allow degrees to be different for different layers. , where , and each edge carries a weight that is chosen randomly in . This is our model . We also consider —because it leads to a simpler and more efficient learner—a model where edge weights are random in instead of ; this is called . Recall that is such that the vector input at the top layer has ’s in a random subset of nodes.
It can be seen that since the network is random of degree , applying a sparse vector at the top layer is likely to produce the following density of ’s (approximately) at the successive layers: , etc.. We assume the density of last layer . This way the density at the lastbutone layer is , and the last layer is realvalued and dense.
Now we state our main result. Note that is at most .
Theorem 1
When degree for , density for some large constant ^{10}^{10}10In this case the output is dense, the network model can be learnt using samples and time. The network model can be learnt in polynomial time and using samples, where is the statistical distance between the true distribution and that generated by the learnt model.
Algorithmic ideas. We are unable to analyse existing algorithms. Instead, we give new learning algorithms that exploit the very same structure that makes these random networks interesting in the first place i.e., each layer is a denoising autoencoder. The crux of the algorithm is a new twist on the old Hebbian rule [Heb49] that “Things that fire together wire together.” In the setting of layerwise learning, this is adapted as follows: “Nodes in the same layer that fire together a lot are likely to be connected (with positive weight) to the same node at the higher layer.” The algorithm consists of looking for such pairwise (or wise) correlations and putting together this information globally. The global procedure boils down to the graphtheoretic problem of reconstructing a bipartite graph given pairs of nodes that are at distance in it (see Section 6). This is a variant of the GRAPH SQUARE ROOT problem which is NPcomplete on worstcase instances but solvable for sparse random (or randomlike) graphs.
Note that current algorithms (to the extent that they are Hebbian, roughly speaking) can also be seen as leveraging correlations. But putting together this information is done via the language of nonlinear optimization (i.e., an objective function with suitable penalty terms). Our ground truth network is indeed a particular local optimum in any reasonable formulation. It would be interesting to show that existing algorithms provably find the ground truth in polynomial time but currently this seems difficult.
Can our new ideas be useful in practice? We think that using a global reconstruction procedure to leverage local correlations seems promising, especially if it avoids the usual nonlinear optimization. Our proof currently needs that the hidden layers are sparse, and the edge structure of the ground truth network is “random like”(in the sense that two distinct features at a level tend to affect fairly disjointish sets of features at the next level). Finally, we note that random neural nets do seem useful in socalled reservoir computing, so perhaps they do provide useful representational power on real data. Such empirical study is left for future work.
Throughout, we need wellknown properties of random graphs with expected degree , such as the fact that they are expanders; these properties appear in the appendix. The most important one, unique neighbors property, appears in the next Section.
3 Each layer is a Denoising Autoencoder
As mentioned earlier, modern deep nets research often assumes that the net (or at least some layers in it) should approximately preserve information, and even allows easy going back/forth between representations in two adjacent layers (what we earlier called “reversibility”). Below, denotes the lower layer and the higher (hidden) layer. Popular choices of include logistic function, soft max, etc.; we use simple threshold function in our model.
Definition 1
(Denoising autoencoder) An autoencoder consists of a decoding function and an encoding function where are linear transformations, are fixed vectors and is a nonlinear function that acts identically on each coordinate. The autoencoder is denoising if with high probability where is drawn from the distribution of the hidden layer, is a noise vector drawn from the noise distribution, and is a shorthand for “ corrupted with noise .” The autoencoder is said to use weight tying if .
In empirical work the denoising autoencoder property is only implicitly imposed on the deep net by minimizing the reconstruction error , where is the noise vector. Our definition is slightly different but is actually stronger since is exactly according to the generative model. Our definition implies the existence of an encoder that makes the penalty term exactly zero. We show that in our ground truth net (whether from model or ) every pair of successive levels whp satisfies this definition, and with weighttying.
We show a singlelayer random network is a denoising autoencoder if the input layer is a random sparse vector, and the output layer has density .
Lemma 2
If (i.e., the assignment to the observed layer is also fairly sparse) then the singlelayer network above is a denoising autoencoder with high probability (over the choice of the random graph and weights), where the noise distribution is allowed to flip every output bit independently with probability . It uses weight tying.
The proof of this lemma highly relies on a property of random graph, called the strong uniqueneighbor property.
For any node and any subset , let be the sets of unique neighbors of with respect to ,
Property 1
In a bipartite graph , a node has unique neighbor property with respect to if
(1) 
The set has strong unique neighbor property if for every , has unique neighbor property with respect to .
When we just assume , this property does not hold for all sets of size . However, for any fixed set of size , this property holds with high probability over the randomness of the graph.
Now we sketch the proof for Lemma 2 (details are in Appendix).For convenience assume the edge weights are in .
First, the decoder definition is implicit in our generative model: . (That is, in the autoencoder definition.) Let the encoder be for .In other words, the same bipartite graph and different thresholds can transform an assignment on the lower level to the one at the higher level.
To prove this consider the strong uniqueneighbor property of the network. For the set of nodes that are at the higher level, a majority of their neighbors at the lower level are adjacent only to them and to no other nodes that are . The unique neighbors with a positive edge will always be 1 because there are no edges that can cancel the edge (similarly the unique neighbors with negative edge will always be 0). Thus by looking at the set of nodes that are at the lower level, one can easily infer the correct assignment to the higher level by doing a simple threshold of say at each node in the higher layer.
4 Learning a single layer network
Our algorithm, outlined below (Algorithm 1), learns the network layer by layer starting from the bottom. Thus the key step is that of learning a single layer network, which we now focus on.^{11}^{11}11Learning the bottommost (real valued) layer is mildly different and is done in Section 7. This step, as we noted, amounts to learning nonlinear dictionaries with random dictionary elements. The algorithm illustrates how we leverage the sparsity and the randomness of the support graph, and use pairwise or 3wise correlations combined with our graph recovery procedure of Section 6. We first give a simple algorithm and then outline one that works with better parameters.
For simplicity we describe the algorithm when edge weights are , and sketch the differences for realvalued weights at the end of this section.
The hidden layer and observed layer each have nodes, and the generative model assumes the assignment to the hidden layer is a random assignment with nonzeros.
Say two nodes in the observed layer are related if they have a common neighbor in the hidden layer to which they are attached via a edge.
Step 1: Construct correlation graph: This step is a new twist on the classical Hebbian rule (“things that fire together wire together”).
Claim In a random sample of the output layer, related pairs are both with probability at least , while unrelated pairs are both with probability at most .
(Proof Sketch): First consider a related pair , and let be a vertex with edges to , . Let be the set of neighbors of , excluding . The size of cannot be much larger than . Under the choice of parameters, we know , so the event conditioned on has probability at least 0.9. Hence the probability of and being both is at least . Conversely, if are unrelated then for both to be there must be two different causes, namely, nodes and that are , and additionally, are connected to and respectively via edges. The chance of such existing in a random sparse assignment is at most by union bound.
Thus, if satisfies , i.e., , then using samples we can recover all related pairs whp, finishing the step.
Step 2: Use graph recover procedure to find all edges that have weight . (See Section 6 for details.)
Step 3: Using the edges to encode all the samples .
Although we have only recovered the positive edges, we can use PartialEncoder algorithm to get given !
Lemma 3
If support of satisfies strong unique neighbor property, and , then Algorithm 3 outputs with .
This uses the unique neighbor property: for every with , it has at least unique neighbors that are connected with edges. All these neighbors must be so . On the other hand, for any with , the unique neighbor property (applied to supp) implies that can have at most positive edges to the ’s in . Hence .
Step 4: Recover all weight edges.
Now consider many pairs of , where is found using Step 3. Suppose in some sample, for some , and exactly one neighbor of in the edge graph (which we know entirely) is in supp. Then we can conclude that for any with , there cannot be a edge , as this would cancel out the unique contribution.
Lemma 4
Given samples of pairs , with high probability (over the random graph and the samples) Algorithm 4 outputs the correct set .
To prove this lemma, we just need to bound the probability of the following event for any nonedge : , , ( are positive and negative parents). These three events are almost independent, the first has probability , second has probability and the third has probability almost 1.
Leveraging wise correlation:
The above sketch used pairwise correlations to recover the weights when , roughly. It turns out that using wise correlations allow us to find correlations under a weaker requirement . Now call three observed nodes related if they are connected to a common node at the hidden layer via edges. Then we can prove a claim analogous to the one above, which says that for a related triple, the probability that are all is at least , while the probability for unrelated triples is roughly at most . Thus as long as , it is possible to find related triples correctly. The graph recover algorithm can be modified to run on uniform hypergraph consisting of these related triples to recover the edges.
The end result is the following theorem. This is the learner used to get the bounds stated in our main theorem.
Theorem 5
Suppose our generative neural net model with weights has a single layer and the assignment of the hidden layer is a random sparse vector, with . Then there is an algorithm that runs in time and uses samples to recover the ground truth with high probability over the randomness of the graph and the samples.
When weights are real numbers.
We only sketch this and leave the details to the appendix. Surprisingly, steps 1, 2 and 3 still work. In the proofs, we have only used the sign of the edge weights – the magnitude of the edge weights can be arbitrary. This is because the proofs in these steps relies on the unique neighbor property, if some node is on (has value ), then its unique positive neighbors at the next level will always be on, no matter how small the positive weights might be. Also notice in PartialEncoder we are only using the support of , but not the weights.
After Step 3 we have turned the problem of unsupervised learning of the hidden graph to a supervised one in which the outputs are just linear classifiers over the inputs! Thus the weights on the edges can be learnt to any desired accuracy.
5 Correlations in a Multilayer Network
We now consider multilayer networks, and show how they can be learnt layerwise using a slight modification of our onelayer algorithm at each layer. At a technical level, the difficulty in the analysis is the following: in singlelayer learning, we assumed that the higher layer’s assignment is a random sparse binary vector. In the multilayer network, the assignments in intermediate layers (except for the top layer) do not satisfy this, but we will show that the correlations among them are low enough that we can carry forth the argument. Again for simplicity we describe the algorithm for the model , in which the edge weights are . Also to keep notation simple, we describe how to bound the correlations in bottommost layer (). It holds almost verbatim for the higher layers. We define to be the “expected” number of s in the layer . Because of the unique neighbor property, we expect roughly fraction of to be . So also, for subsequent layers, we obtain . (We can also think of the above expression as defining ).
Lemma 6
Consider a network from . With high probability (over the random graphs between layers) for any two nodes in layer ,
Proof
(outline) The first step is to show that for a vertex in level , is at least and at most . This is shown by an inductive argument (details in the full version). (This is the step where we crucially use the randomness of the underlying graph.)
Now suppose have a common neighbor with edges to both of them. Consider the event that is and none of the neighbors of with weight edges are in layer . These conditions ensure that ; further, they turn out to occur together with probability at least , because of the bound from the first step, along with the fact that combined have only neighbors (and ), so there is good probability of not picking neighbors with edges.
If are not related, it turns out that the probability of interest is at most plus a term which depends on whether have a common parent in layer in the graph restricted to edges. Intuitively, picking one of these common parents could result in both being . By our choice of parameters, we will have , and also the additional term will be , which implies the desired conclusion.
Then as before, we can use graph recovery to find all the edges in the graph at the bottom most layer. This can then be used (as in Step 3) in the single layer algorithm to encode and obtain values for . Now as before, we have many pairs , and thus using precisely the reasoning of Step 4 earlier, we can obtain the full graph at the bottom layer.
This argument can be repeated after ‘peeling off’ the bottom layer, thus allowing us to learn layer by layer.
6 Graph Recovery
Graph reconstruction consists of recovering a graph given information about its subgraphs [BH77]. A prototypical problem is the Graph Square Root problem, which calls for recovering a graph given all pairs of nodes whose distance is at most . This is NPhard.
Definition 2 (Graph Recovery)
Let be an unknown random bipartite graph between and vertices where each edge is picked with probability independently.
Given: Graph where iff and share a common parent in (i.e. where and ).
Goal: Find the bipartite graph .
Some of our algorithms (using wise correlations) need to solve analogous problem where we are given triples of nodes which are mutually at distance from each other, which we will not detail for lack of space.
We let (resp. ) denote the set of neighbors of (resp. ) in . Also gives the set of neighbors in . Now for the recovery algorithm to work, we need the following properties (all satisfied whp by random graph when ):

For any ,
. 
For any , .

For any , and , .

For any , at least fraction of pairs does not have a common neighbor other than .
The first property says “most correlations are generated by common cause”: all but possibly of the common neighbors of and in , are in fact neighbors of a common neighbor of and in .
The second property basically says the sets ’s should be almost disjoint, this is clear because the sets are chosen at random.
The third property says if a vertex is not related to the cause , then it cannot have correlation with all many neighbors of .
The fourth property says every cause introduces a significant number of correlations that is unique to that cause.
In fact, Properties 24 are closely related from the unique neighbor property.
Lemma 7
When graph satisfies Properties 14, Algorithm 5 successfully recovers the graph in expected time .
Proof
We first show that when has more than one unique common cause, then the condition in the if statement must be false. This follows from Property 2. We know the set contains . If then Property 2 says , which implies the condition in the if statement is false.
Then we show if has a unique common cause , then will be equal to . By Property 1, we know where .
For any vertex in , it is connected to every other vertex in . Therefore , and must be in .
For any vertex outside , by Property 3 it can only be connected to vertices in . Therefore . Hence is not in .
Following these arguments, must be equal to , and the algorithm successfully learns the edges related to .
The algorithm will successfully find all vertices because of Property 4: for every there are enough number of edges in that is only caused by . When one of them is sampled, the algorithm successfully learns the vertex .
Finally we bound the running time. By Property 4 we know that the algorithm identifies a new vertex in at most iterations in expectation. Each iteration takes at most time. Therefore the algorithm takes at most time in expectation.
7 Learning the lowermost (realvalued) layer
Note that in our model, the lowest (observed) layer is realvalued and does not have threshold gates. Thus our earlier learning algorithm cannot be applied as is. However, we see that the same paradigm – identifying correlations and using Graph recover – can be used.
The first step is to show that for a random weighted graph , the linear decoder and the encoder form a denoising autoencoder with realvalued outputs, as in Bengio et al. [BCV13].
Lemma 8
If is a random weighted graph, the encoder and linear decoder form a denoising autoencoder, for noise vectors which have independent components, each having variance at most .
The next step is to show a bound on correlations as before. For simplicity we state it assuming the layer has a random assignment of sparsity . In the full version we state it keeping in mind the higher layers, as we did in the previous sections.
Theorem 9
When , , with high probability over the choice of the weights and the choice of the graph, for any three nodes the assignment to the bottom layer satisfies:

If and have no common neighbor, then

If and have a unique common neighbor, then
8 Two layers cannot be represented by one layer
In this section we show that a twolayer network with weights is more expressive than one layer network with arbitrary weights. A twolayer network consists of random graphs and with random weights on the edges. Viewed as a generative model, its input is and the output is . We will show that a singlelayer network even with arbitrary weights and arbitrary threshold functions must generate a fairly different distribution.
Lemma 10
For almost all choices of , the following is true. For every one layer network with matrix and vector , if is chosen to be a random sparse vector with , the probability (over the choice of ) is at least that .
The idea is that the cancellations possible in the twolayer network simply cannot all be accomodated in a singlelayer network even using arbitrary weights. More precisely, even the bit at a single output node cannot be wellrepresented by a simple threshold function.
First, observe that the output at is determined by values of nodes at the top layer that are its ancestors. It is not hard to show in the one layer net , there should be no edge between and any node that is not its ancestor. Then consider structure in Figure 2. Assuming all other parents of are 0 (which happen with probability at least ), and focus on the values of . When these values are and , is off. When these values are and , is on. This is impossible for a one layer network because the first two ask for and the second two ask for .
9 Conclusions
Rigorous analysis of interesting subcases of any ML problem can be beneficial for triggering further improvements: see e.g., the role played in Bayes nets by the rigorous analysis of messagepassing algorithms for trees and graphs of low treewidth. This is the spirit in which to view our consideration of a random neural net model (though note that there is some empirical work in reservoir computing using randomly wired neural nets).
The concept of a denoising autoencoder (with weight tying) suggests to us a graph with randomlike properties. We would be very interested in an empirical study of the randomness properties of actual deep nets learnt in real life. (For example, in [KSH12] some of the layers use convolution, which is decidedly nonrandom. But other layers do backpropagation starting with a complete graph and may end up more randomlike.)
Network randomness is not so crucial for singlelayer learning. But for provable layerwise learning we rely on the support (i.e., nonzero edges) being random: this is crucial for controlling (i.e., upper bounding) correlations among features appearing in the same hidden layer (see Lemma 6). Provable layerwise learning under weaker assumptions would be very interesting.
Acknowledgments
We would like to thank Yann LeCun, Ankur Moitra, Sushant Sachdeva, Linpeng Tang for numerous helpful discussions throughout various stages of this work. This work was done when the first, third and fourth authors were visiting EPFL.
References
 [AFH12] Anima Anandkumar, Dean P. Foster, Daniel Hsu, Sham M. Kakade, and YiKai Liu. A spectral algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems 25, 2012.
 [AGM12] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models – going beyond svd. In IEEE 53rd Annual Symposium on Foundations of Computer Science, FOCS 2012, New Brunswick NJ, USA, October 2023, pages 1–10, 2012.
 [AGM13] Sanjeev Arora, Rong Ge, and Ankur Moitra. New algorithms for learning incoherent and overcomplete dictionaries. ArXiv, 1308.6273, 2013.
 [BCV13] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828, 2013.
 [Ben09] Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. Also published as a book. Now Publishers, 2009.
 [BGI08] R. Berinde, A.C. Gilbert, P. Indyk, H. Karloff, and M.J. Strauss. Combining geometry and combinatorics: a unified approach to sparse signal recovery. In 46th Annual Allerton Conference on Communication, Control, and Computing, pages 798–805, 2008.
 [BH77] J Adrian Bondy and Robert L Hemminger. Graph reconstruction a survey. Journal of Graph Theory, 1(3):227–268, 1977.
 [CS09] Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems 22, pages 342–350. 2009.
 [Don06] David L Donoho. Compressed sensing. Information Theory, IEEE Transactions on, 52(4):1289–1306, 2006.
 [Heb49] Donald O. Hebb. The Organization of Behavior: A Neuropsychological Theory. Wiley, new edition edition, June 1949.
 [HK13] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pages 11–20, 2013.
 [HKZ12] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden Markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
 [JKS02] Jeffrey C Jackson, Adam R Klivans, and Rocco A Servedio. Learnability beyond . In Proceedings of the thiryfourth annual ACM symposium on Theory of computing, pages 776–784. ACM, 2002.
 [KS09] Adam R Klivans and Alexander A Sherstov. Cryptographic hardness for learning intersections of halfspaces. Journal of Computer and System Sciences, 75(1):2–12, 2009.
 [KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114. 2012.
 [LSSS13] Roi Livni, Shai ShalevShwartz, and Ohad Shamir. A provably efficient algorithm for training deep networks. ArXiv, 1304.7045, 2013.
 [MV10] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of gaussians. In the 51st Annual Symposium on the Foundations of Computer Science (FOCS), 2010.
 [VLBM08] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and PierreAntoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, pages 1096–1103, 2008.
See pages 136 of full.pdf