In this paper, we consider regression problems with onehiddenlayer neural networks (1NNs). We distill some properties of activation functions that lead to local strong convexity in the neighborhood of the groundtruth parameters for the 1NN squaredloss objective. Most popular nonlinear activation functions satisfy the distilled properties, including rectified linear units (s), leaky s, squared s and sigmoids. For activation functions that are also smooth, we show local linear convergence guarantees of gradient descent under a resampling rule. For homogeneous activations, we show tensor methods are able to initialize the parameters to fall into the local strong convexity region. As a result, tensor initialization followed by gradient descent is guaranteed to recover the ground truth with sample complexity and computational complexity for smooth homogeneous activations with high probability, where is the dimension of the input, () is the number of hidden nodes, is a conditioning property of the groundtruth parameter matrix between the input layer and the hidden layer, is the targeted precision and is the number of samples. To the best of our knowledge, this is the first work that provides recovery guarantees for 1NNs with both sample complexity and computational complexity linear in the input dimension and logarithmic in the precision.
Contents
 1 Introduction
 2 Related Work
 3 Problem Formulation
 4 Positive Definiteness of Hessian
 5 Tensor Methods for Initialization
 6 Global Convergence
 7 Numerical Experiments
 8 Conclusion
 A Notation
 B Preliminaries
 C Properties of Activation Functions

D Local Positive Definiteness of Hessian
 D.1 Main Results for Positive Definiteness of Hessian
 D.2 Positive Definiteness of Population Hessian at the Ground Truth
 D.3 Error Bound of Hessians near the Ground Truth for Smooth Activations
 D.4 Error Bound of Hessians near the Ground Truth for Nonsmooth Activations
 D.5 Positive Definiteness for a Small Region
 E Tensor Methods
 F Acknowledgments
1 Introduction
Neural Networks (NNs) have achieved great practical success recently. Many theoretical contributions have been made very recently to understand the extraordinary performance of NNs. The remarkable results of NNs on complex tasks in computer vision and natural language processing inspired works on the expressive power of NNs [CSS16, CS16, RPK16, DFS16, PLR16, MPCB14, Tel16]. Indeed, several works found NNs are very powerful and the deeper the more powerful. However, due to the high nonconvexity of NNs, knowing the expressivity of NNs doesn’t guarantee that the targeted functions will be learned. Therefore, several other works focused on the achievability of global optima. Many of them considered the overparameterized setting, where the global optima or local minima close to the global optima will be achieved when the number of parameters is large enough, including [FB16, HV15, LSSS14, DPG14, SS16, HM17]. This, however, leads to overfitting easily and can’t provide any generalization guarantees, which are actually the essential goal in most tasks.
A few works have considered generalization performance. For example, [XLS17] provide generalization bound under the Rademacher generalization analysis framework. Recently [ZBH17] describe some experiments showing that NNs are complex enough that they actually memorize the training data but still generalize well. As they claim, this cannot be explained by applying generalization analysis techniques, like VC dimension and Rademacher complexity, to classification loss (although it does not rule out a margins analysis—see, for example, [Bar98]; their experiments involve the unbounded crossentropy loss).
In this paper, we don’t develop a new generalization analysis. Instead we focus on parameter recovery setting, where we assume there are underlying groundtruth parameters and we provide recovery guarantees for the groundtruth parameters up to equivalent permutations. Since the parameters are exactly recovered, the generalization performance will also be guaranteed.
Several other techniques are also provided to recover the parameters or to guarantee generalization performance, such as tensor methods [JSA15] and kernel methods [AGMR17]. These methods require sample complexity or computational complexity , which can be intractable in practice. We propose an algorithm that has recovery guarantees for 1NN with sample complexity and computational time under some mild assumptions.
Recently [Sha16] show that neither specific assumptions on the niceness of the input distribution or niceness of the target function alone is sufficient to guarantee learnability using gradientbased methods. In this paper, we assume data points are sampled from Gaussian distribution and the parameters of hidden neurons are linearly independent.
Our main contributions are as follows,

We distill some properties for activation functions, which are satisfied by a wide range of activations, including ReLU, squared ReLU, sigmoid and tanh. With these properties we show positive definiteness (PD) of the Hessian in the neighborhood of the groundtruth parameters given enough samples (Theorem 4.2). Further, for activations that are also smooth, we show local linear convergence is guaranteed using gradient descent.

We propose a tensor method to initialize the parameters such that the initialized parameters fall into the local positive definiteness area. Our contribution is that we reduce the sample/computational complexity from cubic dependency on dimension to linear dependency (Theorem 5.6).
2 Related Work
The recent empirical success of NNs has boosted their theoretical analyses [FZK16, Bal16, BMBY16, SBL16, APVZ14, AGMR17, GKKT17]. In this paper, we classify them into three main directions.
2.1 Expressive Power
Expressive power is studied to understand the remarkable performance of neural networks on complex tasks. Although onehiddenlayer neural networks with sufficiently many hidden nodes can approximate any continuous function [Hor91], shallow networks can’t achieve the same performance in practice as deep networks. Theoretically, several recent works show the depth of NNs plays an essential role in the expressive power of neural networks [DFS16]. As shown in [CSS16, CS16, Tel16], functions that can be implemented by a deep network of polynomial size require exponential size in order to be implemented by a shallow network. [RPK16, PLR16, MPCB14, AGMR17] design some measures of expressivity that display an exponential dependence on the depth of the network. However, the increasing of the expressivity of NNs or its depth also increases the difficulty of the learning process to achieve a good enough model. In this paper, we focus on 1NNs and provide recovery guarantees using a finite number of samples.
2.2 Achievability of Global Optima
The global convergence is in general not guaranteed for NNs due to their nonconvexity. It is widely believed that training deep models using gradientbased methods works so well because the error surface either has no local minima, or if they exist they need to be close in value to the global minima. [SCP16] present examples showing that for this to be true additional assumptions on the data, initialization schemes and/or the model classes have to be made. Indeed the achievability of global optima has been shown under many different types of assumptions.
In particular, [CHM15] analyze the loss surface of a special random neural network through spinglass theory and show that it has exponentially many local optima, whose loss is small and close to that of a global optimum. Later on, [Kaw16] eliminate some assumptions made by [CHM15] but still require the independence of activations as [CHM15], which is unrealistic. [SS16] study the geometric structure of the neural network objective function. They have shown that with high probability random initialization will fall into a basin with a small objective value when the network is overparameterized. [LSSS14] consider polynomial networks where the activations are square functions, which are typically not used in practice. [HV15] show that when a local minimum has zero parameters related to a hidden node, a global optimum is achieved. [FB16] study the landscape of 1NN in terms of topology and geometry, and show that the level set becomes connected as the network is increasingly overparameterized. [HM17] show that products of matrices don’t have spurious local minima and that deep residual networks can represent any function on a sample, as long as the number of parameters is larger than the sample size. [SC16] consider overspecified NNs, where the number of samples is smaller than the number of weights. [DPG14] propose a new approach to secondorder optimization that identifies and attacks the saddle point problem in highdimensional nonconvex optimization. They apply the approach to recurrent neural networks and show practical performance. [AGMR17] use results from tropical geometry to show global optimality of an algorithm, but it requires computational complexity.
Almost all of these results require the number of parameters is larger than the number of points, which probably overfits the model and no generalization performance will be guaranteed. In this paper, we propose an efficient and provable algorithm for 1NNs that can achieve the underlying groundtruth parameters.
2.3 Generalization Bound / Recovery Guarantees
The achievability of global optima of the objective from the training data doesn’t guarantee the learned model to be able to generalize well on unseen testing data. In the literature, we find three main approaches to generalization guarantees.
1) Use generalization analysis frameworks, including VC dimension/Rademacher complexity, to bound the generalization error. A few works have studied the generalization performance for NNs. [XLS17] follow [SC16] but additionally provide generalization bounds using Rademacher complexity. They assume the obtained parameters are in a regularization set so that the generalization performance is guaranteed, but this assumption can’t be justified theoretically. [HRS16] apply stability analysis to the generalization analysis of SGD for convex and nonconvex problems, arguing early stopping is important for generalization performance.
2) Assume an underlying model and try to recover this model. This direction is popular for many nonconvex problems including matrix completion/sensing [JNS13, Har14, SL15, BLWZ17], mixed linear regression [ZJD16], subspace recovery [EV09] and other latent models [AGH14].
Without making any assumptions, those nonconvex problems are intractable [AGKM12, GV15, SWZ17a, GG11, RSW16, SR11, HM13, AGM12, YCS14]. Recovery guarantees for NNs also need assumptions. Several different approaches under different assumptions are provided to have recovery guarantees on different NN settings.
Tensor methods [AGH14, WTSA15, WA16, SWZ16] are a general tool for recovering models with latent factors by assuming the data distribution is known. Some existing recovery guarantees for NNs are provided by tensor methods [SA15, JSA15]. However, [SA15] only provide guarantees to recover the subspace spanned by the weight matrix and no sample complexity is given, while [JSA15] require sample complexity. In this paper, we use tensor methods as an initialization step so that we don’t need very accurate estimation of the moments, which enables us to reduce the total sample complexity from to .
[ABGM14] provide polynomial sample complexity and computational complexity bounds for learning deep representations in unsupervised setting, and they need to assume the weights are sparse and randomly distributed in .
[Tia17] analyze 1NN by assuming Gaussian inputs in a supervised setting, in particular, regression and classification with a teacher. This paper also considers this setting. However, there are some key differences. a) [Tia17] require the secondlayer parameters are all ones, while we can learn these parameters. b) In [Tia17], the groundtruth firstlayer weight vectors are required to be orthogonal, while we only require linear independence. c) [Tia17] require a good initialization but doesn’t provide initialization methods, while we show the parameters can be efficiently initialized by tensor methods. d) In [Tia17], only the population case (infinite sample size) is considered, so there is no sample complexity analysis, while we show finite sample complexity.
Recovery guarantees for convolution neural network with Gaussian inputs are provided in [BG17], where they show a globally converging guarantee of gradient descent on a onehiddenlayer nooverlap convolution neural network. However, they consider population case, so no sample complexity is provided. Also their analysis depends on activations and the nooverlap case is very unlikely to be used in practice. In this paper, we consider a large range of activation functions, but for onehiddenlayer fullyconnected NNs.
3) Improper Learning. In the improper learning setting for NNs, the learning algorithm is not restricted to output a NN, but only should output a prediction function whose error is not much larger than the error of the best NN among all the NNs considered. [ZLJ16, ZLW16] propose kernel methods to learn the prediction function which is guaranteed to have generalization performance close to that of the NN. However, the sample complexity and computational complexity are exponential. [AZS14] transform NNs to convex semidefinite programming. The works by [Bac14] and [BRV05] are also in this direction. However, these methods are actually not learning the original NNs. Another work by [ZLWJ17] uses random initializations to achieve arbitrary small excess risk. However, their algorithm has exponential running time in .
Roadmap.
The paper is organized as follows. In Section 3, we present our problem setting and show three key properties of activations required for our guarantees. In Section 4, we introduce the formal theorem of local strong convexity and show local linear convergence for smooth activations. Section 5 presents a tensor method to initialize the parameters so that they fall into the basin of the local strong convexity region.
3 Problem Formulation
We consider the following regression problem. Given a set of samples
let denote a underlying distribution over with parameters
such that each sample is sampled i.i.d. from this distribution, with
(1) 
where is the activation function, is the number of nodes in the hidden layer. The main question we want to answer is: How many samples are sufficient to recover the underlying parameters?
It is wellknown that, training one hidden layer neural network is NPcomplete [BR88]. Thus, without making any assumptions, learning deep neural network is intractable. Throughout the paper, we assume follows a standard normal distribution; the data is noiseless; the dimension of input data is at least the number of hidden nodes; and activation function satisfies some reasonable properties.
Actually our results can be easily extended to multivariate Gaussian distribution with positive definite covariance and zero mean since we can estimate the covariance first and then transform the input to a standard normal distribution but with some loss of accuracy. Although this paper focuses on the regression problem, we can transform classification problems to regression problems if a good teacher is provided as described in [Tia17]. Our analysis requires to be no greater than , since the firstlayer parameters will be linearly dependent otherwise.
For activation function , we assume it is continuous and if it is nonsmooth let its first derivative be left derivative. Furthermore, we assume it satisfies Property 3.1, 3.2, and 3.3. These properties are critical for the later analyses. We also observe that most activation functions actually satisfy these three properties.
Property 3.1.
The first derivative is nonnegative and homogeneously bounded, i.e., for some constants and .
Property 3.2.
Let , and Let denote The first derivative satisfies that, for all , we have .
Property 3.3.
The second derivative is either (a) globally bounded for some constant , i.e., is smooth, or (b) except for ( is a finite constant) points.
Remark 3.4.
The first two properties are related to the first derivative and the last one is about the second derivative . At high level, Property 3.1 requires to be nondecreasing with homogeneously bounded derivative; Property 3.2 requires to be highly nonlinear; Property 3.3 requires to be either smooth or piecewise linear.
Theorem 3.5.
, leaky , squared and any nonlinear nondecreasing smooth functions with bounded symmetric , like the sigmoid function , the function and the function , satisfy Property 3.1,3.2,3.3. The linear function, , doesn’t satisfy Property 3.2 and the quadratic function, , doesn’t satisfy Property 3.1 and 3.2.
The proof can be found in Appendix C.
4 Positive Definiteness of Hessian
In this section, we study the Hessian of empirical risk near the ground truth. We consider the case when is already known. Note that for homogeneous activations, we can assume since , where is the degree of homogeneity. As only takes discrete values for homogeneous activations, in the next section, we show we can exactly recover using tensor methods with finite samples.
For a set of samples , we define the Empirical Risk,
(2) 
For a distribution , we define the Expected Risk,
(3) 
Let’s calculate the gradient and the Hessian of and . For each , the partial gradient of with respect to can be represented as
For each and , the second partial derivative of for the th offdiagonal block is,
and for each , the second partial derivative of for the th diagonal block is
If is nonsmooth, we use the Dirac function and its derivatives to represent . Replacing the expectation by the average over the samples , we obtain the Hessian of the empirical risk.
Considering the case when , for all , we have,
If Property 3.3(b) is satisfied, almost surely. So in this case the diagonal blocks of the empirical Hessian can be written as,
Now we show the Hessian of the objective near the global optimum is positive definite.
Definition 4.1.
Given the ground truth matrix , let denote the th singular value of , often abbreviated as . Let , . Let denote and denote . Let . Let denote . Let .
Theorem 4.2 (Informal version of Theorem d.1).
Remark 4.3.
As we can see from Theorem 4.2, from Property 3.2 plays an important role for positive definite (PD) property. Interestingly, many popular activations, like ReLU, sigmoid and tanh, have , while some simple functions like linear () and square () functions have and their Hessians are rankdeficient. Another important numbers are and , two different condition numbers of the weight matrix, which directly influences the positive definiteness. If is rank deficient, , and we don’t have PD property. In the best case when is orthogonal, . In the worse case, can be exponential in . Also should be close enough to . In the next section, we provide tensor methods to initialize and such that they satisfy the conditions in Theorem 4.2.
For the PD property to hold, we need the samples to be independent of the current parameters. Therefore, we need to do resampling at each iteration to guarantee the convergence in iterative algorithms like gradient descent. The following theorem provides the linear convergence guarantee of gradient descent for smooth activations.
Theorem 4.4 (Linear convergence of gradient descent, informal version of Theorem d.2).
Let be the current iterate satisfying . Let denote a set of i.i.d. samples from distribution (defined in (1)) with and let the activation function satisfy Property 3.1,3.2 and 3.3(a). Define and . If we perform gradient descent with step size on and obtain the next iterate,
then with probability at least ,
We provide the proofs in the Appendix D.1
5 Tensor Methods for Initialization
In this section, we show that Tensor methods can recover the parameters to some precision and exactly recover for homogeneous activations.
It is known that most tensor problems are NPhard [Hås90, HL13] or even hard to approximate [SWZ17b]. However, by making some assumptions, tensor decomposition method becomes efficient [AGH14, WTSA15, WA16, SWZ16]. Here we utilize the noiseless assumption and Gaussian inputs assumption to show a provable and efficient tensor methods.
5.1 Preliminary
Let’s define a special outer product for simplification of the notation. If is a vector and is the identity matrix, then If is a symmetric rank matrix factorized as and is the identity matrix, then
where , , , , and .
Denote . Now let’s calculate some moments.
Definition 5.1.
We define and as follows :
.
.
.
.
.
.
.
.
According to Definition 5.1, we have the following results,
Claim 5.2.
For each , .
Note that some ’s will be zero for specific activations. For example, for activations with symmetric first derivatives, i.e., , like sigmoid and erf, we have being a constant and since . Another example is . functions have vanishing , i.e., , as . To make tensor methods work, we make the following assumption.
Assumption 5.3.
Assume the activation function satisfies the following conditions:
1. If , then for all .
2. At least one of and is nonzero.
3. If , then is an even function, i.e., .
4. If , then is an odd function, i.e., .
If is an odd function then and . Hence we can always assume . If is an even function, then . So if recovers then also recovers . Note that , leaky and squared satisfy Assumption 5.3. We further define the following nonzero moments.
Definition 5.4.
Let denote a randomly picked vector. We define and as follows: , where and , where .
Claim 5.5.
and .
In other words for the above definition, is equal to the first nonzero matrix in the ordered sequence . is equal to the first nonzero tensor in the ordered sequence . Since is randomly picked up, and we view this number as a constant throughout this paper. So by construction and Assumption 5.3, both and are rank. Also, let and denote the corresponding empirical moments of and respectively.
5.2 Algorithm
Now we briefly introduce how to use a set of samples with size linear in dimension to recover the ground truth parameters to some precision. As shown in the previous section, we have a rank 3rdorder moment that has tensor decomposition formed by . Therefore, we can use the nonorthogonal decomposition method [KCL15] to decompose the corresponding estimated tensor and obtain an approximation of the parameters. The precision of the obtained parameters depends on the estimation error of , which requires samples to achieve error. Also, the time complexity for tensor decomposition on a tensor is .
In this paper, we reduce the cubic dependency of sample/computational complexity in dimension [JSA15] to linear dependency. Our idea follows the techniques used in [ZJD16], where they first used a 2ndorder moment to approximate the subspace spanned by , denoted as , then use to reduce a higherdimensional thirdorder tensor to a lowerdimensional tensor . Since the tensor decomposition and the tensor estimation are conducted on a lowerdimensional space, the sample complexity and computational complexity are reduced.
The detailed algorithm is shown in Algorithm 1. First, we randomly partition the dataset into three subsets each with size . Then apply the power method on , which is the estimation of from , to estimate . After that, the nonorthogonal tensor decomposition (KCL)[KCL15] on outputs which estimates for with unknown sign . Hence can be estimated by . Finally we estimate the magnitude of and the signs in the RecMagSign function for homogeneous activations. We discuss the details of each procedure and provide PowerMethod and RecMagSign algorithms in Appendix E.
5.3 Theoretical Analysis
Theorem 5.6.
(a) Sample complexity for recovery  (b) Tensor initialization error  (c) Objective v.s. iterations 
6 Global Convergence
Combining the positive definiteness of the Hessian near the global optimal in Section 4 and the tensor initialization methods in Section 5, we come up with the overall globally converging algorithm Algorithm 2 and its guarantee Theorem 6.1.
Theorem 6.1 (Global convergence guarantees).
Let denote a set of i.i.d. samples from distribution (defined in (1)) and let the activation function be homogeneous satisfying Property 3.1, 3.2, 3.3(a) and Assumption 5.3. Then for any and any , if , and , then there is an Algorithm (procedure Learning1NN in Algorithm 2) taking time and outputting a matrix and a vector satisfying
with probability at least .
7 Numerical Experiments
In this section we use synthetic data to verify our theoretical results. We generate data points from Distribution (defined in Eq. (1)). We set , where and are orthogonal matrices generated from QR decomposition of Gaussian matrices, is a diagonal matrix whose diagonal elements are . In this experiment, we set and . We set to be randomly picked from with equal chance. We use squared , which is a smooth homogeneous function. For nonorthogonal tensor methods, we directly use the code provided by [KCL15] with the number of random projections fixed as . We pick the stepsize for gradient descent. In the experiments, we don’t do the resampling since the algorithm still works well without resampling.
First we show the number of samples required to recover the parameters for different dimensions. We fix , change for and for . For each pair of and , we run trials. We say a trial successfully recovers the parameters if there exists a permutation , such that the returned parameters and satisfy
We record the recovery rates and represent them as grey scale in Fig. 1(a). As we can see from Fig. 1(a), the least number of samples required to have 100% recovery rate is about proportional to the dimension.
Next we test the tensor initialization. We show the error between the output of the tensor method and the ground truth parameters against the number of samples under different dimensions in Fig 1(b). The pure dark blocks indicate, in at least one of the 10 trials, , which means is not correctly initialized. Let denote the set of all possible permutations . The grey scale represents the averaged error,
over 10 trials. As we can see, with a fixed dimension, the more samples we have the better initialization we obtain. We can also see that to achieve the same initialization error, the sample complexity required is about proportional to the dimension.
We also compare different initialization methods for gradient descent in Fig. 1(c). We fix and compare three different initialization approaches, (\@slowromancapi@) Let both and be initialized from tensor methods, and then do gradient descent for while is fixed; (\@slowromancapii@) Let both and be initialized from random Gaussian, and then do gradient descent for both and ; (\@slowromancapiii@) Let and be initialized from random Gaussian, and then do gradient descent for while is fixed. As we can see from Fig 1(c), Approach (\@slowromancapi@) is the fastest and Approach (\@slowromancapii@) doesn’t converge even if more iterations are allowed. Both Approach (\@slowromancapi@) and (\@slowromancapiii@) have linear convergence rate when the objective value is small enough, which verifies our local linear convergence claim.
8 Conclusion
As shown in Theorem 6.1, the tensor initialization followed by gradient descent will provide a globally converging algorithm with linear time/sample complexity in dimension, logarithmic in precision and polynomial in other factors for smooth homogeneous activation functions. Our distilled properties for activation functions include a wide range of nonlinear functions and hopefully provide an intuition to understand the role of nonlinear activations played in optimization. Deeper neural networks and convergence for SGD will be considered in the future.
References
 [ABGM14] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma. Provable bounds for learning some deep representations. In Proceedings of the 31st International Conference on Machine Learning (ICML), pages 584–592. https://arxiv.org/pdf/1310.6343.pdf, 2014.
 [AGH14] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. JMLR, 15:2773–2832, 2014.
 [AGKM12] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Computing a nonnegative matrix factorization–provably. In Proceedings of the fortyfourth annual ACM symposium on Theory of computing (STOC), pages 145–162. ACM, 2012.
 [AGM12] Sanjeev Arora, Rong Ge, and Ankur Moitra. Learning topic models–going beyond svd. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pages 1–10. IEEE, 2012.
 [AGMR17] Sanjeev Arora, Rong Ge, Tengyu Ma, and Andrej Risteski. Provable learning of noisyor networks. In Proceedings of the 49th Annual Symposium on the Theory of Computing (STOC). https://arxiv.org/pdf/1612.08795.pdf, 2017.
 [AKZ12] Peter Arbenz, Daniel Kressner, and DMATH ETH Zürich. Lecture notes on solving large scale eigenvalue problems. http://people.inf.ethz.ch/arbenz/ewp/Lnotes/lsevp2010.pdf, 2012.
 [APVZ14] Alexandr Andoni, Rina Panigrahy, Gregory Valiant, and Li Zhang. Learning polynomials with neural networks. In Proceedings of the 31st International Conference on Machine Learning (ICML), pages 1908–1916, 2014.
 [AZS14] Özlem Aslan, Xinhua Zhang, and Dale Schuurmans. Convex deep learning via normalized kernels. In Advances in Neural Information Processing Systems (NIPS), pages 3275–3283, 2014.
 [Bac14] Francis Bach. Breaking the curse of dimensionality with convex neural networks. arXiv preprint arXiv:1412.8690, 2014.
 [Bal16] David Balduzzi. Deep online convex optimization with gated games. arXiv preprint arXiv:1604.01952, 2016.
 [Bar98] Peter L. Bartlett. The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2):525–536, 1998.
 [BG17] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966, 2017.
 [BLWZ17] MariaFlorina Balcan, Yingyu Liang, David P. Woodruff, and Hongyang Zhang. Optimal sample complexity for matrix completion and related problems via regularization. arXiv preprint arXiv:1704.08683, 2017.
 [BMBY16] David Balduzzi, Brian McWilliams, and Tony ButlerYeoman. Neural taylor approximations: Convergence and exploration in rectifier networks. arXiv preprint arXiv:1611.02345, 2016.
 [BR88] Avrim Blum and Ronald L Rivest. Training a 3node neural network is npcomplete. In Proceedings of the 1st International Conference on Neural Information Processing Systems (NIPS), pages 494–501. MIT Press, 1988.
 [BRV05] Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. Convex neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 123–130, 2005.
 [CHM15] Anna Choromanska, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 192–204, 2015.
 [CS16] Nadav Cohen and Amnon Shashua. Convolutional rectifier networks as generalized tensor decompositions. In International Conference on Machine Learning (ICML), 2016.
 [CSS16] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In 29th Annual Conference on Learning Theory (COLT), pages 698–728, 2016.
 [DFS16] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances in neural information processing systems (NIPS), pages 2253–2261, 2016.
 [DPG14] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in neural information processing systems (NIPS), pages 2933–2941, 2014.
 [EV09] Ehsan Elhamifar and René Vidal. Sparse subspace clustering. In CVPR, pages 2790–2797, 2009.
 [FB16] C Daniel Freeman and Joan Bruna. Topology and geometry of halfrectified network optimization. In arXiv preprint. https://arxiv.org/pdf/1611.01540.pdf, 2016.
 [FZK16] Jiashi Feng, Tom Zahavy, Bingyi Kang, Huan Xu, and Shie Mannor. Ensemble robustness of deep learning algorithms. arXiv preprint arXiv:1602.02389, 2016.
 [GG11] Nicolas Gillis and François Glineur. Lowrank matrix approximation with weights or missing data is nphard. SIAM Journal on Matrix Analysis and Applications, 32(4):1149–1165, 2011.
 [GKKT17] Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the relu in polynomial time. In 30th Annual Conference on Learning Theory (COLT). https://arxiv.org/pdf/1611.10258.pdf, 2017.
 [GV15] Nicolas Gillis and Stephen A Vavasis. On the complexity of robust pca and norm lowrank matrix approximation. arXiv preprint arXiv:1509.09236, 2015.
 [Har14] Moritz Hardt. Understanding alternating minimization for matrix completion. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 651–660. IEEE, 2014.
 [Hås90] Johan Håstad. Tensor rank is npcomplete. Journal of Algorithms, 11(4):644–654, 1990.
 [HK13] Daniel Hsu and Sham M Kakade. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In ITCS, pages 11–20. ACM, 2013.
 [HKZ12] Daniel Hsu, Sham M Kakade, and Tong Zhang. A tail inequality for quadratic forms of subgaussian random vectors. Electron. Commun. Probab, 17(52):1–6, 2012.
 [HL13] Christopher J Hillar and LekHeng Lim. Most tensor problems are nphard. In Journal of the ACM (JACM), volume 60(6), page 45. https://arxiv.org/pdf/0911.1393.pdf, 2013.
 [HM13] Moritz Hardt and Ankur Moitra. Algorithms and hardness for robust subspace recovery. In COLT, volume 30, pages 354–375, 2013.
 [HM17] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. ICLR, 2017.
 [Hor91] Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
 [HRS16] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In ICML, pages 1225–1234, 2016.
 [HV15] Benjamin D Haeffele and René Vidal. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015.
 [JNS13] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Lowrank matrix completion using alternating minimization. In Proceedings of the fortyfifth annual ACM symposium on Theory of computing (STOC), 2013.
 [JSA15] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of nonconvexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
 [Kaw16] Kenji Kawaguchi. Deep learning without poor local minima. arXiv preprint arXiv:1605.07110, 2016.
 [KCL15] Volodymyr Kuleshov, Arun Chaganty, and Percy Liang. Tensor factorization via matrix factorization. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 507–516, 2015.
 [LSSS14] Roi Livni, Shai ShalevShwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Advances in neural information processing systems (NIPS), pages 855–863, 2014.
 [MPCB14] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems (NIPS), pages 2924–2932, 2014.
 [PLR16] Ben Poole, Subhaneil Lahiri, Maithreyi Raghu, Jascha SohlDickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances In Neural Information Processing Systems (NIPS), pages 3360–3368, 2016.
 [RPK16] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha SohlDickstein. On the expressive power of deep neural networks. arXiv preprint arXiv:1606.05336, 2016.
 [RSW16] Ilya P Razenshteyn, Zhao Song, and David P. Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the 48th Annual Symposium on the Theory of Computing (STOC), pages 250–263, 2016.
 [SA15] Hanie Sedghi and Anima Anandkumar. Provable methods for training neural networks with sparse connectivity. In International Conference on Learning Representation (ICLR), 2015.
 [SBL16] Levent Sagun, Léon Bottou, and Yann LeCun. Singularity of the Hessian in deep learning. arXiv preprint arXiv:1611.07476, 2016.
 [SC16] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.
 [SCP16] Grzegorz Swirszcz, Wojciech Marian Czarnecki, and Razvan Pascanu. Local minima in training of deep networks. arXiv preprint arXiv:1611.06310, 2016.
 [Sha16] Ohad Shamir. Distributionspecific hardness of learning neural networks. arXiv preprint arXiv:1609.01037, 2016.
 [SL15] Ruoyu Sun and ZhiQuan Luo. Guaranteed matrix completion via nonconvex factorization. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 270–289. IEEE, 2015.
 [SR11] David Sontag and Dan Roy. Complexity of inference in latent dirichlet allocation. In Advances in neural information processing systems, pages 1008–1016, 2011.
 [SS16] Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. In International Conference on Machine Learning (ICML), 2016.
 [SWZ16] Zhao Song, David P. Woodruff, and Huan Zhang. Sublinear time orthogonal tensor decomposition. In Advances in Neural Information Processing Systems(NIPS), pages 793–801, 2016.
 [SWZ17a] Zhao Song, David P. Woodruff, and Peilin Zhong. Low rank approximation with entrywise norm error. In Proceedings of the 49th Annual Symposium on the Theory of Computing (STOC). ACM, https://arxiv.org/pdf/1611.00898.pdf, 2017.
 [SWZ17b] Zhao Song, David P. Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. In arXiv preprint. https://arxiv.org/pdf/1704.08246.pdf, 2017.
 [Tel16] Matus Telgarsky. Benefits of depth in neural networks. In 29th Annual Conference on Learning Theory (COLT), pages 1517–1539, 2016.
 [Tia17] Yuandong Tian. Symmetrybreaking convergence analysis of certain twolayered neural networks with ReLU nonlinearity. In Workshop at International Conference on Learning Representation, 2017.
 [Tro12] Joel A. Tropp. Userfriendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012.
 [WA16] Yining Wang and Anima Anandkumar. Online and differentiallyprivate tensor decomposition. In Advances in Neural Information Processing Systems (NIPS), pages 3531–3539, 2016.
 [WTSA15] Yining Wang, HsiaoYu Tung, Alexander J Smola, and Anima Anandkumar. Fast and guaranteed tensor decomposition via sketching. In Advances in Neural Information Processing Systems (NIPS), pages 991–999, 2015.
 [XLS17] Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
 [YCS14] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi. Alternating minimization for mixed linear regression. In ICML, pages 613–621, 2014.
 [ZBH17] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
 [ZJD16] Kai Zhong, Prateek Jain, and Inderjit S Dhillon. Mixed linear regression with multiple components. In Advances in neural information processing systems (NIPS), pages 2190–2198, 2016.
 [ZLJ16] Yuchen Zhang, Jason D Lee, and Michael I Jordan. L1regularized neural networks are improperly learnable in polynomial time. In Proceedings of The 33rd International Conference on Machine Learning (ICML), pages 993–1001, 2016.
 [ZLW16] Yuchen Zhang, Percy Liang, and Martin J Wainwright. Convexified convolutional neural networks. arXiv preprint arXiv:1609.01000, 2016.
 [ZLWJ17] Yuchen Zhang, Jason D. Lee, Martin J. Wainwright, and Michael I. Jordan. On the learnability of fullyconnected neural networks. In International Conference on Artificial Intelligence and Statistics, 2017.
Appendix A Notation
For any positive integer , we use to denote the set . For random variable , let denote the expectation of (if this quantity exists). For any vector , we use to denote its norm.
We provide several definitions related to matrix . Let denote the determinant of a square matrix . Let denote the transpose of . Let denote the MoorePenrose pseudoinverse of . Let denote the inverse of a full rank square matrix. Let denote the Frobenius norm of matrix . Let denote the spectral norm of matrix . Let to denote the th largest singular value of . We often use capital letter to denote the stack of corresponding small letter vectors, e.g., . For two samesize matrices, , we use to denote elementwise multiplication of these two matrices.
We use to denote outer product and to denote dot product. Given two column vectors , then and , and . Given three column vectors , then and . We use to denote the vector outer product with itself times.
Tensor is symmetric if and only if for any , . Given a third order tensor and three matrices , we use to denote a tensor where the th entry is,
We use to denote the operator norm of the tensor , i.e.,