The goal of a recommendation system is to predict the interest of a user in a given item by exploiting the existing set of ratings as well as certain user/item features. A standard approach to modeling this problem is Inductive Matrix Completion where the predicted rating is modeled as an inner product of the user and the item features projected onto a latent space. In order to learn the parameters effectively from a small number of observed ratings, the latent space is constrained to be lowdimensional which implies that the parameter matrix is constrained to be lowrank. However, such bilinear modeling of the ratings can be limiting in practice and nonlinear prediction functions can lead to significant improvements. A natural approach to introducing nonlinearity in the prediction function is to apply a nonlinear activation function on top of the projected user/item features. Imposition of nonlinearities further complicates an already challenging problem that has two sources of nonconvexity: a) lowrank structure of the parameter matrix, and b) nonlinear activation function. We show that one can still solve the nonlinear Inductive Matrix Completion problem using gradient descent type methods as long as the solution is initialized well. That is, close to the optima, the optimization function is strongly convex and hence admits standard optimization techniques, at least for certain activation functions, such as Sigmoid and tanh. We also highlight the importance of the activation function and show how ReLU can behave significantly differently than say a sigmoid function. Finally, we apply our proposed technique to recommendation systems and semisupervised clustering, and show that our method can lead to much better performance than standard linear Inductive Matrix Completion methods.
1 Introduction
Matrix Completion (MC) or Collaborative filtering [CR09, GUH16] is by now a standard technique to model recommendation systems problems where a few useritem ratings are available and the goal is to predict ratings for any useritem pair. However, standard collaborative filtering suffers from two drawbacks: 1) Coldstart problem: MC can’t give prediction for new users or items, 2) Missing sideinformation: MC cannot leverage sideinformation that is typically present in recommendation systems such as features for users/items. Consequently, several methods [ABEV06, Ren10, XJZ13, JD13] have been proposed to leverage the side information together with the ratings. Inductive matrix completion (IMC) [ABEV06, JD13] is one of the most popular methods in this class.
IMC models the ratings as the inner product between certain linear mapping of the user/items’ features, i.e., , where is the predicted rating of user for item , are the feature vectors. Parameters () can typically be learned using a small number of observed ratings [JD13].
However, the bilinear structure of IMC is fairly simplistic and limiting in practice and might lead to fairly poor accuracy on realworld recommendation problems. For example, consider the Youtube recommendation system [CAS16] that requires predictions over videos. Naturally, a linear function over the pixels of videos will lead to fairly inaccurate predictions and hence one needs to model the videos using nonlinear networks. The survey paper by [ZYS17] presents many more such examples, where we need to design a nonlinear ratings prediction function for the input features, including [LLL16] for image recommendation, [WW14] for music recommendation and [ZYL16] for recommendation systems with multiple types of inputs.
We can introduce nonlinearity in the prediction function using several standard techniques, however, if our parameterization admits too many free parameters then learning them might be challenging as the number of available useritem ratings tend to be fairly small. Instead, we use a simple nonlinear extension of IMC that can control the number of parameters to be estimated. Note that IMC based prediction function can be viewed as an inner product between certain latent useritem features where the latent features are a linear map of the raw useritem features. To introduce nonlinearity, we can use a nonlinear mapping of the raw useritem features rather than the linear mapping used by IMC. This leads to the following general framework that we call nonlinear inductive matrix completion (NIMC),
(1) 
where are the feature vectors, is their rating and are nonlinear mappings from the raw feature space to the latent space.
The above general framework reduces to standard inductive matrix completion when are linear mappings and further reduces to matrix completion when are unit vectors for th item and th user respectively. When is used as the feature vector and is restricted to be a twoblock (one for and the other for ) diagonal matrix, then the above framework reduces to the dirtyIMC model [CHD15]. Similarly, can also be neural networks (NNs), such as feedforward NNs [SCH16, CAS16], convolutional NNs for images and recurrent NNs for speech/text.
In this paper, we focus on a simple nonlinear activation based mapping for the useritem features. That is, we set and where is a nonlinear activation function . Note that if is ReLU then the latent space is guaranteed to be in nonnegative orthant which in itself can be a desirable property for certain recommendation problems.
Note that parameter estimation in both IMC and NIMC models is hard due to nonconvexity of the corresponding optimization problem. However, for "nice" data, several strong results are known for the linear models, such as [CR09, JNS13, GJZ17] for MC and [JD13, XJZ13, CHD15] for IMC. However, nonlinearity in NIMC models adds to the complexity of an already challenging problem and has not been studied extensively, despite its popularity in practice.
In this paper, we study a simple onelayer neural network style NIMC model mentioned above. In particular, we formulate a squaredloss based optimization problem for estimating parameters and . We show that under a realizable model and Gaussian input assumption, the objective function is locally strongly convex within a "reasonably large" neighborhood of the ground truth. Moreover, we show that the above strong convexity claim holds even if the number of observed ratings is nearlylinear in dimension and polynomial in the conditioning of the weight matrices. In particular, for wellconditioned matrices, we can recover the underlying parameters using only useritem ratings, which is critical for practical recommendation systems as they tend to have very few ratings available per user. Our analysis covers popular activation functions, e.g., sigmoid and ReLU, and discuss various subtleties that arise due to the activation function. Finally we discuss how we can leverage standard tensor decomposition techniques to initialize our parameters well. We would like to stress that practitioners typically use random initialization itself, and hence results studying random initialization for NIMC model would be of significant interest.
As mentioned above, due to nonlinearity of activation function along with nonconvexity of the parameter space, the existing proof techniques do not apply directly to the problem. Moreover, we have to carefully argue about both the optimization landscape as well as the sample complexity of the algorithm which is not carefully studied for neural networks. Our proof establishes some new techniques that might be of independent interest, e.g., how to handle the redundancy in the parameters for ReLU activation. To the best of our knowledge, this is one of the first theoretically rigorous study of neuralnetwork based recommendation systems and will hopefully be a stepping stone for similar analysis for "deeper" neural networks based recommendation systems. We would also like to highlight that our model can be viewed as a strict generalization of a onehidden layer neural network, hence our result represents one of the few rigorous guarantees for models that are more powerful than onehidden layer neural networks [LY17, BGMSS18, ZSJ17].
Finally, we apply our model on synthetic datasets and verify our theoretical analysis. Further, we compare our NIMC model with standard linear IMC on several realworld recommendationtype problems, including usermovie rating prediction, genedisease association prediction and semisupervised clustering. NIMC demonstrates significantly superior performance over IMC.
1.1 Related work
Collaborative filtering: Our model is a nonlinear version of the standard inductive matrix completion model [JD13]. Practically, IMC has been applied to genedisease prediction [ND14], matrix sensing [ZJD15], multilabel classification[YJKD14], blog recommender system [SCLD15], link prediction [CHD15] and semisupervised clustering [CHD15, SCH16]. However, IMC restricts the latent space of users/items to be a linear transformation of the user/item’s feature space. [SCH16] extended the model to a threelayer neural network and showed significantly better empirical performance for multilabel/multiclass classification problem and semisupervised problems.
Although standard IMC has linear mappings, it is still a nonconvex problem due to the bilinearity . To deal with this nonconvex problem, [JD13, Har14] provided recovery guarantees using alternating minimization with sample complexity linear in dimension. [XJZ13] relaxed this problem to a nuclearnorm problem and also provided recovery guarantees. More general norms have been studied [RSW16, SWZ17a, SWZ17b, SWZ18], e.g. weighted Frobenius norm, entrywise norm. More recently, [ZDG18] uses gradientbased nonconvex optimization and proves a better sample complexity. [CHD15] studied dirtyIMC models and showed that the sample complexity can be improved if the features are informative when compared to matrix completion. Several lowrank matrix sensing problems [ZJD15, GJZ17] are also closely related to IMC models where the observations are sampled only from the diagonal elements of the rating matrix. [Ren10, LY16] introduced and studied an alternate framework for ratings prediction with sideinformation but the prediction function is linear in their case as well.
Neural networks: Nonlinear activation functions play an important role in neural networks. Recently, several powerful results have been discovered for learning onehiddenlayer feedforward neural networks [Tia17, ZSJ17, JSA15, LY17, BGMSS18, GKKT17], convolutional neural networks [BG17, ZSD17, DLT18a, DLT18b, GKM18]. However, our problem is a strict generalization of the onehidden layer neural network and is not covered by the above mentioned results.
Notations. For any function , we define to be . For two functions , we use the shorthand (resp. ) to indicate that (resp. ) for an absolute constant . We use to mean for constants . We use to denote .
2 Problem Formulation
Consider a useritem recommender system, where we have users with feature vectors , items with feature vectors and a collection of partiallyobserved useritem ratings, . That is is the rating that user gave for item . For simplicity, we assume ’s and ’s are sampled i.i.d. from distribution and , respectively. Each element of the index set is also sampled independently and uniformly with replacement from .
In this paper, our goal is to predict the rating for any useritem pair with feature vectors and , respectively. We model the useritem ratings as:
(2) 
where , and is a nonlinear activation function. Under this realizable model, our goal is to recover from a collection of observed entries, . Without loss of generality, we set . Also we treat as a constant throughout the paper. Our analysis requires to be full column rank, so we require . And w.l.o.g., we assume , i.e., the smallest singular value of both and is .
Note that this model is similar to onehidden layer feedforward network popular in standard classification/regression tasks. However, as there is an inner product between the output of two nonlinear layers, and , it cannot be modeled by a single hidden layer neural network (with same number of nodes). Also, for linear activation function, the problem reduces to inductive matrix completion [ABEV06, JD13].
Now, to solve for , , we optimize a simple squaredloss based optimization problem, i.e.,
(3) 
Naturally, the above problem is a challenging nonconvex optimization problem that is strictly harder than two nonconvex optimization problems which are challenging in their own right: a) the linear inductive matrix completion where nonconvexity arises due to bilinearity of , and b) the standard onehidden layer neural network (NN). In fact, recently a lot of research has focused on understanding various properties of both the linear inductive matrix completion problem [GJZ17, JD13] as well as onehidden layer NN [GLM18, ZSJ17].
In this paper, we show that despite the nonconvexity of Problem (3), it behaves as a convex optimization problem close to the optima if the data is sampled stochastically from a Gaussian distribution. This result combined with standard tensor decomposition based initialization [ZSJ17, KCL15, JSA15] leads to a polynomial time algorithm for solving (3) optimally if the data satisfies certain sampling assumptions in Theorem 2.1. Moreover, we also discuss the effect of various activation functions, especially the difference between a sigmoid activation function vs RELU activation (see Theorem 3.2 and Theorem 3.4).
Informally, our recovery guarantee can be stated as follows,
Theorem 2.1 (Informal Recovery Guarantee).
Consider a recommender system with a realizable model Eq. (2) with sigmoid activation, Assume the features and are sampled i.i.d. from the normal distribution and the observed pairs are i.i.d. sampled from uniformly at random. Then there exists an algorithm such that can be recovered to any precision with time complexity and sample complexity (refers to ) polynomial in the dimension and the condition number of , and logarithmic in .
3 Main Results
Our main result shows that when initialized properly, gradientbased algorithms will be guaranteed to converge to the ground truth. We first study the Hessian of empirical risk for different activation functions, then based on the positivedefiniteness of the Hessian for smooth activations, we show local linear convergence of gradient descent. The proof sketch is provided in Appendix C.
The positive definiteness of the Hessian does not hold for several activation functions. Here we provide some examples. Counter Example 1) The Hessian at the ground truth for linear activation is not positive definite because for any fullrank matrix , is also a global optimal. Counter Example 2) The Hessian at the ground truth for ReLU activation is not positive definite because for any diagonal matrix with positive diagonal elements, is also a global optimal. These counter examples have a common property: there is redundancy in the parameters. Surprisingly, for sigmoid and tanh, the Hessian around the ground truth is positive definite. More surprisingly, we will later show that for ReLU, if the parameter space is constrained properly, its Hessian at a given point near the ground truth can also be proved to be positive definite with high probability.
3.1 Local Geometry and Local Linear Convergence for Sigmoid and Tanh
We define two natural condition numbers for the problem that captures the "hardness" of the problem:
Definition 3.1.
Define and , where , , and denotes the th singular value of with the ordering .
First we show the result for sigmoid and tanh activations.
Theorem 3.2 (Positive Definiteness of Hessian for Sigmoid and Tanh).
Remark. Theorem 3.2 shows that, given sufficiently large number of useritems ratings and a sufficiently large number of users/items themselves, the Hessian at a point close enough to the true parameters , , is positive definite with high probability. The sample complexity, including and , have a nearlinear dependency on the dimension, which matches the linear IMC analysis [JD13]. Strong convexity parameter as well as the sample complexity depend on the condition number of as defined in Definition 3.1. Although we don’t explicitly show the dependence on , both sample complexity and the minimal eigenvalue scale as a polynomial of . The proofs can be found in Appendix C.
As the above theorem shows the Hessian is positive definite w.h.p. for a given that is close to the optima. This result along with smoothness of the activation function implies linear convergence of gradient descent that samples a fresh batch of samples in each iteration as shown in the following, whose proof is postponed to Appendix E.1.
Theorem 3.3.
Let be the parameters in the th iteration. Assuming , then given a fresh sample set, , that is independent of and satisfies the conditions in Theorem 3.2, the next iterate using one step of gradient descent, i.e., satisfies
with probability , where is the step size and is the lower bound on the eigenvalues of the Hessian and is the upper bound on the eigenvalues of the Hessian.
Remark. The linear convergence requires each iteration has a set of fresh samples. However, since it converges linearly to the groundtruth, we only need iterations, therefore the sample complexity is only logarithmic in . This dependency is better than directly using Tensor decomposition method [JSA15], which requires samples. Note that we only use Tensor decomposition to initialize the parameters. Therefore the sample complexity required in our tensor initialization does not depend on .
3.2 Empirical Hessian around the Ground Truth for ReLU
We now present our result for ReLU activation. As we see in Counter Example 2, without any further modification, the Hessian for ReLU is not locally strongly convex due to the redundancy in parameters. Therefore, we reduce the parameter space by fixing one parameter for each pair, . In particular, we fix when minimizing the objective function, Eq. (3), where is th element in the first row of . Note that as long as , can be fixed to any other nonzero values. We set just for simplicity of the proof. The new objective function can be represented as
(4) 
where is the first row of and .
Surprisingly, after fixing one parameter for each pair, the Hessian using ReLU is also positive definite w.h.p. for a given around the ground truth.
Theorem 3.4 (Positive Definiteness of Hessian for ReLU).
Define . For any and any given , if
then with probability , the minimal eigenvalue of the objective for ReLU activation function, Eq. (4), is lower bounded,
Remark. Similar to the sigmoid/tanh case, the sample complexity for ReLU case also has a linear dependency on the dimension. However, here we have a worse dependency on the condition number of the weight matrices. The scale of can also be important and in practice one needs to set it carefully. Note that although the activation function is not smooth, the Hessian at a given point can still exist with probability , since ReLU is smooth almost everywhere and there are only a finite number of samples. However, owing to the nonsmoothness, a proof of convergence of gradient descent method for ReLU is still an open problem.
3.3 Initialization
To achieve the ground truth, our algorithm needs a good initialization method that can initialize the parameters to fall into the neighborhood of the ground truth. Here we show that this is possible by using tensor method under the Gaussian assumption.
In the following, we consider estimating . Estimating is similar.
Define , where . Define . Then where and . When , we can approximately recover and from the empirical version of using nonorthogonal tensor decomposition [KCL15]. When is sigmoid, . Given , we can estimate , since is a monotonic function w.r.t. . Applying Lemma B.7 in [ZSJ17], we can bound the approximation error of empirical and population using polynomial number of samples. By [KCL15], we can bound the estimation error of and . Finally combining Theorem 3.2, we are able to show the recovery guarantee for sigmoid activation, i.e., Theorem 2.1.
Although tensor initialization has nice theoretical guarantees and sample complexity, it heavily depends on Gaussian assumption and realizable model assumption. In contrast, practitioners typically use random initialization.
4 Experiments
In this section, we show experimental results on both synthetic data and realworld data. Our experiments on synthetic data are intended to verify our theoretical analysis, while the realworld data shows the superior performance of NIMC over IMC. We apply gradient descent with random initialization to both NIMC and IMC.
4.1 Synthetic Data
We first generate some synthetic datasets to verify the sample complexity and the convergence of gradient descent using random initialization. We fix . For sigmoid, set the number of samples and the number of observations . For ReLU, set and . The sampling rule follows our previous assumptions. For each pair, we make 5 trials and take the average of the successful recovery times. We say a solution () successfully recovers the ground truth parameters when the solution achieves 0.001 relative testing error, i.e., where is a newly sampled testing dataset. For both ReLU and sigmoid, we minimize the original objective function (3). We illustrate the recovery rate in left figures in Figure 1. As we can see, ReLU requires more samples/observations than that for sigmoid for exact recovery (note the scales of and are different in the two figures). This is consistent with our theoretical results. Comparing Theorem 3.2 and Theorem 3.4, we can see the sample complexity for ReLU has a worse dependency on the conditioning of than sigmoid. We can also see that when is sufficiently large, the number of observed ratings required remains the same for both methods. This is also consistent with the theorems, where is nearlinear in and is independent of .
4.2 Semisupervised Clustering
We apply NIMC to semisupervised clustering and follow the experimental setting in GIMC [SCH16]. In this problem, we are given a set of items with their features, , where is the number of items and is the feature dimension, and an incomplete similarity matrix , where if th item and th item are similar and if th item and th item are dissimilar. The goal is to do clustering using both existing features and the partially observed similarity matrix. We build the dataset from a classification dataset where the label of each item is known and will be used as the ground truth cluster. We first compute the similarity matrix from the labels and sample entries uniformly as the observed entries. Since there is only one features we set in the objective function Eq. (3).
We initialize and to be the same Gaussian random matrix, then apply gradient descent. This guarantees and to keep identical during the optimization process. Once converges, we take the top left singular vectors of to do kmeans clustering. The clustering error is defined as in [SCH16]. Like [SCH16], we define the clustering error as follows,
where is the groundtruth clustering and is the predicted clustering. We compare NIMC of a ReLU activation function with IMC on six datasets using raw features and random Fourier features (RFF). The random Fourier feature is and each entry of is i.i.d. sampled from . We use Random Fourier features in order to see how increasing the depth of the neural network changes the performance. However, our analysis only works for onelayer neural networks, therefore, we use Random Fourier features, which can be viewed as using twolayer neural networks but with the firstlayer parameters fixed.
is chosen such that a linear classifier using these random features achieves the best classification accuracy. is set as for all datasets. Datasets mushroom, segment, letter,usps,covtype are downloaded from libsvm website. We subsample covtype dataset to balance the samples from different classes. We preprocess yalefaces dataset as described in [KTWA14]. As shown in the right table in Figure 1, when using raw features, NIMC achieves better clustering results than IMC for all the cases. This is also true for most cases when using Random Fourier features.
4.3 Recommendation Systems
Recommender systems are used in many real situations. Here we consider two tasks.
Movie recommendation for users. We use Movielens[Res97] dataset, which has not only the ratings users give movies but also the users’ demographic information and movies’ genre information. Our goal is to predict ratings that new users will give the existing movies. We randomly split the users into existing users (training data) and new users (testing data) with ratio 4:1. The user features include 21 types of occupations, 7 different age ranges and one gender information; the movie features include 1819 (18 for ml1m and 19 for ml100k) genre features and 20 features from the top 20 right singular values of the training rating matrix (which has size #training users by #movies). In our experiments, we set to be 50. Here are our results on datasets ml1m and ml100k. For NIMC, we use ReLU activation. As shown in Table 1, NIMC achieves much smaller RMSE than IMC for both ml100k and ml1m datasets.
Dataset  #movies  #users  # ratings  # movie feat.  # user feat. 




ml100k  1682  943  100,000  39  29  1.034  1.321  
ml1m  3883  6040  1,000,000  38  29  1.021  1.320 
(a)  (b)  (c)  (d) 

GeneDisease association prediction. We use the dataset collected by [ND14], which has 300 gene features and 200 disease features. Our goal is to predict associated genes for a new disease given its features. Since the dataset only contains positive labels, this is a problem called positiveunlabeled learning [HND15] or oneclass matrix factorization [YHDL17]. We adapt our objective function to the following objective,
(5) 
where is the association matrix, is the set of indices for observed associations, is the complementary set of and is the penalty weight for unobserved associations. There are totally 12331 genes and 3209 diseases in the dataset. We randomly split the diseases into training diseases and testing diseases with ratio 4:1. The results are presented in Fig 2. We follow [ND14] and use the cumulative distribution of the ranks as a measure for comparing the performances of different methods, i.e., the probability that any groundtruth associated gene of a disease appears in the retrieved top genes for this disease.
In Fig 2(a), we show how changes the performance of NIMC and IMC. In general, the higher , the better the performance. The performance of IMC becomes stable when is larger than 100, while the performance of NIMC is still increasing. Although IMC performs better than NIMC when is small, the performance of NIMC increases much faster than IMC when increases. is fixed as and in the experiment for Fig 2(a). In Fig. 2(b), we present how in Eq. (5) affects the performance. We tried over to check how the value of changes the performance. As we can see, and give the best results. Fig. 2(c) shows the probability that any groundtruth associated gene of a disease appears in the retrieved top genes for this disease w.r.t. different ’s. Here we fix , and . Fig. 2(d) shows the precisionrecall curves for different methods when , and .
5 Conclusion
In this paper, we studied a nonlinear IMC model that represents one of the simplest inductive model for neuralnetworkbased recommender systems. We study local geometry of the empirical risk function and show that, close to the optima, the function is strongly convex for both ReLU and sigmoid activations. Therefore, using a smooth activation function like sigmoid activation along with standard tensor initialization, gradient descent recovers the underlying model with polynomial sample complexity and time complexity. Thus we provide the first theoretically rigorous result for the nonlinear recommendation system problem, which we hope will spur further progress in the area of deeplearning based recommendation systems. Our experimental results on synthetic data matches our analysis and the results on realworld benchmarks for semisupervised clustering and recommendation systems show a superior performance over linear IMC.
References
 [ABEV06] Jacob Abernethy, Francis Bach, Theodoros Evgeniou, and JeanPhilippe Vert. Lowrank matrix factorization with attributes. arXiv preprint cs/0611124, 2006.
 [BG17] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with Gaussian inputs. In ICML. https://arxiv.org.pdf/1702.07966, 2017.
 [BGMSS18] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai ShalevShwartz. SGD learns overparameterized networks that provably generalize on linearly separable data. In ICLR. https://arxiv.org/pdf/1710.10174, 2018.
 [CAS16] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for YouTube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 191–198. ACM, 2016.
 [CHD15] KaiYang Chiang, ChoJui Hsieh, and Inderjit S Dhillon. Matrix completion with noisy side information. In Advances in Neural Information Processing Systems, pages 3447–3455, 2015.
 [CR09] Emmanuel J. Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, December 2009.
 [DLT18a] Simon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional filter easy to learn? In ICLR. https://arxiv.org/pdf/1709.06129, 2018.
 [DLT18b] Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns onehiddenlayer CNN: Don’t be afraid of spurious local minima. In ICML. https://arxiv.org/pdf/1712.00779, 2018.
 [GJZ17] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In International Conference on Machine Learning, pages 1233–1242. https://arxiv.org/pdf/1704.00708, 2017.
 [GKKT17] Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the ReLU in polynomial time. In 30th Annual Conference on Learning Theory (COLT). https://arxiv.org/pdf/1611.10258, 2017.
 [GKM18] Surbhi Goel, Adam Klivans, and Reghu Meka. Learning one convolutional layer with overlapping patches. In ICML. https://arxiv.org/pdf/1802.02547, 2018.
 [GLM18] Rong Ge, Jason D Lee, and Tengyu Ma. Learning onehiddenlayer neural networks with landscape design. In ICLR. https://arxiv.org/pdf/1711.00501, 2018.
 [GUH16] Carlos A GomezUribe and Neil Hunt. The Netflix recommender system: Algorithms, business value, and innovation. ACM Transactions on Management Information Systems (TMIS), 6(4):13, 2016.
 [Har14] Moritz Hardt. Understanding alternating minimization for matrix completion. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 651–660. IEEE, 2014.
 [HKZ12] Daniel Hsu, Sham M Kakade, and Tong Zhang. A tail inequality for quadratic forms of subGaussian random vectors. Electronic Communications in Probability, 17(52):1–6, 2012.
 [HND15] ChoJui Hsieh, Nagarajan Natarajan, and Inderjit Dhillon. Pu learning for matrix completion. In International Conference on Machine Learning, pages 2445–2453, 2015.
 [JD13] Prateek Jain and Inderjit S Dhillon. Provable inductive matrix completion. arXiv preprint arXiv:1306.0626, 2013.
 [JNS13] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Lowrank matrix completion using alternating minimization. In STOC, 2013.
 [JSA15] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of nonconvexity: Guaranteed training of neural networks using tensor methods. arXiv preprint 1506.08473, 2015.
 [KCL15] Volodymyr Kuleshov, Arun Chaganty, and Percy Liang. Tensor factorization via matrix factorization. In AISTATS, pages 507–516, 2015.
 [KTWA14] Matt Kusner, Stephen Tyree, Kilian Weinberger, and Kunal Agrawal. Stochastic neighbor compression. In International Conference on Machine Learning, pages 622–630, 2014.
 [LLL16] Chenyi Lei, Dong Liu, Weiping Li, ZhengJun Zha, and Houqiang Li. Comparative deep learning of hybrid representations for image recommendations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2545–2553, 2016.
 [LY16] Ming Lin and Jieping Ye. A nonconvex onepass framework for generalized factorization machine and rankone matrix sensing. In Advances in Neural Information Processing Systems, pages 1633–1641, 2016.
 [LY17] Yuanzhi Li and Yang Yuan. Convergence analysis of twolayer neural networks with ReLU activation. In Advances in Neural Information Processing Systems, pages 597–607. https://arxiv.org/pdf/1705.09886, 2017.
 [ND14] Nagarajan Natarajan and Inderjit S Dhillon. Inductive matrix completion for predicting gene–disease associations. Bioinformatics, 30(12):i60–i68, 2014.
 [Ren10] Steffen Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 995–1000. IEEE, 2010.
 [Res97] GroupLens Research. Movie lens dataset. In University of Minnesota. http://www.grouplens.org/taxonomy/term/14, 1997.
 [RSW16] Ilya Razenshteyn, Zhao Song, and David P Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the fortyeighth annual ACM symposium on Theory of Computing (STOC), pages 250–263. ACM, 2016.
 [SCH16] Si Si, KaiYang Chiang, ChoJui Hsieh, Nikhil Rao, and Inderjit S Dhillon. Goaldirected inductive matrix completion. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1165–1174. ACM, 2016.
 [SCLD15] Donghyuk Shin, Suleyman Cetintas, KuangChih Lee, and Inderjit S Dhillon. Tumblr blog recommendation with boosted inductive matrix completion. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 203–212. ACM, 2015.
 [SWZ17a] Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with entrywise norm error. In Proceedings of the 49th Annual Symposium on the Theory of Computing (STOC). ACM, https://arxiv.org/pdf/1611.00898, 2017.
 [SWZ17b] Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. arXiv preprint arXiv:1704.08246, 2017.
 [SWZ18] Zhao Song, David P Woodruff, and Peilin Zhong. Towards a zeroone law for entrywise low rank approximation. 2018.
 [Tia17] Yuandong Tian. An analytical formula of population gradient for twolayered ReLU network and its applications in convergence and critical point analysis. In ICML. https://arxiv.org/pdf/1703.00560, 2017.
 [Tro12] Joel A. Tropp. Userfriendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012.
 [WW14] Xinxi Wang and Ye Wang. Improving contentbased and hybrid music recommendation using deep learning. In Proceedings of the 22nd ACM international conference on Multimedia, pages 627–636. ACM, 2014.
 [XJZ13] Miao Xu, Rong Jin, and ZhiHua Zhou. Speedup matrix completion with side information: Application to multilabel learning. In NIPS, pages 2301–2309, 2013.
 [YHDL17] HsiangFu Yu, HsinYuan Huang, Inderjit S Dihillon, and ChihJen Lin. A unified algorithm for oneclass structured matrix factorization with side information. In AAAI, 2017.
 [YJKD14] HsiangFu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Largescale multilabel learning with missing labels. In ICML, pages 593–601, 2014.
 [ZDG18] Xiao Zhang, Simon S Du, and Quanquan Gu. Fast and sample efficient inductive matrix completion via multiphase procrustes flow. In ICML. https://arxiv.org/pdf/1803.01233, 2018.
 [ZJD15] Kai Zhong, Prateek Jain, and Inderjit S. Dhillon. Efficient matrix sensing using rank1 Gaussian measurements. In International Conference on Algorithmic Learning Theory, pages 3–18. Springer, 2015.
 [ZSD17] Kai Zhong, Zhao Song, and Inderjit S Dhillon. Learning nonoverlapping convolutional neural networks with multiple kernels. arXiv preprint arXiv:1711.03440, 2017.
 [ZSJ17] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for onehiddenlayer neural networks. In ICML. https://arxiv.org/pdf/1706.03175, 2017.
 [ZYL16] Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and WeiYing Ma. Collaborative knowledge base embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 353–362. ACM, 2016.
 [ZYS17] Shuai Zhang, Lina Yao, and Aixin Sun. Deep learning based recommender system: A survey and new perspectives. arXiv preprint arXiv:1707.07435, 2017.
Appendix
Appendix A Notation
For any positive integer , we use to denote the set . For random variable , let denote the expectation of (if this quantity exists).
For any vector , we use to denote its norm.
We provide several definitions related to matrix . Let denote the determinant of a square matrix . Let denote the transpose of . Let denote the MoorePenrose pseudoinverse of . Let denote the inverse of a full rank square matrix. Let denote the Frobenius norm of matrix . Let denote the spectral norm of matrix . Let to denote the th largest singular value of .
We use to denote the indicator function, which is if holds and otherwise. Let denote the identity matrix. We use to denote an activation function. We use to denote a Gaussian distribution . For integer , we use to denote .
For any function , we define to be . In addition to notation, for two functions , we use the shorthand (resp. ) to indicate that (resp. ) for an absolute constant . We use to mean for constants .
Appendix B Preliminaries
We state some useful facts in this section.
Fact B.1.
Let . Let denote the vector where the th entry is , . Let denote the vector that the th entry is , . We have the following properties,
Proof.
Using the definition, it is easy to see that (\@slowromancapi@), (\@slowromancapii@) and (\@slowromancapiii@) are holding.
Proof of (\@slowromancapiv@), we have
where the last step follows by (\@slowromancapii@) and (\@slowromancapiii@). ∎
Fact B.2.
Let . Let denote the vector where the th entry is , . Let denote the vector that the th entry is , . We have the following properties,
Proof.
Proof of (\@slowromancapi@). We have
Proof of (\@slowromancapii@). It is similar to (\@slowromancapi@).
Proof of (\@slowromancapiii@). We have
Proof of (\@slowromancapiv@). We have
where the second step follows by , the third step follows by . ∎
Appendix C Proof Sketch
At high level the proofs for Theorem 3.2 and Theorem 3.4 include the following steps. 1) Show that the population Hessian at the ground truth is positive definite. 2) Show that population Hessians near the ground truth are also positive definite. 3) Employ matrix Bernstein inequality to bound the population Hessian and the empirical Hessian.
We now formulate the Hessian. The Hessian of Eq. (3), , can be decomposed into two types of blocks, (),
where (, resp.) is the th column of (th column of , resp.). Note that each of the above secondorder derivatives is a matrix.
The first type of blocks are given by:
where , , and
For sigmoid/tanh activation function, the second type of blocks are given by:
(6) 
For ReLU/leaky ReLU activation function, the second type of blocks are given by:
Note that the second term of Eq. (6) is missing here as are fixed, the number of samples is finite and almost everywhere.
In this section, we will discuss important lemmas/theorems for Step 1 in Appendix C.1 and Step 2,3 in Appendix C.3.
c.1 Positive definiteness of the population hessian
The corresponding population risk for Eq. (3) is given by:
(7) 
where . For simplicity, we also assume and are normal distributions.
Now we study the Hessian of the population risk at the ground truth. Let the Hessian of at the groundtruth be , which can be decomposed into the following two types of blocks (),
To study the positive definiteness of , we characterize the minimal eigenvalue of by a constrained optimization problem,
(8) 
where denotes that . Obviously, due to the squared loss and the realizable assumption. However, this is not sufficient for the local convexity around the ground truth, which requires the positive (semi)definiteness for the neighborhood around the ground truth. In other words, we need to show that is strictly greater than , so that we can characterize an area in which the Hessian still preserves positive definiteness (PD) despite the deviation from the ground truth.
Challenges. As we mentioned previously there are activation functions that lead to redundancy in parameters. Hence one challenge is to distill properties of the activation functions that preserve the PD. Another challenge is the correlation introduced by when it is nonorthogonal. So we first study the minimal eigenvalue for orthogonal and orthogonal and then link the nonorthogonal case to the orthogonal case.
c.2 Warm up: orthogonal case
In this section, we consider the case when are unitary matrices, i.e., . (). This case is easier to analyze because the dependency between different elements of or can be disentangled. And we are able to provide lower bound for the Hessian. Before we introduce the lower bound, let’s first define the following quantities for an activation function .
(9)  