Exploration of Large Networks with Covariates via Fast and Universal Latent Space Model Fitting

# Exploration of Large Networks with Covariates via Fast and Universal Latent Space Model Fitting

Zhuang Ma  and  Zongming Ma111 Email: zongming@wharton.upenn.edu.

University of Pennsylvania

###### Abstract

Latent space models are effective tools for statistical modeling and exploration of network data. These models can effectively model real world network characteristics such as degree heterogeneity, transitivity, homophily, etc. Due to their close connection to generalized linear models, it is also natural to incorporate covariate information in them. The current paper presents two universal fitting algorithms for networks with edge covariates: one based on nuclear norm penalization and the other based on projected gradient descent. Both algorithms are motivated by maximizing likelihood for a special class of inner-product models while working simultaneously for a wide range of different latent space models, such as distance models, which allow latent vectors to affect edge formation in flexible ways. These fitting methods, especially the one based on projected gradient descent, are fast and scalable to large networks. We obtain their rates of convergence for both inner-product models and beyond. The effectiveness of the modeling approach and fitting algorithms is demonstrated on five real world network datasets for different statistical tasks, including community detection with and without edge covariates, and network assisted learning.

Keywords: community detection, network with covariates, non-convex optimization, projected gradient descent.

\tcbuselibrary

skins, listings, theorems, breakable, hooks

## 1 Introduction

Network is a prevalent form of data for quantitative and qualitative analysis in a number of fields, including but not limited to sociology, computer science, neuroscience, etc. Moreover, due to advances in science and technology, the sizes of the networks we encounter are ever increasing. Therefore, to explore, to visualize and to utilize the information in large networks poses significant challenges to Statistics. Unlike traditional datasets in which a number of features are recorded for each subject, network datasets provide information on the relation among all subjects under study, sometimes together with additional features. In this paper, we focus on the modeling, visualization and exploration of networks in which additional features might be observed for each node pair.

On real world networks, people oftentimes observe the following characteristics. First, the degree distributions of nodes are often right-skewed and so networks exhibit degree heterogeneity. In addition, connections in networks often demonstrate transitivity, that is nodes with common neighbors are more likely to be connected. Moreover, nodes that are similar in certain ways (students in the same grade, brain regions that are close physically, etc.) are more likely to form bonds. Such a phenomenon is usually called homophily in network studies. Furthermore, nodes in some networks exhibit clustering effect and in such cases it is desirable to partition the nodes into different communities.

An efficient way to explore network data and to extract key information is to fit appropriate statistical models on them. To date, there have been a collection of network models proposed by researchers in various fields. These models aim to catch different subsets of the foregoing characteristics, and Goldenberg et al. [26] provides a comprehensive overview. An important class of network models are latent space models [31]. Suppose there are nodes in the observed network. The key idea underlying latent space modeling is that each node can be represented by a vector in some low dimensional Euclidean space (or some other metric space of choice, see e.g. [37, 5] for latent spaces with negative curvature) that is sometimes called the social space, and nodes that are “close” in the social space are more likely to be connected. Hoff et al. [31] considered two types of latent space models: distance models and projection models. In both cases, the latent vectors were treated as fixed effects. Later, a series of papers [28, 27, 30, 39] generalized the original proposal in [31] for better modeling of other characteristics of social networks, such as clustering, degree heterogeneity, etc. In these generalizations, the ’s were treated as random effects generated from certain multivariate Gaussian mixtures. Moreover, model fitting and inference in these models has been carried out via Markov Chain Monte Carlo, and it is difficult to scale these methodologies to handle large networks [26]. In addition, one needs to use different likelihood function based on choice of model and there is little understanding of the quality of fitting when the model is mis-specified. Albeit these disadvantages, latent space models are attractive due to their friendliness to interpretation and visualization.

For concreteness, assume that we observe an undirected network represented by a symmetric adjacency matrix on nodes with if nodes and are connected and zero otherwise. In addition, we may also observe a symmetric pairwise covariate matrix which measures certain characteristics of node pairs. We do not allow self-loop and so we set the diagonal elements of the matrices and to be zeros. The covariate can be binary, such as an indicator of whether nodes and share some common attribute (e.g. gender, location, etc) or it can be continuous, such as a distance/similarity measure (e.g. difference in age, income, etc). It is straightforward to generalize the methods and theory in this paper to a finite number of covariates.

In this paper, we aim to tackle the following two key issues in latent space modeling of network data:

• First, we seek a class of latent space models that is special enough so that we can design fast fitting algorithms for them and hence be able to handle networks of very large sizes;

• In addition, we would like to be able to fit a class of models that are flexible enough to well approximate a wide range of latent space models of interest so that fitting methods for this flexible class continue to work even when the model is mis-specified.

From a practical viewpoint, if one is able to find such a class of models and design fast algorithms for fitting them, then one would be able to use this class as working models and to use the associated fast algorithms to effectively explore large networks.

### 1.1 Main contributions

We make progress on tackling the foregoing two issues simultaneously in the present paper, which we summarize as the following main contributions:

1. We first consider a special class of latent space models, called inner-product models, and design two fast fitting algorithms for this class. Let the observed -by- adjacency matrix and covariate matrix be and , respectively. The inner-product model assumes that for any ,

 Aij=Aji\lx@stackrelind.∼Bernoulli(Pij),withlogit(Pij)=Θij=αi+αj+βXij+z⊤izj, (1)

where for any , . Here, , , are parameters modeling degree heterogeneity. The parameter is the coefficient for the observed covariate, and is the inner-product between the latent vectors. As we will show later in \prettyrefsec:model, this class of models can incorporate degree heterogeneity, transitivity and homophily explicitly. From a matrix estimation viewpoint, the matrix is of rank at most that can be much smaller than . Motivated by recent advances in low rank matrix estimation, we design two fast algorithms for fitting (1). One algorithm is based on lifting and nuclear norm penalization of the negative log-likelihood function. The other is based on directly optimizing the negative log-likelihood function via projected gradient descent. For both algorithms, we establish high probability error bounds for inner-product models. The connection between model (1) and the associated algorithms and other related work in the literature will be discussed immdediately in next subsection.

2. We further show that these two fitting algorithms are “universal” in the sense that they can work simultaneously for a wide range of latent space models beyond the inner-product model class. For example, they work for the distance model and the Gaussian kernel model in which the inner-product term in (1) is replaced with and , respectively. Thus, the class of inner-product models is flexible and can be used to approximate many other latent space models of interest. In addition, the associated algorithms can be applied to networks generated from a wide range of mis-specified models and still yield reasonable results. The key mathematical insight that enables such flexibility is introduced in \prettyrefsec:model as the Schoenberg Condition (7).

3. We demonstrate the effectiveness of the model and algorithms on real data examples. In particular, we fit inner-product models by the proposed algorithms on five different real network datasets for several different tasks, including visualization, clustering and network-assisted classification. On three popular benchmark datasets for testing community detection on networks, a simple -means clustering on the estimated latent vectors obtained by our algorithm yields as good result on one dataset and better results on the other two when compared with four state-of-the-art methods. The same “model fitting followed by -means clustering” approach also yields nice clustering of nodes on a social network with edge covariates. Due to the nature of latent space models, for all datasets on which we fit the model, we obtain natural visualizations of the networks by plotting latent vectors. Furthermore, we illustrate how network information can be incorporated in traditional learning problems using a document classification example.

A Matlab implementation of the methodology in the present paper is available upon request.

### 1.2 Other related works

The current form of the inner-product model (1) has previously appeared in Hoff [28] and Hoff [29], though the parameters were modeled as random effects rather than fixed values, and Bayesian approaches were proposed for estimating variance parameters of the random effects. Hoff [30] proposed a latent eigenmodel which has a probit link function as opposed to the logistic link function in the present paper. As in [28] and [29], parameters were modeled as random effects and model fitting was through Bayesian methods. It was shown that the eigenmodel weakly generalizes the distance model in the sense that the order of the entries in the latent component can be preserved. This is complementary to our results which aim to approximate the latent component directly in some matrix norm. An advantage of the eigenmodel is its ability to generalize the latent class model, whereas the inner-product model (1) and the more general model we shall consider generalize a subset of latent class models due to the constraint that the latent component (after centering) is positive semi-definite. We shall return to this point later in \prettyrefsec:discuss. Young and Scheinerman [57] proposed a random dot product model, which can be viewed as an inner-product model with identity link function. The authors studied a number of properties of the model, such as diameter, clustering and degree distribution. Tang et al. [51] studied properties of the leading eigenvectors of the adjacency matrices of latent positions graphs (together with its implication on classification in such models) where the connection probability of two nodes is the value that some universal kernel [46] takes on the associated latent positions and hence generalizes the random dot product model. This work is close in spirit to the present paper. However, there are several important differences. First, the focus here is model fitting/parameter estimation as opposed to classification in [51]. In addition, any universal kernel considered in [51] satisfies the Schoenberg condition (7) and thus is covered by the methods and theory of the present paper, and so we cover a broader range of models that inner-product models can approximate. This is also partially due to the different inference goals. Furthermore, we allow the presence of observed covariates while [51] did not.

When fitting a network model, we are essentially modeling and estimating the edge probability matrix. From this viewpoint, the present paper is related to the literautre on graphon estimation and edge probability matrix estimation for block models. See, for instance, [6, 4, 55, 22, 35, 24] and the references therein. However, the block models have stronger structural assumptions than the latent space models we are going to investigate.

The algorithmic and theoretical aspects of the paper is also closely connected to the line of research on low rank matrix estimation, which plays an important role in many applications such as phase retrieval [12, 13] and matrix completion [11, 33, 34, 10, 36]. Indeed, the idea of nuclear norm penalization has originated from matrix completion for both general entries [11] and binary entries [19]. In particular, our convex approach can be viewed as a Lagrangian form of the proposal in [19] when there is no covariate and the matrix is fully observed. We have nonetheless decided to spell out details on both method and theory for the convex approach because the matrix completion literature typically does not take into account the potential presence of observed covariates. On the other hand, the idea of directly optimizing a non-convex objective function involving a low rank matrix has been studied recently in a series of papers. See, for instance, [42, 9, 50, 54, 15, 60, 25] and the references therein. Among these papers, the one that is the most related to the projected gradient descent algorithm we are to propose and analyze is [15] which focused on estimating a positive semi-definite matrix of exact low rank in a collection of interesting problems. Another recent and related work [56] has appeared after the initial posting of the present manuscript. However, we will obtain tighter error bounds for latent space models and we will go beyond the exact low rank scenario.

### 1.3 Organization

After a brief introduction of standard notation used throughout the paper, the rest of the paper is organized as follows. \prettyrefsec:model introduces both inner-product models and a broader class of latent space models on which our fitting methods work. The two fitting methods are described in detail in \prettyrefsec:fitting, followed by their theoretical guarantees in \prettyrefsec:theory under both inner-product models and the general class. The theoretical results are further corroborated by simulated examples in \prettyrefsec:simu. \prettyrefsec:data demonstrates the competitive performance of the modeling approach and fitting methods on five different real network datasets. We discuss interesting related problems in \prettyrefsec:discuss and present proofs of the main results in \prettyrefsec:proof. Technical details justifying the initialization methods for the project gradient descent approach are deferred to the appendix.

##### Notation

For , stands for its trace. For , defines an inner product between them. If , for any matrix with singular value decomposition , , and stand for the nuclear norm, the Frobenius norm and the operator norm of the matrix, respectively. Moreover, and denote the -th row and -th column of , and for any function , is the shorthand for applying element-wisely to , that is and . Let be the set of all positive semidefinite matrices and be the set of all orthonormal matrices. For any convex set , is the projection onto the .

## 2 Latent space models

In this section, we first give a detailed introduction of the inner-product model (1) and conditions for its identifiability. In addition, we introduce a more general class of latent space models that includes the inner-product model as a special case. The methods we propose later will be motivated by the inner-product model and can also be applied to the more general class.

### 2.1 Inner-product models

Recall the inner-product model defined in (1), i.e., for any observed and and any ,

 Aij=Aji\lx@stackrelind.∼Bernoulli(Pij),withlogit(Pij)=Θij=αi+αj+βXij+z⊤izj.

Fixing all other parameters, if we increase , then node has higher chances of connecting with other nodes. Therefore, the ’s model degree heterogeneity of nodes and we call them degree heterogeneity parameters. Next, the regression coefficient moderates the contribution of covariate to edge formation. For instance, if indicates whether nodes and share some common attribute such as gender, then a positive value implies that nodes that share common attribute are more likely to connect. Such a phenomenon is called homophily in the social network literaute. Last but not least, the latent variables enter the model through their inner-product , and hence is the name of the model. We impose no additional structural/distributional assumptions on the latent variables for the sake of modeling flexibility.

We note that model (1) also allows the latent variables to enter the second equation in the form of . To see this, note that , and we may re-parameterize by setting for all . Then we have

 Θij=αi+αj+βXij−12∥zi−zj∥2=~αi+~αj+βXij+z⊤izj.

An important implication of this observation is that the function directly models transitivity, i.e., nodes with common neighbors are more likely to connect since their latent variables are more likely to be close to each other in the latent space. In view of the foregoing discussion, the inner-product model (1) also enjoys this nice modeling capacity.

In matrix form, we have

 Θ=α1n⊤+1nα⊤+βX+G (2)

where is the all one vector in and with . Since there is no self-edge and is symmetric, only the upper diagonal elements of are well defined, which we denote by . Nonetheless we define the diagonal element of as in (2) since it is inconsequential. To ensure identifiability of model parameters in (1), we assume the latent variables are centered, that is

 JZ=ZwhereJ=In−1n1n1n⊤. (3)

Note that this condition uniquely identifies up to a common orthogonal transformation of its rows while is now directly identifiable.

### 2.2 A more general class and the Schoenberg condition

Model (1) is a special case of a more general class of latent space models, which can be defined by

 Aij=Aji\lx@stackrelind.∼Bernoulli(Pij),withlogit(Pij)=Θij=~αi+~αj+βXij+ℓ(zi,zj) (4)

where is a smooth symmetric function on . We shall impose an additional constraint on following the discussion below. In matrix form, for and , we can write

 Θ=~α1n⊤+1n~α⊤+βX+L.

To better connect with (2), let

 G=JLJ,andα1n⊤+1nα⊤=~α1n⊤+1n~α⊤+L−JLJ. (5)

Note that the second equality in the last display holds since the expression on its righthand side is symmetric and of rank at most two. Then we can rewrite the second last display as

 Θ=α1n⊤+1nα⊤+βX+G (6)

which reduces to (2) and satisfies . Our additional constraint on is the following Schoenberg condition:

 For any positive integer n≥2 and any z1,…,zn∈Rk, (7) G=JLJ is positive semi-definite for L=(ℓ(zi,zj)) and J=In−1n1n1n⊤.

Condition (7) may seem abstract, while the following lemma elucidates two important classes of symmetric functions for which it is satisfied.

###### Lemma 2.1.

Condition (7) is satisfied in the following cases:

1. is a positive semi-definite kernel function on ;

2. for some and where is the -norm (or -seminorm when ) on .

The first claim of \prettyreflem:h-constraint is a direct consequence of the definition of positive semi-definite kernel function which ensures that the matrix itself is positive semi-definite and so is since is also positive semi-definite. The second claim is a direct consequence of the Hilbert space embedding result by Schoenberg [48, 49]. See, for instance, Theorems 1 and 2 of [48].

## 3 Two model fitting methods

This section presents two methods for fitting models (1) and (4)–(7). Both methods are motivated by minimizing the negative log-likelihood function of the inner-product model, and can be regarded as pseudo-likelihood approaches for more general models. From a methodological viewpoint, a key advantage of these methods, in particular the projected gradient descent method, is the scalability to networks of large sizes.

### 3.1 A convex approach via penalized MLE

In either the inner-product model or the general model we suppose the parameter matrix in (2) or (6) satisfies

 −M1≤Θij≤−M2  for  1≤i≠j≤n,and|Θii|≤M1  for  1≤i≤n. (8)

where are non-negative. Then for any satisfying (8), the corresponding edge probabilities satisfy

 12e−M1≤11+eM1≤Pij≤11+eM2≤e−M2, 1≤i≠j≤n. (9)

Thus controls the conditioning of the problem and controls the sparsity of the network.

Let be the sigmoid function, i.e. the inverse of logit function, then for any , and the log-likelihood function of the inner-product model (1) can be written as

 ∑i

Recall that in inner-product models. The MLE of is the solution to the following rank constrained optimization problem:

 minΘu,α,β,G −∑i

This optimization problem is non-convex and generally intractable. To overcome this difficulty, we consider a convex relaxation that replaces the rank constraint on in (10) with a penalty term on its nuclear norm. Since is positive semi-definite, its nuclear norm equals its trace. Thus, our first model fitting scheme solves the following convex program:

 minα,β,G −∑i,j{AijΘij+log(1−σ(Θij))}+λnTr(G) (11) subject to Θ=α1n⊤+1nα⊤+βX+G,GJ=G,  G∈Sn+,  −M1≤Θij≤−M2.

The convex model fitting method (11) is motivated by the nuclear norm penalization idea originated from the matrix completion literature. See, for instance, [11], [10], [36], [19] and the references therein. In particular, it can be viewed as a Lagrangian form of the proposal in [19] when there is no covariate and the matrix is fully observed. However, we have decided to make this proposal and study the theoretical properties as the existing literature, such as [19], does not take in consideration the potential presence of observed covariates. Furthermore, one can still solve (11) when the true underlying model is one of the general models introduced in \prettyrefsec:generalmodel. The appropriate choice of will be discussed in \prettyrefsec:theory.

###### Remark 3.1.

In addition to the introduction of the trace penalty, the first term in the objective function in (11) now sums over all pairs. Due to symmetry, after scaling, the difference from the sum in (10) lies in the inclusion of all diagonal terms in . This slight modification leads to neither theoretical consequence nor noticeable difference in practice. However, it allows easier implementation and simplifies the theoretical investigation. We would note that the constraint is included partially for obtaining theoretical guarantees. In simulated examples reported in \prettyrefsec:simu, we found that the convex program worked equally well without this constraint.

### 3.2 A non-convex approach via projected gradient descent

Although the foregoing convex relaxation method is conceptually neat, state-of-the-art algorithms to solve the nuclear (trace) norm minimization problem (11), such as iterative singular value thresholding, usually require computing a full singular value decomposition at every iteration, which can still be time consuming when fitting very large networks.

To further improve scalability of model fitting, we propose an efficient first order algorithm that directly tackles the following non-convex objective function:

 minZ,α,β g(Z,α,β)=−∑i,j{AijΘij+log(1−σ(Θij))}  where Θ=α1n⊤+1nα⊤+βX+ZZ⊤. (12)

The detailed description of the method is presented in Algorithm 1.

After initialization, Algorithm 1 iteratively updates the estimates for the three parameters, namely , and . In each iteration, for each parameter, the algorithm first descends along the gradient direction by a pre-specified step size. The descent step is then followed by an additional projection step which projects the updated estimates to pre-specified constraint sets. We propose to set the step sizes as

 ηZ=η/∥∥Z0∥∥2op,  ηα=η/(2n),andηβ=η/(2∥X∥2F) (13)

for some small numeric constant . To establish the desired theoretical guarantees, we make a specific choice of the constraint sets later in the statement of \prettyrefthm:nonconvex and \prettyrefthm:kernel-nonconvex-upper. In practice, one may simply set

 Zt+1=J˜Zt+1,αt+1=˜αt+1,andβt+1=˜βt+1. (14)

Here and after, when there is no covariate, i.e. , we skip the update of in each iteration.

For each iteration, the update on the latent part is performed in the space of (that is ) rather than the space of all Gram matrices as was required in the convex approach. In this way, it reduces the computational cost per iteration from to . Since we are most interested in cases where , such a reduction leads to improved scalability of the non-convex approach to large networks. To implement this non-convex algorithm, we need to specify the latent space dimension , which was not needed for the convex program (11). We defer the discussion on the data-driven choice of to \prettyrefsec:discuss.

We note that Algorithm 1 is not guaranteed to find any global minimizer, or even any local minimizer, of the objective function (12). However, as we shall show later in \prettyrefsec:theory, under appropriate conditions, the iterates generated by the algorithm will quickly enter a neighborhood of the true parameters () and any element in this neighborhood is statistically at least as good as the estimator obtained from the convex method (11). This approach has close connection to the investigation of various non-convex methods for other statistical and signal processing applications. See for instance [13], [15] and the references therein. Our theoretical analysis of the algorithm is going to provide some additional insight as we shall establish its high probability error bounds for both the exact and the approximate low rank scenarios. In the rest of this section, we discuss initialization of Algorithm 1.

#### 3.2.1 Initialization

Appropriate initialization is the key to success for Algorithm 1. We now present two ways to initialize it which are theoretically justifiable under different circumstances.

##### Initialization by projected gradient descent in the lifted space

The first initialization method is summarized in Algorithm 2, which is essentially running the projected gradient descent algorithm on the following regularized objective function for a small number of steps:

 f(G,α,β)=−∑i,j{AijΘij+log(1−σ(Θij))}+λnTr(G)+γn2(∥G∥2F+2∥∥α1n⊤∥∥2F+∥Xβ∥2F). (15)

Except for the third term, this is the same as the objective function in (11). However, the inclusion of the additional proximal term ensures that one obtains the desired initializers after a minimal number of steps.

The appropriate choices of and will be spelled out in Theorem 4.5 and Corollary 4.1. The step size in Algorithm 2 can be set at a small positive numeric constant, e.g. . The projection sets that lead to theoretical justification will be specified later in Theorem 4.5 while in practice, one may simply set , , and .

##### Initialization by universal singular value thresholding

Another way to construct the initialization is to first estimate the probability matrix by universal singular value thresholding (USVT) proposed by [14] and then compute the initial estimates of heuristically by inverting the logit transform. The procedure is summarized as Algorithm 3.

Intuitively speaking, the estimate of by USVT is consistent when is “small”. Following the arguments in Theorems 2.6 and 2.7 of [14], such a condition is satisfied when the covariate matrix or when has “simple” structure. Such “simple” structure could be where are feature vectors associated with the nodes and characterizes the distance/similarity between node and node . For instance, one could have where is a categorical variable such as gender, race, nationality, etc; or where is a continuous monotone link function and is a continuous node covariate such as age, income, years of education, etc.

###### Remark 3.2.

The least squares problem in step 2 of Algorithm 3 has closed form solution and can be computed in operations. The computational cost of Algorithm 3 is dominated by matrix decompositions in step 1 and step 3.

## 4 Theoretical results

In this section, we first present error bounds for both fitting methods under inner-product models, followed by their generalizations to the more general models satisfying the Schoenberg condition (7). In addition, we give theoretical justifications of the two initialization methods for Algorithm 1.

### 4.1 Error bounds for inner-product models

We shall establish uniform high probability error bounds for inner-product models belonging to the following parameter space:

 F(n,k,M1,M2,X)={Θ| Θ=α1n⊤+1nα⊤+βX+ZZ⊤, JZ=Z, (16) max1≤i≤n∥Zi∗∥2, ∥α∥∞, |β|max1≤i

When , we replace the first inequality in (16) with . For the results below, , , and are all allowed to change with .

##### Results for the convex approach

We first present theoretical guarantees for the optimizer of (11). When in nonzero, we make the following assumption for the identifiability of .

###### Assumption 4.1.

The stable rank of the covariate matrix satisfies for some large enough constant .

The linear dependence on of is in some sense necessary for to be identifiable as otherwise the effect of the covariates could be absorbed into the latent component .

Let be the solution to (11) and be the true parameter that governs the data generating process. Let and be defined as in (2) but with the estimates and the true parameter values for the components respectively. Define the error terms , , and . The following theorem gives both deterministic and high probability error bounds for estimating both the latent vectors and logit-transformed probability matrix.

###### Theorem 4.1.

Under Assumption 4.1, for any satisfying , there exists a constant such that

 ∥∥ΔˆG∥∥2F,∥∥ΔˆΘ∥∥2F≤Ce2M1λ2nk.

Specifically, setting for a large enough positive constant , there exist positive constants such that uniformly over , with probability at least ,

 ∥∥ΔˆG∥∥2F,∥∥ΔˆΘ∥∥2F≤Cψ2n

where

 ψ2n=e2M1nk×max{e−M2,lognn}. (17)

If we turn the error metrics in \prettyrefthm:convex to mean squared errors, namely and , then we obtain the familiar rate in low rank matrix estimation problems [36, 2, 19, 43]. In particular, the theorem can viewed as a complementary result to the result in [19] in the case where there are observed covariate and the bit matrix is fully observed. When , the sparsity of the network affects the rate through the multiplier . As the network gets sparser, the multiplier will be no smaller than .

###### Remark 4.1.

Note that the choice of the penalty parameter depends on which by (9) controls the sparsity of the observed network. In practice, we do not know this quantity and we propose to estimate with .

##### Results for the non-convex approach

A key step toward establishing the statistical properties of the outputs of Algorithm 1 is to characterize the evolution of its iterates. To start with, we introduce an error metric that is equivalent to while at the same time is more convenient for establishing an inequality satisfied by all iterates. Note that the latent vectors are only identifiable up to an orthogonal transformation of , for any , we define the distance measure

where collects all orthogonal matrices. Let and , and further let and . Then the error metric we use for characterizing the evolution of iterates is

 et=∥Z⋆∥2op∥∥ΔZt∥∥2F+2∥∥Δαt1n⊤∥∥2F+∥∥ΔβtX∥∥2F. (18)

Let be the condition number of (i.e., the ratio of the largest to the smallest singular values). The following lemma shows that the two error metrics and are equivalent up to a multiplicative factor of order .

###### Lemma 4.1.

Under Assumption 4.1, there exists constant such that

 et ≤κ2Z⋆2(√2−1)∥∥ΔGt∥∥2F+2∥∥Δαt1n⊤∥∥2F+∥∥ΔβtX∥∥2F≤κ2Z⋆2(√2−1)(1−c0)∥∥ΔΘt∥∥2F.

Moreover, if ,

 et ≥1(c+2)2∥∥ΔGt∥∥2F+2∥∥Δαt1n⊤∥∥2F+∥∥ΔβtX∥∥2F≥1(c+2)2(1+c0)∥∥ΔΘt∥∥2F.

In addition, our error bounds depend on the following assumption on the initializers.

###### Assumption 4.2.

The initializers in Algorithm 1 satisfy for a sufficiently small positive constant .

Note that the foregoing assumption is not very restrictive. If , is a constant and the entries of are i.i.d. random variables with mean zero and bounded variance, then and . In view of Lemma 4.1, this only requires to be upper bounded by some constant. We defer verification of this assumption for initial estimates constructed by Algorithm 2 and Algorithm 3 to \prettyrefsec:init-theory.

The following theorem states that under such an initialization, errors of the iterates converge linearly till they reach the same statistical precision as in \prettyrefthm:convex modulo a multiplicative factor that depends only on the condition number of .

###### Theorem 4.2.

Let Assumptions 4.1 and 4.2 be satisfied. Set the constraint sets as222When , and we replace in and with in correspondence with the discussion following (16).

 CZ ={Z∈Rn×k,JZ=Z,max1≤i≤n∥Zi∗∥≤M1/3}, Cα ={α∈Rn,∥α∥∞≤M1/3}, Cβ={β∈R,β∥X∥∞≤M1/3}.

Set the step sizes as in (13) for any where is a universal constant. Let . Then we have

1. Deterministic errors of iterates: If for a sufficiently large constant , there exist positive constants and such that

2. Probabilistic errors of iterates: If for a sufficiently large constant , there exist positive constants and such that uniformly over with probability at least ,

For any ,

 ∥ΔGT∥2F, ∥ΔΘT∥2F≤C′κ2Z⋆ψ2n.

for some constant .

###### Remark 4.2.

In view of Lemma 4.1, the rate obtained by the non-convex approach in terms of matches the upper bound achieved by the convex method, up to a multiplier of . As suggested by Lemma 4.1, the extra factor comes partly from the fact that is a slightly stronger loss function than and in the worst case can be times larger than .

###### Remark 4.3.

Under the setting of Theorem 4.2, the projection steps for in Algorithm 1 are straightforward and have the following closed form expressions: , . The projection step for is slightly more involved. Notice that where

 C1Z={Z∈Rn×k,JZ=Z}, C2Z={Z∈Rn×k,max1≤i≤n∥Zi∗∥2≤M1/3}.

Projecting to either of them has closed form solution, that is , . Then Dykstra’s projection algorithm [20] (or alternating projection algorithm) can be applied to obtain . We note that projections induced by the boundedness constraints for are needed for establishing the error bounds theoretically. However, when implementing the algorithm, users are at liberty to drop these projections and to only center the columns of the iterates as in (14). We did not see any noticeable difference thus caused on simulated examples reported in \prettyrefsec:simu.

###### Remark 4.4.

When both and are constants and the covariate matrix is absent, the result in Section 4.5 of [15], in particular Corollary 5, implies the error rate of in \prettyrefthm:nonconvex. However, when and remains bounded as , the error rate in [15] becomes333One can verify that in this case we can identify the quantities in Corollary 5 of [15] as