1 Introduction
###### Abstract

We consider the problem of tensor-response regression given covariates on multiple modes. Such data problems arise frequently in applications such as neuroimaging, network analysis, and spatial-temporal modeling. We propose a new family of tensor response regression models that incorporate covariates, and establish the theoretical accuracy guarantees. Unlike earlier methods, our estimation allows high-dimensionality in both the tensor response and the covariate matrices on multiple modes. An efficient alternating updating algorithm is further developed. Our proposal handles a broad range of data types, including continuous, count, and binary observations. Through simulation and applications to two real datasets, we demonstrate the outperformance of our approach over the state-of-art.

Generalized Tensor Regression with Covariates on Multiple Modes

Zhuoyan Xu, Jiaxin Hu, and Miaoyan Wang111Equal contribution. Miaoyan Wang is Assistant Professor, Department of Statistics, University of Wisconsin-Madison, Madison, WI 53706, E-mail: miaoyan.wang@wisc.edu. Zhuoyan Xu is a BS/MS student in Statistics, E-mail: zxu444@wisc.edu. Jiaxin Hu is a BS/MS student in Statistics, E-mail: jhu267@wisc.edu.

Department of Statistics, University of Wisconsin-Madison

Keywords: Tensor-response models, Multiple covariates, Dimension reduction, Reduced-rank regression, Generalized linear models

## 1 Introduction

Many contemporary scientific and engineering studies collect multi-way array data, a.k.a. tensors, accompanied by additional covariates. One example is neuroimaging analysis [1, 2], in which the brain connectivity networks are collected from a sample of individuals. Researchers are often interested in identifying connection edges that are affected by individual characteristics such as age, gender, and disease status (see Figure 1a). Another example is in the field of network analysis [3, 4, 5]. A typical social network consists of nodes that represent people and edges that represent friendships. In addition, features on nodes and edges are often available, such as people’s personality and demographic location. It is of keen scientific interest to identify the variation in the connection patterns (e.g., transitivity, community) that can be attributable to the node features.

This paper presents a general treatment to these seemingly different problems. We formulate the learning task as a regression problem, with tensor observation serving as a response, and the node features and/or their interactions forming the predictor. Figure 1b illustrates the general set-up we consider. The regression approach allows the identification of variation in the data tensor that is explained by the covariates. In contrast to earlier work [6, 3], our method allows the contribution of covariates from multiple modes, whenever available. We utilize a low-rank constraint in the regression coefficient to encourage the sharing among tensor entries. The statistical convergence of our estimator is established, and we quantify the gain in accuracy compared to classical multivariate regression approach.

Related work. Our work is closely related to but also clearly distinctive from several lines of previous work. The first is a class of unsupervised tensor decomposition [7, 8, 9] that aims to find a low-rank representation of a data tensor. In contrast, our model can be viewed a supervised tensor learning, which aims to identify the association between a data tensor and covariates. The second related line [2, 10] tackles tensor regression where the response is a scalar and the predictor is a tensor. Our proposal is orthogonal to theirs because we treat the tensor as a response. The tensor-response model is appealing for high-dimensional analysis when both the response and the covariate dimensions grow. The last line of work studies the network-response model [11, 12]. The earlier development of this model focuses mostly on binary data in the presence of dyadic covariates [5, 3]. We will demonstrate the enhanced accuracy as the order of data grows, and establish the general theory for exponential family which is arguably better suited to various data types.

## 2 Preliminaries

We begin by reviewing the basic properties about tensors [8]. We use to denote an order- -dimensional tensor. The multilinear multiplication of a tensor by matrices is defined as

 Y×1X1…×KXK=⟦∑i1,…,iKyi1,…,iKx(1)j1,i1…x(K)jK,iK⟧, (1)

which results in an order- -dimensional tensor. For ease of presentation, we use shorthand notion to denote the tensor-by-matrix product. For any two tensors , of identical order and dimensions, their inner product is defined as . The Frobenius norm of tensor is defined as . A higher-order tensor can be reshaped into a lower-order object [13]. We use to denote the operation that reshapes the tensor into a vector, and the operation that reshapes the tensor along mode- into a matrix of size -by-. The Tucker rank of an order- tensor is defined as a length- vector , where is the rank of matrix , . We use lower-case letters (e.g., ) for scalars/vectors, upper-case boldface letters (e.g., ) for matrices, and calligraphy letters (e.g., ) for tensors of order three or greater. We let denote the identity matrix, denote the -set , and allow an function to be applied to tensors in an element-wise manner.

## 3 Motivation and model

Let denote an order- data tensor. Suppose we observe covariates on some of the modes. Let denote the available covariates on the mode , where . We propose a multilinear structure on the conditional expectation of the tensor. Specifically,

 E(Y|X1,…,XK)=f(Θ), with (2) Θ=B×{X1,…,XK}, (3)

where is a known link function, is the linear predictor, is the parameter tensor of interest, and denotes the tensor Tucker product. The choice of link function depends on the distribution of the response data. Some common choices are identity link for Gaussian tensor, logistic link for binary tensor, and link for Poisson tensor (see Table 1).

We give three concrete examples of tensor regression that arise in practice.

###### Example 1 (Spatio-temporal growth model).

Let denote the pH measurements of lakes at levels of depth and for time points. Suppose the sampled lakes belong to types, with lakes in each type. Let denote the sampled depth levels and the time points. Assume that the expected pH trend in depth is a polynomial of order and that the expected trend in time is a polynomial of order . Then, the spatio-temporal growth model can be represented as

 E(Y|X1,X2,X3)=B×{X1,X2,X3}, (4)

where is the coefficient tensor of interest, is the design matrix for lake types,

 X2=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝1ℓ1⋯ℓr11ℓ2⋯ℓr2⋮⋮⋱⋮1ℓm⋯ℓrm⎞⎟ ⎟ ⎟ ⎟ ⎟⎠, X3=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝1t1⋯ts11t2⋯ts2⋮⋮⋱⋮1tn⋯tsn⎞⎟ ⎟ ⎟ ⎟ ⎟⎠

are the design matrices for spatial and temporal effects, respectively. The model (4) is a higher-order extension of the “growth curve” model originally proposed for matrix data [14, 15, 16]. Clearly, the spatial-temporal model is a special case of our tensor regression model, with covariates available on each of the three modes.

###### Example 2 (Network population model).

Network response model is recently developed in the context of neuroimanig analysis. The goal is to study the relationship between network-valued response and the individual covariates. Suppose we observe i.i.d. observations , where is the brain connectivity network on the -th individual, and is the individual covariate such as age, gender, cognition, etc. The network-response model [11, 6] has the form

 logit(E(Yi|xi))=B×3xi,for i=1,…,n (5)

where is the coefficient tensor of interest.

The model (5) is a special case of our tensor-response model, with covariates on the last mode of the tensor. Specifically, stacking together yields an order-3 response tensor , along with covariate matrix . Then, the model (5) can be written as

 logit(E(Y|X))=B×3X=B×{Id,Id,X}.
###### Example 3 (Dyadic data with node attributes).

Dyadic dataset consists of measurements on pairs of objects or under a pair of conditions. Common examples include networks and graphs. Let denote a network, where is the node set of the graph, and is the edge set. Suppose that we also observe covariate associated to each . A probabilistic model on the graph can be described by the following matrix regression. The edge connects the two vertices and independently of other pairs, and the probability of connection is modeled as

 logit(P((i,j)∈E)=xTiBxj=⟨B,xTixj⟩. (6)

The above model has demonstrated its success in modeling transitivity, balance, and communities in the networks [5]. We show that our tensor regression model (2) also incorporates the graph model as a special case. Let be a binary matrix where . Define . Then, the graph model (6) can be expressed as

 logit(E(Y|X))=B×{X,X}.

In the above three examples and many other studies, researchers are interested in uncovering the variation in the data tensor that can be explained by the covariates. The regression coefficient in our model model (2) serves this goal by collecting the effects of covariates and the interaction thereof. To encourage the sharing among effects, we assume that the coefficient tensor lies in a low-dimensional parameter space:

 Pr1,…,rK={B∈Rp1×⋯×pK:rk(B)≤rk for all k∈[K]}, (7)

where is the Tucker rank at mode of the tensor. The low-rank assumption is plausible in many scientific applications. In brain imaging analysis, for instance, it is often believed that the brain nodes can be grouped into fewer communities, and the numbers of communities are much smaller than the number of nodes. The low-rank structure encourages the shared information across tensor entries, thereby greatly improving the estimation stability. When no confusion arises, we drop the subscript and write for simplicity.

Our tensor regression model is able to incorporate covariates on any subset of modes, whenever available. Without loss of generality, we denote by the covariates in all modes and treat if the mode- has no (informative) covariate. Then, the final form of our tensor regression model can be written as:

 E(Y|X)=f(Θ),Θ=B×{X1,…,XK}, where rank(B)≤(r1,…,rK), (8)

where the entries of are independent r.v.’s conditional on , and is the low-rank coefficient tensor of interest. We comment that other forms of tensor low-rankness are also possible, and here we choose Tucker rank just for parsimony. Similar models can be derived using various notions of low-rankness based on CP decomposition [17] and train decomposition [18].

## 4 Rank-constrained likelihood-based estimation

We develop a likelihood-based procedure to estimate the coefficient tensor in (3). We adopt the exponential family as a flexible framework for different data types. In a classical generalized linear model (GLM) with a scalar response and covariate , the density is expressed as:

 p(y|x,β)=c(y,ϕ)exp(yθ−b(θ)ϕ) with θ=βTx,

where is a known function, is the linear predictor, is the dispersion parameter, and is a known normalizing function. The choice of link functions depends on the data types and on the observation domain of , denoted . For example, the observation domain is for continuous data, for count data, and for binary data. Note that the canonical link function is chosen to be . Table 1 summarizes the canonical link functions for common types of distributions.

In our context, we model the the entries in the response tensor conditional on as independent draws from an exponential family. The quasi log-likelihood of (3) is equal (ignoring constant) to Bregman distance between and :

 LY(B) =⟨Y,Θ⟩−∑i1,…,iKb(θi1,…,iK), (9) where Θ =B×{X1,…,XK}. (10)

We assume that we have an additional information on an upper bound such that . This is the case for many applications we have in mind such as brain network analysis where fiber connections are bounded. We propose a constrained maximum likelihood estimator (MLE) for the coefficient tensor:

 ^B=argmaxrank(B)=r,∥Θ(B)∥max≤αLY(B). (11)

In the following theoretical analysis, we assume the rank is known and fixed. The adaptation of unknown will be addressed in Section 5.2.

### 4.1 Statistical properties

We assess the estimation accuracy using the deviation in the Frobenius norm. For the true coefficient tensor and its estimator , define

 Loss(Btrue, ^B)=∥Btrue−^B∥2F.

In modern applications, the response tensor and covariates are often large-scale. We are particularly interested in the high-dimensional region in which both and diverge; i.e.  and , while . As the size of problem grows, and so does the number of unknown parameters. As such, the classical MLE theory does not directly apply. We leverage the recent development in random tensor theory and high-dimensional statistics to establish the error bounds of the estimation.

###### Assumption 1.

We make the following assumptions:

1. There exist two positive constants such that for all . Here and denotes the smallest and largest singular values, respectively.

2. There exist two positive constants such that for all .

3. Equivalently, there exists two positive constants such that for all , where is the upper bound of the linear predictor.

The assumptions are fairly mild. Assumption A1 guarantees the non-singularity of the covariates, and Assumption A2 ensures the log-likelihood is strictly concave in the linear predictor . Assumption A2 and A2’ are equivalent, because when belongs to an exponential family [19].

###### Theorem 4.1 (Statistical convergence).

Consider a generalized tensor regression model with covariates on multiple modes . Suppose the entries in are independent realizations of an exponential family distribution, and follows the low-rank tensor regression model (3). Under Assumption 1, there exist two constants , such that, with probability at least ,

 Loss(Btrue, ^B)≤C2∑kpk. (12)

Here, is a constant that does not depend on the dimensions and .

To gain further insight on the bound (12), we consider a special case when tensor dimensions are equal at each of the modes, i.e., , , for all , and the covariates are Gaussian design matrices with i.i.d.  entries. To put the context in the framework of Theorem 4.1, we rescale the covariates into so that the singular values of are bounded by . The result in (12) implies that the estimated coefficient has a convergence rate in the scale of the original covariates . Therefore, our estimation is consistent as the dimension grows, and the convergence becomes especially favorably as the order of tensor data increases.

As immediate applications, we obtain the convergence rate for the three examples mentioned in Section 3. Without loss of generality, we assume that the singular values of the -by- covariate matrix are bounded by .

###### Corollary 1 (Spatio-temporal growth model).

The estimated type-by-time-by-space coefficient tensor converges at the rate where , and . The estimation achieves consistency as long as the dimension grows in either of the three modes.

###### Corollary 2 (Network population model).

The estimated node-by-node-by-covariate tensor converges at the rate where . The estimation achieves consistency as the number of individuals or the number of nodes grows.

###### Corollary 3 (Dyadic data with node attributes).

The estimated covariate-by-covariate matrix converges at the rate where . Again, our estimation achieves consistency as the number of nodes grows.

We conclude this section by providing the prediction accuracy, measured in KL divergence, for the response distribution.

###### Theorem 4.2 (Prediction error).

Assume the same set-up as in Theorem 4.1. Let and denote the distributions of given the true parameter and estimated parameter , respectively. Then, we have, with probability at least ,

 KL(PYtrue, P^Y)≤C4∑kpk,

where is a constant that do not depend on the dimensions and .

## 5 Numerical implementation

### 5.1 Alternating optimization

In this section, we introduce an efficient algorithm to solve (11). The objective function is concave in when the link is the canonical link function. However, the feasible set is non-convex, and thus the optimization (11) is a non-convex problem. We utilize a Tucker factor representation of the coefficient tensor and turn the optimization into a block-wise convex problem.

Specifically, write the rank- decomposition of coefficient tensor as

 B=C×{M1,…,MK}, (13)

where is a full-rank core tensor, are factor matrices with orthogonal columns. Estimating amounts to finding both the core tensor and the factor matrices ’s. The optimization (11) can be written as , where

 LY(C,M1,…,MK) =⟨Y,Θ⟩−∑i1,…,iKb(θi1,…,iK), (14) with Θ =C×{M1X1,…,MKXK}. (15)

The decision variables in the above objective function consist of blocks of variables, one for the core tensor and for the factor matrices ’s. We notice that, if any out of the blocks of variables are known, then the optimization with respect to the last block of variables reduced to a simple GLM. This observation suggests that we can iteratively update one block at a time while keeping others fixed. After each iteration, we rescale the core tensor subject to the maximum norm constraint. This post-processing in principle may not guarantee the monotonic increase of the objective, but we found that in our experiment this simple step appears robust for obtaining a desirable solution. The full algorithm is described in Algorithm 1.

### 5.2 Rank selection

Algorithm 1 takes the rank as an input. Estimating an appropriate rank given the data is of practical importance. We propose to use Bayesian information criterion (BIC) and choose the rank that minimizes BIC; i.e.

 ^r =argminr=(r1,…,rK)BIC(r) (16) =argminr=(r1,…,rK)[−2LY(^B)+pe(r)log(∏kdk)], (17)

where is the effective number of parameters in the model. We choose that minimizes via grid search. Our choice of BIC aims to balance between the goodness-of-fit for the data and the degree of freedom in the population model. We test its empirical performance in Section 6.

## 6 Simulation

We evaluate the empirical performance of our generalized tensor regression through simulations. We consider order-3 tensors with a range of distribution types. The coefficient tensor is generated using the factorization form (13) where both the core and factor matrices are drawn i.i.d. from Uniform[-1,1]. The linear predictor is then simulated from , where is either an identity matrix (i.e. no covariate available) or Gaussian random matrix with i.i.d. entries from . We set to ensure the singular values of are bounded as increases. The is scaled such that . Conditional on the linear predictor , the entries in the tensor are drawn independently according to one of the following three probabilistic models:

1. [itemsep=0pt,topsep=0pt]

2. (Gaussian). Continuous data .

3. (Poisson). Count data .

4. (Bernoulli). Binary data .

Here is a scalar controlling the magnitude of the effect size. In each simulation study, we report the mean squared error (MSE) for the coefficient tensor averaged across replications.

The first experiment assesses the selection accuracy of our BIC criterion (16). We consider the balanced situation where , for . We set and consider various combinations of dimension and rank . For each combination, we simulate tensor data following Gaussian, Bernoulli, and Poisson models. We then minimize BIC using a grid search over three dimensions. The hyper-parameter is set to infinity in the fitting, which essentially imposes no prior on the coefficient magnitude. Table 2 reports the selected rank averaged over replicates for Gaussian and Poisson models. We found that when , the selected rank is slightly smaller than the true rank, and the accuracy improves immediately when the dimension increases to . This agrees with our expectation, as in tensor regression, the sample size is related to the number of entries. A larger implies a larger sample size, so the BIC selection becomes more accurate.

The second experiment evaluates the accuracy when covariates are available on all modes. We set and increase from 25 to 50. Our theoretical analysis suggests that has a convergence rate in this setting. Figure 1 plots the estimation error versus the “effective sample size”, , under three different distribution models. We found that the empirical MSE decreases roughly at the rate of , which is consistent with our theoretical ascertainment. We also observed that, tensors with higher ranks tend to yield higher estimation errors, as reflected by the upward shift of the curves as increases. Indeed, a larger implies a higher model complexity and thus greater difficulty in the estimation. Similar behaviors can be observed in the non-Gaussian data in Figure 2b-c.

The third experiment investigates our model’s ability in handling correlation among coefficients. We mimic the scenario of brain imaging analysis. A sample of networks are simulated, one for each individual. Each network measures the connections between brain nodes. We simulate covariates for the each of the 50 individuals. These covariates may represent, for example, age, gender, cognitive score, etc. Recent study [20] has suggested that brain connectivity networks often exhibit community structure represented as a collection of subnetworks, and each subnetwork is comprised of a set of spatially distributed brain nodes. To accommodate this structure, we utilize the stochastic block model [21] to generate the effect size. Specifically, we partition the nodes into blocks by assigning each node to a block with uniform probability. Edges within a same block are assumed to share the same covariate effects, where the effects are drawn i.i.d. from . We then apply our tensor regression model to the network data using the BIC-selected rank. Note that in this case, the true model rank is unknown; the rank of a -block matrix is not necessarily equal to  [22].

Figure 3 compares the MSE of our method with a classical GLM approach. A classical GLM is to regress the dyadic edges, one at a time, on the covariates, and this model is repeatedly fitted for each edge. This repeated approach, however, does not account for the correlation among the edges, and may suffer from overfitting. As we can see in Figure 3, out tensor regression method achieves significant error reduction in all three models considered. The outer-performance is significant in the presence of large communities, and even in the less structured case ( nodes per block), our method still outer-performs GLM. This is because the low-rankness in our modeling automatically identifies the shared information across entries. By selecting the rank in a data-driven way, our method is able to achieve accurate estimation with improved interpretability.

## 7 Data analysis

We apply our tensor regression model to two real datasets. The first application concerns the brain network modeling in response to individual attributes (i.e. covariate on one mode), and the second application focuses on multi-relational network analysis with dyadic attributes (i.e. covariates on two modes).

### 7.1 Human Connectome Project (HCP)

The Human connectome project (HCP [23]) aims to build a “network map” that characterizes the anatomical and functional connectivity within healthy human brains. We take a subset of HCP data that consists of 136 brain structural networks, one for each individual. Each brain network is represented as a 68-by-68 binary matrix, where the entries encode the presence or absence of fiber connections between the 68 brain regions. We consider four individual-covariates: gender (65 females vs. 71 males), age 22-25 (), age 26-30 (), and age 31+ (). The goal is to identify the connection edges that are affected by the individual covariates. A key challenge in brain network is that the edges are correlated; for example, two edges may stem out from a same brain region, and it is of importance to take into account the within-dyad dependence.

We fit the tensor regression model to the HCP data. The response is a binary tensor and the covariates are of dimension 4 along the 3 mode. The BIC selection suggests a rank with log-likelihood . Figure 4 shows the top edges with high effect size, overlaid on the Desikan atlas brain template [24, 25]. We utilize the sum-to-zero contrasts in the effects coding and depicted only the top 3% edges whose connections are non-constant across the sample. It is observed that the global connection exhibits clear spatial separation, and that the nodes within each hemisphere are more densely connected with each other (Figure 4a). In particular, the superior-temproal (SupT), middle-temporal (MT) and Insula are the top three popular nodes in the network. Interestingly, female brains display higher inter-hemispheric connectivity, especially in the frontal, parental and temporal lobes (Figure 4b). This is in agreement with a recent study showing that female brains are optimized for inter-hemispheric communication [26]. We also found several edges with declined connection in the group Age 31+. Notably, those edges involve Frontal-pole (Fploe), superior-frontal (SupF) and Cuneus nodes. The Frontal-pole region has long been known for its importance in memory and cognition, and the detected decline with age further highlights its biological importance.

### 7.2 Nations data

The second application concerns the multi-relational network analysis with node-level attributes. We consider Nations dataset [27] which records 56 relations among 14 countries between 1950 and 1965. The multi-relational networks can be organized into a binary tensor, with each entry indicating the presence or absence of a connection, such as “sending tourist to”, “export”, “import”, between countries. The 56 relations span the fields of politics, economics, military, religion, and so on. In addition, country-level attributes are also available, and we focus on the following six covariates: constitutional, catholics, lawngos, politicalleadership, geographyx, and medicinengo. The goal is to identify the variation in connections due to country-level attributes and interactions thereof. One of the key features is that the 56 relations are correlated, and we would like to take that into account in assessing the covariate effects.

We applied our tensor regression model to the Nations data. The multi-relational network was treated as the response tensor, and the country attributes were treated as covariates on both the 1st and 2nd modes. The BIC criterion suggests a rank for the coefficient tensor . Table Section 6 shows the -mean clustering of the 56 relations based on the 3 mode factor . We found that the relations reflecting the similar aspects of international affairs are grouped together. In particular, Cluster I consists of political relations such as officialvisits, intergovorgs, and militaryactions; Clusters II and III capture the economical relations such as economicaid, booktranslations, tourism; and Cluster IV represents the Cold War alliance blocs. The similarity among entities in each cluster suggests the plausibility of our dimension reduction.

To investigate the effects of dyadic attributes towards connections, we depicted the estimated coefficients for several relation types (Figure 5). Note that entries can be interpreted as the contribution, at the logit scale, of covariate pair (th covariate for the “sender” country and th covariate for the “receiver” country) towards the connection of relation . Several interesting findings emerge from the observation. We found that relations belonging to a same cluster tend to have similar covariate effects. For example, the relations warnings and ecnomicaid are classified into Cluster II, and both exhibit similar covariate pattern (Figure 5a-b). Moreover, the majority of the diagonal entries positively contribute to the connection. This suggests that countries with coherent attributes tend to interact more often than others. We also found that the constitutional attribute is an important predictor for the commonbloc relation, whereas the effect is weaker for other relations (Figure 5d). This is not surprising, as the block partition during Cold War is associated with the constitutional attribute.

## 8 Conclusion

We have developed a generalized tensor regression with covariates on multiple modes. A fundamental feature of tensor-valued data is the statistical interdependence among entries. Our proposed rank-constrained estimation achieves high accuracy with sound theoretical guarantees. The estimation accuracy is quantified via deviation in the Frobenius norm and K-L divergence. Other measures of accuracy may also be desirable, such as the spectral norm or the maximum norm of the deviation. Exploiting the properties and benefits of different error quantification warrants future research.

## Appendix A Proofs

###### Proof of Theorem 4.1.

Define , where the expectation is taken with respect to under the model with true parameter . We first prove the following two conclusions:

1. There exists two positive constants , , such that, with probability at least , the stochastic deviation, , satisfies

 |LY(B)−ℓ(B)|=|⟨E, B×1X1×2⋯×KXK⟩|≤C2∥B∥F ⎷∏krkmaxkrk∑kpk.
2. The inequality holds, where is the lower bound for .

To prove C1, we note that the stochastic deviation can be written as:

 LY(B)−ℓ(B) =⟨Y−E(Y|X), Θ(B)⟩ =⟨Y−b′(Θtrue), Θ⟩ =⟨E×1XT1×2⋯×KXTK, B⟩, (18)

where , and the second line uses the property of exponential family that . Based on Proposition 2, the boundedness of implies that is a sub-Gaussian- tensor. Let . By Proposition 1, is a -dimensional sub-Gaussian tensor with parameter bounded by . Here is the upper bound of . Applying Cauchy-Schwarz inequality to (A) yields

 |LY(B)−ℓ(B)|≤∥∥ˇE∥∥2∥B∥∗, (19)

where denotes the tensor spectral norm and denotes the tensor nuclear norm. The nuclear norm is bounded by (c.f. [28, 13]). The spectral norm is bounded by with probability at least (c.f. [28, 29]). Combining these two bounds with (19), we have, with probability at least ,

 |LY(B)−ℓ(B)|≤C2∥B∥F ⎷∏krkmaxkrk∑kpk,

where is a constant absorbing all factors that do not depend on and .

Next we prove C2. Applying Taylor expansion to around yields

 LY(B)=LY(Btrue)+⟨∂LY(B)B∣∣B=Btrue,B−Btrue⟩+12vec(B−Btrue)TH(ˇB)vec(B−Btrue), (20)

where is the (non-random) Hession of evaluated at for some . Note that we have . We take expectation with respect to on both sides of (21) and obtain

 ℓ(B)=ℓ(Btrue)+12vec(B−Btrue)TH(ˇB)vec(B−Btrue). (21)

By the fact and chain rule over , the equation (21) implies that

 ℓ(B)−ℓ(Btrue)=−12∑i1,…,iKb′′(ˇθi1,…,iK)(θi1,…,iK−θtrue,i1,…,iK)2≤−L2∥Θ−Θtrue∥2F, (22)

holds for all , provided that . In particular, the inequality (21) also applies to the constrained MLE . So we have

 ℓ(^B)−ℓ(Btrue)≤−L2∥^Θ−Θtrue∥2F. (23)

Now we have proved both C1 and C2. Note that by the definition of , This implies that

 0 ≤LY(^B)−LY(Btrue) (24) ≤(LY(^B)−ℓ(^B))−(LY(Btrue)−ℓ(Btrue))+(ℓ(^B)−ℓ(Btrue)) (25) ≤⟨E, Θ−Θtrue⟩−L2∥^Θ−Θtrue∥2F, (26)

where the second line follows from (23). Therefore,

 ∥^Θ−Θtrue∥F ≤2L⟨E, ^Θ−Θtrue∥^Θ−Θtrue∥F⟩ ≤2LsupΘ:∥Θ∥F=1,Θ=B×1X1×2⋯×KXK⟨E, Θ⟩ ≤2LsupB∈P:∥B∥F≤∏kσ−1min(Xk)⟨E,  B×1X1×2⋯×KXK⟩. (27)

Combining (A) with C1 yields

 ∥^Θ−Θtrue∥F≤2C2L∏kσ−1min(Xk) ⎷∏krkmaxrk∑kpk.

Therefore, the final conclusion follows by noting that

 ∥^B−Btrue∥F≤∥^Θ−Θtrue∥F∏kσ−1min(Xk)≤C√∑kpk,

where is a constant that does not depend on the dimensions and . ∎

###### Proposition 1 (sub-Gaussian tensors).

Let be a sub-Gaussian- tensor of dimension , and be non-random matrices for all . Then is a sub-Gaussian- tensor of dimension , where . Here denotes the largest singular value of the matrix.

###### Proof.

To show is a sub-Guassian tensor, it suffices to show that the is a sub-Gaussian scalar with parameter , for any unit-1 vector , .

Note that,

 E×1uT1×2⋯×KuTK =S×1(uT1X1)×2⋯×K(uTKXK) (28) =(∏k∥uTkXk∥2)[S×1(u