Abstract
We consider the problem of tensorresponse regression given covariates on multiple modes. Such data problems arise frequently in applications such as neuroimaging, network analysis, and spatialtemporal modeling. We propose a new family of tensor response regression models that incorporate covariates, and establish the theoretical accuracy guarantees. Unlike earlier methods, our estimation allows highdimensionality in both the tensor response and the covariate matrices on multiple modes. An efficient alternating updating algorithm is further developed. Our proposal handles a broad range of data types, including continuous, count, and binary observations. Through simulation and applications to two real datasets, we demonstrate the outperformance of our approach over the stateofart.
Generalized Tensor Regression with Covariates on Multiple Modes
Zhuoyan Xu, Jiaxin Hu, and Miaoyan Wang^{1}^{1}1Equal contribution. Miaoyan Wang is Assistant Professor, Department of Statistics, University of WisconsinMadison, Madison, WI 53706, Email: miaoyan.wang@wisc.edu. Zhuoyan Xu is a BS/MS student in Statistics, Email: zxu444@wisc.edu. Jiaxin Hu is a BS/MS student in Statistics, Email: jhu267@wisc.edu.
Department of Statistics, University of WisconsinMadison
Keywords: Tensorresponse models, Multiple covariates, Dimension reduction, Reducedrank regression, Generalized linear models
1 Introduction
Many contemporary scientific and engineering studies collect multiway array data, a.k.a. tensors, accompanied by additional covariates. One example is neuroimaging analysis [1, 2], in which the brain connectivity networks are collected from a sample of individuals. Researchers are often interested in identifying connection edges that are affected by individual characteristics such as age, gender, and disease status (see Figure 1a). Another example is in the field of network analysis [3, 4, 5]. A typical social network consists of nodes that represent people and edges that represent friendships. In addition, features on nodes and edges are often available, such as people’s personality and demographic location. It is of keen scientific interest to identify the variation in the connection patterns (e.g., transitivity, community) that can be attributable to the node features.
This paper presents a general treatment to these seemingly different problems. We formulate the learning task as a regression problem, with tensor observation serving as a response, and the node features and/or their interactions forming the predictor. Figure 1b illustrates the general setup we consider. The regression approach allows the identification of variation in the data tensor that is explained by the covariates. In contrast to earlier work [6, 3], our method allows the contribution of covariates from multiple modes, whenever available. We utilize a lowrank constraint in the regression coefficient to encourage the sharing among tensor entries. The statistical convergence of our estimator is established, and we quantify the gain in accuracy compared to classical multivariate regression approach.
Related work. Our work is closely related to but also clearly distinctive from several lines of previous work. The first is a class of unsupervised tensor decomposition [7, 8, 9] that aims to find a lowrank representation of a data tensor. In contrast, our model can be viewed a supervised tensor learning, which aims to identify the association between a data tensor and covariates. The second related line [2, 10] tackles tensor regression where the response is a scalar and the predictor is a tensor. Our proposal is orthogonal to theirs because we treat the tensor as a response. The tensorresponse model is appealing for highdimensional analysis when both the response and the covariate dimensions grow. The last line of work studies the networkresponse model [11, 12]. The earlier development of this model focuses mostly on binary data in the presence of dyadic covariates [5, 3]. We will demonstrate the enhanced accuracy as the order of data grows, and establish the general theory for exponential family which is arguably better suited to various data types.
2 Preliminaries
We begin by reviewing the basic properties about tensors [8]. We use to denote an order dimensional tensor. The multilinear multiplication of a tensor by matrices is defined as
(1) 
which results in an order dimensional tensor. For ease of presentation, we use shorthand notion to denote the tensorbymatrix product. For any two tensors , of identical order and dimensions, their inner product is defined as . The Frobenius norm of tensor is defined as . A higherorder tensor can be reshaped into a lowerorder object [13]. We use to denote the operation that reshapes the tensor into a vector, and the operation that reshapes the tensor along mode into a matrix of size by. The Tucker rank of an order tensor is defined as a length vector , where is the rank of matrix , . We use lowercase letters (e.g., ) for scalars/vectors, uppercase boldface letters (e.g., ) for matrices, and calligraphy letters (e.g., ) for tensors of order three or greater. We let denote the identity matrix, denote the set , and allow an function to be applied to tensors in an elementwise manner.
3 Motivation and model
Let denote an order data tensor. Suppose we observe covariates on some of the modes. Let denote the available covariates on the mode , where . We propose a multilinear structure on the conditional expectation of the tensor. Specifically,
(2)  
(3) 
where is a known link function, is the linear predictor, is the parameter tensor of interest, and denotes the tensor Tucker product. The choice of link function depends on the distribution of the response data. Some common choices are identity link for Gaussian tensor, logistic link for binary tensor, and link for Poisson tensor (see Table 1).
Data type  Gaussian  Poisson  Bernoulli 

Domain  
link 
We give three concrete examples of tensor regression that arise in practice.
Example 1 (Spatiotemporal growth model).
Let denote the pH measurements of lakes at levels of depth and for time points. Suppose the sampled lakes belong to types, with lakes in each type. Let denote the sampled depth levels and the time points. Assume that the expected pH trend in depth is a polynomial of order and that the expected trend in time is a polynomial of order . Then, the spatiotemporal growth model can be represented as
(4) 
where is the coefficient tensor of interest, is the design matrix for lake types,
are the design matrices for spatial and temporal effects, respectively. The model (4) is a higherorder extension of the “growth curve” model originally proposed for matrix data [14, 15, 16]. Clearly, the spatialtemporal model is a special case of our tensor regression model, with covariates available on each of the three modes.
Example 2 (Network population model).
Network response model is recently developed in the context of neuroimanig analysis. The goal is to study the relationship between networkvalued response and the individual covariates. Suppose we observe i.i.d. observations , where is the brain connectivity network on the th individual, and is the individual covariate such as age, gender, cognition, etc. The networkresponse model [11, 6] has the form
(5) 
where is the coefficient tensor of interest.
Example 3 (Dyadic data with node attributes).
Dyadic dataset consists of measurements on pairs of objects or under a pair of conditions. Common examples include networks and graphs. Let denote a network, where is the node set of the graph, and is the edge set. Suppose that we also observe covariate associated to each . A probabilistic model on the graph can be described by the following matrix regression. The edge connects the two vertices and independently of other pairs, and the probability of connection is modeled as
(6) 
The above model has demonstrated its success in modeling transitivity, balance, and communities in the networks [5]. We show that our tensor regression model (2) also incorporates the graph model as a special case. Let be a binary matrix where . Define . Then, the graph model (6) can be expressed as
In the above three examples and many other studies, researchers are interested in uncovering the variation in the data tensor that can be explained by the covariates. The regression coefficient in our model model (2) serves this goal by collecting the effects of covariates and the interaction thereof. To encourage the sharing among effects, we assume that the coefficient tensor lies in a lowdimensional parameter space:
(7) 
where is the Tucker rank at mode of the tensor. The lowrank assumption is plausible in many scientific applications. In brain imaging analysis, for instance, it is often believed that the brain nodes can be grouped into fewer communities, and the numbers of communities are much smaller than the number of nodes. The lowrank structure encourages the shared information across tensor entries, thereby greatly improving the estimation stability. When no confusion arises, we drop the subscript and write for simplicity.
Our tensor regression model is able to incorporate covariates on any subset of modes, whenever available. Without loss of generality, we denote by the covariates in all modes and treat if the mode has no (informative) covariate. Then, the final form of our tensor regression model can be written as:
(8) 
where the entries of are independent r.v.’s conditional on , and is the lowrank coefficient tensor of interest. We comment that other forms of tensor lowrankness are also possible, and here we choose Tucker rank just for parsimony. Similar models can be derived using various notions of lowrankness based on CP decomposition [17] and train decomposition [18].
4 Rankconstrained likelihoodbased estimation
We develop a likelihoodbased procedure to estimate the coefficient tensor in (3). We adopt the exponential family as a flexible framework for different data types. In a classical generalized linear model (GLM) with a scalar response and covariate , the density is expressed as:
where is a known function, is the linear predictor, is the dispersion parameter, and is a known normalizing function. The choice of link functions depends on the data types and on the observation domain of , denoted . For example, the observation domain is for continuous data, for count data, and for binary data. Note that the canonical link function is chosen to be . Table 1 summarizes the canonical link functions for common types of distributions.
In our context, we model the the entries in the response tensor conditional on as independent draws from an exponential family. The quasi loglikelihood of (3) is equal (ignoring constant) to Bregman distance between and :
(9)  
(10) 
We assume that we have an additional information on an upper bound such that . This is the case for many applications we have in mind such as brain network analysis where fiber connections are bounded. We propose a constrained maximum likelihood estimator (MLE) for the coefficient tensor:
(11) 
In the following theoretical analysis, we assume the rank is known and fixed. The adaptation of unknown will be addressed in Section 5.2.
4.1 Statistical properties
We assess the estimation accuracy using the deviation in the Frobenius norm. For the true coefficient tensor and its estimator , define
In modern applications, the response tensor and covariates are often largescale. We are particularly interested in the highdimensional region in which both and diverge; i.e. and , while . As the size of problem grows, and so does the number of unknown parameters. As such, the classical MLE theory does not directly apply. We leverage the recent development in random tensor theory and highdimensional statistics to establish the error bounds of the estimation.
Assumption 1.
We make the following assumptions:

There exist two positive constants such that for all . Here and denotes the smallest and largest singular values, respectively.

There exist two positive constants such that for all .

Equivalently, there exists two positive constants such that for all , where is the upper bound of the linear predictor.
The assumptions are fairly mild. Assumption A1 guarantees the nonsingularity of the covariates, and Assumption A2 ensures the loglikelihood is strictly concave in the linear predictor . Assumption A2 and A2’ are equivalent, because when belongs to an exponential family [19].
Theorem 4.1 (Statistical convergence).
Consider a generalized tensor regression model with covariates on multiple modes . Suppose the entries in are independent realizations of an exponential family distribution, and follows the lowrank tensor regression model (3). Under Assumption 1, there exist two constants , such that, with probability at least ,
(12) 
Here, is a constant that does not depend on the dimensions and .
To gain further insight on the bound (12), we consider a special case when tensor dimensions are equal at each of the modes, i.e., , , for all , and the covariates are Gaussian design matrices with i.i.d. entries. To put the context in the framework of Theorem 4.1, we rescale the covariates into so that the singular values of are bounded by . The result in (12) implies that the estimated coefficient has a convergence rate in the scale of the original covariates . Therefore, our estimation is consistent as the dimension grows, and the convergence becomes especially favorably as the order of tensor data increases.
As immediate applications, we obtain the convergence rate for the three examples mentioned in Section 3. Without loss of generality, we assume that the singular values of the by covariate matrix are bounded by .
Corollary 1 (Spatiotemporal growth model).
The estimated typebytimebyspace coefficient tensor converges at the rate where , and . The estimation achieves consistency as long as the dimension grows in either of the three modes.
Corollary 2 (Network population model).
The estimated nodebynodebycovariate tensor converges at the rate where . The estimation achieves consistency as the number of individuals or the number of nodes grows.
Corollary 3 (Dyadic data with node attributes).
The estimated covariatebycovariate matrix converges at the rate where . Again, our estimation achieves consistency as the number of nodes grows.
We conclude this section by providing the prediction accuracy, measured in KL divergence, for the response distribution.
Theorem 4.2 (Prediction error).
Assume the same setup as in Theorem 4.1. Let and denote the distributions of given the true parameter and estimated parameter , respectively. Then, we have, with probability at least ,
where is a constant that do not depend on the dimensions and .
5 Numerical implementation
5.1 Alternating optimization
In this section, we introduce an efficient algorithm to solve (11). The objective function is concave in when the link is the canonical link function. However, the feasible set is nonconvex, and thus the optimization (11) is a nonconvex problem. We utilize a Tucker factor representation of the coefficient tensor and turn the optimization into a blockwise convex problem.
Specifically, write the rank decomposition of coefficient tensor as
(13) 
where is a fullrank core tensor, are factor matrices with orthogonal columns. Estimating amounts to finding both the core tensor and the factor matrices ’s. The optimization (11) can be written as , where
(14)  
(15) 
The decision variables in the above objective function consist of blocks of variables, one for the core tensor and for the factor matrices ’s. We notice that, if any out of the blocks of variables are known, then the optimization with respect to the last block of variables reduced to a simple GLM. This observation suggests that we can iteratively update one block at a time while keeping others fixed. After each iteration, we rescale the core tensor subject to the maximum norm constraint. This postprocessing in principle may not guarantee the monotonic increase of the objective, but we found that in our experiment this simple step appears robust for obtaining a desirable solution. The full algorithm is described in Algorithm 1.
5.2 Rank selection
Algorithm 1 takes the rank as an input. Estimating an appropriate rank given the data is of practical importance. We propose to use Bayesian information criterion (BIC) and choose the rank that minimizes BIC; i.e.
(16)  
(17) 
where is the effective number of parameters in the model. We choose that minimizes via grid search. Our choice of BIC aims to balance between the goodnessoffit for the data and the degree of freedom in the population model. We test its empirical performance in Section 6.
6 Simulation
We evaluate the empirical performance of our generalized tensor regression through simulations. We consider order3 tensors with a range of distribution types. The coefficient tensor is generated using the factorization form (13) where both the core and factor matrices are drawn i.i.d. from Uniform[1,1]. The linear predictor is then simulated from , where is either an identity matrix (i.e. no covariate available) or Gaussian random matrix with i.i.d. entries from . We set to ensure the singular values of are bounded as increases. The is scaled such that . Conditional on the linear predictor , the entries in the tensor are drawn independently according to one of the following three probabilistic models:

[itemsep=0pt,topsep=0pt]

(Gaussian). Continuous data .

(Poisson). Count data .

(Bernoulli). Binary data .
Here is a scalar controlling the magnitude of the effect size. In each simulation study, we report the mean squared error (MSE) for the coefficient tensor averaged across replications.
The first experiment assesses the selection accuracy of our BIC criterion (16). We consider the balanced situation where , for . We set and consider various combinations of dimension and rank . For each combination, we simulate tensor data following Gaussian, Bernoulli, and Poisson models. We then minimize BIC using a grid search over three dimensions. The hyperparameter is set to infinity in the fitting, which essentially imposes no prior on the coefficient magnitude. Table 2 reports the selected rank averaged over replicates for Gaussian and Poisson models. We found that when , the selected rank is slightly smaller than the true rank, and the accuracy improves immediately when the dimension increases to . This agrees with our expectation, as in tensor regression, the sample size is related to the number of entries. A larger implies a larger sample size, so the BIC selection becomes more accurate.
True Rank  Dimension (Gaussian tensors)  Dimension (Poisson tensors)  

The second experiment evaluates the accuracy when covariates are available on all modes. We set and increase from 25 to 50. Our theoretical analysis suggests that has a convergence rate in this setting. Figure 1 plots the estimation error versus the “effective sample size”, , under three different distribution models. We found that the empirical MSE decreases roughly at the rate of , which is consistent with our theoretical ascertainment. We also observed that, tensors with higher ranks tend to yield higher estimation errors, as reflected by the upward shift of the curves as increases. Indeed, a larger implies a higher model complexity and thus greater difficulty in the estimation. Similar behaviors can be observed in the nonGaussian data in Figure 2bc.
The third experiment investigates our model’s ability in handling correlation among coefficients. We mimic the scenario of brain imaging analysis. A sample of networks are simulated, one for each individual. Each network measures the connections between brain nodes. We simulate covariates for the each of the 50 individuals. These covariates may represent, for example, age, gender, cognitive score, etc. Recent study [20] has suggested that brain connectivity networks often exhibit community structure represented as a collection of subnetworks, and each subnetwork is comprised of a set of spatially distributed brain nodes. To accommodate this structure, we utilize the stochastic block model [21] to generate the effect size. Specifically, we partition the nodes into blocks by assigning each node to a block with uniform probability. Edges within a same block are assumed to share the same covariate effects, where the effects are drawn i.i.d. from . We then apply our tensor regression model to the network data using the BICselected rank. Note that in this case, the true model rank is unknown; the rank of a block matrix is not necessarily equal to [22].
Figure 3 compares the MSE of our method with a classical GLM approach. A classical GLM is to regress the dyadic edges, one at a time, on the covariates, and this model is repeatedly fitted for each edge. This repeated approach, however, does not account for the correlation among the edges, and may suffer from overfitting. As we can see in Figure 3, out tensor regression method achieves significant error reduction in all three models considered. The outerperformance is significant in the presence of large communities, and even in the less structured case ( nodes per block), our method still outerperforms GLM. This is because the lowrankness in our modeling automatically identifies the shared information across entries. By selecting the rank in a datadriven way, our method is able to achieve accurate estimation with improved interpretability.
7 Data analysis
We apply our tensor regression model to two real datasets. The first application concerns the brain network modeling in response to individual attributes (i.e. covariate on one mode), and the second application focuses on multirelational network analysis with dyadic attributes (i.e. covariates on two modes).
7.1 Human Connectome Project (HCP)
The Human connectome project (HCP [23]) aims to build a “network map” that characterizes the anatomical and functional connectivity within healthy human brains. We take a subset of HCP data that consists of 136 brain structural networks, one for each individual. Each brain network is represented as a 68by68 binary matrix, where the entries encode the presence or absence of fiber connections between the 68 brain regions. We consider four individualcovariates: gender (65 females vs. 71 males), age 2225 (), age 2630 (), and age 31+ (). The goal is to identify the connection edges that are affected by the individual covariates. A key challenge in brain network is that the edges are correlated; for example, two edges may stem out from a same brain region, and it is of importance to take into account the withindyad dependence.
We fit the tensor regression model to the HCP data. The response is a binary tensor and the covariates are of dimension 4 along the 3 mode. The BIC selection suggests a rank with loglikelihood . Figure 4 shows the top edges with high effect size, overlaid on the Desikan atlas brain template [24, 25]. We utilize the sumtozero contrasts in the effects coding and depicted only the top 3% edges whose connections are nonconstant across the sample. It is observed that the global connection exhibits clear spatial separation, and that the nodes within each hemisphere are more densely connected with each other (Figure 4a). In particular, the superiortemproal (SupT), middletemporal (MT) and Insula are the top three popular nodes in the network. Interestingly, female brains display higher interhemispheric connectivity, especially in the frontal, parental and temporal lobes (Figure 4b). This is in agreement with a recent study showing that female brains are optimized for interhemispheric communication [26]. We also found several edges with declined connection in the group Age 31+. Notably, those edges involve Frontalpole (Fploe), superiorfrontal (SupF) and Cuneus nodes. The Frontalpole region has long been known for its importance in memory and cognition, and the detected decline with age further highlights its biological importance.
7.2 Nations data
The second application concerns the multirelational network analysis with nodelevel attributes. We consider Nations dataset [27] which records 56 relations among 14 countries between 1950 and 1965. The multirelational networks can be organized into a binary tensor, with each entry indicating the presence or absence of a connection, such as “sending tourist to”, “export”, “import”, between countries. The 56 relations span the fields of politics, economics, military, religion, and so on. In addition, countrylevel attributes are also available, and we focus on the following six covariates: constitutional, catholics, lawngos, politicalleadership, geographyx, and medicinengo. The goal is to identify the variation in connections due to countrylevel attributes and interactions thereof. One of the key features is that the 56 relations are correlated, and we would like to take that into account in assessing the covariate effects.
We applied our tensor regression model to the Nations data. The multirelational network was treated as the response tensor, and the country attributes were treated as covariates on both the 1st and 2nd modes. The BIC criterion suggests a rank for the coefficient tensor . Table Section 6 shows the mean clustering of the 56 relations based on the 3 mode factor . We found that the relations reflecting the similar aspects of international affairs are grouped together. In particular, Cluster I consists of political relations such as officialvisits, intergovorgs, and militaryactions; Clusters II and III capture the economical relations such as economicaid, booktranslations, tourism; and Cluster IV represents the Cold War alliance blocs. The similarity among entities in each cluster suggests the plausibility of our dimension reduction.
To investigate the effects of dyadic attributes towards connections, we depicted the estimated coefficients for several relation types (Figure 5). Note that entries can be interpreted as the contribution, at the logit scale, of covariate pair (th covariate for the “sender” country and th covariate for the “receiver” country) towards the connection of relation . Several interesting findings emerge from the observation. We found that relations belonging to a same cluster tend to have similar covariate effects. For example, the relations warnings and ecnomicaid are classified into Cluster II, and both exhibit similar covariate pattern (Figure 5ab). Moreover, the majority of the diagonal entries positively contribute to the connection. This suggests that countries with coherent attributes tend to interact more often than others. We also found that the constitutional attribute is an important predictor for the commonbloc relation, whereas the effect is weaker for other relations (Figure 5d). This is not surprising, as the block partition during Cold War is associated with the constitutional attribute.
8 Conclusion
We have developed a generalized tensor regression with covariates on multiple modes. A fundamental feature of tensorvalued data is the statistical interdependence among entries. Our proposed rankconstrained estimation achieves high accuracy with sound theoretical guarantees. The estimation accuracy is quantified via deviation in the Frobenius norm and KL divergence. Other measures of accuracy may also be desirable, such as the spectral norm or the maximum norm of the deviation. Exploiting the properties and benefits of different error quantification warrants future research.
Appendix A Proofs
Proof of Theorem 4.1.
Define , where the expectation is taken with respect to under the model with true parameter . We first prove the following two conclusions:

There exists two positive constants , , such that, with probability at least , the stochastic deviation, , satisfies

The inequality holds, where is the lower bound for .
To prove C1, we note that the stochastic deviation can be written as:
(18) 
where , and the second line uses the property of exponential family that . Based on Proposition 2, the boundedness of implies that is a subGaussian tensor. Let . By Proposition 1, is a dimensional subGaussian tensor with parameter bounded by . Here is the upper bound of . Applying CauchySchwarz inequality to (A) yields
(19) 
where denotes the tensor spectral norm and denotes the tensor nuclear norm. The nuclear norm is bounded by (c.f. [28, 13]). The spectral norm is bounded by with probability at least (c.f. [28, 29]). Combining these two bounds with (19), we have, with probability at least ,
where is a constant absorbing all factors that do not depend on and .
Next we prove C2. Applying Taylor expansion to around yields
(20) 
where is the (nonrandom) Hession of evaluated at for some . Note that we have . We take expectation with respect to on both sides of (21) and obtain
(21) 
By the fact and chain rule over , the equation (21) implies that
(22) 
holds for all , provided that . In particular, the inequality (21) also applies to the constrained MLE . So we have
(23) 
Now we have proved both C1 and C2. Note that by the definition of , This implies that
(24)  
(25)  
(26) 
where the second line follows from (23). Therefore,
(27) 
Combining (A) with C1 yields
Therefore, the final conclusion follows by noting that
where is a constant that does not depend on the dimensions and . ∎
Proposition 1 (subGaussian tensors).
Let be a subGaussian tensor of dimension , and be nonrandom matrices for all . Then is a subGaussian tensor of dimension , where . Here denotes the largest singular value of the matrix.
Proof.
To show is a subGuassian tensor, it suffices to show that the is a subGaussian scalar with parameter , for any unit1 vector , .
Note that,
(28)  