# Sparse Tensor Additive Regression

###### Abstract

Tensors are becoming prevalent in modern applications such as medical imaging and digital marketing. In this paper, we propose a sparse tensor additive regression (STAR) that models a scalar response as a flexible nonparametric function of tensor covariates. The proposed model effectively exploits the sparse and low-rank structures in the tensor additive regression. We formulate the parameter estimation as a non-convex optimization problem, and propose an efficient penalized alternating minimization algorithm. We establish a non-asymptotic error bound for the estimator obtained from each iteration of the proposed algorithm, which reveals an interplay between the optimization error and the statistical rate of convergence. We demonstrate the efficacy of STAR through extensive comparative simulation studies, and an application to the click-through-rate prediction in online advertising.

KEY WORDS: Additive models; low-rank tensor; non-asymptotic analysis; non-convex optimization; tensor regression.

## 1 Introduction

Tensor data have recently become popular in a wide range of applications such as medical imaging (zhou2013; li2017parsimonious; sun2017store), digital marketing (zhe2016distributed; sun2017provable), video processing (guo2012tensor), and social network analysis (park2009pairwise; hoff2015multilinear), among many others. In such applications, a fundamental statistical tool is tensor regression, a modern high-dimensional regression method that relates a scalar response to tensor covariates. For example, in neuroimaging analysis, an important objective is to predict clinical outcomes using subjects’ brain imaging data. This can be formulated as a tensor regression problem by treating the clinical outcomes as the response and the brain images as the tensor covariates. Another example is in the study of how advertisement placement affect users’ clicking behavior in online advertising. This again can be formulated as a tensor regression problem by treating the daily overall click-through rate (CTR) as the response and the tensor that summarizes the impressions (i.e., view counts) of different advertisements on different devices (e.g., phone, computer, etc.) as the covariate. In Section 6, we consider such an online advertising application.

Denote as a scalar response and as an -way tensor covariate, . A general tensor regression model can be formulated as

where is an unknown regression function, are scalar observation noises. Many existing methods assumed a linear relationship between the response and the tensor covariates by considering for some low-rank tensor coefficient (zhou2013; rabusseau2016; yu2016; guhaniyogi2017bayesian; raskutti2019convex). In spite of its simplicity, the linear assumption could be restrictive and difficult to satisfy in real applications. Consider the online advertising data in Section 6 as an example. Figure 1 shows the marginal relationship between the overall CTR and the impressions of an advertisement delivered on phone, tablet, and PC, respectively. It is clear that the relationship between the response (i.e., the overall CTR) and the covariate (i.e., impressions across three devices) departs notably from the linearity assumption.

A few work considered more flexible tensor regressions by treating as a nonparametric function (suzuki2016minimax; kanagawa2016gaussian). In particular, suzuki2016minimax proposed a general nonlinear model where the true function is consisted of components from a reproducing kernel Hilbert space, and used an alternating minimization estimation procedure; kanagawa2016gaussian considered a Bayesian approach that employed a Gaussian process prior in learning the nonparametric function on the reproducing kernel Hilbert space. One serious limitation of both work is that they assume that the tensor covariates are *exact low-rank*. This assumption is difficult to satisfy in practice, as most tensor covariates are not exact low-rank. When the tensor covariates are not exact low-rank, the performance of these two methods deteriorates dramatically; see Section 5.2 for more details. In addition, the Gaussian process approach is computationally very expensive, which severely limits its application in problems with high-dimensional tensor covariates.

In this paper, we develop a flexible and computationally feasible tensor regression framework, which accommodates the nonlinear relationship between the response and the tensor covariate, and is highly interpretable. Specifically, for an -way tensor covariate , we consider a sparse tensor additive regression (STAR) model with

(1) |

where denotes the -th element of , and is a nonparametric additive component belonging to some smooth function class. Approximating the additive component using spline series expansions, can be simplified to have a compact tensor representation of spline coefficients. To reduce the number of parameters and increase computational efficiency, we assume that the corresponding high-dimensional coefficient tensors have low-rank and group sparsity structures. Both low-rankness and sparsity are commonly used dimension reduction tools in recent tensor models (li2017parsimonious; sun2017provable; sun2017store; hao2018sparse; zhang2019cross; zhang2019optimal). Besides effectively reducing computational cost, the group sparsity structure also significantly improves the model interpretability. For instance, in the online advertising example, when the daily overall CTR is regressed on the impressions of different advertisements on different devices, the group sparsity enables our STAR model to select effective advertisement and device combinations. Such a type of advertisement selection is important for managerial decision making and has been an active research area (choi2010using; xu2016lift). To efficiently estimate the model, we formulate the parameter estimation as a non-convex optimization and propose a penalized alternating minimization algorithm. By fully exploiting the low-rankness and group sparsity structures, our algorithm may run faster than the tensor linear regression in some experiments. For example, in the online advertising application, our STAR model can reduce the CTR prediction error by 50% while using 10% computational time of the linear or nonlinear tensor regression benchmark models. See Section 6 for more details.

Besides methodological contributions, we also obtain some strong theoretical results for our proposed method. In particular, we first establish a general theory for penalized alternating minimization in the context of tensor additive model. To the best of our knowledge, this is the first statistical-versus-optimization guarantee for the penalized alternating minimization. Previous work mostly focus on either the EM-type update (wang2014optimal; BWY15; hao2017simultaneous), or the truncation-based update (sun2017provable). Those techniques are not directly applicable to our scenario; see Section 4.1 for detailed explanations. Next, we derive a non-asymptotic error bound for the estimator from each iteration, which demonstrates the improvement of the estimation error in each update. Finally, we apply this general theory to our STAR estimator with B-spline basis and the group-lasso penalty, and show that the estimation error in the -th iteration satisfies

where is a contraction parameter, is the smoothness parameter of the function class, , and is the number of spline series. The above error bound reveals an interesting interplay between the optimization error and the statistical error. The optimization error decays geometrically with the iteration number , while the statistical error remains the same as grows. When the tensor covariate is of order one (i.e., a vector covariate), our problem reduces to the vector nonparametric additive model. In that case, our statistical error matches with that from the vector nonparametric additive model in huang2010variable.

### 1.1 Other related work

The problem we consider in our work is fundamentally different from those in tensor decomposition and tensor response regression. As a result, the technical tools involved and the theoretical results are quite different.

Tensor decomposition (chi2012on; anandkumar2014tensor; yuan2016on; sun2017provable) is an unsupervised learning method that aims to find the best low-rank approximation of a single tensor. In comparison, our STAR model is a supervised learning method that seeks to capture the nonlinear relationship between the response and the tensor covariate. Although the low-rank structure of the tensor coefficient is also employed in our estimation, our objective and the technical tools involved are entirely different from the typical tensor decomposition problem. Additionally, one fundamental difference is that our model works with multiple tensor samples, while tensor decomposition works only with a single tensor. As a result, our error bound is a function of the sample size, which is different from that in tensor decomposition.

Another line of related work considers tensor response regression, where the response is a tensor and the covariates are scalars (ZhuHT2009intrinsic; li2017parsimonious; sun2017store).
These work also utilized the low-rank and/or sparse structures of the coefficient tensors for dimension reduction.
However, tensors are treated as the *response* in tensor response regression, whereas they are treated as a *covariates* in our approach. These are two very different types of models, motivated by different applications.
The tensor response regression aims to study the change of the tensor (e.g., the brain image) as the covariate (e.g., disease status) varies.
However, the tensor regression model focuses on understanding the change of a scalar outcome (e.g., the overall CTR) with the tensor covariates. As a result, technical tools used for theoretical analysis are also largely different.

### 1.2 Notations and structure

Throughout this article, we denote scalars by lower case characters such as , vectors by lower-case bold characters such as , matrices by upper-case bold characters such as and tensors by upper-case script characters such as . Given a vector and a set of indices , we define such that if and , otherwise. For a square matrix , we denote and as its minimum and maximum eigenvalues, respectively. For any function on , we define its norm by . Suppose are -way tensors. We define tensor inner product . The tensor Frobenius norm is defined as . The notation implies for some constant . For any two sequences , we write if there exists some positive constant and sufficiently large such that . We also write if there exist constants such that for all .

The rest of the article is organized as follows. Section 2 introduces our sparse tensor additive regression model. Section 3 develops an efficient penalized alternating minimization algorithm for model estimation. Section 4 investigates its theoretical properties, followed by simulation studies in Section 5 and a real online advertising application in Section 6. The appendix collects all technical proofs.

## 2 Sparse Tensor Additive Model

Given i.i.d. samples , our sparse tensor additive model assumes

(2) |

where is the nonparametric additive function belonging to some smooth function class , and are i.i.d. observation noises.

Our STAR model utilizes spline series expansion (huang2010variable; fan2011nonparametric) to approximate each individual nonparametric additive component. Let be the space of polynomial splines and be a normalized basis for , where is the number of spline series and . It is known that for any , there always exists some coefficients such that . In addition, under suitable smoothness assumptions (see Lemma 1), each nonparametric additive component can be well approximated by functions in . Applying the above approximation to each individual component, the regression function in (2) can be approximated by

(3) |

The expression in has a compact tensor representation. Define such that , and such that for , where denotes for an integer . Consequently, we can write

(4) |

Therefore, the parameter estimation of the nonparametric additive model (2) reduces to the estimation of unknown tensor coefficients . The coefficients include a total number of free parameters, which could be much larger than the sample size . In such ultrahigh-dimensional scenario, it is important to employ dimension reduction tools. A common tensor dimension reduction tool is the low-rank assumption (chi2012on; anandkumar2014tensor; yuan2016on; sun2017provable). Similarly, we assume each coefficient tensor satisfies the CP low-rank decomposition (kolda2009):

(5) |

where is the vector outer product, , and is the CP-rank. This formulation reduces the effective number of the parameters from to , and hence greatly improves computational efficiency. Under this formulation, our model can be written as

(6) |

Our model in can be viewed as a generalization of several existing work. When in (4) is an identity basis function () with only one basis (), (6) reduces to the bilinear form (li2010dimension; hung2012matrix) for a matrix covariate (), and the multilinear form for linear tensor regression (zhou2013; hoff2015multilinear; yu2016; rabusseau2016; sun2017store; guhaniyogi2017bayesian; raskutti2019convex) for a tensor covariate ().

In addition to the CP low-rank structure on the tensor coefficients, we further impose a group-type sparsity constraint on the components . This group sparsity structure not only further reduces the effective parameter size, but also improves the model interpretability, as it enables the variable selection of components in the tensor covariate. Recall that in (6) we have for . We define our group sparsity constraint as

(7) |

where refers to the cardinality of the set . Figure 2 provides an illustration of the low-rank (5) and group-sparse (7) coefficients when the order of the tensor is . When , our model with the group sparsity constraint reduces to the vector sparse additive model (Ravikumar2009; meier2009high; huang2010variable).

## 3 Estimation

In this section, we describe our approach to estimate the parameters in our STAR model via a penalized empirical risk minimization which simultaneously satisfies the low-rankness and encourages the sparsity of decomposed components. In particular, we consider

(8) |

where is the empirical risk function, in which the low-rankness is guaranteed due to the CP decomposition, and is a penalty term that encourages sparsity. To enforce the sparsity as defined in (7), we consider the group lasso penalty (yuan2006model), i.e.,

(9) |

where are tuning parameters. It is worth mentioning that our algorithm and theoretical analysis can accommodate a general class of decomposable penalties (see Condition 16 for details), which includes lasso, ridge, fused lasso, and group lasso as special cases.

For a general tensor covariate (), the optimization problem in (8) is a non-convex optimization. This is fundamentally different from the vector sparse additive model (Ravikumar2009; huang2010variable) whose optimization is convex. The non-convexity in (8) brings significant challenges in both model estimation and theoretical development. The key idea of our estimation procedure is to explore the bi-convex structure of the empirical risk function since it is convex in one argument while fixing all the other parameters. This motivates us to rewrite the empirical risk function into a bi-convex representation, which in turn facilitates the introduction of an efficient alternating minimization algorithm.

Denote for , , and . We also define the operator . Remind that with , see . We use to refer to the way tensor when we fix the index along the -th way of as , e.g., . Define

(10) |

and denote . In addition, we denote , , and . Thus, when other parameters are fixed, minimizing the empirical risk function (8) with respect to is equivalent to minimizing

(11) |

Note that the expression of (11) holds for any with proper definitions on and .

Based on this reformulation, we are ready to introduce the alternating minimization algorithm that solves (8) by alternatively updating . A desirable property of our algorithm is that updating given others can be solved efficiently via the back-fitting algorithm (Ravikumar2009). The detailed algorithm is summarized in Algorithm 1. With a little abuse of notations, we redefine the penalty term .

In our implementation, we use ridge regression to initialize Algorithm 1, and set tuning parameters for for simplicity.

## 4 Theory

In this section, we first establish a general theory for the penalized alternating minimization in the context of the tensor additive model. Several sufficient conditions are proposed to guarantee the optimization error and statistical error. Then, we apply our theory to the STAR estimator with B-spline basis functions and the group-lasso penalty. To ease the presentation, we consider a three-way tensor covariate (i.e, ) in our theoretical development, while its generalization to an -way tensor is straightforward.

### 4.1 A general contract property

To bound the optimization error and statistical error of the proposed estimator, we introduce three sufficient conditions: a Lipschitz-gradient condition, a sparse strongly convex condition, and a generic statistical error condition. For the sake of brevity, we only present conditions for the update of in the main paper, and defer similar conditions for to Section 2 in the appendix.

For each vector , we divide it into equal-length segments as in Figure 3. A segment is colored if it contains at least one non-zero element, and a segment is uncolored if all of its elements are zero. We let be the indices of colored segments in and be the set of all -dimensional vectors with less than colored segments, for some constant . Mathematically, for a vector , denote and .

Define a sparse ball for a given constant radius . Moreover, the noisy gradient function and noiseless gradient function of empirical risk function defined in (8) of order-3 with respect to can be written as

(13) | |||

(14) |

where

[(Lipschitz-Gradient)] For , the noiseless gradient function satisfies -Lipschitz-gradient condition, and satisfies -Lipschitz-gradient condition with high probability. That is,

with probability at least for any . Here, may depend on .
{remark}
Condition 3 defines a variant of Lipschitz continuity for and . Note that the gradient is always taken with respect to the first argument of and the Lipschitz continuity is with respect to the second or the third argument. Analogous Lipschitz-gradient conditions were also considered in BWY15; hao2017simultaneous for the *population-level -function* in the EM-type update.

Next condition characterizes the curvature of noisy gradient function in a sparse ball. It states that when the second and the third argument are fixed, is strongly convex with parameter with high probability. As shown later in Section 4.2, this condition holds for a broad family of basis functions. {condition}[(Sparse-Strong-Convexity)] For any , the loss function is sparse strongly convex in its first argument, namely

with probability at least for any . Here, is the strongly convex parameter and may depend on . Next we present the definition for dual norm, which is a key measure for statistical error condition. More details on the dual norm are referred to negahban2012. {definition}[Dual norm] For a given inner product , the dual norm of is given by

As a concrete example, the dual of -norm is -norm while the dual of -norm is itself. Suppose is a -dimensional vector and the index set is partitioned into disjoint groups, namely . The group norm for is defined as . According to Definition (4.1), the dual of is defined as . For simplicity, we write .

The generic statistical error (SE) condition guarantees that the distance between noisy gradient and noiseless gradient under -norm is bounded. {condition}[(Statistical-Error)] For any , , we have with probability at least ,

Here, is only a generic quantity and its explicit form will be derived for a specific loss function in Section 4.2.

Next we introduce two conditions for the penalization parameter (Condition 4.1) and penalty (Condition 16). To illustrate Condition 4.1, we first introduce an quantity called support space compatibility constant to measure the intrinsic dimensionality of defined in (7) with respect to penalty . Specifically, it is defined as

(15) |

which is a variant of subspace compatibility constant originally proposed by negahban2012 and Wain2014. If is chosen as a group lasso penalty, we have , where is the index set of active groups. Similar definitions of can be made accordingly. {condition}[(Penalization Parameter)] We consider an iterative turning procedure where tuning parameters in are allowed to change with iteration. In particular, we assume tuning parameters satisfy

(16) |

where are Lipschitz-gradient parameter which are defined in Condition 3 and Conditions 2-2. {remark} Condition 4.1 considers an iterative sequence of regularization parameters. Given reasonable initializations for , their estimation errors gradually decay when the iteration increases, which implies that is a decreasing sequence. After sufficiently many iterations, the rate of the will be bounded by the statistical error , for . This agrees with the theory of high-dimensional regularized M-estimator in that suitable tuning parameter should be proportional to the target estimation error (Wain2014). Such iterative turning procedure plays a critical role in controlling statistical and optimization error, and has been commonly used in other high-dimensional non-convex optimization problems (wang2014optimal; yi2015regularized). Finally, we present a general condition on the penalty term. {condition}[(Decomposable Penalty)] Given a space , a norm-based penalty is assumed to be decomposable with respect to such that it satisfies for any and , where is the complement pf . As shown in negahban2011, a broad class of penalties satisfies the decomposable property, such as lasso, ridge, fused lasso, group lasso penalties. Next theorem quantifies the error of one-step update for the estimator coming Algorithm 1. {theorem}[Contraction Property] Suppose Conditions 3,3,3,4.1, 2-2 hold. Assume the update at -th iteration of Algorithm 1, fall into sparse balls respectively, where are some constants. Define . There exists absolute constant such that, the estimation error of the update at the -th iteration satisfies,

(17) |

with probability at least . Here, is the contraction parameter defined as

(18) |

where is some constant.

Theorem 16 demonstrates the mechanism of how the estimation error improves in the one-step update. When the the contraction parameter is strictly less than 1 (we will prove that it holds for certain class of basis functions and penalties in next section), the first term of RHS in (17) will gradually go towards zero and the second term will be stable. The contraction parameter is roughly the ratio of Lipschitz-gradient parameter and the strongly convex parameter. Similar formulas of contraction parameter frequently appears in the literature of statistical guarantees for low/high-dimensional non-convex optimization (BWY15; yi2015regularized; hao2017simultaneous).

### 4.2 Application to STAR estimator

In this section, we apply the general contract property in Theorem 16 to STAR estimator with B-spline basis functions. The formal definition of B-spline basis function is defined in Section 1 of the appendix. To ensure Conditions 3-3 and 3 are satisfied, in our STAR estimator we require conditions on the nonparametric component, the distribution of tensor covariate and the noise distribution.

[(Function Class)] Each nonparametric component in (1) is assumed to belong to the function class defined as follows,

(19) |

Let be the smoothness parameter of function class . For , there is a constant such that and . Each component of the covariate tensor has an absolutely continuous distribution and its density function is bounded away from zero and infinitely on . Condition 4.2 is classical for nonparametric additive model (stone1985additive; huang2010variable; fan2011nonparametric). Such condition is required to optimally estimate each individual additive component in -norm.

[(Sub-Gaussian Noise)] The noise are i.i.d. randomly generated with mean 0 and bounded variance . Moreover, is sub-Gaussian distributed, i.e., there exists constant such that , and independent of tensor covariates .

[(Parameter Space)] We assume the absolute value of maximum entry of is upper bounded by some positive constant , and the absolute value of minimum non-zero entry of is lower bounded by some positive constant . Here, not depending on . Moreover, we assume the CP-rank and sparsity parameters are bounded by some constants. {remark} The condition of bounded elements of tensor coefficient widely appears in tensor literature (anandkumar2014tensor; sun2017provable). Here the bounded tensor rank condition is imposed purely for simplifying the proofs and this condition is possible to relax to allow slowly increased tensor rank (sun2018dynamic). The fixed sparsity assumption is also required in the vector nonparametric additive regression (huang2010variable). To relax it, meier2009high considered a diverging sparsity scenario but required a compatibility condition which was hard to verify in practice. Thus, in this paper we consider a fixed sparsity case and leave the extension of diverging sparsity as future work. Since the penalized empirical risk minimization (8) is a highly non-convex optimization, we require some conditions on the initial update in Algorithm 1.

[(Initialization)] The initialization of is assumed to fall into a sparse constant ball centered at , saying , where are some constants that are not diverging with .

Similar initialization conditions have been widely used in tensor decomposition (sun2017provable; sun2018dynamic), tensor regression (suzuki2016minimax; sun2017store), and other non-convex optimization (wang2014optimal; yi2015regularized). Once the initial values fall into the sparse ball, the contract property and group lasso ensure that the successive updates also fall into a sparse ball. Another line of work considers to design spectral methods to initialize certain simple non-convex optimization, such as matrix completion (ma2017implicit) and tensor sketching (hao2018sparse). The success of spectral methods heavily relies on a simple non-convex geometry and explicit form of high-order moment calculation, which is easy to achieve in previous work (ma2017implicit; hao2018sparse) by assuming a Gaussian error assumption. However, the design of spectral method in our tensor additive regression is substantially harder since the high-order moment calculation has no explicit form in our context. We leave the design of spectral-based initialization as future work.

Finally, we state the main theory on the estimation error of our STAR estimator with B-spline basis functions and a group lasso penalty. {theorem} Suppose Conditions 4.1, 4.2-4.2, 4.2 hold and consider the class of normalized B-spline basis functions defined in (A1) and group-lasso penalty defined in (9). If one chooses the number of spline series , with probability at least , we have

(20) |

where is a contraction parameter, and is the smoothness parameter of function class in (19). Consequently, when the total number of iterations is no smaller than

and the sample size for sufficiently large , we have

with probability at least .

The non-asymptotic estimation error bound (20) consists of two parts: an optimization error which is incurred by the non-convexity and a statistical error which is incurred by the observation noise and the spline approximation. Here, optimization error decays geometrically with the iteration number , while the statistical error remains the same when grows. When the tensor covariate is of order-one, i.e., a vector covariate, the overall optimization problem reduces to classical vector nonparametric additive model. In that case, we do not have the optimization error any more since the whole optimization is convex, and the statistical error term matches the state-of-the-art rate in huang2010variable.

Lastly, let us define