Low-Rank Approximation and Completion of Positive TensorsThis work was supported in part by NSF grant CMMI-1450963.

# Low-Rank Approximation and Completion of Positive Tensors††thanks: This work was supported in part by NSF grant CMMI-1450963.

Anil Aswani Industrial Engineering and Operations Research, University of California, Berkeley, CA 94720 (aaswani@berkeley.edu).
###### Abstract

Unlike the matrix case, computing low-rank approximations of tensors is NP-hard and numerically ill-posed in general. Even the best rank-1 approximation of a tensor is NP-hard. In this paper, we use convex optimization to develop polynomial-time algorithms for low-rank approximation and completion of positive tensors. Our approach is to use algebraic topology to define a new (numerically well-posed) decomposition for positive tensors, which we show is equivalent to the standard tensor decomposition in important cases. Though computing this decomposition is a nonconvex optimization problem, we prove it can be exactly reformulated as a convex optimization problem. This allows us to construct polynomial-time randomized algorithms for computing this decomposition and for solving low-rank tensor approximation problems. Among the consequences is that best rank-1 approximations of positive tensors can be computed in polynomial time. Our framework is next extended to the tensor completion problem, where noisy entries of a tensor are observed and then used to estimate missing entries. We provide a polynomial-time algorithm that for specific cases requires a polynomial (in tensor order) number of measurements, in contrast to existing approaches that require an exponential number of measurements. These algorithms are extended to exploit sparsity in the tensor to reduce the number of measurements needed. We conclude by providing a novel interpretation of statistical regression problems with categorical variables as tensor completion problems, and numerical examples with synthetic data and data from a bioengineered metabolic network show the improved performance of our approach on this problem.

t
\slugger

sioptxxxxxxxx–x

ensor completion, tensor approximation, categorical regression

{AMS}

90C25, 62F12, 05E45, 60B20

## 1 Introduction

Tensors generalize matrices by describing a multidimensional array of numbers. More formally, a tensor of order is given by , where is the dimension of the tensor in the -th index, for . When we would like to refer to a specific entry in the tensor, we use the notation , where , denotes the value of the -th index, and . Also let . The reasons for choosing this notation will become more clear when discussing our novel interpretation of statistical regression with categorical variables as tensor completion.

The similarity between tensors and matrices is misleading because many problems that are routine and polynomial-time computable for matrices are NP-hard for tensors. For instance, it is NP-hard to compute the rank of a tensor [27], which is defined as the minimal number of rank-1 components needed to represent the tensor: , where is the tensor product [33, 27]. Tensor analogs of the matrix singular value decomposition (e.g., candecomp/parafac or cp) are also NP-hard to compute [27]. Furthermore, determining the best low-rank approximations for tensors is an ill-posed problem in general [16], and computing the best rank-1 approximation is NP-hard in general [27].

In this paper, we attack the computational challenges posed by tensor problems by showing that positive tensors are amenable to polynomial-time algorithms with strong guarantees. A new tensor decomposition called a hierarchical decomposition is defined in §2 using a structure from algebraic topology. This decomposition is shown to exist, be numerically well-posed, and coincide with the usual tensor cp decomposition [31, 27] in specific cases. Section 3 develops a randomized algorithm to compute the hierarchical decomposition in polynomial time that depends on the degrees-of-freedom of the tensor rather than on the total number of tensor entries, which can be exponentially larger than the degrees-of-freedom. This algorithm can compute a best rank-1 approximation of positive tensors in polynomial time.

Our approach differs from existing methods in a number of ways. An iterative descent algorithm was proposed for decomposition of positive tensors in [58], but no convergence guarantees were provided. Matrix nuclear norm approaches are popular for tensor completion [56, 55, 22, 37, 42, 62], though these approaches have exponentially slow statistical convergence. Using the tensor nuclear norm (which is NP-hard to compute [21]) also gives exponentially slow statistical convergence [60, 61]. For orthogonal third-order tensors (i.e., ), alternating minimization provides polynomial statistical convergence [28]; unfortunately, this guarantee requires using a large number of randomized initializations, and so the results cannot be naturally generalized to higher order tensors without losing polynomial-time computability. Recently, Lassere hierarchy approaches have been proposed for best rank-1 approximations [46] and tensor completion [50]. These have found success on specific numerical examples, but conditions to guarantee global optimality are currently unavailable.

After presenting our algorithm for computing a hierarchical decomposition, §4 extends this framework to the problem of tensor completion [56, 55, 22, 37, 42, 62, 40], in which a subset of tensors entries are observed and then used to estimate missing entries. We provide an algorithm that for specific cases requires a polynomial (in tensor order) number of measurements, which is much lower than the exponential number of measurements required by tensor completion methods using the matrix nuclear norm [56, 55, 22, 37, 42, 62]. In the case of a rank-1 tensor, the number of needed measurements of our approach , for any , is essentially a quadratic factor away from the information-theoretic lower bound of . Section 5 shows how the algorithms can be improved to exploit sparsity in the tensor. Numerical examples with synthetic data in §6 show our approach outperforms other tensor completion algorithms. We conclude by providing in §7 a novel interpretation of statistical regression problems with categorical variables as tensor completion problems. Data from a bioengineered metabolic network is used to show the improved performance of our approach for categorical regression.

## 2 Hierarchical Decomposition of Positive Tensors

A structure from algebraic topology [17, 18] is used to parametrize our new decomposition. Following the definition of [18]: A simplicial complex is a set such that and implies that . The elements of are called faces of and the inclusion-maximal faces are the facets of . We will assume the facets have been arbitrarily assigned an order, so that we can represent the simplicial complex as , where is the number of facets. We will drop the argument notation when clear from the context. Roughly speaking, a simplicial complex is a graph with higher-order connections between vertices. Whereas edges in a graph can only connect two vertices, facets in a simplicial complex can simultaneously connect an arbitrary number of vertices. An example of a simplicial complex with zero-, one-, two-, and three-dimensional facets is shown in Figure 1.

This section begins with the definition of a hierarchical decomposition for positive tensors, and this decomposition is parametrized by a simplicial complex. Hierarchical decompositions are shown to always exist and be numerically well-posed for positive tensors. Next, we define an important special case of the hierarchical decomposition, which we call a partition decomposition. This is used to provide instances in which these decompositions exactly coincide with the typical tensor cp decomposition. The following notation (adapted from [18]) will be needed to write subindices. Recall the set notation , and define . If and , then , and this vector has the state space . We use the notation and to reduce the number of indices in our equations.

### 2.1 Definition

Motivated by hierarchical log-linear models used in statistics to construct hypothesis tests for contingency tables [18], we define a hierarchical decomposition of a positive tensor to be

 (1) ψx=m∏k=1θ(k)Xk

where is a simplicial complex with , and are constants indexed by the different values of . When is such that (1) is satisfied, we say is correct; on the other hand, if is such that (1) does not hold, then we say is incorrect. To simplify notation, we drop the superscript in and write this as when clear from the context. Also, refers to the set of all parameters.

### 2.2 Existence and Representational Complexity

Existence (and well-posedness) of the hierarchical decomposition of a positive tensor can be shown under a mild boundedness assumption:

A1. The tensor is bounded by some constant .

Our results generalize to the case , where ; we keep the above assumption to simplify stating the results. Relaxing the lower bound to zero is more delicate: In practice, we can choose sufficiently large such that the lower bound is arbitrarily close to zero. In theory, relaxing the lower bound to exactly zero requires additional analysis because the loss function we will use, though continuously differentiable, does not have a bounded derivative at zero.

{proposition}

If satisfies , then a hierarchical decomposition of with a correct exists.

{proof}

The result follows by choosing a simplicial complex: , where , and then setting .

Note there is a lack of uniqueness of the parametrizing because we can always choose a simplicial complex with a single facet, as in the above proof, to specify a valid hierarchical decomposition. Because of this nonuniqueness, it is useful to define a notation of complexity. We define the effective dimension of a hierarchical decomposition for a specific choice of to be . The effective dimension is the number of coefficients used in the hierarchical decomposition of the tensor. In many cases, a tensor of low rank can be represented by a hierarchical decomposition with low effective dimension. Specific examples are given in the next subsection. Moreover, a counting argument implies that the tensor rank must be upper bounded by the effective dimension: . It is for these reasons we use low effective dimension as a surrogate for low tensor rank when we study the problems of tensor approximation and completion.

### 2.3 Numerical Well-Posedness

Beyond existence, hierarchical decompositions are also well-posed. One of the reasons that computing the best low-rank approximation of a tensor is an ill-posed problem in general [16] is that though the entries of the tensor might be bounded, the coefficients of the tensor decomposition can be unbounded. (This can occur because the unbounded nature of the coefficients cancel each other out.) This leads to unique phenomenon such as having a sequence of tensors of rank two that converge to a tensor of rank three [16, 33]. Fortunately, the situation for nonnegative tensors is better because the approximation problem is well-posed [36, 49]. As we show with the next proposition, the hierarchical decomposition is also well-posed in a particular way that will be important for formulating optimization problems.

{proposition}

If satisfies and is correct, then there exists such that , for all and .

{proof}

We successively construct a set of parameters and show these satisfy the proposition. One set of parameters can be defined by performing the following steps:

1. Set ;

2. For :

1. Set ;

2. While :

1. Select an arbitrary element ;

2. Set for all ;

3. Set , for all such that ;

4. Set ;

5. Set .

Observe lists the subset of indices of for which the decomposition is undefined, and lists the subset of indices of for which the decomposition is undefined. The intuition behind this algorithm is we successively specify the parameters of the decomposition until there are no indices for which the decomposition is undefined. The inner loop ensures becomes empty, and the set becomes empty at the end of the algorithm because when .

Next, note that the parameters trivially satisfy since , and so we only need to show that the remaining parameters satisfy the bounds of the proposition. For any , suppose that . If this condition holds, then two consequences follow from step 2.b.iii: (i) , and (ii) . In fact, for we have that , since . This inductively shows that the bounds of the proposition hold for all the remaining parameters.

This result implies that the parameters of the decomposition are bounded by an amount that is independent of and as long as the individual entries of the tensor are bounded as in A1. This will allow us to define constraints in our optimization problems that ensure the numerical scaling of different parameters is controlled. For numerical reasons, we would like to avoid scalings in which some parameters are very large and other parameters are very small. This proposition allows us to define constraints that control the scaling.

### 2.4 Partition Decomposition

An important special case of a hierarchical decomposition is when the facets of the simplicial complex are a partition of the set . We refer to this instance as a partition decomposition. The partition decomposition can be written as , where the middle equation is the partition decomposition, is the tensor product, and is an appropriately-defined permutation of the indices. The partition decomposition is of note because it can be written as the product of tensors with smaller order than , and because it exactly coincides in specific cases with a low-rank cp decomposition of tensors. The cp decomposition is defined as , where and is the tensor rank [31], and it is a typical tensor decomposition and an analog of the matrix singular value decomposition [31, 27].

The simplest case in which the partition decomposition coincides with the cp decomposition is when the partition is given by . In this case, both the partition and cp decompositions represent a rank-1 tensor: , where the are vectors, the middle equation is the partition decomposition, and the right equation is the cp decomposition. The decompositions coincide in this case because they are equivalent.

Another instance where the partition and cp decompositions coincide is when the are either vectors or matrices of full rank. Assume the partitions are arranged so are matrices and are vectors. Also, let a matrix decomposition of be given by , where is the matrix rank of , and are vectors of appropriate dimensions. Then we have , where the middle and right equations are the partition and cp decompositions, respectively. The decompositions coincide in this case because the partition decomposition can be used to compute the cp decomposition by computing the matrix singular value decomposition of ; similarly, the cp decomposition can be used to compute the partition decomposition by computing .

## 3 Randomized Algorithm for Decompositions and Approximations

The algorithm in Proposition 2.3 implies a hierarchical decomposition can be computed in steps that are polynomial in the number of tensor entries. However, this computational complexity can be improved with a randomized algorithm that will only need a polynomial in effective dimension number of arithmetic calculations. This can be a significant improvement because the effective dimension can be much smaller than the number of tensor entries: For instance, a rank-1 tensor has effective dimension while it has entries.

Our approach to developing a randomized algorithm for computing a hierarchical decomposition is to randomly sample entries of the tensor. With enough samples, the decomposition will have low error with high probability. In anticipation of generalizing to the tensor completion problem, we allow the sampled entries to be measured with noise. This noise could be deterministically interpreted as the approximation error of a hierarchical decomposition, meaning the hierarchical decomposition for a specified that is closest (as measured by some loss function) to the tensor . As a result, the statistical consequences have deterministic interpretations.

This section begins by describing the noise and measurement model for sampling entries of the tensor, and then attention turns towards choosing the loss function that will be used to measure the discrepancy between the computed decomposition and the sampled entries. Specific computational and statistical challenges with choosing the loss function are discussed, and this precludes the use of a squared loss function or of taking the logarithm of the data. We propose an alternative loss function: This loss has the same minimizer in specific cases as that of the squared loss function, and we show it is majorized and minorized by the squared loss function. Furthermore, we show this loss function can be minimized in polynomial time by exactly reformulating the optimization problem as a convex program.

Next, we use this reformulation to show an equivalence result between our loss function and the decomposition error as measured by the squared loss. This equivalence result allows us to study approximation properties using our loss function and then apply the approximation properties to the squared loss. We use the stochastic processes theory of Rademacher complexity [5, 29, 32, 7] to bound the approximation error induced by computing a decomposition using a sample of tensor entries (rather than using all the tensor entries). And the section concludes by presenting a randomized algorithm, which uses the alternative loss function, and proving it has polynomial-time complexity in terms of effective dimension .

### 3.1 Noise and Measurement Model

Note we use the indexing notation to denote the -th measurement. For a randomly chosen set of indices , suppose we make a noisy measurement of the corresponding tensor entry , where is noise. A multiplicative noise model, as opposed to an additive noise model, is used here because this allows us to define a statistical model where measurements are positive-valued while the noise is independent of . However, our results also apply to the case of additive zero-mean noise with the only changes being in the constants of the resulting bounds. Rather than complicating the presentation, we focus on the multiplicative noise model. We make the following assumption about the noise:

A2. The noise are iid random variables with a mean of zero , and they are bounded by some constant .

The bounds on noise could be relaxed to be unbounded in both directions (i.e., positive and negative). This is appealing because many interesting noise distributions satisfying the property are sub-gamma distributions [7]. We do not consider these cases because their consideration does not provide additional theoretical insights; the main difference is slower rates of convergence for heavier-tailed distributions. And so for simplicity, we assume the above boundedness condition; however, we will use the gamma distribution (which is unbounded) to generate noise for the synthetic data in our numerical examples.

Another note is the reason for choosing a model with is so holds. This is a mild assumption because we are interested in computing a decomposition that best approximates ; the themselves do not have any particular meaning in our decomposition because they are nonunique up to a scaling factor.

We also make an assumption about the measurements available. For now, we will not impose any conditions on the distribution, except for requiring iid measurements.

A3. The data are iid measurements , for , where is the number of measurements.

### 3.2 Challenges with Choosing Loss Function

The usual loss function is the squared loss , and when is correct the minimizer is given by such that [4]. But numerically minimizing this loss is difficult because of nonconvexity of the squared loss in the parameters . One common approach is to use a heuristic such as alternating least squares (ALS), but this only converges to local optimum [31].

Given the structure of the hierarchical decomposition, it is tempting to compute the decomposition by minimizing , because this converts the optimization into a linear least squares problem. However, this a problematic choice because the approach in [4] can be used to show that the minimizer of the above loss function is . This is nonideal because the solution will be incorrect by the amount . Jensen’s inequality for concave functions implies ; so the general case is the nuisance parameter will be nonpositive. Taking the exponent does not resolve the problem because we still have a multiplicative error of .

### 3.3 Alternative Loss Function

So if we do not a priori know the value of the nuisance parameter , then we could devise a two step procedure that consistently estimates this nuisance parameter and then removes it from the least squares solution, in order to compute a best hierarchical decomposition of . We can eliminate the need for considering this nuisance parameter by defining an alternative loss function. This choice will be subsequently justified by showing that it displays faithful error properties and is amenable to polynomial-time computation.

We use the following loss function

 (2)

and the best approximate hierarchical decomposition

 (3) ^Θ=argmin{^R(Θ) ∣∣ Θ∈Ω}.

is defined to be the minimizer of the empirical loss function

 (4) ^R(Θ)=1n∑ni=1(−y⟨i⟩⋅∑mk=1logθXk⟨i⟩+∏mk=1θXk⟨i⟩)

subject to the constraint set

 (5) Ω={Θ:M−1≤∏mk=1θXk≤M, M−2≤θXk≤M2, ∀x∈R}.

We justify this choice by first showing an equivalence to the usual squared loss function. Our second justification is it enables polynomial-time computation for specific approximation problems for positive tensors that are NP-hard in the case of a general (i.e., not necessarily positive) tensor, and this is shown by rewriting (3) as a convex optimization problem with a polynomial in and number of constraints.

### 3.4 Error Properties of Loss Function

We show the loss function (2) has favorable error properties. This loss function resembles the negative log-likelihood for a Poisson distribution: , where is the rate parameter of the distribution, and this is not surprising because this likelihood can be used to fit hierarchical log-linear models to contingency tables [18]. Furthermore, maximum likelihood decomposition of nonnegative tensors of count data using the Poisson distribution has been previously considered [13]. However, this is the wrong interpretation for our case because can take continuous (non-integer) values and should not be interpreted as counts in general.

A better interpretation for the loss function (2) is as a Bregman divergence [4], or more specifically a generalized I-divergence (which is a generalization of the Kullback-Leibler divergence) [4, 39]. This is a more natural interpretation because of the following proposition that shows minimizing either our loss or the squared loss recovers the same solution when is correct.

{proposition}

[Banerjee, et al., 2005 [4]] If , hold and is correct, then . Moreover, the solution has the property .

A further justification for using the loss (2) is that it is equivalent to the squared loss in the sense that it both majorizes and minorizes the squared loss.

{proposition}

Under , and for any , the loss function majorizes and minorizes the squared loss function , meaning , where constants and depend on .

{proof}

Define , and consider the function over the domain . This function is strongly convex for , and so we have . Choosing gives . The lower bound follows by setting and taking the expectation of both sides.

The upper bound is shown using the mean-value form of Taylor’s theorem, which states that for any : , for some between and . As a result, we have . Choosing gives . The result follows by setting and taking the expectation of both sides.

### 3.5 Computational Properties

An equivalent reformulation of (3) can be defined using the following reparametrization of the loss function

 (6) R(U)=E(−y⋅∑mk=1uXk+exp(∑mk=1uXk)),

and the relationship between parametrizations is that . The loss function is convex in , unlike the original parametrization (3) which is nonconvex in . Moreover, the number of constraints in can be reduced to a polynomial in number of constraints by using a linear program (LP) lift [59]. Consider the set

 (7) Φ={U:∃ηk,νk s.t. ηk≤uXk≤νk, −2logM≤ηk, νk≤2logM,−logM≤∑mk=1ηk, ∑mk=1νk≤logM, ∀x∈R}.

We use this to define our reparametrized best approximate hierarchical decomposition as the minimizer to the following convex optimization problem

 (8) ^U=argmin{^R(U) ∣∣ U∈Φ},

where the reparametrized empirical loss function is

 (9)

The following proposition shows that (8) is equivalent to (3).

{proposition}

Under and for any , the solution to is equivalent to the solution of with the invertible (under ) mapping .

{proof}

We have already argued above that and are identical under , and so to prove the first part we have to show is equivalent to when using the same mapping. Observe that for points belonging to , we must have and . Combining this with the other inequalities defining leads to and , which is the same (under the equivalence) as from . A similar argument gives that and from is the same as from , under the equivalence. Because the objective and constraints of (8) and (3) are the same when equating , we have that the solution to (8) is the same as the solution to (3).

We also have the following result about the polynomial-time solvability of (3). It is proved by considering the convex reparametrization (8) and explicitly defining a barrier function for a path-following interior-point method, and then using the methods of [44, 43] to conduct a complexity analysis for an interior-point method with this barrier function. The result is stated in terms of the complexity of computing an -solution, which is a solution of an optimization problem , such that (i) , (ii) for all , and (iii) .

{proposition}

Under and for any , an -solution to the optimization problem can calculated with arithmetic steps, which is polynomial time in .

{proof}

We suitably modify the proof in [44] for the polynomial-time solvability of geometric programs: The first step is to reformulate the convex program (8) as following the convex program

 (10) min 1nn∑i=1(−y⟨i⟩⋅∑mk=1uXk⟨i⟩+ti) s.t. exp(∑mk=1uXk⟨i⟩)−ti≤0,∀i∈[n] ηk−uXk≤0, uXk−νk≤0,∀k∈[m],Xk∈Rk −logM−∑mk=1ηk≤0, ∑mk=1νk−logM≤0 (u,η,ν,t)∈G

where is a bounded convex set. Note is the symmetry center of , and so the asymmetry coefficient (see [44, 52]) of with respect to is . From Propositions 5.1.3 and 5.4.1 of [44], it follows that

 (11)

is a -self-concordant barrier for .

The next step is to bound the objective and constraints of (10). Note imply the absolute value of the objective is upper bounded by . Similarly, imply the absolute value of the left hand side of the constraints are upper bounded by , , , , , respectively. Consequently, an upper bound on these upper bounds is .

The third step is to identify barrier functions for the epigraphs of each constraint in (10). Proposition 5.4.1 of [44] states is a 1-self-concordant barrier for the constraint . Similarly, Proposition 5.3.3 of [44] states is a 2-self-concordant barrier for the constraint . Consequently, the following

 (12) −log(3Vt/ϵ−V−1n∑ni=1(−y⟨i⟩⋅∑mk=1uXk⟨i⟩+ti))+−∑ni=1{log(log(ti+t)−∑mk=1uXk⟨i⟩)+log(ti+t)}+−∑mk=1∑Xk∈Rk{log(−ηk+uXk)+log(−uXk+νk)}+−log(logM+∑mk=1ηk)−log(logM−∑mk=1νk)+F

is a -self-concordant barrier function by Proposition 5.1.3 of [44]. Note we have the bound from the definition of . From the results of §6.1 of [44], an -solution to (8) can be found in steps of the path-following algorithm. The Newton system for one step is assembled in arithmetic steps and can be solved in arithmetic steps. Consequently, an -solution to (8) can be found in arithmetic steps. The proof concludes by noting Proposition 3.5 implies an -solution to (3) can be calculated by applying the transformation to the -solution to (8).

This result immediately implies that the best rank-1 approximation of a positive tensor can be computed in polynomial time, which is in contrast to the general case where computing the best rank-1 approximation is NP-hard [27]. The approximation problem becomes easier when we restrict our focus to positive tensors.

{corollary}

The best rank-1 approximation, under the loss function and satisfying , of a tensor can be computed in polynomial time with a number of arithmetic steps that is polynomial in .

{proof}

The best rank-1 approximation corresponds to a partition decomposition with , and so the result follows from Proposition 3.5.

### 3.6 Bound on Squared Error in Terms of Loss Function

Define the oracle parameters to be any . Below, we provide a relationship between a squared error function involving and the loss function (3). This relationship will serve as useful machinery for proving subsequent results.

{proposition}

Under , and for any , we have for any that .

{proof}

We will use the equivalent (by Proposition 3.5) convex reparameterization in to show the necessary bound. The first-order optimality condition [53] for the reparametrized optimization problem (8) is

 (13) D(U∗,U)=∇R(U∗)⋅(U−U∗)≥0,

for all . Since the probability space of is finite, we can interchange the order of differentiation and integration as shown below

 (14) ∂XjR(U) =∑x∈Rfx⋅(∂Xj(−ψx⋅∑mk=1uXk+exp(∑mk=1uXk))) (15) =∑x∈Rfx⋅(−ψx+exp(∑mk=1uXk))⋅\mathbbm1Xj,

where . Combining (13) and (15) leads to

 (16) D(U∗,U) =∑Xj∑x∈Rfx⋅(−ψx+exp(∑mk=1u∗Xk))⋅\mathbbm1Xj⋅(uXj−u∗Xj) (17) =E((−ψx+exp(∑mk=1u∗Xk))⋅(∑mk=1uXk−∑mk=1u∗Xk)).

Next, consider . Since for all , this function is strongly convex [8] and satisfies , for all . Applying this inequality to gives that for any , , where we have used (17) to simplify the expression. Since from (13), we have that for any , . Because is Lipschitz on bounded domains (i.e., , for all ), we have that for any , . Inverting the mapping , which is possible because of A1, gives .

### 3.7 Risk Consistency via Rademacher Complexity

Having shown that the loss function (2) has promising properties, we next identify sufficient conditions for risk consistency [5, 26, 29, 32]. Our approach is to interpret the problem as a high-dimensional (though lower-dimensional than if we had not taken the low-rank tensor structure into consideration) linear regression under a Lipschitz loss function. The linear regression will not be with respect to the indices , but will instead be defined using indicator functions. With this interpretation, we will use Rademacher averages [5, 29, 32, 7] to bound the complexity of our model (2).

{proposition}

Under and for any , we have

 (18) P(supΘ∈Ω∣∣^R(Θ)−R(Θ)∣∣

where constants depend on .

{proof}

The proof proceeds similarly to [5, 32] by bounding the deviation of the supremum from the expectation of the supremum, and it will be easier to work in the reparametrized space. First, note that satisfies the bounded deviation condition with constant [7] because of A1,A2. As a result, McDiarmid’s inequality [7] gives

 (19) P(supU∈Φ∣∣Δ(U)∣∣−E(supU∈Φ∣∣Δ(U)∣∣)>t)≤exp(−2nt2(μMlogM+M)2),

where . And so the result follows if we can bound the quantity . Because the loss function (for a fixed value of and for ) is Lipschitz with respect to with Lipschitz constant , structural results [34, 5] give that

 (20) E(supU∈Φ∣∣Δ(U)∣∣)≤4L⋅R(FW),

where is the Rademacher complexity for an appropriate linear function class. In particular, we can define our empirical loss by taking the sample average of composed with the linear model . We should interpret the terms as pseudo-predictors, and the are still the parameters. The key observation is that if we define to be the vector of pseudo-predictors, then in fact , , and . Recall that is defined so that . And so results from [29] imply that . The result follows by combining this with (19) and (20).

The above result can be used to show risk consistency of the -solution to the best approximate hierarchical decomposition problem (3):

{theorem}

Under and for any , with probability at least we have , where constant depends on .

{proof}

The proof follows that in [26] with modifications to extend the argument for -solutions. The triangle inequality implies , and so we need to bound these two terms. The first term is bounded by Proposition 3.7, and so we only need to focus on the second term . Because the quantity is an -solution to an optimization problem with objective function , we have . Similarly, because is the minimizer of , we have . The result follows from combining the above with Proposition 3.7.

### 3.8 Polynomial-Time Hierarchical Decompositions and Approximations

We are now in a position to provide a randomized algorithm that can compute a hierarchical decomposition with time that is polynomial in , as opposed to the algorithm given in the proof of Proposition 2.3 that is polynomial in . Let be a parameter that controls the approximation accuracy of the decomposition. Then given any , the algorithm is as follows:

1. Set ;

2. Sample indices and record the corresponding tensor entries , for ;

3. Compute by solving (3);

4. Compute by using the mapping .

We will use Proposition 3.6 and Theorem 3.7 to reason about errors measured using the squared loss function. Recall that controls the approximation accuracy of the decomposition, and the value controls the accuracy of the optimization solution.

{theorem}

Suppose hold, is correct for , and the indices are sampled uniformly from . Then with probability at least the above algorithm computes a hierarchical decomposition with average approximation error

 (21) (∏ri)−1∑x∈X((ψx−∏mk=1^θϵXk)2)≤2M3(√mδlog(ρ/δ)+ϵ)

and has a polynomial-time arithmetic cost , where constant depends on .

{proof}

We can replace with because of Proposition 3.4. Combining Theorem 3.7 with Proposition 3.6 gives with probability at least . When is uniformly sampled, this expectation can be written as