Tensor Completion via Tensor Networks with a Tucker Wrapper

# Tensor Completion via Tensor Networks with a Tucker Wrapper

## Abstract

In recent years, low-rank tensor completion (LRTC) has received considerable attention due to its applications in image/video inpainting, hyperspectral data recovery, etc. With different notions of tensor rank (e.g., CP, Tucker, tensor train/ring, etc.), various optimization based numerical methods are proposed to LRTC. However, tensor network based methods have not been proposed yet. In this paper, we propose to solve LRTC via tensor networks with a Tucker wrapper. Here by “Tucker wrapper” we mean that the outermost factor matrices of the tensor network are all orthonormal. We formulate LRTC as a problem of solving a system of nonlinear equations, rather than a constrained optimization problem. A two-level alternative least square method is then employed to update the unknown factors. The computation of the method is dominated by tensor matrix multiplications and can be efficiently performed. Also, under proper assumptions, it is shown that with high probability, the method converges to the exact solution at a linear rate. Numerical simulations show that the proposed algorithm is comparable with state-of-the-art methods.

## 1 Introduction

Tensors are multi-dimensional arrays, which are generalizations of vectors and matrices. Tensors are natural tools for the representation of high dimensional data. For example, EEG signal is a third tensor (time frequency electrodes); a color video is a fourth-order tensor (width height 3 time). Tensors and their decompositions nowadays become increasingly popular and become fundamental tools to deal with high dimensional data. We refer the readers to [2, 22, 9, 31, 32] for tensors, their decompositions and applications.

Due to the data acquisition process and/or outliers, values can be missing in data. People have great interests in inferring the missing values (e.g., recommender system). However, there are infinitely many ways to fill in the missing values without further assumptions. It is commonly assumed that the high dimensional data lie in a low dimensional manifold. (For example, in a recommendation system, it is commonly believed that users’ behaviors are dictated by a few common factors.) Upon such an assumption, people may learn the low dimensional manifold from the observed data, then infer the missing values. In current literature for tensor completion, the low dimensional manifold is represented by a “low rank” tensor decomposition. Tensor rank differs from matrix rank dramatically (e.g., a real-valued tensor may have different tensor ranks over and ; the best low rank approximation of a high order tensor may not exist), and it’s the cornerstone of all methods for LRTC (low-rank tensor completion).

Mathematically, the LRTC problem can be formulated as the following optimization problem:

 minXrank∗(X),subject toΠΩ(X)=ΠΩ(T), (1)

where denotes a specific type of tensor rank, stores the indices of the observed entries, and picks the entries of a tensor with entries’ indices in . With different notations of tensor rank, various methods are proposed to solve LRTC. First, tensor has a CP decomposition (CPD) [7, 13, 15], and the tensor CP rank. However, the determination of the CP rank is NP-hard [14]. Thus in practice the CP rank is usually treated as a parameter that can be tuned. Several CP rank based methods are proposed in last two decades, e.g., INDAFAC [35], CP-WOPT [1], BPTF [42], STC [24]. Second, tensors has Tucker decomposition/high-order SVD (HOSVD) [38, 39] and the Tucker/multi-linear rank. Based on such a decomposition, many numerical methods are proposed. To name a few, pTucker [8], MRTF [19], geomCG [23], Tmac [43], FaLRTC/HaLRTC [27], Square Deal [28]. Tensors also have other decompositions and related rank definitions, e.g., tensor train (TT) [29] and TT rank [16], tensor ring (TR) and TR rank [49], t-SVD and tubal-rank [20], the related methods includes [4, 11, 12, 21, 44, 45, 48], etc. Among the various numerical methods for LRTC, some of them have theoretical guarantees for exact recovery under proper assumptions, e.g., Jain and Oh [18], Liu and Moitra [26], Yuan and Zhang [46], Mu el al. [28], Xia and Yuan [41], Zhang and Aeron [47]. We refer the readers to a recent survey [34] for a comprehensive overview of LRTC.

To the best of the authors’ knowledge, existing (popular) methods for LRTC are all optimization based. And perhaps, due to the difficulty for finding an appropriate for tensor networks (TN), (which plays the role of the nuclear norm for matrix), optimization based methods have not been proposed for LRTC based on TN. In this paper, we adopt a TN model with a Tucker wrapper, or equivalently, a Tucker/HOSVD model with the core tensor having a TN structure, see Figure 1 for illustrations, where the tensor diagram notation [5] is adopted. Unfamiliar readers may refer to Figure 2 first. What’s more, we formulate the LRTC problem as a problem solving a system of nonlinear equations (SNLE), rather than a constrained optimization problem as in (1). Then we propose a two-level alternative least square method to solve the SNLE. Under a “low rank” assumption, that is, the outermost factor matrices are low rank and the core tensor can be represented by a TN with a small number of parameters, we show that with high probability, the method converges to the exact solution at a linear rate. Finally, numerical simulations show the merits of the method.

The rest of this paper is organized as follows. In Section 2, we present some preliminary results. In Section 3, we formulate the LRTC problem as a problem of solving a system of nonlinear equations and present an algorithm to solve it. The convergence analysis of the algorithm is then presented in Section 4. Numerical simulations are provided in Section 5. Concluding remarks are given in Section 6.

Notations  In this paper, we use lowercase letters to denote scalars (e.g., ), boldface lowercase letters to denote column vectors (e.g., ), boldface uppercase letters to denote matrices (e.g., ), and boldface calligraphic letters to denote tensors (e.g., ). The symbol denotes the Kronecker product. The operation denotes the vectorization of the matrix formed by stacking the columns of into a single column vector. The identity matrix of order is denoted by . For a (rectangular) matrix , its singular values are denoted by , and is usually denoted by . The rank of is denoted by . The 2-norm and Frobenius norm are denoted by and , respectively. The range space of , which is the subspace spanned by the column vectors of , is denoted by .

## 2 Preliminary

In this section, we present some notations and preliminary results for facilitating of our following discussions.

Canonical Angles  Let be two -dimensional subspaces of . Let be the orthonormal basis matrices of and , respectively, i.e.,

 R(X)=X, XTX=Ik,  and  R(Y)=Y, YTY=Ik.

Denote for the singular values of in ascending order, i.e., . The canonical angles between and are defined by

 0≤θj(X,Y):=arccosωj≤π2,for 1≤j≤k.

They are in descending order, i.e., . Set

 Θ(X,Y)=diag(θ1(X,Y),…,θk(X,Y)).

Notice that if , the canonical angle is nothing but the angle between two vectors. In what follows, we sometimes place a vector or matrix in one or both arguments of and with the meaning that it is about the subspace spanned by the vector or the column vectors of the matrix argument.

Modal Unfolding  Given a tensor , its mode- unfolding is an -by- matrix, its columns are the mode- fibers of , denoted by .

Modal Product  Given a tensor and a matrix , the mode- product of and is an -by--by--by--by--by--by- tensor, denoted by . Let . The mode- product can be defined via modal unfolding as .

The operation is usually denoted as . Furthermore, in such case, it holds the following important equality:

 S(n)=AnT(n)(AN⊗…An+1⊗An⊗⋯⊗A1)T.

High Order SVD (HOSVD)  Given an th tensor . Let for be the SVDs of the modal unfoldings of . Then the HOSVD of can be given by

 T=S×1U1⋯×NUN=\llbracketS;U1,…,UN\rrbracket,

where is the core tensor. Furthermore, if the SVDs of ’s are economic (zero singular values of are removed, the corresponding left and right singular vectors are also removed from the column vectors of and , respectively), then will be -by--by-, where . In such case, the HOSVD will be referred to as an economic HOSVD. The vector is the multi-linear rank of , denoted by . In addition, if small singular values of are also removed, then one would obtain the truncated HOSVD.

Tucker Decomposition  Given an th tensor and a vector with inequality in at least one component. The Tucker decomposition tries to find a tensor such that

 minX∥T−X∥F,s.t.rankn(X)=r. (2)

The truncated HOSVD does not solve (2), but provides a good approximation.

Tensor Network and Graph  A tensor network aims to represent a high order tensor into a set of lower order (usually 2 or 3) tensors, which are connected sparsely. In such a way, the curse of dimensionality can be greatly alleviated or even avoided. Tensor diagram notation is a simple yet effective way to represent tensor networks, in which a node represents a tensor, an edge between two nodes indicates a contraction between the two connected node tensors in the associated pair of modes, each outgoing edge represents a mode, and the weight above it indicates the size of the mode. For example, in Figure 2, the node with none/one/two/three weighted edges stands for a scalar/a length vector/an -by- matrix/an -by--by- tensor, two nodes connected by an edge with weight stands for the inner product of two length vectors, two nodes with one outgoing edge each and one common edge stand for the multiplication between an -by- matrix and an -by- matrix.

Simply speaking, a graph is a structure that consists of nodes that may or may not be connected with one another. A graph is undirected if the edge is undirected. A graph is weighted if each of its edges is assigned with a number (weight).

Now we may embed a tensor into a weighted and undirected graph G as follows: for each mode, pick a node from G and assign the node an outgoing edge with weight . Denote the resultant graph-like structure as . We can uniquely construct a TN for a tensor from since is essentially a tensor diagram. So, for any tensor with a TN decomposition, we can rewrite it as:

 T≜T(G+(w,d),B) (3)

where is a tensor diagram constructed from a graph G, is the weight vector for G (the edges of G need to be numbered), is the weight vector of all outgoing edges, i.e., the dimension of , is the collection of the node tensors in G. For example, in Figure 3, is a tensor of dimension , , , consists of 5 matrices and 2 order-3 tensors. Note here that modes 3 and 4 are assigned to the same node.

## 3 Algorithm

In this section, we first motivate the method in Section 3.1, then summarize it in Section 3.2.

### 3.1 Solving a System of Nonlinear Equations

Let have the following TN decomposition with a Tucker wrapper:

 X=\llbracketG(G+(w,d),B);A(1),…,A(N)\rrbracket, (4)

where is the core tensor with , and are similarly defined as in (3), has orthonormal columns for all . The reason why we add a Tucker wrapper is that the size of the core tensor is smaller (in practice much smaller) than that of , by instinct, the structure of TNs for should be simpler than that for , and the parameters of TNs for should be much less than that for .

Given and assume has the decomposition (4), then

 ΠΩ(X)=ΠΩ(T) (5)

is a system of nonlinear equations (SNLE) in ’s and the node tensors in . One may use the alternative least square (ALS) method to update ’s and the node tensors in in some prescribed order. That is, fix all ’s and all node tensors in except one, update the one via minimizing ; update all ’s and the node tensors in in some prescribed order until convergence.

In order to update the factor matrices ’s and the node tensors in , we need to know , i.e., the tensor diagram (including the weights of all edges). Without knowing any information of , the problem NP-hard. To the best of the authors’ knowledge, such a problem is only recently discussed for tensor network decomposition [25], where a genetic meta-algorithm was used, and only the simplest case was treated.

In this paper, to simplify the problem, we assume and to be known, and we only have a good initial guess for . Furthermore, for the sake of a convergence guarantee, we propose to solve the SNLE via a two-level ALS method. Simply put, in the first level, we update for ; in the second level, we update the node tensors in until the core tensor converges. Next, we show how to update ’s, the node tensors in and the weight vector in detail.

#### Updating the factor matrices A(n)’s

Let be the current estimation for the core tensor. Assume that we update for one by one, and we are updating with , , updated, , , to be updated. Then we denote the current estimations for , , and , , by , , and , , , respectively. In order to update , we need to solve the following optimization problem:

 ∥ΠΩ(\llbracketGt−1;A(1)t,…,A(n−1)t,X,A(n+1)t−1,…,A(N)t−1\rrbracket)−ΠΩ(T)∥F=min. (6)

Notice that (6) is nothing but a linear least square (LLS) problem, since theoretically it can be rewritten in the form .

At first glance, the LLS problem has equations and unknowns. A closer examination indicates that the LLS problem can be decoupled into independent smaller LLS problems, that is, to solve row-wise. The LLS problem with the th row of as its unknowns, has equations. For simplicity, let us assume , , and each entry of is observed independently with probability . Then we have , and . So, when solving the LLS problem with a standard direct solver (say the normal equation method), the computational cost for computing one row of is . And updating all once requires , which are very expensive when and are large. To reduce the computation cost, we may compute a sub-optimal solution instead. For example, we may randomly sample rows from the coefficient matrix of the LLS problem, then a sub-optimal solution can be obtained from the sampled LLS problem. And the computational cost for updating all ’s can be reduced from into , which scales linearly with respect to both and , and hence, suitable for large scale problems.

In practice, we don’t need to compute the exact solution to (6), an approximation is usually sufficient. And to get an approximate solution, one may use the iterative method LSQR [30] to solve all at once. In each iteration of LSQR, two matrix vector multiplications (MVPs) are needed, namely, and . For the LLS problem (6), the operation amounts to the tensor modal product: , where ; and the operation amounts to the multiplication between a sparse matrix and a matrix: . As long as these two MVPs can be efficiently computed, LSQR is preferred, especially for large scale problems. Also, note that the sub-sample idea can also be applied to reduce the computational cost.

#### Updating the node tensors in B

To update the node tensors in , we need to know . For the ease of illustration, we assume that is in the CPD form, i.e.,

 G(G+(w,d),B)=\llbracketΛ;B(1),…,B(N)\rrbracket, (7)

where , is a diagonal tensor, for , , . See the upper left plot in Figure 1 for an illustration.

Let the current estimations for , , be , , , respectively. We can update the node tensors in , that is, , by solving

 ∥ΠΩ(\llbracketG;A(1)t,…,A(N)t\rrbracket)−ΠΩ(T)∥F=min, (8)

where is given by (7). Again, the ALS method can be used. Similar to the ALS method for computing CPD, we may update ’s as follows: fix all ’s but one (say ), then (8) becomes an LLS problem. Similar to the way we find an approximate solution for ’s (’s can no loner be computed row by row), we can obtain a new estimation for , denoted by . Let the length of the th column vector of be , for . Then we set , , where . Such a normalization step makes the all columns of ’s be of unit length, which is commonly adopted in the ALS method for computing CPD. We will perform the ALS method for updating the node tensors in until the core tensor converges. 1

For general , the ALS method can also be used to update the node tensors in . And the iteration continues until converges.

#### Updating the weight vector d

Recall that is nothing but the multi-linear rank of . We recommend an over-estimated initial guess to begin with. In each iteration, after updating , we expect to observe rank deficiency in the modal unfolding matrices of . So, when small singular values occur in , we may remove them and update the multi-linear rank correspondingly. To be precise, let be the current estimation for the core tensor, the economic SVD of be , for , where is orthogonal, is orthonormal, and . Then for all , we find the smallest such that , where is a prescribed number. Next, update , , for all , and update . After that, the size of becomes smaller, the condition number of is no more than , and ’s have orthonormal columns (thus still consist of a Tucker wrapper).

###### Remark 1.

In [6], the authors used a similar idea to transform the robust matrix completion problem into a problem of solving a system of nonlinear equations. Due to the differences between high order tensors and matrices, the approach here differs from the aforementioned one in some obvious aspects, e.g., the multilinear rank vs. the matrix rank, orthonormal factor matrices vs. two, etc. Besides that, the major difference between the two approaches is that we impose a TN structure for the core tensor. As a result, a two-level ALS method is needed rather than a “one-level” ALS method for the matrix completion problem. The reason why we impose a TN structure is that: first, it is expensive to compute the core tensor as a whole since can be large though is assumed to be small; second, a TN structure of the core tensor may improve the performance of the method in practical problems, see Example 2 in Section 5.

### 3.2 Algorithm Details

Before we present the detailed algorithm, we need to explain some notations. Recall , for each mode, pick a node from G and assign the node an outgoing edge with certain weight. To be specific, assume that all nodes of G are numbered; For each mode-, we assign it to node ; and the node tensor is connected with the mode- outgoing edge through its -mode. The detailed algorithm is summarized in Algorithm 1. Some implementation details follow.

Line 10  Inspired by the initialization of the matrix completion problem, we initialize the factor matrices ’s and the core tensor via the best rank- approximation of , which can be computed by the ALS method [3]. Then the node tensors ’s can be initialized via minimizing . A large tolerance, say , is usually sufficient, for both ’s and ’s.

Line 17  Here we only need to compute an “economic” QR decomposition, in which has orthonormal columns, is an upper triangular square matrix.

Line 20  For the sake of a guaranteed convergence, we need to update the node tensors until the core tensor converges. In practice, an approximation for the core tensor is sufficient.

Line 23  Here we only need to compute an “economic” SVD, in which only the singular values and the corresponding left singular vectors are required.

Line 24  The singular values are truncated such that the condition number of is no more than .

Lines 11, 21 and 27 Assume that the contractions for the TN can be efficiently computed. Then it is possible to perform the algorithm without formulating the core tensor explicitly.

## 4 Convergence Analysis

In this section, we study the convergence of Algorithm 1. We will follow the notations in Algorithm 1 and we will also make the following assumptions.

A1 Each entry of is observed independently with probability .
A2 The tensor can be factorized as in (4).
A3 The factor matrix satisfies an incoherence condition with parameter : , for all and .
A4 There exist two positive constants , such that

 γ≤minX∥ΠΩ(\llbracketX;A(1)t,…,A(N)t\rrbracket)−ΠΩ(T)∥Fτt≤Γ,

where is the same as in Algorithm 1.
A5 , i.e., the multi-linear rank estimations are all correctly revealed.

Several remarks on the assumptions follow.

(a) A1-A2 are standard for tensor completion [34].

(b) Recall that solving the SNLE (5) as described in Algorithm 1 is equivalent to the minimization of . If we add a regularizer and still apply Algorithm 1 (of course, with small modifications), the computed becomes smaller. The larger is, the smaller is. So, we may declare that assumption A3 is only slightly stronger than the standard assumption that satisfies an incoherence condition.

(c) Assumption A4 essentially requires that together with the weight vector have a good capacity for the representation for the minimizer of the numerator in A4. Such a requirement is quite natural. In fact, for sufficiently large , can represent any tensor of size -by--by-. And in such case, .

(d) From Algorithm 1, we know that for each , . Thus, must converge. We assume A5 for the ease of the convergence analysis.

To motivate the convergence analysis, in Section 4.1, we sketch a proof for the convergence when all entries of are observed. Then the convergence results for the partial observation case are given in Section 4.2.

### 4.1 Full Observation Case

The difference between the full and partial observation cases is the LLS problems in Algorithm 1. Note that for an LLS problem , sampling the rows uniformly yields a smaller LLS problem , where a 0-1 matrix which selects the indices in . Its solution is the solution to , which is a perturbed LLS problem for . So, to motivate the convergence analysis for the partial observation case, for the full observation case, instead of assuming A2, we assume , where is a noise tensor.

First, note that under the assumption A5, lines 24 to 29 of Algorithm 1 can be skipped. Next, we consider Lines 18 and 22. Let

 Mt,n=A(N)t−1⊗⋯⊗A(n+1)t−1⊗A(n−1)t⊗⋯⊗A(1)t,M∗,n=A(N)∗⊗⋯⊗A(n+1)∗⊗A(n−1)∗⊗⋯⊗A(1)∗. (9)

Then (6) with all entries observed is equivalent to

 ∥X[Gt−1](n)MTt,n−T(n)∥=min.

Therefore, Algorithm 1 can be given by

 X =T(n)Mt,n[Gt−1]†(n) =(A(n)∗[G∗](n)MT∗,n+E(n))Mt,n[Gt−1]†(n). (10)

When is small, the right hand side of (10) almost lies in , thus, we expect to be small. In particular, when , it holds that , i.e., updating once will find the the range space . So, it is not surprising to conclude that is a better approximation of than when is sufficiently small.

Similarly, we may also show that when is sufficiently small, is a better approximation of than . In summary, one iteration of Algorithm 1 gives better approximations for ’s and . In particular, when there is no noise, one iteration of algorithm 1 will return a solution of LRTC.

### 4.2 Partial Observation Case

In this section, we present the convergence of Algorithm 1 when the entries of are partially observed.

The next two lemmas establish the bridges between the partial and full observation cases.

###### Lemma 4.1.

Denote , , , . Let

 L(X) =\llbracketGt−1;A(1)t,…,A(n−1)t,X,A(n+1)t−1,…,A(N)t−1\rrbracket, Xopt =argmin∥Lt(X)−T∥F, ˜Xopt

Assume A1-A3 and A5, with and for all and . Also assume

 ∥sinΘ(Mt,n[Gt−1]T(n),M∗,n[G∗]T(n))∥≤Csinθt−1, (11)

where , are defined in (9), is a constant. 2 Then with probability (w.p.) , it holds that

 ∥˜Xopt−Xopt∥≤6gmax(C1+C2)gmin√αpsinθt−1,

where , and .

Lemma 4.1 tells that the distance between the partial observation solution and the full observation solution is upper bounded: the larger is, the smaller the distance is; the smaller is, the smaller the distance is.

###### Lemma 4.2.

Let , , where

 Xopt =argmin∥\llbracketX;A(1)t,…,A(N)t\rrbracket−T∥F, ˜Xopt =argmin∥ΠΩ(\llbracketX;A(1)t,…,A(N)t\rrbracket−T)∥F.

Assume A1, A3, with being the same as in Lemma 4.1. Then w.p. , it holds that

 ϕt≤ψt≤3/√2 ϕt.

Lemma 4.2 tells that the partial observation solution is as good as the full observation solution , in term of the residual. With the help of the above two lemmas, we are able to prove our main theorem.

###### Theorem 4.3.

Follow the notations in Lemma 4.1. Assume A1-A5, , and

 μ= 3√2∥T∥Fgmin[(1+sinθ0)N−1]sinθ0×6gmax(C1+C2)√αpgmin−√2κψ0−6gmax(C1+C2)√αpsinθ0<γ7Γ.

Then

In other words, Algorithm 1 converges to the exact solution at a linear rate, w.h.p.

###### Remark 2.

Let G be a single node. Then we are in fact solving LRTC problem with the multi-linear rank. Let , . Then , i.e., we need at least observations. When , we need observations, which is theoretical optimal. When , we need observations. Compared with existing multi-linear rank based methods, Algorithm 1 needs less observations, see Table 1 below.

## 5 Numerical Experiment

In this section, we present several numerical examples to illustrate the performance of our method.

Example 1.  In this example, we let G have a single node. We compare our algorithm with two multi-linear rank based tensor completion methods, namely, geomCG [23] and Tmac [43]. 3 The MATLAB codes for geomCG and Tmac are obtained from Github. 4

We generate the tensor as , where , , and their entries are i.i.d. from the standard normal distribution. And each entry of is observed independently with probability . We perform the tests under the following settings:

1. , , ;

2. , , for ;

3. , for , .

We use the residual to measure the quality of the computed solution.

In Figure 4, we plot the results of our method under settings 1 and 2. From the left (setting 1) and right (setting 2) figures, we can see that in all cases, converges linearly; the larger is, the larger the convergence rate is; the smaller is, the larger the convergence rate is.

Under setting 3, we set the tolerance for all three methods to . On the output of each method, if , we take it as a success. For every pair of , we perform all three methods 20 times. The phase transitions for three methods are reported in Figure 5. We can see that to ensure a successful recovery, our method permits a smaller and a larger , compared with geomCG and Tmac.

Example 2.  In this example, we show the effectiveness of the TN with a Tucker wrapper. For the color image “onion.png” (available in MATLAB), which is a third-order tensor of dimension , we reshape it into a fifth-order tensor of dimension . Each entry of the tensor is observed with a probability of . The original and observed figures are shown in Figure 6. We perform our method with three different tensor diagrams, the recovered figures are shown in Figure 7. We can see that compared with simple Tucker, Tucker with TT and TR improves the quality of the recovered figures.

## 6 Conclusion

This paper considers the LRTC problem via tensor networks with a Tucker wrapper. The problem is formulated as a system of nonlinear equations, and a two-level ALS method is proposed to solve it. Under proper assumptions, it is shown that our method converges to the exact solution at a linear rate, w.h.p. Also, to ensure convergence, our method requires less number of observations compared with existing multi-linear rank based tensor completion methods. Numerical simulations show that the method is comparable with state-of-the-art algorithms.

The topology structure search problem remains still open: how to find the tensor diagram to achieve the best performance for real-world data? More studies are needed to that end.

Appendix

## Appendix A Preliminary Lemmas

In this section, we give preliminary lemmas that will be frequently used in subsequent proofs. Firstly, the following lemma is fundamental for canonical angle, and can be easily verified via definition.

###### Lemma A.1.

Let and be two orthogonal matrices with . Then

 ∥sinΘ(U,V)∥ =∥UTcV∥=∥U