1 Introduction
Abstract

We study the problem of learning a tensor from a set of linear measurements. A prominent methodology for this problem is based on a generalization of trace norm regularization, which has been used extensively for learning low rank matrices, to the tensor setting. In this paper, we highlight some limitations of this approach and propose an alternative convex relaxation on the Euclidean ball. We then describe a technique to solve the associated regularization problem, which builds upon the alternating direction method of multipliers. Experiments on one synthetic dataset and two real datasets indicate that the proposed method improves significantly over tensor trace norm regularization in terms of estimation error, while remaining computationally tractable.

A New Convex Relaxation for Tensor Completion

Bernardino Romera-Paredes

Department of Computer Science and UCL Interactive Centre

University College London

Gower Street, WC1EBT, London, UK

bernardino.paredes.09@ucl.ac.uk

Massimiliano Pontil

Department of Computer Science and

Centre for Computational Statistics and Machine Learning

University College London

Gower Street, WC1EBT, London, UK

m.pontil@cs.ucl.ac.uk

1 Introduction

During the recent years, there has been a growing interest on the problem of learning a tensor from a set of linear measurements, such as a subset of its entries, see [9, 17, 22, 23, 25, 26, 27] and references therein. This methodology, which is also referred to as tensor completion, has been applied to various fields, ranging from collaborative filtering [15], to computer vision [17], to medical imaging [9], among others. In this paper, we propose a new method to tensor completion, which is based on a convex regularizer which encourages low rank tensors and develop an algorithm for solving the associated regularization problem.

Arguably the most widely used convex approach to tensor completion is based upon the extension of trace norm regularization [24] to that context. This involves computing the average of the trace norm of each matricization of the tensor [16]. A key insight behind using trace norm regularization for matrix completion is that this norm provides a tight convex relaxation of the rank of a matrix defined on the spectral unit ball [8]. Unfortunately, the extension of this methodology to the more general tensor setting presents some difficulties. In particular, we shall prove in this paper that the tensor trace norm is not a tight convex relaxation of the tensor rank.

The above negative result stems from the fact that the spectral norm, used to compute the convex relaxation for the trace norm, is not an invariant property of the matricization of a tensor. This observation leads us to take a different route and study afresh the convex relaxation of tensor rank on the Euclidean ball. We show that this relaxation is tighter than the tensor trace norm, and we describe a technique to solve the associated regularization problem. This method builds upon the alternating direction method of multipliers and a subgradient method to compute the proximity operator of the proposed regularizer. Furthermore, we present numerical experiments on one synthetic dataset and two real-life datasets, which indicate that the proposed method improves significantly over tensor trace norm regularization in terms of estimation error, while remaining computationally tractable.

The paper is organized in the following manner. In Section 2, we describe the tensor completion framework. In Section 3, we highlight some limitations of the tensor trace norm regularizer and present an alternative convex relaxation for the tensor rank. In Section 4, we describe a method to solve the associated regularization problem. In Section 5, we report on our numerical experience with the proposed method. Finally, in Section 6, we summarize the main contributions of this paper and discuss future directions of research.

2 Preliminaries

In this section, we begin by introducing some notation and then proceed to describe the learning problem. We denote by the set of natural numbers and, for every , we define . Let and let111For simplicity we assume that for every , otherwise we simply reduce the order of the tensor without loss of information. . An -order tensor , is a collection of real numbers . Boldface Euler scripts, e.g. , will be used to denote tensors of order higher than two. Vectors are -order tensors and will be denoted by lower case letters, e.g. or ; matrices are -order tensors and will be denoted by upper case letters, e.g. . If then for every , we define . We also use the notation and .

A mode- fiber of a tensor is a vector composed of the elements of obtained by fixing all indices but one, corresponding to the -th mode. This notion is a higher order analogue of columns (mode- fibers) and rows (mode- fibers) for matrices. The mode- matricization (or unfolding) of , denoted by , is a matrix obtained by arranging the mode- fibers of so that each of them is a column of , where . Note that the order of the columns is not important as long as it is consistent.

We are now ready to describe the learning problem. We choose a linear operator , representing a set of linear measurements obtained from a target tensor as , where is some disturbance noise. In this paper, we mainly focus on tensor completion, in which case the operator measures elements of the tensor. That is, we have , where, for every and , the index is a prescribed integer in the set . Our aim is to recover the tensor from the data . To this end, we solve the regularization problem

 min{∥y−I(W)∥22+γR(W):W∈Rp1×⋯×pN} (1)

where is a positive parameter which may be chosen by cross validation. The role of the regularizer is to encourage tensors which have a simple structure in the sense that they involve a small number of “degrees of freedom”. A natural choice is to consider the average of the rank of the tensor’s matricizations. Specifically, we consider the combinatorial regularizer

 R(W)=1NN∑n=1rank(W(n)). (2)

Finding a convex relaxation of this regularizer has been the subject of recent works [9, 17, 23]. They all agree to use the trace norm for tensors as a convex proxy of . This is defined as the average of the trace norm of each matricization of , that is,

 ∥W∥tr=1NN∑n=1∥W(n)∥tr (3)

where is the trace (or nuclear) norm of matrix , namely the -norm of the vector of singular values of matrix (see, e.g. [14]). Note that in the particular case of -order tensors, functions (2) and (3) coincide with the usual notion of rank and trace norm of a matrix, respectively.

A rational behind the regularizer (3) is that the trace norm is the tightest convex lower bound to the rank of a matrix on the spectral unit ball, see [8, Thm. 1]. This lower bound is given by the convex envelope of the function

 Ψ(W)={rank(W),if ∥W∥∞≤1+∞,otherwise (4)

where is the spectral norm, namely the largest singular value of . The convex envelope can be derived by computing the double conjugate of . This is defined as

 Ψ∗∗(W)=sup{⟨W,S⟩−Ψ∗(W):S∈Rp1×p2} (5)

where is the conjugate of , namely .

Note that is a spectral function, that is, where denotes the associated symmetric gauge function. Using von Neumann’s trace theorem (see e.g. [14]) it is easily seen that is also a spectral function. That is, , where

 ψ∗(σ)=sup{⟨σ,w⟩−ψ(w):w∈Rd+},   with d:=min(p1,p2).

We refer to [8] for a detailed discussion of these ideas. We will use this equivalence between spectral and gauge functions repeatedly in the paper.

3 Alternative Convex Relaxation

In this section, we show that the tensor trace norm is not a tight convex relaxation of the tensor rank in equation (2). We then propose an alternative convex relaxation for this function.

Note that due to the composite nature of the function , computing its convex envelope is a challenging task and one needs to resort to approximations. In [22], the authors note that the tensor trace norm in equation (3) is a convex lower bound to on the set

The key insight behind this observation is summarized in Lemma 4, which we report in Appendix A. However, the authors of [22] leave open the question of whether the tensor trace norm is the convex envelope of on the set . In the following, we will prove that this question has a negative answer by showing that there exists a convex function which underestimates the function on and such that for some tensor it holds that .

To describe our observation we introduce the set

 G2:={W∈Rp1×…×pN:∥W∥2≤1}

where is the Euclidean norm for tensors, that is,

 ∥W∥22:=p1∑i1=1⋯pN∑iN=1(Wi1,…,iN)2.

We will choose

 Ω(W)=Ωα(W):=1NN∑n=1ω∗∗α(σ(W(n))) (6)

where is the convex envelope of the cardinality of a vector on the -ball of radius and we will choose . Note, by Lemma 4 stated in Appendix A, that, for every , function is a convex lower bound of function on the set .

Below, for every vector we denote by the vector obtained by reordering the components of so that they are non increasing in absolute value, that is, .

Lemma 1.

Let be the convex envelope of the cardinality function on the -ball of radius . Then, for every such that , it holds that .

Proof.

First, we note that the conjugate of the function on the ball of radius is given by the formula

 ω∗α(s)=sup∥y∥2≤α{⟨s,y⟩−card(y)}=maxr∈{0,…,d}{α∥s↓1:r∥2−r}. (7)

Hence, by the definition of the double conjugate, we have, for every that

 ω∗∗α(x)≥⟨s,x⟩−maxr∈{0,…,d}{α∥s↓1:r∥2−r}.

In particular, if for some this inequality becomes

 ω∗∗α(x)≥k∥x∥22−maxr∈{0,…,d}(αk∥x↓1:r∥2−r).

If is large enough, the maximum is attained at . Consequently,

 ω∗∗α(x)≥kα2−kα2+card(x)=card(x).

By the definition of the convex envelope, it also holds that . The result follows. ∎

The next lemma provides, together with Lemma 1, a sufficient condition for the existence of a tensor at which the proposed regularizer is strictly larger than the tensor trace norm.

Lemma 2.

If and are not all equal to each other, then there exists such that: (a) (b) (c) .

Proof.

Without loss of generality we assume that . By hypothesis . First we consider the special case

 p1=⋯=pN−1, and pN=p1+1. (8)

We define a class of tensors by choosing a singular value decomposition for their mode- matricization,

 Wi1,i2,…,iN=pN∑k=1σkukiNvki1,…,iN−1 (9)

where , the vectors are orthonormal and the vectors are orthonormal as well. Moreover, we choose as

 vki1,…,iN−1=⎧⎪ ⎪⎨⎪ ⎪⎩1 if i1=⋯=iN−1=k, k

By construction the matrix has rank equal to and Frobenius norm equal to . Thus properties (a) and (c) hold true. It remains to show that satisfies property (b). To this end, we will show, for every and every , that

 ∥W⊤(n)x∥2≤∥x∥2.

The case is immediate. If we have

 ∥W⊤(1)x∥22 = ∑i2,…,iN(∑kσk∑i1ukiNvki1,…,iN−1xi1)2 = ∑i2,…,iN∑k,ℓ∑i1,j1xi1xj1σkσℓukiNuℓiNvki1,i2,…,iN−1vℓj1,i2,…,iN−1 = ∑kσ2k∑i1,j1xi1xj1⎛⎝∑i2,…,iN−1vki1,i2,…,iN−1vkj1,i2,…,iN−1⎞⎠ = ∑kσ2kx2k+σ2pNp1∑kx2k=∥x∥22

where we used in the third equality, equation (10) and a direct computation in the fourth equality, and the definition of in the last equality.
All other cases, namely , are conceptually identical, so we only discuss the case . We have

 ∥W⊤(2)x∥22 = ∑i1,i3,…,iN(∑kσk∑i2ukiNvki2,…,iN−1xi2)2 = ∑i1,i3,…,iN∑k,ℓ∑i2,j2xi2xj2σkσℓukiNuℓiNvki1,i2,…,iN−1vℓi1,j2,…,iN−1 = ∑kσ2k∑i2,j2⎛⎝xi2xj2∑i1,i3,…,iN=1vki1,i2,…,iN−1vki1,j2,…,iN−1⎞⎠ = ∑kσ2kx2k+σ2pNp1∑kx2k=∥x∥22

where again we used in the third equality, equation (10) and a direct computation in the fourth equality, and the definition of in the last equality.
Finally, if assumption (8) is not true we set if , for some or . We then proceed as in the case and . ∎

We are now ready to present the main result of this section.

Proposition 3.

Let , let be the tensor trace norm in equation (3) and let be the function in equation (6) for . If , then there are infinitely many tensors such that . Moreover, for every , it holds that .

Proof.

By construction for every . Since then is a convex lower bound for the tensor rank on the set as well. The first claim now follows by Lemmas 1 and 2. Indeed, all tensors obtained following the process described in Lemma 2 have the property that

 ∥W∥tr = 1NN∑n=1∥σ(W(n))∥1 = 1N(pmin(N−1)+√p2min+pmin) < 1N(pmin(N−1)+pmin+1)=Ω(W)=R(W).

Furthermore there are infinitely many such tensors which satisfy this claim since the left singular vectors can be arbitrarily chosen in equation (9).
To prove the second claim, we note that since is the convex envelope of the cardinality on the Euclidean unit ball, then it holds that for every vector such that . Consequently,

 Ω1(W)=1NN∑n=1ω∗∗1(σ(W(n)))≥1NN∑n=1∥σ(W(n))∥1=∥W∥tr.

The above result stems from the fact that the spectral norm is not an invariant property of the matricization of a tensor, whereas the Euclidean (Frobenius) norm is. This observation leads us to further study the function .

4 Optimization Method

In this section, we explain how to solve the regularization problem associated with the proposed regularizer (6). For this purpose, we first recall the alternating direction method of multipliers (ADMM) [4], which was conveniently applied to tensor trace norm regularization in [9, 22].

4.1 Alternating Direction Method of Multipliers (ADMM)

To explain ADMM we consider a more general problem comprising both tensor trace norm regularization and the regularizer we propose,

 minW{E(W)+γN∑n=1Ψ(W(n))} (11)

where is an error term such as and is a convex spectral function. It is defined, for every matrix , as

 Ψ(A)=ψ(σ(A))

where is a gauge function, namely a function which is symmetric and invariant under permutations. In particular, if is the norm then problem (11) corresponds to tensor trace norm regularization, whereas if it implements the proposed regularizer.

Problem (11) poses some difficulties because the terms under the summation are interdependent, that is, the different matricizations of have the same elements rearranged in a different way. In order to overcome this difficulty, the authors of [9, 22] proposed to use ADMM as a natural way to decouple the regularization term appearing in problem (11). This strategy is based on the introduction of auxiliary tensors, , so that problem (11) can be reformulated as222The somewhat cumbersome notation denotes the mode- matricization of tensor , that is, .

 minW,B1,…,BN{1γE(W)+N∑n=1Ψ(Bn(n)): Bn=W,n∈[N]} (12)

The corresponding augmented Lagrangian (see e.g. [4, 5]) is given by

 (13)

where denotes the scalar product between tensors, is a positive parameter and are the set of Lagrange multipliers associated with the constraints in problem (12).

ADMM is based on the following iterative scheme

 W[i+1] ← (14) B[i+1]n ← argminBn L(W[i+1],B,A[i]) (15) A[i+1]n ← A[i]n−(βW[i+1]−B[i+1]n). (16)

Step (16) is straightforward, whereas step (14) is described in [9]. Here we focus on the step (15) since this is the only problem which involves function . We restate it with more explanatory notations as

 argminBn(n){Ψ(Bn(n))−⟨An(n),W(n)−Bn(n)⟩+β2∥∥W(n)−Bn(n)∥∥22}.

By completing the square in the right hand side, the solution of this problem is given by

 ^Bn(n)=prox1βΨ(X):=argminBn(n){1βΨ(Bn(n))+12∥∥Bn(n)−X∥∥22}

where . By using properties of proximity operators (see e.g. [2, Prop. 3.1]) we know that if is a gauge function then

 prox1βΨ(X)=UXdiag(prox1βψ(σ(X)))V⊤X

where and are the orthogonal matrices formed by the left and right singular vectors of , respectively.

If we choose the associated proximity operator is the well-known soft thresholding operator, that is, , where the vector has components

 vi=sign(σi)(|σi|−1β).

On the other hand, if we choose , we need to compute . In the next section, we describe a method to accomplish this task.

4.2 Computation of the Proximity Operator

In order to compute the proximity operator of the function we will use several properties of proximity calculus. First, we use the formula (see e.g. [7]) for . Next we use a property of conjugate functions from [21, 13], which states that . Finally, by the scaling property of proximity operators [7], we have that .

It remains to compute the proximity operator of a multiple of the function in equation (7), that is, for any , , we wish to compute

 proxβω∗α(y)=argminw{h(w):w∈S}

where we have defined and

 h(w)=12∥w−y∥22+β dmaxr=0{α∥w1:r∥2−r}.

In order to solve this problem we employ the projected subgradient method, see e.g. [6]. It consists in applying two steps at each iteration. First, it advances along a negative subgradient of the current solution; second, it projects the resultant point onto the feasible set . In fact, according to [6], it is sufficient to compute an approximate projection, a step which we describe in Appendix B. To compute a subgradient of at , we first find any integer such that . Then, we calculate a subgradient of the function at by the formula

 gi=⎧⎨⎩(1+αβ∥w1:k∥2)wi−yi,if i≤k,wi−yi,otherwise.

Now we have all the ingredients to apply the projected subgradient method, which is summarized in Algorithm 1. In our implementation we stop the algorithm when an update of is not made for more than iterations.

5 Experiments

We have conducted a set of experiments to assess whether there is any advantage of using the proposed regularizer over the tensor trace norm for tensor completion. First, we have designed a synthetic experiment to evaluate the performance of both approaches under controlled conditions. Then, we have tried both methods on two tensor completion real data problems. In all cases, we have used a validation procedure to tune the hyper-parameter , present in both approaches, among the values . In our proposed approach there is one further hyper-parameter, , to be specified. It should take the value of the Frobenius norm of any matricization of the underlying tensor. Since this is unknown, we propose to use the estimate

 ^α= ⎷∥w∥22+(mean(w)2+var(w))(N∏i=1pi−m),

where if the number of known entries and contains their values. This estimator assumes that each value in is sampled from , where and are the average and the variance of the elements in .

5.1 Synthetic Dataset

We have generated a -order tensor by the following procedure. First we generated a tensor with ranks using Tucker decomposition (see e.g. [16])

 Wi1,i2,i3=12∑j1=16∑j2=13∑j3=1Cj1,j2,j3M(1)i1,j1M(2)i2,j2M(3)i3,j3,   (i1,i2,i3)∈[40]×[20]×[10]

where each entry of the Tucker decomposition components is sampled from the standard Gaussian distribution . We then created the ground truth tensor by the equation

 W0i1,i2,i3=Wi1,i2,i3−mean(W)std(W)+ξi1,i2,i3

where and are the mean and standard deviation of the elements of and the are i.i.d. Gaussian random variables with zero mean and variance . We have randomly sampled of the elements of the tensor to compose the training set, for the validation set, and the remaining for the test set. After repeating this process times, we report the average results in Figure 1 (Left). Having conducted a paired -test for each value of , we conclude that the visible differences in the performances are highly significant, obtaining always -values less than for .

Furthermore, we have conducted an experiment to test the running time of both approaches. We have generated tensors for different values of , following the same procedure as outlined above. The results are reported in Figure 1 (Right). For low values of , the ratio between the running time of our approach and that of trace norm regularization is quite high. For example in the lowest value tried for in this experiment, , this ratio is . However, as the volume of the tensor increases, the ratio quickly decreases. For example, for , the running time ratio is . These outcomes are expected since when is low, the most demanding routine in our method is the one described in Algorithm 1, where each iteration is of order and in the best and worst case, respectively. However, as increases the singular value decomposition routine, which is common to both methods, becomes the most demanding because it has a time complexity [10]. Therefore, we can conclude that even though our approach is slower than the trace norm based method, this difference becomes much smaller as the size of the tensor increases.

5.2 School Dataset

The first real dataset we have tried is the Inner London Education Authority (ILEA) dataset . It is composed of examination marks ranging from to , of students which are described by a set of attributes such as school and ethnic group. Most of these attributes are categorical, thereby we can think of exam mark prediction as a tensor completion problem where each of the modes corresponds to a categorical attribute. In particular, we have used the following attributes: school (), gender (), VR-band (), ethnic (), and year (), leading to a -order tensor .

We have selected randomly of the instances to make the test set and another of the instances for the validation set. From the remaining instances, we have randomly chosen of them for several values of . This procedure has been repeated times and the average performance is presented in Figure 2 (Left). There is a distinguishable improvement of our approach with respect to tensor trace norm regularization. To check whether this gap is significant, we have conducted a set of paired -tests for each value of . In all cases we obtained a -value below .

5.3 Video Completion

In the second real-data experiment we have performed a video completion test. Any video can be treated as a -order tensor: “width” “height” “RGB” “video length”, so we can use tensor completion algorithms to rebuild a video from a few inputs, a procedure that can be useful for compression purposes. In our case, we have used the Ocean video, available at [17]. This video sequence can be treated as a tensor . We have randomly sampled tensors elements as training data, of them as validation data, and the remaining ones composed the test set. After repeating this procedure times, we present the average results in Figure 2 (Right). The proposed approach is noticeably better than the tensor trace norm in this experiment. This apparent outcome is strongly supported by the paired -tests which we run for each value of , obtaining always -values below , and for the cases , we obtained -values below .

6 Conclusion

In this paper, we proposed a convex relaxation for the average of the rank of the matricizations of a tensor. We compared this relaxation to a commonly used convex relaxation used in the context of tensor completion, which is based on the trace norm. We proved that this second relaxation is not tight and argued that the proposed convex regularizer may be advantageous. Empirical comparisons indicate that our method consistently improves in terms of estimation error over tensor trace norm regularization, while being computationally comparable on the range of problems we considered. In the future it would be interesting to study methods to speed up the computation of the proximity operator of our regularizer and investigate its utility in tensor learning problem beyond tensor completion such as multilinear multitask learning [20].

References

• [1]
• [2] A. Argyriou, C.A. Micchelli, M. Pontil, L. Shen and Y. Xu. Efficient first order methods for linear composite regularizers. arXiv:1104.1436, 2011.
• [3] R. Bhatia. Matrix Analysis. Springer Verlag, 1997.
• [4] D.P. Bertsekas, J.N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, 1989.
• [5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
• [6] S. Boyd, L. Xiao, A. Mutapcic. Subgradient methods, Stanford University, 2003.
• [7] P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering (H. H. Bauschke et al. Eds), pages 185–212, Springer, 2011.
• [8] M. Fazel, H. Hindi, and S. Boyd. A rank minimization heuristic with application to minimum order system approximation. Proc. American Control Conference, Vol. 6, pages 4734–4739, 2001.
• [9] S. Gandy, B. Recht, I. Yamada. Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Problems, 27(2), 2011.
• [10] G. H. Golub, C. F. Van Loan. Matrix Computations. 3rd Edition. Johns Hopkins University Press, 1996.
• [11] Z. Harchaoui, M. Douze, M. Paulin, M. Dudik, J. Malick. Large-scale image classification with trace-norm regularization. IEEE Conference on Computer Vision & Pattern Recognition (CVPR), pages 3386–3393, 2012.
• [12] J-B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms, Part I. Springer, 1996.
• [13] J-B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms, Part II. Springer, 1993.
• [14] R.A. Horn and C.R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 2005.
• [15] A. Karatzoglou, X. Amatriain, L. Baltrunas, N. Oliver. Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering. Proc. 4th ACM Conference on Recommender Systems, pages 79–86, 2010.
• [16] T.G. Kolda and B.W. Bade. Tensor decompositions and applications. SIAM Review, 51(3):455–500, 2009.
• [17] J. Liu, P. Musialski, P. Wonka, J. Ye. Tensor completion for estimating missing values in visual data. Proc. 12th International Conference on Computer Vision (ICCV), pages 2114–2121, 2009.
• [18] Y. Nesterov. Gradient methods for minimizing composite objective functions. ECORE Discussion Paper, 2007/96, 2007.
• [19] B. Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12:3413–3430, 2009.
• [20] B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze and M. Pontil. Multilinear multitask learning. Proc. 30th International Conference on Machine Learning (ICML), pages 1444–1452, 2013.
• [21] N. Z. Shor. Minimization Methods for Non-differentiable Functions. Springer, 1985.
• [22] M. Signoretto, Q. Tran Dinh, L. De Lathauwer, J.A.K. Suykens. Learning with tensors: a framework based on convex optimization and spectral regularization. Machine Learning, to appear.
• [23] M. Signoretto, R. Van de Plas, B. De Moor, J.A.K. Suykens. Tensor versus matrix completion: a comparison with application to spectral data. IEEE Signal Processing Letters, 18(7):403–406, 2011.
• [24] N. Srebro, J. Rennie and T. Jaakkola. Maximum margin matrix factorization. Advances in Neural Information Processing Systems (NIPS) 17, pages 1329–1336, 2005.
• [25] R. Tomioka, K. Hayashi, H. Kashima, J.S.T. Presto. Estimation of low-rank tensors via convex optimization. arXiv:1010.0789, 2010.
• [26] R. Tomioka and T. Suzuki. Convex tensor decomposition via structured Schatten norm regularization. arXiv:1303.6370, 2013.
• [27] R. Tomioka, T. Suzuki, K. Hayashi, H. Kashima. Statistical performance of convex tensor decomposition. Advances in Neural Information Processing Systems (NIPS) 24, pages 972–980, 2013.

Appendix

In this appendix, we describe an auxiliary result and present the main steps for the computation of an approximate projection.

Appendix A A Useful Lemma

Lemma 4.

Let be convex subsets of a Euclidean space and let . Let and let be the function defined, for every , as . Then, for every , it holds that

 h∗∗(x)≥g∗∗(x1,…,xN)∣∣xn=x, ∀n∈[N].
Proof.

Since the restriction of on equals to , the convex envelope of when evaluated on the smaller set cannot be larger than the convex envelope of on . ∎

Using this result it is immediately possible to derive a convex lower bound for the function in equation (2). Since the convex envelope of the rank function on the unit ball of the spectral norm is the trace norm, using Lemma 4 with and

 g(W1,…,WN)=1NN∑n=1rank((Wn)(n)),

we conclude that the convex envelope of the function on the set is bounded from below by . Likewise the convex envelope of on the set is lower bounded by the function in equation (6).

Appendix B Computation of an Approximated Projection

Here, we address the issue of computing an approximate Euclidean projection onto the set

 S={v∈Rd:v1≥⋯≥vd≥0}.

That is, for every , we shall find a point such that

 ∥∥~PS(v)−z∥∥2≤∥v−z∥2,∀z∈S. (17)

As noted in [6], in order to build such that this property holds true, it is useful to express the set of interest as the smallest one in a series of nested sets. In our problem, we can express as

,

where . This property allows us to sequentially compute an approximate projection on the set using the formula

 ~PS(v)=PSd(PSd−1⋯(PS1(v))) (18)

where, for every close convex set , we let be the associated projection operator. Indeed, following [6], we can argue by induction on that verifies condition (17). The base case is , which is obvious. Now, if for a given it holds that

then

,

since is also contained in .

Note that to evaluate the right hand side of equation (18) we do not require full knowledge of , we only need to compute for . The next proposition describes a recursive formula to achieve this step.

Proposition 5.

For any , we express its first elements as , where the last is the largest integer such that . It holds that

where denotes the vector containing in all its elements.

Proof.

The first case is straightforward. In the following we prove the remaining two. In both cases it will be useful to recall that the projection operator on any convex set is characterized as

 (19)

To prove the second case, we use property (19) and apply simple algebraic transformations to obtain, for all , that

 ⟨v−PSi+1(v),z−PSi+1(v)⟩=vi+1−vij+1(jzi+1−∥∥zi−j+1:i∥∥1)≤0.

Finally we prove the third case. We want to show that if then

 x=PSi+1([v1:i−j,vi−j1j,vi+1−(vi−j−vi)j,vi+2:d]).

By using property (19), the last equation is equivalent to the statement that if

 ⟨v−x,z−x⟩≤0,∀z∈Si+1  then (20)
 ⟨[v1:i−j,vi−j1j,vi+1−(vi