A Randomized Tensor Train Singular Value Decomposition

A Randomized Tensor Train Singular Value Decomposition

Abstract

The hierarchical SVD provides a quasi-best low rank approximation of high dimensional data in the hierarchical Tucker framework. Similar to the SVD for matrices, it provides a fundamental but expensive tool for tensor computations. In the present work we examine generalizations of randomized matrix decomposition methods to higher order tensors in the framework of the hierarchical tensors representation. In particular we present and analyze a randomized algorithm for the calculation of the hierarchical SVD (HSVD) for the tensor train (TT) format.

\addbibresource

literature.bib

1 Introduction

Low rank matrix decompositions, such as the singular value decomposition (SVD) and the QR decomposition are principal tools in data analysis and scientific computing. For matrices with small rank both decompositions offer a tremendous reduction in computational complexity and can expose the underlying problem structure. In recent years generalizations of these low rank decompositions to higher order tensors have proven to be very useful and efficient techniques as well. In particular the hierarchical Tucker [hackbusch2009new] and the tensor train [oseledets2011tensor] format made quite an impact, as both formats allow to circumvent the notorious curse of dimensionality, i.e. the exponential scaling of the ambient spaces with respect to the order of the tensors. Applications of these formats are as various as high-dimensional PDE’s like the Fokker Planck equations and the many particle Schrödinger equations, applications in neuroscience, graph analysis, signal processing, computer vision and computational finance, see the extensive survey of \citetgrasedyck2013literature. Also in a recent paper in machine learning \citetcohen2015expressive showed a connection between these tensor formats and deep neural networks and used this to explain the much higher power of expressiveness of deep neural networks over shallow ones.

One of the main challenges when working with these formats is the calculation of low rank decompositions of implicitly or explicitly given tensors, i.e. the high dimension analog of the classical SVD calculation. For matrices there exists a wide range of methods, which allow these calculations with high efficiently and precision. One particular branch are randomized methods which appear often in the literature, mostly as efficient heuristics to calculate approximate decompositions. It was only recently that thanks to new results form random matrix theory, a rigorous analysis of these procedures became possible, see [halko2011finding]. In this work we aim to extend some of these results for randomized matrix decompositions to the high dimensional tensor case. To this end we present an algorithm which allows the efficient calculation of the tensor train SVD (TT-SVD) for general higher order tensors. Especially for sparse tensors this algorithms exhibits a superior complexity, scaling only linear in the order, compared to the exponential scaling of the naive approach. Extending the results of [halko2011finding], we show that stochastic error bounds can also be obtained for these higher order methods.

This work focuses on the theoretical and algorithmic aspects of this randomized (TT-)SVD. However a particular application on our mind is the work in [rauhut2015tensor, rauhut2016low], where we treat the tensor completion problem. That is, in analogy to matrix completion, see e.g. [candes2009exact, recht2011simpler, cai2010singular], we want to reconstruct a tensor from measurements using a low rank assumption. We use an iterative (hard) thresholding procedure, which requires the (approximate) calculation of a low rank decomposition in each iteration of the algorithm. As the deterministic TT-SVD is already a fundamental tool, there are of course many further possible application for our randomized variant, see for example [bachmayr2015adaptive, bachmayr2016tensor, bachmayr2016iterative].

We start with a brief recap of tensor product spaces and introduce the notation used in the remainder of this work. In section 2 we give an overview of different tensor decompositions, generalizing the singular value decomposition from matrices to higher order tensors. In the second part a detailed introduction of the tensor train format is provided. Section 3.1 summarizes results for randomized matrix decompositions which are important for this work. In section 3.2 we introduce our randomized TT-SVD scheme and prove stochastic error bounds for this procedure. An interesting relation between the proposed algorithm and the popular alternating least squares (ALS) algorithm is examined in section 4. Section 5 collects several numerical experiments showing the performance of the proposed algorithms. Section 6 closes with some concluding remarks.

1.1 Tensor product spaces

Let us begin with some preliminaries on tensors and tensor spaces. For an exhaustive introduction we refer to the monograph of \citethackbusch2012tensorBuch.

Given Hilbert spaces the tensor product space of order

is defined as the closure of the span of all elementary tensor products of vectors from , i.e.

The elements are called tensors of order . If each space is supplied with an orthonormal basis , then any can be represented as

Using this basis, with a slight abuse of notation, we can identify with its representation by a -variate function, often called hyper matrix,

depending on discrete variables, usually called indices . Analogous to vectors and matrices we use square brackets to index the entries of this hypermatrix. Of course, the actual representation of depends on the chosen bases of . The index is said to correspond to the -th mode or equivalently the -th dimension of the tensor.

In the remainder of this article, we confine to finite dimensional real linear spaces , however most parts are easy to extend to the complex case as well. For these, the tensor product space

is easily defined. If it is not stated explicitly, the are supplied with the canonical basis of the vector spaces . Then every can be represented as

(1)

We equip the finite dimensional linear space with the inner product

and the corresponding -norm . We use the fact that for a Hilbert space the dual space is isomorphic to and use the identification . For the treatment of reflexive Banach spaces we refer to [falco2012minimal, falco2015geometric].

The number of possibly non-zero entries in the representation of is , and with , the dimension of the space scales exponentially in , i.e. . This is often referred to as the curse of dimensions and presents the main challenge when working with higher order tensors.

1.2 Tensor Contractions and Diagrammatic Notation

Important concepts for the definitions of tensor decompositions are so called matricizations and contractions introduced in this section.

The matricization or flattening of a tensor is the reinterpretation of the given tensor as a matrix, by combining a subset of modes to a single mode and combining the remaining modes to a second one.

Definition 1 (Matricization or Flattening).

Let and be a tensor of order . Furthermore let be a subset of the modes of , and let be its complement. Given two bijective functions and respectively.

The -matricization or -flattening

(2)
(3)

of is defined entry-wise as

(4)

A common choice for and is . The actual choice is of no significance though, as long as it stays consistent. The matrix dimensions are given as and .

The inverse operation is the de-matricization or unflattening . In principle it is possible to define de-matricization for any kind of matrix, typically called tensorization. However this requires to give the dimensions of the resulting tensor and the details of the mapping alongside with the operator. Instead, in this work the de-matricization is only applied to matrices where at least one mode of the matrix encodes a tensor structure through a former matricization, in which the details of the mapping are clear from the context. For all other modes the de-matricization is simply defined to be the identity.

The second important tool are tensor contractions, which are generalizations of the matrix-vector and matrix-matrix multiplications to higher order tensors.

Definition 2 (Tensor Contraction).

Let and be two tensors of order and respectively, with . The contraction of the -th mode of with the -th mode of

(5)

is defined entry-wise as

or via the matricizations

The resulting tensor is of order . Note that in order for this operation to be well-defined, must hold.

If no indices are specified, i.e. only , a contraction of the last mode of the left operand and the first mode of the right operand is assumed. If tuples of indices are given, e.g. , a contraction of the respective mode pairs (, ) is assumed.1 As writing this for larger tensor expressions quickly becomes cumbersome we also use a diagrammatic notation to visualize the contraction. In this notation a tensor is depicted as a dot or box with edges corresponding to each of its modes. If appropriate the cardinality of the corresponding index set is given as well. From left to right the following shows this for an order one tensor (vector) , an order two tensor (matrix) and an order 4 tensor .

If a contraction is performed between the modes of two tensors the corresponding edges are joined. The following shows this exemplary for the inner product of two vectors and a matrix-vector product with and .

There are two special cases concerning orthogonal and diagonal matrices. If a specific matricization of a tensor yields an orthogonal or diagonal matrix, the tensor is depicted by a half filled circle (orthogonal) or a circle with a diagonal bar (diagonal) respectively. The half filling and the diagonal bar both divide the circle in two halves. The edges joined to either half, correspond to the mode sets of the matricization, which yields the orthogonal or diagonal matrix. As an example the diagrammatic notation can be used to depict the singular value decomposition of a matrix with rank , as shown in the following.

2 Low Rank Tensor Decompositions

In this section we give an introduction to the low rank tensor decomposition techniques used in the remainder of this work. As there are in fact quite different approaches to generalize the singular value decomposition, and thereby also the definition of the rank, to higher order tensors, we start with an overview of the most popular formats. For an in depth overview including application we refer to the survey of \citetgrasedyck2013literature. In the second part we provide a detailed introduction of the tensor train format, which is used in the remainder of this work.

The probably best known and classical tensor decomposition is the representation by a sum of elementary tensor products, i.e.

(6)

where and are vectors from the respective vector spaces. This format is mainly know as the canonical format but also appears in the literature under the names canonical polyadic (CP) format, CANDECOMP and PARAFAC. The canonical or CP rank is defined as the minimal such that a decomposition as in (6) exists. Note that in general there is no unique CP representation with minimal rank. This is somewhat expected, since even for matrices the SVD is not unique if two or more singular values coincide. Some discussion on the uniqueness can be found in the paper of \citetkolda2009tensor. For tensors with small canonical rank (6) offers a very efficient representation, requiring only storage instead of for the direct representation. Unfortunately the canonical format suffers from several difficulties and instabilities. First of all the task of determining the canonical rank of a tensor with order is, in contrast to matrices, highly non trivial. In fact it was shown by [haastad1990tensor] that even for order , the problem of deciding whether a rational tensor has CP-rank is NP-hard (and NP-complete for finite fields). Consequently also the problem of calculating low rank approximations proves to be challenging. That is, given a tensor and a CP-rank , finding the best CP-rank approximation

(7)

The norm used, may differ depending on the application. In the matrix case the Eckart-Young theorem provides that for the Frobenius and spectral norm this best approximation can be straightforwardly calculated by a truncated SVD. In contrast, \citetkeinCP proved that the problem of the best CP-rank approximation, as formulated in (7), is ill-posed for many ranks and all orders regardless of the choice of the norm . Furthermore they showed that the set of tensors that do not have a best CP-rank approximation is a non-null set, i.e. there is a strictly positive probability that a randomly chosen tensor does not admit a best CP-rank approximation. Finally it was shown by \citetkeinCP that neither the set of all tensors with CP-rank , nor the set of all tensors with CP-rank at most is closed for . These are some severe difficulties for both the theoretical and practical work with the canonical format.

The second classical approach to generalize the SVD to higher order tensors is the subspace based Tucker decomposition. It was first introduced by \citettucker1966 in and has been refined later on in many works, see e.g. [kolda2009tensor, de2000multilinear, hackbusch2012tensorBuch]. Given a tensor , the main idea is to find minimal subspaces , such that is an element of the induced tensor space

(8)

Let denote the dimension of the -th subspace and let be an orthonormal basis of . If the subspaces are chosen such that , then (1) states that there is a such that can be expressed as

(9)

Usually the basis vectors are combined to orthogonal matrices , called basis matrices. This leads to the following common form of the Tucker format

(10)

The order tensor of the prefactors is usually called core tensor. The -tuple of the subspace dimensions is called the representation rank and is associated with the particular representation. The Tucker rank (T-rank) of is defined as the unique minimal -tuple , such that there exists a Tucker representation of with rank . Equation (10) consists of tensor contractions, that can be visualized in the diagrammatic notation, which is exemplarily shown in Figure 1 for . Note that even for the minimal T-rank, the Tucker decomposition is not unique, as for any orthogonal matrix , one can define a matrix and the tensor

such that the tensor can also be written as

which is a valid Tucker decomposition with the same rank.

Figure 1: Left: A tensor of order . Right: Its Tucker decomposition.

It is shown by \citetde2000multilinear that the Tucker rank as the minimal -tuple is indeed well-defined and that the entries of the Tucker rank correspond to the rank of the -th mode matricization of the tensor. That is

(11)

The proof is tightly linked to the fact, that a Tucker representation of a tensor with minimal representation rank, can be obtained by successive singular value decompositions. This procedure is referred to as the higher order singular value decomposition (HOSVD), see [de2000multilinear] for the details. Using truncated SVDs an approximation of by a tensor with lower T-rank , can be obtained. Where the symbol denotes an entry-wise , i.e. . In contrast to the Eckart-Young theorem for matrices the approximation obtained in this way is not the best T-rank approximation of . However it is a quasi-best approximation by a factor , i.e.

(12)

For many applications this quasi-best approximation is sufficient. As for the canonical format, finding the true best approximation is at the very least NP-hard in general, as it is shown by [hillar2013most] that finding the best rank approximation is already NP-hard. To store a tensor in the Tucker format only the core tensor and the basis matrices have to be stored. This amounts to a storage requirement of , where and . Compared to the this is a major reduction but does not break the curse of dimensionality as the exponential scaling in remains.

A more recent development is the hierarchical Tucker (HT) format, introduced by \citethackbusch2009new. It inherits most of the advantages of the Tucker format, in particular a generalized higher order SVD, see [grasedyck2010hierarchical]. But in contrast to the Tucker format the HT format allows a linear scaling with respect to the order for the storage requirements and common operations for tensors of fixed rank. The main idea of the HT format is to extend the subspace approach of the Tucker format by a multi-layer hierarchy of subspaces. For an in-depth introduction of the hierarchical Tucker format we refer to the pertinent literature, e.g. [hackbusch2009new, grasedyck2010hierarchical, hackbusch2012tensorBuch]. In this work we will instead focus on the tensor train (TT) format, as introduced by \citetoseledets2011tensor. The TT format offers mostly the same advantages as the more general HT format, while maintaining a powerful simplicity. In fact it can to some extend be seen as a special case of the HT format, see \citetgrasedyck2011introduction for details on the relation between the TT and HT format.

2.1 Tensor Train Format

In this subsection we give a detailed introduction to the tensor train (TT) format. In the formulation used in this work the TT format was introduced by \citetoseledets2011tensor, however an equivalent formulation was known in quantum physic for quite some time, see e.g. [perez2006matrix] for an overview. The idea of the TT-format is to separate the modes of a tensor into order two and three tensors. This results in a tensor network that is exemplary shown for an order four tensor in the following diagram.

Formally it can be defined as follows.

Definition 3 (Tensor Train Format).

Let be a tensor of order . A factorization

(13)

of , into component tensors , () and , is called a tensor train (TT) representation of . Equivalently (13) can be given entry-wise as

The tuple of the dimensions of the component tensors is called the representation rank and is associated with the specific representation. In contrast the tensor train rank (TT-rank) of is defined as the minimal rank tuple such that there exists a TT representation of with representation rank equal to .

As for the Tucker format the TT-rank is well defined and linked to the rank of specific matricizations via

The proof is again closely linked to the fact that a tensor train decomposition of an arbitrary tensor can be calculated using successive singular value decompositions. This procedure is commonly referred to as the TT-SVD. For this work the TT-SVD is of particular importance as it constitutes the deterministic baseline for our randomized approach in section 3.2. In the following we therefore provide a complete step by step description of this procedure.

Tensor Train Singular Value Decomposition (TT-SVD)
The intuition of the TT-SVD is that in every step a (matrix) SVD is performed to detach one open mode from the tensor. Figure 2 shows this process step by step for an order four tensor and is frequently referred to in the following description. The TT-SVD starts by calculating an SVD of the matricization of , where all modes but the first one are combined (Fig. 2 (a)-(c))

(14)

with , , . The dimension is equal to the rank of . The resulting matrices and are each dematricized, which is trivial for in the first step (Fig. 2 (d)-(e)).

(15)
(16)

Note that there holds

(17)

In the next step a matricization of the newly acquired tensor is performed. The first dimension of the matricization is formed by the first two modes of , corresponding to the new dimension introduced by the prior SVD and the second original dimension. The second dimension of the matricization is formed by all remaining modes of (Fig. 2 (f)). From this matricization another SVD is calculated (Fig. 2 (g))

(18)

with , , . As in the first step and are then dematricized (Fig. 2 (i)),

(19)
(20)

and again there holds

(21)
(22)

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

(m)

Figure 2: Step by step depiction of the TT-SVD by example for an order tensor.

The obtained rank is equal to the rank of the matricization , which can be shown as follows. First note that is an orthogonal matrix, because

(23)
(24)
(25)
(26)
(27)

Then consider that

(28)
(29)
(30)

holds. This is a valid SVD of and since the matrix of singular values is unique it follows that in fact .

This procedure is continued for a total of steps and in each step the order of shrinks by one. Furthermore there holds

(31)

and in every step. The -st step (Fig. 2 (k))

(32)

with , , is special since the de-matricization of , yields an order two tensor that is named instead of (Fig. 2 (l)-(m))

(33)
(34)

Finally

(35)

is a valid TT representation of with TT-rank , whose entries are exactly the ranks of the matricizations as asserted.

The same algorithm can also be used to calculate a rank approximation of a tensor with TT-rank . To this end the normal SVDs are replaced by truncated rank SVDs, yielding a tensor of TT-rank . In contrast to the matrix case, is in general not the best rank approximation of . However as shown by [oseledets2011tensor], it is a quasi best approximation with

(36)

The computational complexity of the TT-SVD is dominated by the matrix singular value decompositions, with all other contributions being asymptotically negligible. With and the cost scales as , i.e. still exponential in the order. This is somewhat expected because there are in general entries in the original tensor that have to be considered. Unfortunately being sparse or otherwise structured incurs no dramatic change because the structure is generally lost after the first SVD.

Apart from the generalized singular value decomposition the TT format offers several further beneficial properties. In particular it is able to break the curse of dimensionality, in the sense that the storage complexity of a tensor with TT-rank in a minimal TT representation scales as , i.e. linearly in the order. Here