Optimization on the Hierarchical Tucker manifold - applications to tensor completion

Optimization on the Hierarchical Tucker manifold - applications to tensor completion

Curt Da Silva and Felix J. Herrmann
Department of Mathematics, University of British Columbia
Department of Earth and Ocean Sciences, University of British Columbia
Abstract.

In this work, we develop an optimization framework for problems whose solutions are well-approximated by Hierarchical Tucker (HT) tensors, an efficient structured tensor format based on recursive subspace factorizations. By exploiting the smooth manifold structure of these tensors, we construct standard optimization algorithms such as Steepest Descent and Conjugate Gradient for completing tensors from missing entries. Our algorithmic framework is fast and scalable to large problem sizes as we do not require SVDs on the ambient tensor space, as required by other methods. Moreover, we exploit the structure of the Gramian matrices associated with the HT format to regularize our problem, reducing overfitting for high subsampling ratios. We also find that the organization of the tensor can have a major impact on completion from realistic seismic acquisition geometries. These samplings are far from idealized randomized samplings that are usually considered in the literature but are realizable in practical scenarios. Using these algorithms, we successfully interpolate large-scale seismic data sets and demonstrate the competitive computational scaling of our algorithms as the problem sizes grow.

1. Introduction

The matrix completion problem is concerned with interpolating a matrix from a subset of its entries. The amount of recent successes in developing solution techniques to this problem is a result of assuming a low-rank model on the 2-D signal of interest and by considering subsampling schemes that increase the rank of the underlying matrix [8], [7], [9]. The original signal is recovered by promoting low-rank structures subject to data constraints.

Using a similar approach, we consider the problem of interpolating a dimensional tensor from samples of its entries. That is, we aim to solve,

(1)

where is a linear operator , is our subsampled data satisfying for some “solution” tensor and is a specific class of low-rank tensors to be specified later. Under the assumption that is well approximated by an element in , our goal is to recover by solving (1). For concreteness, we concern ourselves with the case when is a restriction operator, i.e.,

and is the so-called sampling set, where . In the above equation, we suppose that , so that is a subsampling operator.

Unlike the matrix case, there is no unique notion of rank for tensors, as we shall see in Section 1.1, and there are multiple tensor formats that generalize a particular notion of separability from the matrix case—i.e, there is no unique extension of the SVD to tensors. Although each tensor format can lead to compressible representations of their respective class of low-rank signals, the truncation of a general signal to one of these formats requires access to the fully sampled tensor (or at the very least query-based access to the tensor [4]) in order to achieve reasonable accuracy, owing to the use of truncated SVDs acting on various matricizations of the tensor. As in matrix completion, randomized missing entries change the behavior of the singular values and vectors of these matricizations and hence of the final approximation. Moreover, when the tensor of interest is actually a discretized continuous signal, there can be a number of constraints, physical or otherwise, that limit our ability to ideally sample it. For instance, in the seismic case, the tensor of interest is a multi-dimensional wavefield in the earth’s subsurface sampled at an array of receivers located at the surface. In real-world seismic experiments, budgetary constraints or environmental obstructions can limit both the total amount of time available for data acquisition as well as the number and placement of active sources and receivers. Since seismic and other methods rely on having fully sampled data for drawing accurate inferences, tensor completion is an important technique for a variety of scientific fields that acquire multidimensional data.

In this work, we consider the class of Hierarchical Tucker (abbreviated HT) tensors, introduced in [21, 18], as our low-rank tensors of interest. The set of all such tensors is a smooth, embedded submanifold of , first studied in [42], which we equip with a Riemannian metric. Using this Riemannian structure, we can construct optimization algorithms in order to solve (1) for -dimensional tensors. We will also study some of the effects of higher dimensional sampling and extend ideas from compressive sensing and matrix completion to the HT tensor case for our specific seismic examples.

1.1. Previous Work

To provide the reader with some context on tensor representations, let us briefly detail some of the available structured tensor formats, including tensor completion results, here (see [26] and [28] for a very comprehensive overview). Here we let be the maximum individual dimension size, denote the dimension of the ambient space , and, for each tensor format discussed, is the maximum of all of the rank parameters associated to that format.

The so-called Candecomp/Parafac (CP) decomposition is a very straightforward application of the separation of variables technique. Very much like the SVD of a matrix, one stipulates that, for a function living on a tensor product space, one can write

Thus its discretization can be written as

where is the Kronecker product and is the discretization of the one dimensional function . In addition to its intuitive construction, the CP decomposition of rank only requires parameters versus the of the full tensor and tensor-tensor operations can be performed efficiently on the underlying factors rather than the full tensors themselves (see [3] for a comprehensive set of MATLAB tools).

Unfortunately, despite the parsimoniousness of the CP construction, the approximation of an arbitrary (full) tensor by CP tensors has both theoretical and numerical difficulties. In particular, the set of all CP tensors of rank at most is not closed, and thus the notion of a best rank approximation is difficult to compute in many cases [13]. Despite this shortcoming, various authors have proposed iterative and non-iterative algorithms in the CP format for approximating full tensors [28] as well as interpolating tensors with missing data, such as the Alternating Least Squares approach (a block Gauss-Seidel type method) proposed alongside the CP format in [10] and [22], with convergence analysis in [41], and a nonlinear least-squares optimization scheme in [2].

The CP format is a specific case of the more general Tucker format, which aims to write a tensor as a multilinear product

where is the so-called core tensor and the matrices , are the factors of the decomposition. Here we use the notation of the multilinear product, that is, indicates that is multiplied by in dimension , e.g., see [13, 28]. We will elaborate on this construction in Section 2.2. The CP format follows from this formulation when the core tensor is diagonal, i.e., , where when and otherwise.

The Tucker format enjoys many benefits in terms of approximation properties over its CP counterpart. Namely, the set of all Tucker tensors of at most multilinear rank is closed and as a result every tensor has a best at most multilinear rank- Tucker approximation. A near-optimal approximation can be computed efficiently by means of the Higher Order SVD [12]. For the tensor completion problem, the authors in [17] consider the problem of recovering a Tucker tensor with missing entries using the Douglas-Rachford splitting technique, which decouples interpolation and regularization by nuclear norm penalization of different matricizations of the tensor into subproblems that are then solved via a particular proximal mapping. An application of this approach to seismic data is detailed in [29] for the interpolation problem and [30] for denoising. Depending on the size and ranks of the tensor to be recovered, there are theoretical and numerical indications that this approach is no better than penalizing the nuclear norm in a single matricization (see [36] for a theoretical justification in the Gaussian measurement case, as well as [39] for an experimental demonstration of this effect). Some preliminary results on theoretical guarantees for recovering low-rank Tucker tensors from subsampled measurements are given in [25] for pointwise measurements and a suitable, tensor-based incoherence condition and [35], which considers a nuclear norm penalty of the matricization of the first modes of as opposed to a sum of nuclear norms of each of its modes, as is typically considered.

Aside from convex relaxations of the tensor rank minimization problem, the authors in [32] develop an alternative manifold-based approach to Tucker Tensor optimization similar to our considerations for the Hierarchical Tucker case and subsequently complete such tensors with missing entries. Each evaluation of the objective and Riemannian gradient requires operations, whereas our method only requires operations. As a result of using the Hierarchical Tucker format instead of the Tucker format, our method scales much better as , , and grow.

Previous work in completing tensors in the Tensor Train format, which is the Hierarchical Tucker format with a specific, degenerate binary dimension tree, includes [19, 23], wherein the authors use an alternating least-squares approach for the tensor completion problem. The derivations of the smooth manifold structure of the set of TT tensors can be found in [24]. This work is a precursor for the manifold structure of Hierarchical Tucker tensors studied in [42], upon which we expand in this article. For a comprehensive review of various tensor formats, we refer the reader to [27, 20].

Owing to its extremely efficient storage requirements (which are linear in the dimension as opposed to exponential in ), the Hierarchical Tucker format has enjoyed a recent surge in popularity for parametrizing high-dimensional problems. The hTucker toolbox [31] contains a suite of MATLAB tools for working with tensors in the HT format, including efficient vector space operations, matrix-tensor and tensor-tensor products, and truncations of full arrays to HT format. This truncation, the so-called Hierarchical SVD developed in [18], allows one to approximate a full tensor in HT format with a near-optimal approximation error. Even though the authors in [4] develop a HT truncation method that does not need access to every entry of the tensor in order to form the HT approximation, their approach requires algorithm-driven access to the entries, which does not apply for the seismic examples we consider below. A HT approach for solving dynamical systems is outlined in [33], which considers similar manifold structure as in this article applied in a different context.

1.2. Contributions and Organization

In this paper, we extend the primarily theoretical results of [42] to practical algorithms for solving optimization algorithms on the HT manifold. In Section 3.1, we introduce the Hierarchical Tucker format. We restate some of the results of [42] in Section 3.1 to provide context for the Riemannian metric we introduce on the quotient manifold in Section 4. Equipped with this metric, we can now develop optimization methods on the HT manifold in Section 5 that are fast and SVD-free. For large-scale, high-dimensional problems, the computational costs of SVDs are prohibitive and affect the scalability of tensor completion methods such as [17]. Since we are using the HT manifold rather than the Tucker manifold, we avoid an exponential dependence on the internal rank parameters as in [32]. In Section 5.4, we exploit the structure of HT tensors to regularize different matricizations of the tensor without having to compute SVDs of these matricizations, lessening the effects of overfitting when there are very few samples available. To the best of our knowledge, our approach is the first instance of exploiting the manifold structure of HT tensors for solving the tensor completion problem. We conclude by demonstrating the effectiveness of our techniques on interpolating various seismic data volumes with missing data points in all dimensions as well as missing receivers, which is more realistic. Our numerical results are similar to those presented previously in [11], but much more extensive and include our regularization and Gauss-Newton based methods. In this paper, we also compare our method to a reference implementation of [32] and achieve very reasonable results for our seismic data volumes.

We note that the algorithmic results here generalize readily to complex tensor completion and more general subsampling operators .

2. Notation

In this paper, we denote vectors by lower case letters , matrices by upper case, plain letters , and tensors by upper case, bold letters .

2.1. Matricization

We let the matricization of a tensor along the modes be the matrix such that the indices in are vectorized along the rows and the indices in are vectorized along the columns, i.e., if we set , then

We also use the notation for the dematricization operation, i.e., , which reshapes the matricized version of along modes back to its full tensor form.

2.2. Multilinear product

A natural operation to consider on tensors is that of the multilinear product [42, 18, 28].

Definition 1.

Given a tensor of size and matrices , the multilinear product of with , is the tensor , is defined in terms of the matricizations of as

Conceptually, we are applying operator each operator to dimension of the tensor , keeping all other coordinates fixed. For example, when are matrices of appropriate sizes, the quantity can be written as .

The standard Euclidean inner product between two dimensional tensors and can be defined in terms of the standard Euclidean product for vectors, by letting

where is the usual vectorization operator. This inner product induces a norm on the set of all dimensional tensors in the usual way, .

Here we state several properties of the multilinear product, which are straightforward to prove.

Proposition 1.

Let , be collections of linear operators and be tensors, all of appropriate sizes, so that the multilinear products below are well-defined. Then we have the following:

2.3. Tensor-tensor contraction

Another natural operation to consider between two tensors is tensor-tensor contraction, a generalization of matrix-matrix multiplication. We define tensor-tensor contraction in terms of tensors of the same dimension for ease of presentation [16].

Definition 2.

Given a tensor of size and a tensor of size , select such that and for . The tensor-tensor contraction of and along modes , denoted , is defined as tensor of size , satisfying

Tensor tensor contraction over modes and merely sums over the dimensions specified by in and respectively, leaving the dimensions and free.

The inner product is a special case of the tensor product when .

We also make use of the fact that when the index sets are with , , and are appropriately sized for , then

(2)

i.e., applying to dimension commutes with contracting tensors over every dimension except the th one.

3. Smooth Manifold Geometry of the Hierarchical Tucker Format

In this section, we review the definition of the Hierarchical Tucker format (Section 3.1) as well as previous results [42] in the smooth manifold geometry of this format (Section 3.2). We extend these results in the next section by introducing a Riemannian metric on the space of HT parameters and subsequently derive the associated Riemannian gradient with respect to this metric. A reader familiar with the results in [42] can glance over this section quickly for a few instances of notation and move on to Section 4.

3.1. Hierarchical Tucker Format

The standard definition of the Hierarchical Tucker format relies on the notion of a dimension tree, chosen apriori, which specifies the format [18]. Intuitively, the dimension tree specifies which groups of dimensions are “separated” from other groups of dimensions, where “separation” is used in a similar sense to the SVD in two dimensions.

Definition 3.

A dimension tree is a non-trivial binary tree such that

  • the root, , has the label

  • for every , where is the set of leaves of , the labels of its left and right children, , form a partition of the label for , i.e., and .

An example of a dimension tree when is given in Figure 1.

Figure 1. Complete dimension tree for .
Remark 1.

For the following derivations, we take the point of view of each quantity with a subscript is associated to the node . By Definition 3, for each , there is a corresponding subset of associated to . If our HT tensor has dimensions , we let and, when , satisfies .

Definition 4.

Given a dimension tree and a vector of hierarchical ranks with , a tensor can be written in the Hierarchical Tucker format if there exist parameters such that , where

(3)

where , the set of full-rank matrices, for and , the set of -tensors of full multilinear rank, i.e.,

We say the parameters are in Orthogonal Hierarchical Tucker (OHT) format if, in addition to the above construction, we also have

(4)

We have made a slight modification of the definition of the HT format compared to [42] for ease of presentation. When , our construction is the same as the subspace decomposition introduced in [34] for low-rank matrices, but our approach is not limited to this case.

Owing to the recursive construction (4), the intermediate matrices for do not need to be stored. Instead, specifying for and for determines completely. Therefore, the overall number of parameters is bounded above by , where and . When and , this quantity is much less than the parameters typically needed to represent .

Definition 5.

The hierarchical rank of a tensor corresponding to a dimension tree is the vector where

We consider the set of Hierarchical Tucker tensors of fixed rank , that is,

We restrict ourselves to parameters that are strictly orthogonalized, as in (4). In addition to significantly simplifying the resulting notation, this restriction allows us to avoid cumbersome and unnecessary matrix inversions, in particular for the resulting subspace projections in future sections. Moreover, using only orthogonalized parameters avoids the problem of algorithms converging to points with possibly lower than prescribed HT rank, see [42, Remark 4.1]. This restriction does not reduce the expressibility of the HT format, however, since for any non orthogonalized parameters such that , there exists orthogonalized parameters with [18, Alg. 3].

We use the grouping to denote , as these are our “independent variables” of interest in this case. In order to avoid cumbersome notation, we also suppress the dependence on in the following, and presume a fixed dimension tree and hierarchical ranks .

Figure 2. Dimension tree representation of (3) with dimensions . Starred quantities are computed recursively.

3.2. Quotient Manifold Geometry

The results in this section are adapted from those in [42] to the orthogonalized parameter case and we include them here in the interest of keeping this article self contained.
Below, let be the closed submanifold of , the set of tensors with full multilinear rank, such that is orthonormal along modes and , i.e., and let be the Stiefel manifold of matrices with orthonormal columns.
Our orthogonal parameter space is then

with corresponding tangent space at

Note that . We omit an explicit description of for brevity.

Let be the parameter to tensor map in (3). Then for each , then there is an inherent ambiguity in its representation by parameters , i.e., for distinct parameters and with the following relationship between them.

Let be the Lie group

where is the orthogonal group of matrices and the group action of component-wise multiplication. Let be the group action

(5)

Then if and only if there exists a unique such that [42, Prop. 3.9]. Therefore these are the only types of ambiguities we must consider in this format.

It follows that the orbit of ,

is the set of all parameters that map to the same tensor under . This induces an equivalence relation on the set of parameters ,

If we let be the corresponding quotient space of equivalence classes and denote the quotient map, then pushing down through results in an injective function

whose image is all of , and hence is an isomorphism (in fact, a diffeomorphism).

The vertical space, , is the subspace of that is tangent to . That is, when it is of the form [42, Eq. 4.7]

(6)

where , the set of skew symmetric matrices. A straightforward computation shows that , and therefore for every , . From an optimization point of view, moving from the point to will not change the current tensor and therefore for any search direction , we must filter out the corresponding component in in order to compute the gradient correctly. We accomplish this by projecting on to a horizontal space, which is any complementary subspace to . One such choice is [42, Eq. 4.8],

(7)

Note that there is no restriction on , which is a matrix.
This choice has the convenient property that is invariant under the action of , i.e., [42, Prop. 4.9]

(8)

which we shall exploit for our upcoming discussion of a Riemannian metric. The horizontal space allows us to uniquely represent abstract tangent vectors in with concrete vectors in .

4. Riemannian Geometry of the HT Format

In this section, we introduce a Riemannian metric on the parameter space that will allow us to use parameters as representations for their equivalence class in a well-defined manner while performing numerical optimization.

4.1. Riemannian metric

Since each distinct equivalence class is uniquely identified with each distinct value of , the quotient manifold is really our manifold of interest for the purpose of computations—i.e, we would like to formulate our optimization problem over the equivalence classes . Unfortunately, is an abstract mathematical object and thus hard to implement numerically. By introducing a Riemannian metric on that respects its quotient structure, we can formulate concrete optimization algorithms in terms of the HT parameters without being affected by the non-uniqueness of the format—i.e., by optimizing over parameters while implicitly performing optimization over equivalence classes . Below, we explain how to explicitly construct this Riemannian metric for the HT format.

Let be tangent vectors at the point . Then we define the inner product at as

(9)

By the full-rank conditions on and at each node, by definition of the HT format, each for is symmetric positive definite and varies smoothly with . As a result, is a smooth, symmetric positive definite, bilinear form on , i.e., a Riemannian metric. Note that when is in OHT, as it is in our case, reduces to the standard Euclidean product on the parameter space , making it straightforward to compute in this case.

Proposition 2.

On the Riemannian manifold , defined in (5) acts isometrically on , i.e., for every ,

where is the push-forward map, .

Proof.

Let , for .

If we write , for , then, by (8), it follows that

and similarly for .

We will compare each component of the sum of (9) term by term. For ease of presentation, we only consider interior nodes , as leaf nodes and the root node are handled in an analogous manner.

For , let be the component of at the node , i.e.,

and similarly let .

A straightforward computation based on (3) and (5) yields that

for appropriately sized . In particular, for , we have that

Adding the terms for each , we obtain

Although the above computation uses the fact that , an almost identical calculation yields Proposition 2 when is non orthogonalized, as considered in [42]. As we are interested in carrying out our optimization using the HT parameters as proxies for their equivalence classes , this proposition states that if we measure inner products between two tangent vectors at the point , we obtain the same result as if we had measured the inner product between two tangent vectors transformed by at the point . In this sense, once we have a unique association of tangent vectors in with a subspace of , we can use the actual representatives, the parameters , instead of the abstract equivalence class , in a well-defined way during our optimization. This shows that , endowed with the Riemannian metric

where are the horizontal lifts at of , respectively, is a Riemannian quotient manifold of (i.e., is a Riemannian submersion) [1, Sec. 3.6.2].

In summary, by using this Riemannian metric and restricting our optimization to only consider horizontal tangent vectors, we can implicitly formulate our algorithms on the abstract quotient space by working with the concrete HT parameters. Below, we will derive the Riemannian gradient in this context.

Remark 2.

It should be noted that although the horizontal space (7) is complementary to the vertical space (3.2), it is demonstrably not perpendicular to under the Riemannian metric (9). Choosing a horizontal space which is perpendicular to under the standard Euclidean product (i.e., (9) when is orthogonalized) is beyond the scope of this paper. Suffice to say, it can be done, as a generalization of the approach outlined in [34], resulting in a series of symmetry conditions on various multi-way combinations of parameters. The resulting projection operators involve solving a number of coupled Lyapunov equations, increasing with the depth of . It remains to be seen whether such equations can be solved efficiently when is large. We will not dwell on this point here, as we will not be needing orthogonal projections for our computations in the following.

4.2. Riemannian gradient

The problem we are interested in solving is

for a smooth objective function . We write , where .

We need to derive expressions for the Riemannian gradient to update the HT parameters as part of local optimization procedures. Therefore, our primary quantity of interest is the Riemannian gradient of .

Definition 6.

[1, Sec. 3.6] Given a smooth scalar function on a Riemannian manifold , the Riemannian gradient of at , denoted , is the unique element of which satisfies

with respect to the Riemannian metric .

Our manifold of interest in this case is , with the corresponding horizontal space in lieu of the abstract tangent space . Therefore, in the above equation, we can consider the horizontal lift of the tangent vector and instead write

Our derivation is similar to that of [42, Sec 6.2.2], except our derivations are more streamlined and cheaper computationally since we reduce the operations performed at the interior nodes . By a slight abuse of notation in this section, we denote variational quantities associated to node as and where is the reshaping of in to a tensor. The Riemannian gradient will be denoted and a general horizontal vector will be denoted by .

Since is orthogonalized, we use to denote the Euclidean inner product. By the chain rule, we have that, for any ,

Then each tensor , with , satisfies the recursion

(10)

for matrices , and tensor satisfying [42, Lemma 2]

(11)

The third orthogonality condition is omitted when .

Owing to this recursive structure, we compute , where