On Optimizing Distributed Tucker Decomposition for Dense Tensors
Abstract
The Tucker decomposition expresses a given tensor as the product of a small core tensor and a set of factor matrices. Apart from providing data compression, the construction is useful in performing analysis such as principal component analysis (PCA) and finds applications in diverse domains such as signal processing, computer vision and text analytics. Our objective is to develop an efficient distributed implementation for the case of dense tensors. The implementation is based on the HOOI (Higher Order Orthogonal Iterator) procedure, wherein the tensortimesmatrix product forms the core routine. Prior work have proposed heuristics for reducing the computational load and communication volume incurred by the routine. We study the two metrics in a formal and systematic manner, and design strategies that are optimal under the two fundamental metrics. Our experimental evaluation on a large benchmark of tensors shows that the optimal strategies provide significant reduction in load and volume compared to prior heuristics, and provide up to x speedup in the overall running time.
1 Introduction
Tensors are the higher dimensional analogues of matrices. While matrices represent twodimensional data, tensors are useful in representing data in three or higher dimensions. Tensors have been studied extensively via generalizing concepts pertaining to matrices. The Tucker decomposition [tucker] is a prominent construction that extends the singular value decomposition (SVD) to the setting of tensors. Given an dimensional tensor , the decomposition approximately expresses the tensor as the product of a small dimensional core tensor and a set of factor matrices, one along each dimension (or mode); see Figure 1 for an illustration. The core is much smaller than the original tensor leading to data compression. Prior work [koldaipdps] has reported compression rates to the tune of on large real tensors. Apart from data compression, the decomposition is also useful in analysis such as PCA, and finds applications in diverse domains such as computer vision [s229] and signal processing [s173]. A detailed discussion on the topic can be found in the excellent survey, by Kolda and Bader [survey].
Tucker decomposition has been wellstudied in sequential, shared memory and distributed settings for both dense and sparse tensors (e.g., [koldaipdps, ucaricpp, koldaicdm]). Our objective is to develop an optimized implementation for dense tensors on distributed memory systems. The implementation is based on the popular STHOSVD/HOOI procedures. The STHOSVD (Sequentially Truncated Higher Order SVD) [vanv] is used to produce an initial decomposition. The HOOI (Higher Order Orthogonal Iterator) [kolda7] procedure transforms any given decomposition to a new decomposition with the same core size, but with reduced error. The procedure is applied iteratively so as to reduce the error monotonically across the iterations. We focus on the latter HOOI procedure which is invoked multiple times. The tensortimesmatrix product (TTM) component forms the core module of the procedure. Prior work has proposed heuristic schemes for reducing the computational load and communication volume of the component. Our objective is to enhance the performance by constructing schemes which are optimal in these two fundamental metrics.
Prior Work
Heuristics for computational load The TTM component comprises of a set of tensortimesmatrix multiplication operations. Based on the observation that the operations can be rearranged and reused in multiple ways, prior work has proposed heuristics for reducing the computational load. A naive scheme for implementing the component performs TTM operations. Baskaran et al. [baskaran] focused on reducing the number of TTM operations, and proposed a scheme with (approximately) operations, which was further improved to by Kaya and Uçar [ucarreport]. However, minimizing the number of TTM operations is not sufficient and it is crucial to consider the cost of the operations, especially in the context of dense tensors. Austin et al. [koldaipdps] measure the cost in terms of the number of floating point operations (FLOP). They empirically showed that the performance of the navie scheme can be improved by permuting (ordering) the modes of the input tensor and proposed a greedy heuristic for mode ordering. A similar heuristic is given by Vannieuwenhoven et al. [vanv].
Heuristics for communication volume Austin et al. [koldaipdps] presented the first distributed implementation of HOOI. They distribute the tensors among the processors using a Cartesian parallel distribution which generalizes the block distribution technique used in the context of matrices. The processors are arranged in the form of an dimensional grid and a tensor is partitioned into blocks by imposing the grid on the tensor; the blocks are then assigned to the processors. They showed that the communication volume is determined by the choice of the grid, and presented an empirical evaluation of the effect of the grid on the communication volume.
Our Contributions
Our objective is to enhance the performance of distributed Tucker decomposition for dense tensors by designing optimal schemes for the two metrics and we make the following contributions.

Optimal TTMtrees: As observed in prior work [ucarreport], the different TTM schemes can be conveniently represented in the form of trees, called TTMtrees. We present an efficient algorithm for constructing the optimal TTMtree, the one having the least computational load, measured in terms of number of floating operations (FLOP).

Dynamic Gridding: Prior work uses a static gridding scheme, wherein the same grid is used for distributing the tensors arising in the different TTM operations. We propose the concept of dynamic gridding that uses different grids tailored for the different operations, leading to significant reduction in communication volume, even when compared to the optimal static grids.

Optimal Dynamic Gridding: We present an efficient algorithm for finding the dynamic gridding scheme achieving the optimal communication volume.
Our distributed implementation builds on the work of Austin et al. [koldaipdps] and incorporates the optimal schemes described above. We setup a large benchmark consisting of about tensors whose metadata are derived from reallife tensors. Furthermore, we also include a set of tensors with metadata derived from simulations in combustion science. Our experimental evaluation on the above benchmark demonstrates that the combination of optimal trees and the dynamic gridding scheme offers significant reduction in computational load and communication volume, resulting in up to factor improvement in overall execution time, compared to prior heuristics. To the best of our knowledge, our study is the first to consider optimal algorithms for the Tucker decomposition.
We note that prior work [vanv, koldaipdps] has provided evidence that STHOSVD may be sufficient for particular application domains. They present experimental evaluations on a sample of tensors arising in image processing and combustion science showing that for these tensors, STHOSVD is sufficient and HOOI does not provide significant error reduction. Our optimizations on HOOI would be useful for other tensors/domains where HOOI provides error reduction over STHOSVD. Furthermore, the ideas developed in this paper can be recast and used for improving STHOSVD as well.
Related Work
Tucker decomposition has been studied in sequential and parallel settings for dense and sparse tensors. For dense tensors, the MATLAB Tensor Toolbox provides a sequential implementation [TTB_Software2]. Zhou et al. [zhou] proposed a randomized algorithm for the case where the tensor fits in the physical memory of a single machine. Li et al. [vuduc] proposed performance enhancements for a single TTM operation and their techniques can be incorporated within our framework. Austin et al. [koldaipdps] described the first implementation for distributed memory systems, wherein they proposed heuristics for mode ordering and experimentally demonstrated the effect of grid selection on communication time. For sparse tensors, sequential [koldaicdm], shared memory [baskaran] and distributed implementations [ucaricpp] are known. Other tensor decompositions have also been considered (see [survey]). In particular, CP decomposition which generalizes the concept of rank factorization has been well studied (e.g. [cp3]). We refer to survey by Kolda and Bader [survey] for a detailed treatment of tensor decompositions.
2 Tucker Decomposition
In this section, we briefly discuss tensor concepts pertinent to our problem, and then, describe the Tucker decomposition and the HOOI procedure.
2.1 Preliminaries
Fibers Consider an dimensional tensor of size . Let denote the cardinality (number of elements) of the tensor. The elements of can be canonically indexed by a coordinate vector of the form , where each index belongs to , for all modes . A mode fiber is a vector of length , containing all the elements that differ on the th coordinate, but agree on all the other coordinates, i.e., . The number of mode fibers is . In the analogous case of matrices, two types of fibers can be found: row vectors and column vectors.
Tensor Unfolding The tensor is stored as a matrix and there are different matrix layouts are possible, called the unfoldings. The mode unfolding of refers to the matrix whose columns are the mode fibers of the tensor. The columns are arranged in a lexicographic order (the details are not crucial for our discussion). This matrix is of size , and we denote it as .
TensorTimesMatrix Multiplication (TTM) For any mode , the tensor can be multiplied by a matrix along mode , provided has size , for some ; the operation is denoted . Conceptually, the operation applies the linear transformation to all the mode fibers. It is realized via the matrixmatrix multiplication , and taking the output matrix to be the mode unfolding of . While the length along mode changes from to , the number of fibers and the lengths along other modes remains the same. Thus, has cardinality and size .
TTMChain The TTMchain operation refers to multiplying along multiple distinct modes. For two modes and and matrices and , we first multiply by along mode , and then multiply the output tensor along mode by . An important property of the operation is commutativity [kolda7], namely the two TTM operations can be performed in any order: . In general, for a subset of distinct modes , and matrices , where has size , the output is a tensor . The length of remains the same as along modes not belonging to , and changes to , for all . Commutativity implies that the multiplications can be performed in any order.
2.2 Tucker Decomposition and HOOI
The Tucker decomposition of approximates the tensor as the product of a core tensor of size , with each , and a set of factor matrices : . The factor matrix has size . The decomposition compresses the length of along each mode from to . We write the decomposition as . The error of the decomposition is measured by comparing the recovered tensor and the input tensor under the normalized root mean square metric.
The HOOI procedure [kolda7] transforms a given decomposition into a new decomposition having the same core size, but with reduced the error. Given an initial decomposition, the procedure can be invoked repeatedly to reduce the error monotonically, until a desired convergence is achieved. An initial decomposition can be found using methods such as STHOSVD [vanv].
The HOOI procedure (a single invocation), shown in Figure 2, takes as input the tensor , and a decomposition with core size . It produces a new decomposition with lesser error, but having the same core and factor matrix sizes. For computing each new factor matrix , the procedure utilizes the alternating least squares paradigm and works in two steps. First, it performs a TTMchain operation by skipping mode and multiplying by the transposes of all the other factor matrices (with ) and obtains a tensor . The tensor has length compressed from to along all modes . In the next step, it performs an SVD on , the mode unfolding of the . The new factor matrix is obtained by arranging the leading singular vectors as columns. Once all the new factor matrices are computed, the new core tensor is computed.
Figure 3 (a) depicts the process in the form of a tree. The root represents the input tensor , each node with label represents multiplication along mode , and each leaf represents a new factor matrix. For dense tensors, the SVD operations tend to be inexpensive (see [koldaipdps]). Therefore, we focus on optimizing the TTM component comprising of the TTMchains, from the perspectives of computational load and communication volume.
Input: A tensor and a decomposition  
Size of tensor :  
Size of core : ,  
Size of factor matrix :  
Output: New decomp. with lesser error  
Size of core and factor matrices : Same as input.  
Procedure:  
For each mode from to  
TTMChain: Perform TTM along all the modes, except .  
.  
SVD: leading left singular vectors of  
New core:  
Output . 
3 Computational Load
The TTM component performs TTMchains, each involving TTM operations. Commutativity allows us to rearrange and reuse the operations in multiple ways, all of which can be represented in the form of TTMtrees, as observed in prior wok [ucarreport]. We measure the computational load of a tree by the number of floating point operations incurred. Our objective is to design an efficient algorithm for finding the optimal TTMtrees. Below, we first formalize the above model and rephrase prior schemes, and then describe the optimal algorithm.
3.1 TTMtrees and Cost
In a TTMtree, the root represents the input tensor , each leaf node represents a unique new factor matrix and each internal node (nodes other than the root and the leaves) represents TTM along a particular mode. The roottoleaf path leading to a new factor matrix realizes the TTMchain required for computing .
TTMTrees
Formally, a TTMtree is a rooted tree with a function that assigns a label to each node such that the following properties are satisfied: (i) the label of the root node is ; (ii) there are exactly leaves, with each leaf being labeled with a unique new factor matrix ; (iii) each internal node is labeled with a mode ; (iv) for each leaf with label , the path from the root to has exactly internal nodes and all the modes except appear on the path.
Figure 3 (a)  (c) provides example TTMtrees for the case of . Although the trees differ in the order in which the modes are processed and the total number of TTMs performed, they all realize the necessary TTM chains.
Given a tree , the HOOI procedure can be executed via a natural topdown process by associating each node with an input tensor and an output tensor . For the root node, . Each internal node with takes as input the tensor output by its parent , multiplies it along mode by the factor matrix , and outputs the resultant tensor, i.e., and . Each leaf node with constructs the new factor matrix by performing an SVD on the tensor output by its parent. The correctness of the procedure follows from the commutativity property of the TTMchain operation. In the above procedure, we reuse the tensor output by a node for processing all its children. By executing the process via an inorder traversal, we can ensure that the maximum number of intermediate tensors stored at any point is bounded by the depth of the tree.
(a) Chain tree  (b) Chain tree  (c) Balanced tree 
Computational Load
We define the cost (or computational load) of a TTMtree to be the number of floating point operations (FLOP) performed. Each internal node with label executes the TTM . Recall that the operation involves the matrixmatrix multiplication, wherein the matrix is multiplied by the mode unfolding of . The matrix has size and the unfolded tensor has size and so, the cost of the TTM is . The cardinality of the output tensor is ; namely, the node compresses the tensor by a factor . We can compute the cost incurred at all the nodes and the cardinality of their output tensors by performing the above calculations in a topdown manner. Then, the cost of the tree is given by the sum of costs of its internal nodes. We can see that each mode is associated with two parameters: a cost factor and a compression factor , which we denote as . At each node, the cost incurred and the cardinality of the output tensor can be expressed in terms of these two parameters.
Figure 4 provides an illustration. The cost incurred and the cardinality of the output tensor are shown at each node. For the ease of exposition, we have normalized all the quantities by . The root node has cost and its cardinality of its output is , which is after normalization. Each node with label incurs a cost of times the cardinality of the tensor output by its parent; it outputs a tensor having cardinality compressed by a factor .
3.2 Prior Schemes
We rephrase the prior schemes in terms of TTMtrees.
Chain trees These trees encode the naive scheme, with independent chains, each comprising of nodes (see figure 3 (a) and (b)).
Balanced trees Chain trees perform TTMs. Kaya and Uçar [ucarreport] improved the count to approximately , via a divideandconquer strategy. The idea is to divide the modes into two groups and , where . We create a chain of nodes of length with labels from the first group and attach it to the root. Then, we recursively construct a subtree for the second group and attach it at the bottom of the chain. We then repeat the process by reversing the roles of the two groups. Figure 3 (c) shows an example for . The number of internal nodes is approximately .
Mode Ordering Since the TTMchain operation is commutative, the TTM products within a chain can be performed in any order. Based on this fact, Austin et al. [koldaipdps] propose the concept of mode ordering, wherein the modes of the input tensor are rearranged according to some permutation. For example, Figure 3 (a) and (b) are both chain trees, but have different mode orderings. They proposed two greedy heuristic for mode ordering. The first heuristic arranges the modes in increasing order of cost factor , placing lower cost modes at the top of the tree where large tensors are encountered. The second heuristic arranges the modes in increasing order of compression factor , aiming at higher compression at the top layers of the tree. We are not aware of any prior work on mode ordering with respect to balanced trees.
3.3 Constructing Optimal Trees
In this section, we present our algorithm for constructing the optimal TTMtree, the tree with the minimum cost. The algorithm is based on dynamic programming and runs in time . In practice, the algorithm takes negligible time, since the number of dimensions of dense tensors is fairly small (typically, ).
Towards developing the dynamic programming algorithm, we first claim that the optimal TTMtree is binary, namely every node has at most two children. The proof is based on the observation that if a node has three children, then the children can be rearranged so that only two of the nodes remain as children of . We then identify a set of subproblems and derive a recurrence relation relating them. This is followed by a description of the algorithm and an analysis of the running time.
Lemma 3.1.
There exists an optimal binary tree.