On Optimizing Distributed Tucker Decomposition for Dense Tensors
Abstract
The Tucker decomposition expresses a given tensor as the product of a small core tensor and a set of factor matrices. Apart from providing data compression, the construction is useful in performing analysis such as principal component analysis (PCA) and finds applications in diverse domains such as signal processing, computer vision and text analytics. Our objective is to develop an efficient distributed implementation for the case of dense tensors. The implementation is based on the HOOI (Higher Order Orthogonal Iterator) procedure, wherein the tensortimesmatrix product forms the core routine. Prior work have proposed heuristics for reducing the computational load and communication volume incurred by the routine. We study the two metrics in a formal and systematic manner, and design strategies that are optimal under the two fundamental metrics. Our experimental evaluation on a large benchmark of tensors shows that the optimal strategies provide significant reduction in load and volume compared to prior heuristics, and provide up to x speedup in the overall running time.
1 Introduction
Tensors are the higher dimensional analogues of matrices. While matrices represent twodimensional data, tensors are useful in representing data in three or higher dimensions. Tensors have been studied extensively via generalizing concepts pertaining to matrices. The Tucker decomposition [1] is a prominent construction that extends the singular value decomposition (SVD) to the setting of tensors. Given an dimensional tensor , the decomposition approximately expresses the tensor as the product of a small dimensional core tensor and a set of factor matrices, one along each dimension (or mode); see Figure 1 for an illustration. The core is much smaller than the original tensor leading to data compression. Prior work [2] has reported compression rates to the tune of on large real tensors. Apart from data compression, the decomposition is also useful in analysis such as PCA, and finds applications in diverse domains such as computer vision [3] and signal processing [4]. A detailed discussion on the topic can be found in the excellent survey, by Kolda and Bader [5].
Tucker decomposition has been wellstudied in sequential, shared memory and distributed settings for both dense and sparse tensors (e.g., [2, 6, 7]). Our objective is to develop an optimized implementation for dense tensors on distributed memory systems. The implementation is based on the popular STHOSVD/HOOI procedures. The STHOSVD (Sequentially Truncated Higher Order SVD) [8] is used to produce an initial decomposition. The HOOI (Higher Order Orthogonal Iterator) [9] procedure transforms any given decomposition to a new decomposition with the same core size, but with reduced error. The procedure is applied iteratively so as to reduce the error monotonically across the iterations. We focus on the latter HOOI procedure which is invoked multiple times. The tensortimesmatrix product (TTM) component forms the core module of the procedure. Prior work has proposed heuristic schemes for reducing the computational load and communication volume of the component. Our objective is to enhance the performance by constructing schemes which are optimal in these two fundamental metrics.
Prior Work
Heuristics for computational load
The TTM component comprises of a set of tensortimesmatrix multiplication operations. Based on the observation that the operations can be rearranged and reused in multiple ways, prior work has proposed heuristics for reducing the computational load. A naive scheme for implementing the component performs TTM operations. Baskaran et al. [10] focused on reducing the number of TTM operations, and proposed a scheme with (approximately) operations, which was further improved to by Kaya and Uçar [11]. However, minimizing the number of TTM operations is not sufficient and it is crucial to consider the cost of the operations, especially in the context of dense tensors. Austin et al. [2] measure the cost in terms of the number of floating point operations (FLOP). They empirically showed that the performance of the navie scheme can be improved by permuting (ordering) the modes of the input tensor and proposed a greedy heuristic for mode ordering. A similar heuristic is given by Vannieuwenhoven et al. [8].
Heuristics for communication volume
Austin et al. [2] presented the first distributed implementation of HOOI. They distribute the tensors among the processors using a Cartesian parallel distribution which generalizes the block distribution technique used in the context of matrices. The processors are arranged in the form of an dimensional grid and a tensor is partitioned into blocks by imposing the grid on the tensor; the blocks are then assigned to the processors. They showed that the communication volume is determined by the choice of the grid, and presented an empirical evaluation of the effect of the grid on the communication volume.
Our Contributions
Our objective is to enhance the performance of distributed Tucker decomposition for dense tensors by designing optimal schemes for the two metrics and we make the following contributions.

Optimal TTMtrees: As observed in prior work [11], the different TTM schemes can be conveniently represented in the form of trees, called TTMtrees. We present an efficient algorithm for constructing the optimal TTMtree, the one having the least computational load, measured in terms of number of floating operations (FLOP).

Dynamic Gridding: Prior work uses a static gridding scheme, wherein the same grid is used for distributing the tensors arising in the different TTM operations. We propose the concept of dynamic gridding that uses different grids tailored for the different operations, leading to significant reduction in communication volume, even when compared to the optimal static grids.

Optimal Dynamic Gridding: We present an efficient algorithm for finding the dynamic gridding scheme achieving the optimal communication volume.
Our distributed implementation builds on the work of Austin et al. [2] and incorporates the optimal schemes described above. We setup a large benchmark consisting of about tensors whose metadata are derived from reallife tensors. Furthermore, we also include a set of tensors with metadata derived from simulations in combustion science. Our experimental evaluation on the above benchmark demonstrates that the combination of optimal trees and the dynamic gridding scheme offers significant reduction in computational load and communication volume, resulting in up to factor improvement in overall execution time, compared to prior heuristics. To the best of our knowledge, our study is the first to consider optimal algorithms for the Tucker decomposition.
We note that prior work [8, 2] has provided evidence that STHOSVD may be sufficient for particular application domains. They present experimental evaluations on a sample of tensors arising in image processing and combustion science showing that for these tensors, STHOSVD is sufficient and HOOI does not provide significant error reduction. Our optimizations on HOOI would be useful for other tensors/domains where HOOI provides error reduction over STHOSVD. Furthermore, the ideas developed in this paper can be recast and used for improving STHOSVD as well.
Related Work
Tucker decomposition has been studied in sequential and parallel settings for dense and sparse tensors. For dense tensors, the MATLAB Tensor Toolbox provides a sequential implementation [12]. Zhou et al. [13] proposed a randomized algorithm for the case where the tensor fits in the physical memory of a single machine. Li et al. [14] proposed performance enhancements for a single TTM operation and their techniques can be incorporated within our framework. Austin et al. [2] described the first implementation for distributed memory systems, wherein they proposed heuristics for mode ordering and experimentally demonstrated the effect of grid selection on communication time. For sparse tensors, sequential [7], shared memory [10] and distributed implementations [6] are known. Other tensor decompositions have also been considered (see [5]). In particular, CP decomposition which generalizes the concept of rank factorization has been well studied (e.g. [15]). We refer to survey by Kolda and Bader [5] for a detailed treatment of tensor decompositions.
2 Tucker Decomposition
In this section, we briefly discuss tensor concepts pertinent to our problem, and then, describe the Tucker decomposition and the HOOI procedure.
2.1 Preliminaries
Fibers
Consider an dimensional tensor of size . Let denote the cardinality (number of elements) of the tensor. The elements of can be canonically indexed by a coordinate vector of the form , where each index belongs to , for all modes . A mode fiber is a vector of length , containing all the elements that differ on the th coordinate, but agree on all the other coordinates, i.e., . The number of mode fibers is . In the analogous case of matrices, two types of fibers can be found: row vectors and column vectors.
Tensor Unfolding
The tensor is stored as a matrix and there are different matrix layouts are possible, called the unfoldings. The mode unfolding of refers to the matrix whose columns are the mode fibers of the tensor. The columns are arranged in a lexicographic order (the details are not crucial for our discussion). This matrix is of size , and we denote it as .
TensorTimesMatrix Multiplication (TTM)
For any mode , the tensor can be multiplied by a matrix along mode , provided has size , for some ; the operation is denoted . Conceptually, the operation applies the linear transformation to all the mode fibers. It is realized via the matrixmatrix multiplication , and taking the output matrix to be the mode unfolding of . While the length along mode changes from to , the number of fibers and the lengths along other modes remains the same. Thus, has cardinality and size .
TTMChain
The TTMchain operation refers to multiplying along multiple distinct modes. For two modes and and matrices and , we first multiply by along mode , and then multiply the output tensor along mode by . An important property of the operation is commutativity [9], namely the two TTM operations can be performed in any order: . In general, for a subset of distinct modes , and matrices , where has size , the output is a tensor . The length of remains the same as along modes not belonging to , and changes to , for all . Commutativity implies that the multiplications can be performed in any order.
2.2 Tucker Decomposition and HOOI
The Tucker decomposition of approximates the tensor as the product of a core tensor of size , with each , and a set of factor matrices : . The factor matrix has size . The decomposition compresses the length of along each mode from to . We write the decomposition as . The error of the decomposition is measured by comparing the recovered tensor and the input tensor under the normalized root mean square metric.
The HOOI procedure [9] transforms a given decomposition into a new decomposition having the same core size, but with reduced the error. Given an initial decomposition, the procedure can be invoked repeatedly to reduce the error monotonically, until a desired convergence is achieved. An initial decomposition can be found using methods such as STHOSVD [8].
The HOOI procedure (a single invocation), shown in Figure 2, takes as input the tensor , and a decomposition with core size . It produces a new decomposition with lesser error, but having the same core and factor matrix sizes. For computing each new factor matrix , the procedure utilizes the alternating least squares paradigm and works in two steps. First, it performs a TTMchain operation by skipping mode and multiplying by the transposes of all the other factor matrices (with ) and obtains a tensor . The tensor has length compressed from to along all modes . In the next step, it performs an SVD on , the mode unfolding of the . The new factor matrix is obtained by arranging the leading singular vectors as columns. Once all the new factor matrices are computed, the new core tensor is computed.
Figure 3 (a) depicts the process in the form of a tree. The root represents the input tensor , each node with label represents multiplication along mode , and each leaf represents a new factor matrix. For dense tensors, the SVD operations tend to be inexpensive (see [2]). Therefore, we focus on optimizing the TTM component comprising of the TTMchains, from the perspectives of computational load and communication volume.
3 Computational Load
The TTM component performs TTMchains, each involving TTM operations. Commutativity allows us to rearrange and reuse the operations in multiple ways, all of which can be represented in the form of TTMtrees, as observed in prior wok [11]. We measure the computational load of a tree by the number of floating point operations incurred. Our objective is to design an efficient algorithm for finding the optimal TTMtrees. Below, we first formalize the above model and rephrase prior schemes, and then describe the optimal algorithm.
3.1 TTMtrees and Cost
In a TTMtree, the root represents the input tensor , each leaf node represents a unique new factor matrix and each internal node (nodes other than the root and the leaves) represents TTM along a particular mode. The roottoleaf path leading to a new factor matrix realizes the TTMchain required for computing .
TTMTrees
Formally, a TTMtree is a rooted tree with a function that assigns a label to each node such that the following properties are satisfied: (i) the label of the root node is ; (ii) there are exactly leaves, with each leaf being labeled with a unique new factor matrix ; (iii) each internal node is labeled with a mode ; (iv) for each leaf with label , the path from the root to has exactly internal nodes and all the modes except appear on the path.
Figure 3 (a)  (c) provides example TTMtrees for the case of . Although the trees differ in the order in which the modes are processed and the total number of TTMs performed, they all realize the necessary TTM chains.
Given a tree , the HOOI procedure can be executed via a natural topdown process by associating each node with an input tensor and an output tensor . For the root node, . Each internal node with takes as input the tensor output by its parent , multiplies it along mode by the factor matrix , and outputs the resultant tensor, i.e., and . Each leaf node with constructs the new factor matrix by performing an SVD on the tensor output by its parent. The correctness of the procedure follows from the commutativity property of the TTMchain operation. In the above procedure, we reuse the tensor output by a node for processing all its children. By executing the process via an inorder traversal, we can ensure that the maximum number of intermediate tensors stored at any point is bounded by the depth of the tree.
(a) Chain tree  (b) Chain tree  (c) Balanced tree 
Computational Load
We define the cost (or computational load) of a TTMtree to be the number of floating point operations (FLOP) performed. Each internal node with label executes the TTM . Recall that the operation involves the matrixmatrix multiplication, wherein the matrix is multiplied by the mode unfolding of . The matrix has size and the unfolded tensor has size and so, the cost of the TTM is . The cardinality of the output tensor is ; namely, the node compresses the tensor by a factor . We can compute the cost incurred at all the nodes and the cardinality of their output tensors by performing the above calculations in a topdown manner. Then, the cost of the tree is given by the sum of costs of its internal nodes. We can see that each mode is associated with two parameters: a cost factor and a compression factor , which we denote as . At each node, the cost incurred and the cardinality of the output tensor can be expressed in terms of these two parameters.
Figure 4 provides an illustration. The cost incurred and the cardinality of the output tensor are shown at each node. For the ease of exposition, we have normalized all the quantities by . The root node has cost and its cardinality of its output is , which is after normalization. Each node with label incurs a cost of times the cardinality of the tensor output by its parent; it outputs a tensor having cardinality compressed by a factor .
3.2 Prior Schemes
We rephrase the prior schemes in terms of TTMtrees.
Chain trees
These trees encode the naive scheme, with independent chains, each comprising of nodes (see figure 3 (a) and (b)).
Balanced trees
Chain trees perform TTMs. Kaya and Uçar [11] improved the count to approximately , via a divideandconquer strategy. The idea is to divide the modes into two groups and , where . We create a chain of nodes of length with labels from the first group and attach it to the root. Then, we recursively construct a subtree for the second group and attach it at the bottom of the chain. We then repeat the process by reversing the roles of the two groups. Figure 3 (c) shows an example for . The number of internal nodes is approximately .
Mode Ordering
Since the TTMchain operation is commutative, the TTM products within a chain can be performed in any order. Based on this fact, Austin et al. [2] propose the concept of mode ordering, wherein the modes of the input tensor are rearranged according to some permutation. For example, Figure 3 (a) and (b) are both chain trees, but have different mode orderings. They proposed two greedy heuristic for mode ordering. The first heuristic arranges the modes in increasing order of cost factor , placing lower cost modes at the top of the tree where large tensors are encountered. The second heuristic arranges the modes in increasing order of compression factor , aiming at higher compression at the top layers of the tree. We are not aware of any prior work on mode ordering with respect to balanced trees.
3.3 Constructing Optimal Trees
In this section, we present our algorithm for constructing the optimal TTMtree, the tree with the minimum cost. The algorithm is based on dynamic programming and runs in time . In practice, the algorithm takes negligible time, since the number of dimensions of dense tensors is fairly small (typically, ).
Towards developing the dynamic programming algorithm, we first claim that the optimal TTMtree is binary, namely every node has at most two children. The proof is based on the observation that if a node has three children, then the children can be rearranged so that only two of the nodes remain as children of . We then identify a set of subproblems and derive a recurrence relation relating them. This is followed by a description of the algorithm and an analysis of the running time.
Lemma 3.1.
There exists an optimal binary tree.
Proof.
Let be an optimal tree. Suppose a node has three children , and . The properties of TTMtrees imply that none of the three nodes can be a leaf node. Let , and be the subtrees rooted at the three nodes and let be the label of . Without loss of generality, assume that the leaf node bearing the label appears in the subtree . Below, we argue that can be transformed in to a new tree , without increasing the cost, such that is no longer a child of . Thus, the transformation reduces the number of children of by one. By repeating the process, we can get a binary tree having cost not more than . The transformation is discussed next.
If the label of is also , then we can merge and . Otherwise, mode must appear on all the paths from to the leaves in . We perform two operations: (i) for any node in with label , we delete the node (by making its children the children of its parent); (ii) make as a child of . Let be the new tree created by the process. See Figure 5 for an illustration.
We can verify that is a valid TTMtree. Furthermore, the cost of cannot be more than , as argued next. The cost of the nodes outside does not change. Let be any node in and consider three cases: (i) if is one of the deleted nodes, then we save its cost; (ii) if is the descendant of a deleted node, its cost does not change; (iii) if is an ancestor of a deleted node, the cost cannot increase, since under , the tensor input to is further shrunk by TTM along mode . ∎
Subproblems
Consider any binary tree and let be an internal node in it. With respect to , the modes can be partitioned into three groups: (i) premultiplied: is found along the path from the root to , including ; (ii) computed under : the leaf bearing label is found under the subtree rooted at ; (iii) does not belong to either category. Let , and denote the set of modes belonging to the three categories. For an illustration, consider the tree in Figure 3 (c) and let denote the right child of the root labeled ; with respect this node, , and . Notice that the triple forms a partitioning of . We can characterize any node in a TTMtree via the above partition.
We next make an observation regarding the set . Consider the stage in the HOOI execution, wherein we have completed the processing of the node . At this stage, we have already completed multiplication along all modes in . For any mode , the corresponding TTMchain involves multiplication along all modes, except . Of these modes, we are yet to perform multiplication along the modes in and . The multiplications along modes in are common to the TTMchains corresponding to all the modes in . Therefore, at this stage, we can potentially select any mode from , perform multiplication along the mode and reuse the output tensor. Hence, we call the modes in as reusable. For instance, mode is reusable in the example discussed earlier (Figure 3).
The idea behind the dynamic programming algorithm is to consider a subproblem for each possible triple as follows: construct the optimal subtree given that the modes in have been multiplied already, the modes in needs to be computed and are the reusable modes. We formalize the concept using the notion of partial TTMtrees. These trees are similar to the usual TTMtrees, except that the root represents a partially processed tensor and we only need compute a partial set of factor matrices.
Partial TTMtree
Consider a triple with . Let denote the tensor obtained by multiplying by the factor matrices along all the modes found in . A partial TTMtree for is a rooted tree with labels on its nodes such that the following properties are satisfied: (i) the root is labeled ; (ii) there are exactly leaves, with each leaf being labeled with a unique factor matrix , for ; (iii) each internal node is labeled with a mode from ; (iv) for each leaf node with label , the path from the root to has exactly internal nodes and all the modes except appear on them. Figure 6 shows two example partialTTM trees for the triple , and discussed earlier.
The cost of a partialTTM tree is defined analogous to the usual TTMtrees. Let denote the optimal partial TTMtree for the triple and let be the cost of the optimal tree. The optimal tree for the original problem is given by with , and .
Recurrence Relation
We discuss the subproblem structure and derive a recurrence relation. Consider a triple . Since optimal trees are binary, the root of can have either one or two children. The recurrence relation considers both the possibilities, which we refer to as reuse and splitting.
Reuse: This option is available, if . In this case, we select a mode and multiply along mode . The result is then reused for computing the new factor matrices of all the modes in . In terms of TTMtrees, the operation corresponds to adding a single child with label to the root of the partial TTMtree. Once the above TTM operation is performed, we are left with solving the subproblem corresponding to the triple . The cost is given by sum of the cost of the TTM operation and the cost of recursively solving the subproblem. Recall that the former cost is . The latter cost is . In the above process, any mode from can be reused and we can find the best option by considering all the choices.
Splitting: The second possibility is to split (or partition) into sets and and independently solve the triples and . The total cost is given by the sum of costs of optimal subtrees of the two subproblems, i.e., . Any (nontrivial) partition of with can be used in the above process and the best choice can be found by an exhaustive search.
The above discussion yields the following recurrence relation for computing the optimal cost of a triple :
Algorithm and Running Time Analysis
The algorithm constructs a dynamic programming table having at most entries, one for each triple with . The base cases for the recurrence relation are triples with , and , and the cost is in these cases. The other entries get computed by looking up previously computed entries as per the recurrence relation. The entries can be considered according to the following partial ordering: precedes , if either or and . The optimal partial trees can be constructed in a similar manner. The optimal tree for the original problem corresponds to the triple , and .
The running time of the algorithm can be analyzed by counting the number of dynamic programming table lookups performed. Each lookup can be specified by a configuration of the form in the reuse scenario, and by in the splitting scenario. In either case, there are at most possible configurations. The algorithm does not perform lookup on the same configuration twice and hence, the total number of lookups is at most . Thus, algorithm runs in time ).
Remarks: In the recurrence relation, we may intuitively think that whenever , we should always reuse some mode from ; see tree in Figure 6 for an illustration not reusing even though . However, the strategy is incorrect. We can construct examples, wherein thn optimal tree sacrifices the reuse option on modes having high cost factor so as to postpone multiplication along these modes till the tensor shrinks sufficiently
Given that is small, we may consider constructing the optimal TTMtrees via an exhaustive search. A naive search over all TTMtrees is prohibitively expensive. The TTM operation corresponding to a mode involves multiplication along all the other modes, which can be performed in any of the orderings. Over all the nodes, the number of combinations is , all which can be realized as chain trees. We can expedite the search by considering only the binary TTMtrees. We are not aware of any closed form expression for the number of binary TTMtrees. We note that our algorithm can be modified to enumerate all these trees. Instead of enumeration, the algorithm incorporates memoization and computes the optimal tree efficiently in time .
4 Communication Volume
Our strategy is to fix a TTMtree (based on the heuristic or the optimal tree) and devise schemes for minimizing the volume. Our distributed implementation uses the same strategy as that of Austin et al. [2] for distributing the tensors and performing TTM in a distributed manner. We propose a dynamic gridding scheme that offers significant reduction in volume and design an efficient algorithm for finding the optimal scheme.
4.1 Distributed Setup
Tensor Distribution
Fix a TTMtree and let be the number of processors. We arrange the processors in an dimensional grid such that . To distribute a tensor, we impose the grid on the tensor and partition it into blocks, and assign each block to a processor; see Figure 7 for an illustration. The input tensor and all the intermediate tensors gets partitioned using the same grid.
Distributed TTM and Volume
Each node with label and parent performs the TTM operation . For the grid , we denote the communication volume incurred by the operation as . As observed in the prior work ; a brief outline of the argument in the following paragraph. The total communication volume of , denoted , is defined to be the sum of volumes incurred at all the internal nodes.
Recall that the TTM operation can be viewed as applying the linear transformation to every mode fiber of . That is, we need to perform the matrixvector product . Since the factor matrices are small in size, we can afford to keep a copy of them at every processor. However, each mode fiber gets distributed equally among some processors and so, computing the product requires a reduce operation. Similarly, the output fiber must be distributed among the same processors using a scatter operation. See Figure 8 for an illustration. The reducescatter operation is performed over the output fiber of , for which we incur units of communication. Summed up over all the fibers, the total communication volume for the TTM is .
In the above distribution method, if for some mode , then some processor would receive an empty block while partitioning . Similarly, if then same scenario would arise on some intermediate tensor. We avoid the load imbalance by considering only grids with , for all ; we call these valid grids. In the rest of the discussion, unless explicitly mentioned, we shall only consider valid grids.
5  6  7  8  9  10  
126  252  562  792  1287  2002  
1001  3003  8008  19448  43758  92378  
10626  53130  230K  880K  3.1M  10M 
4.2 Finding the Optimal Static Grid
We observe that the optimal static grid, the one achieving the minimum communication volume, can be found via an exhaustive search in negligible time. The number of grids, including the invalid ones, is the same as number of ways in which the integer can be expressed as the product of factors, which we denote . If the prime factorization of is , then we have that
Table 1 shows the quantity for example values of and . When the quantity becomes large, the search can be parallelized in a straightforward manner. Even for the extreme case of and , the number of grids to be scanned per processor is approximately .
4.3 Dynamic Gridding Scheme
The idea of dynamic gridding is as follows. Consider a node , and let its parent be and label be . The node performs the TTM operation . If the tensor is represented in a grid then we incur a volume of , Thus, it is beneficial to represent under a grid with a small assignment , and in fact, the operation can be made communicationfree by assigning . The static gridding scheme selects a single grid by considering the cumulative effect of the above communication volume over all the nodes. The idea of dynamic gridding is to select different grids for representing the intermediate tensors, as appropriate for each node. However, we need to pay a price for dynamic gridding: if the tensor output by the parent is represented in a grid and we have selected a different grid for representing it at , then the tensor must be regridded (redistributed) among the processors. The process incurs a volume of . Thus, a dynamic grid scheme must decide whether or not to regrid at each node, and furthermore, if it decides to regrid, the new grid must be selected in a manner beneficial for the TTM operations performed later in the subtree, so that the overall communication is minimized.
In Figure 9, we have shown an example, carefully constructed so as to highlight the different aspects of dynamic gridding. Assume that the number of processors is and the core is of size . The choice of the initial grid makes the TTM operations at nodes and are communicationfree. However, the grid is not suitable for the TTM at node , since the volume incurred is . Instead, we switch to a new grid , making the operation communicationfree. We perform another regrid operation at node by selecting the new grid , The choice of the new grid is motivated by the following considerations. The subtree beneath does not involve any TTM along mode and so, it is prudent to assign a high value along the mode. However, we must select a valid grid, and the constraint implies that the maximum possible value is (since the core length along mode is ). We next assign a value of to mode , thereby making the TTM at node communicationfree. The remaining of value of is assigned to the modes and in a balanced manner.
Dynamic Grid Scheme
Formally, a dynamic grid scheme is a mapping that associates a grid with each node . The volume incurred by the scheme, denoted is defined as follows. For each node with label and parent , we compute the volume incurred at the node as the sum of two components: (i) TTM operation volume: , where is the assignment to mode under ; (ii) regridding volume: if is the same as the parent grid , then the volume is zero, and otherwise, it is . The volume of the scheme , denoted , is defined to be the sum of communication incurred over all the nodes . At the root node, we represent the input tensor under the grid and we do not have the regrid option. Let denote the optimal communication volume achievable among all dynamic grid schemes, and let denote an optimal scheme.
4.4 Optimal Dynamic Gridding Scheme
In this section, we develop an efficient dynamic programming algorithm for computing the optimal dynamic grid scheme for a given tree . For a node , let denote the subtree rooted at . A partial grid scheme for refers to a mapping that specifies a grid for each node in . For each node and each grid , we shall define a subproblem with the following connotation: assuming that the tensor output by the parent is represented under the grid , find the optimal partial grid scheme for the subtree . We solve these subproblems via a bottomup traversal of the tree, wherein the optimal solution at is computed from the optimal solutions of its children.
Subproblems
Consider a pair , where is a node and is a grid. For a partial grid scheme for the subtree, let denote the volume incurred by given that the tensor output by the parent of is represented in the grid . Formally, it is computed as follows. For each node , define a parent grid : for the node , , and for the other nodes, , where is the parent of . For any node , associate the volume given by the sum of the following two components: (i) TTM operation volume: , where is the mode label of and is the assignment to mode under ; (ii) regridding volume: if is the same as , then the volume is zero, and otherwise, it is . Then, the volume is defined to be the sum of volumes associated with all the nodes . Let denote the minimum volume possible among all partial grid schemes . We do not regrid at root and so, define to be the minimum volume given that is represented under .
Recurrence Relation
We derive a recurrence relation for computing . Let be the children of . In determining the optimal partial scheme, we have two options: (i) regrid: select a new grid for representing ; (ii) do no regrid: represent under the given grid . In the first case, we select to be the grid yielding the minimum volume for the child subtrees:
We can now write the recurrence for . Let be the label of and be the parent of . Let and let . Then:
The two quantities correspond to the optimal solutions for the two choices of regridding and not regridding. In both the cases, we incur communication for the TTM operation and communication in the subtrees. In addition, the first case incurs a regrid volume of . Under the two choices, the tensors and get represented under the grids and , respectively. Consequently, the recursive calls for the two choices are made with the corresponding grids. At the root node, we represent under and do not regrid and so, we consider only the first choice at the root. The optimal volume for the whole tree is given by minimum of , over all the choices of and can be computed via enumerating the choices.
Algorithm and Running Time Analysis
The algorithm constructs a dynamic programming table containing an entry for each pair . Thus, the number of entries is , where is number of nodes in the tree. The entries are computed via a bottomup traversal of the tree. For each entry , we need to compute , which requires a search over all the grids. However, the selection of the grid is independent of the parameter and so, it is sufficient to compute it once per node. The recurrence relation involves a table lookup for each child and an entry for a node is looked up only by its parent, and so the total number of table lookups is . Similar to the case of static grids (Section 4.2), the exhaustive search involved in computing can be parallelized in a straightforward manner, if is large. Thus, the algorithm executes in negligible time in practice.
5 Distributed Implementation
The distributed implementation consists of two modules, a planner and an engine. The planner constructs a TTMtree, either based on the heuristics or the optimal tree, and selects grids, either the optimal static or dynamic gridding scheme. The module only requires the metadata as input: the dimension lengths of the input tensor and the core tensor. It needs to be executed only once and the output can be used across multiple invocations of the HOOI procedure. All the processors use the same TTMtree and there is synchronization at each tree node.
The second module, called the engine maintains tensors in a distributed manner, and implements the TTM and SVD routines. Tensors are distributed according to the block distribution method (Section 4). To change the grid under which a tensor is represented, the engine implements an element redistribution procedure via the MPI_Alltoallv collective. The TTM operation is implemented in a distributed manner using the algorithm proposed by Austin et al. [2]. The naive method for computing the TTM product along a specified mode first requires an unfolding of the tensor along the mode. Their algorithm cleverly avoids the unfolding operation by employing a blocking strategy which breaks down the TTM product into a series of matrix multiplication calls. The matrix multiplication calls are performed using dgemm. We implement the SVD component using distributed Gram matrix computation () followed by eigen value decomposition (EVD). The Gram matrix computation is performed using dysrk calls which exploits the symmetry in the product. The EVD is computed sequentially by invoking the dsyevx routine; this is acceptable since it operates on small square matrices of size , and in our setting.
6 Experimental Evaluation
In this section, we present an experimental evaluation of the algorithms described in the paper.
6.1 Setup
System
The experiments were conducted on an IBM BG/Q system. Each BG/Q node has cores and GB memory. Our implementation is based on MPI and OpenMP, with gcc 4.4.6 and ESSL 5.1. Each MPI rank was mapped to a single node and spawns 16 threads which are mapped to the cores. All the experiements use nodes.
Tensors
As discussed in the introduction, the execution time of the HOOI algorithm is crucially dependent on the metadata (dimension lengths of the input tensor and the core tensor), and independent of the elements in the tensor. We exploit this property to construct a large benchmark of tensors with metadata derived from real world tensors considered in prior work.
We also include a set of tensors with metadata derived from simulations in combustion science [2]. The metadata of these tensors is shown in Table 2. Due to memory limitations, we curtailed the length along certain dimensions; while the length along all the spatial dimensions were retained as such, we reduced the length along the axes of variables/timesteps and proportionately reduced the length of the core along these axes. We fill these tensors with randomly generated data.
The benchmark is constructed as follows. We constructed and dimensional tensors with dimension lengths drawn from the set . We selected the core dimension lengths by fixing the compression ratio . The value for was drawn from the set . Given the above two sets of choices, an input for the HOOI procedure can be generated as follows: for each dimension N, we select from the first set of choices, and select from the second set of choices, and set . We placed an upper limit of on the cardinality of . We enumerated all possible HOOI inputs in the above manner and obtained a benchmark consisting of dimensional and dimensional tensors.
Tensor Dimensions  Core Tensor Dimensions  

HCCI  (672, 672, 627, 16)  (279, 279, 153, 14) 
TJLR  (460, 700, 360, 16, 4)  (306, 232, 239, 16, 4) 
SP  (500, 500, 500, 11, 10)  (81, 129, 127, 7, 6) 
6.2 Evaluation
The experiments involved comparing our algorithm and prior heuristics (Section 3.2 and 4.1). The heuristics are obtained by fixing the tree class to be chain and balanced trees, and the mode ordering to be ordering and ordering. In the case of balanced trees, we observed that ordering and ordering do not impact the execution time and so, we use the input (naive) mode ordering. For all these heuristics, we use the optimal static grids. We compare the heuristics with our algorithms: the optimal tree algorithm with static grids and the same algorithm with dynamic grids.
The following metrics were studied: overall execution time, computational load and time, and communication volume and time. The dimensions of the tensors/matrices arising in the computations are identical across different HOOI iterations (only data elements change). Consequently, any two HOOI iterations will incur the same computational load and communication volume. Thus, the running times would be approximately the same across iterations. Hence, we executed each algorithm on all the benchmark tensors and measured these metrics for a single HOOI invocation.
Overall Execution Time
We compared overall execution time of the opttree algorithm with dynamic gridding against the prior heuristics. For each tensor, we normalized the execution times w.r.t the execution time of the opttree algorithm (which becomes unit). Given that the benchmark is large, we summarize the results using a percentile plot. Figure (a)a and (b)b shows the plots for D and D tensors. In these plots, normalized time of on percentile value means that for of tensors, the normalized execution time is less than . For example, in Figure (a)a, the th percentile value for the (chain, K) is , meaning that the improvement factor obtained by the opttree algorithm is at most x for of the tensors and at least x for the remaining of the tensors. These plots reveal the overall performance of the heuristics across the benchmark; a lower curve means that the heuristic performs better.
The curves corresponding to the prior work lie above the opttree algorithm, i.e., it outperforms all the prior algorithms on every tensor in the benchmark. The performance gain is dependent on the metadata. and varies from 1.5x to 7x. The tensors that achieved the minimum and the maximum gains are: Min  compressed to ; Max  compressed to . The median improvement is x for D and x for D tensors. A detailed study is required to characterize the gain in terms of metadata.
We also studied the performance of the algorithms on the real tensors. Figure (c)c shows the actual execution time for one HOOI invocation. For each tensor, we show 4 bars, corresponding to three prior algorithms and the opttree algorithm with dynamic grids. For all the tensors, we see that balanced tree outperforms the chain algorithms, because it reuses TTM operations. The opttree algorithm offers improvements as high as 4.6x over (chain, ), 5.8x over (chain, ) and 4.1x over (balanced). For these tensors, the superior performance of the opttree algorithm is mainly because of drastic reduction in communication time and partial reduction in computation time. Remarkably, the opttree algorithm becomes near communicationfree under all the three tensors.
Computation Optimization
Here, we study the performance gains from optimal computation tree construction by comparing heuristics and the opttree algorithm on computation time and load for the TTMcomponent. We normalized the quantities with respect to the opttree algorithm. The time and load for each algorithmtensor pair was normalized w.r.t the time and load of the opttree algorithm. The comparison of the time for D and D tensors are reported in Figure (a)a and (b)b. The opttree algorithm offers 1.51.7x median improvement compared to prior algorithms for D tensors and 1.42.0x median improvement for D tensors. The maximum gain is as high as 2.8x and 3.7x for D and D.
Figure (c)c and (d)d show the normalized computational load for D and D. We see that the opttree algorithm offers up to x (D) and x (D) reduction in load over the best prior algorithm, corroborating the improvements seen in time. The improvements are higher for D, compared to D, because opttree has more opportunities for careful placement and reuse of the TTMs.
Communication Optimization
In this experiment, we study the benefits of dynamic gridding. To do so, we compare the opttree algorithm with the static and the dynamic gridding schemes under the metrics of communication time and volume. For the latter, we include the time incurred in TTM multiplication, as well as regridding. The quantities are normalized with respect to the dynamic gridding scheme. The results are shown in Figure (e)e and (f)f. In Figure (f)f, we can see that dynamic gridding offers up to x factor improvement in communication volume over static gridding, whereas in Figure (e)e, we can see improvements up to x factor (median x) in communication time. The reason for higher improvements on communication time is that regridding (based on alltoall collective) turns out to be faster than TTM multiplication (based on reducescatter over group communicators) for the same communication volume. Remarkably, the dynamic grid scheme outperforms static grid scheme on almost all the tensors in the benchmark, with a gain of at least factor on % of the tensors. The gain in communication time is a result of improvement in communication volume, a machine independent statistic. Thus, we expect similar gains on other distributed memory systems as well.
7 Conclusions
We studied the Tucker decomposition for dense tensors for the distributed memory setting. We proposed efficient algorithms for computing the optimal trees and dynamic gridding schemes. Our experimental evaluation on a large benchmark demonstrates that the proposed algorithms lead to significant reduction in computational load and communication volume, and offers up to x improvement in performance. To further improve the performance of HOOI on dense tensors, a distributed SVD solver could be used instead of the Gram product followed by sequential EVD. Investigating the applicability of the techniques developed in this paper to the case of sparse tensors is a potential avenue of future work.
Acknowledgements
We thank Woody Austin, Grey Ballard and Tamara G. Kolda for sharing their insights with us, and the reviewers for helpful comments.
References
 [1] L. R. Tucker, “Some mathematical notes on threemode factor analysis,” Psychometrika, vol. 31, pp. 279–311, 1966.
 [2] W. Austin, G. Ballard, and T. G. Kolda, “Parallel tensor compression for largescale scientific data,” in IPDPS, 2016.
 [3] M. A. O. Vasilescu and D. Terzopoulos, “Multilinear analysis of image ensembles: Tensorfaces,” in ECCV, 2002.
 [4] D. Muti and S. Bourennane, “Multidimensional filtering based on a tensor approach,” Signal Processing, vol. 85, pp. 2338–2353, 2005.
 [5] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, pp. 455–500, 2009.
 [6] O. Kaya and B. Uçar, “High performance parallel algorithms for the tucker decomposition of sparse tensors,” in ICPP, 2016.
 [7] T. G. Kolda and J. Sun, “Scalable tensor decompositions for multiaspect data mining,” in ICDM, 2008.
 [8] N. Vannieuwenhoven, R. Vandebril, and K. Meerbergen, “A new truncation strategy for the higherorder singular value decomposition,” SIAM J. on Scientific Computing, vol. 34, no. 2, pp. 1027–1052, 2012.
 [9] L. D. Lathauwer, B. D. Moor, and J. Vandewalle, “On the best rank1 and rank() approximation of higherorder tensors,” SIAM J. Matrix Analysis and Applications, vol. 21, pp. 1324–1342, 2000.
 [10] M. Baskaran, B. Meister, N. Vasilache, and R. Lethin, “Efficient and scalable computations with sparse tensors,” in HPEC, 2012.
 [11] O. Kaya and B. Uçar, “Highperformance parallel algorithms for the tucker decomposition of higher order sparse tensors,” Inria, Tech. Rep. RR8801, HAL01219316, 2015.
 [12] B. W. Bader and T. G. Kolda, “Efficient MATLAB computations with sparse and factored tensors,” SIAM J. on Scientific Comp., vol. 30, no. 1, pp. 205–231, 2007.
 [13] G. Zhou, A. Cichocki, and S. Xie, “Decomposition of big tensors with low multilinear rank,” CoRR, arXiv:1412.1885, 2015.
 [14] J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc, “An inputadaptive and inplace approach to dense tensortimesmatrix multiply,” in SC, 2015.
 [15] U. Kang, E. Papalexakis, A. Harpale, and C. Faloutsos, “GigaTensor: Scaling tensor analysis up by 100 times  algorithms and discoveries,” in KDD, 2012.