Scalable Tucker Factorization for Sparse Tensors  Algorithms and Discoveries
Abstract
Given sparse multidimensional data (e.g., (user, movie, time; rating) for movie recommendations), how can we discover latent concepts/relations and predict missing values? Tucker factorization has been widely used to solve such problems with multidimensional data, which are modeled as tensors. However, most Tucker factorization algorithms regard and estimate missing entries as zeros, which triggers a highly inaccurate decomposition. Moreover, few methods focusing on an accuracy exhibit limited scalability since they require huge memory and heavy computational costs while updating factor matrices.
In this paper, we propose PTucker, a scalable Tucker factorization method for sparse tensors. PTucker performs an alternating least squares with a gradientbased update rule in a fully parallel way, which significantly reduces memory requirements for updating factor matrices. Furthermore, we offer two variants of PTucker: a caching algorithm PTuckerCache and an approximation algorithm PTuckerApprox, both of which accelerate the update process. Experimental results show that PTucker exhibits 1.714.1 speedup and 1.44.8 less error compared to the stateoftheart. In addition, PTucker scales near linearly with the number of nonzeros in a tensor and number of threads. Thanks to PTucker, we successfully discover hidden concepts and relations in a largescale realworld tensor, while existing methods cannot reveal latent features due to their limited scalability or low accuracy.
I Introduction
Given a largescale sparse tensor, how can we discover latent concepts/relations and predict missing entries? How can we design a time and memory efficient algorithm for analyzing a given tensor? Various realworld data can be modeled as tensors or multidimensional arrays (e.g., (user, movie, time; rating) for movie recommendations). Many realworld tensors are sparse and partially observable, i.e., composed of a vast number of missing entries and a relatively small number of observable entries. Examples of such data include item ratings [1], social network [2], and web search logs [3] where most entries are missing. Tensor factorization has been used effectively for analyzing tensors [4, 5, 6, 7, 8, 9, 10]. Among tensor factorization methods [11], Tucker factorization has received much interest since it is a generalized form of other factorization methods like CANDECOMP/PARAFAC (CP) decomposition, and it allows us to examine not only latent factors but also relations hidden in tensors.
While many algorithms have been developed for Tucker factorization [12, 13, 14, 15], most methods produce highly inaccurate factorizations since they assume and predict missing entries as zeros, and the values of whose missing entries are unknown. Moreover, existing methods focusing only on observed entries exhibit limited scalability since they exploit tensor operations and singular value decomposition (SVD), leading to heavy memory and computational requirements. In particular, tensor operations generate huge intermediate data for largescale tensors, which is a problem called intermediate data explosion [16]. A few Tucker algorithms [17, 18, 19, 20] have been developed to address the above problems, but they fail to solve the scalability and accuracy issues at the same time. In summary, the major challenges for decomposing sparse tensors are 1) how to handle missing entries for an accurate and scalable factorization, and 2) how to avoid intermediate data explosion and high computational costs caused by tensor operations and SVD.
Method  Scale  Speed  Memory  Accuracy 
TuckerwOpt [18]  ✓  
TuckerCSF [20]  ✓  ✓  
[17]  ✓  ✓  ✓  
PTucker  ✓  ✓  ✓  ✓ 
In this paper, we propose PTucker, a scalable Tucker factorization method for sparse tensors. PTucker performs an alternating least squares (ALS) with a gradientbased update rule, which focuses only on observed entries of a tensor. The gradientbased approach considerably reduces the amount of memory required for updating factor matrices, enabling PTucker to avoid the intermediate data explosion problem. Besides, to speed up the update procedure, we provide its timeoptimized versions: a caching method PTuckerCache and an approximation method PTuckerApprox. PTucker fully employs multicore parallelism by carefully allocating rows of a factor matrix to each thread considering independence and fairness. Table I summarizes a comparison of PTucker and competitors with regard to various aspects.
Our main contributions are the following:

Algorithm. We propose PTucker, a scalable Tucker factorization method for sparse tensors. PTucker not only enhances the accuracy of factorization by focusing on observed values but also achieves higher scalability by utilizing a gradientbased ALS rather than using tensor operations and SVD for updating factor matrices.

Theory. We suggest a rowwise update rule of factor matrices, and prove the correctness and convergence of it. Moreover, we analyze the time and memory complexities of PTucker and other methods, as summarized in Table III.

Performance. PTucker provides the best performance across all aspects: tensor scale, factorization speed, memory requirement, and accuracy of decomposition. Experimental results demonstrate that PTucker achieves 1.714.1 speedup with 1.44.8 less error for largescale tensors, as summarized in Figures 6, 7, and 11.
The source code of PTucker and datasets used in this paper are publicly available at https://datalab.snu.ac.kr/ptucker/ for reproducibility. The rest of this paper is organized as follows. Section II explains preliminaries on a tensor, its operations, and its factorization methods. Section III describes our proposed method PTucker. Section IV presents experimental results of PTucker and other methods. Section V describes our discovery results on the MovieLens dataset. After introducing related works in Section VI, we conclude in Section VII.
Ii Preliminaries
In this section, we describe the preliminaries of a tensor in Section IIA, its operations in Section IIB, and its factorization methods in Section IIC. Notations and definitions are summarized in Table II.
Iia Tensor
Tensors, or multidimensional arrays, are a generalization of vectors (order tensors) and matrices (order tensors) to higher orders. As a matrix has rows and columns, an order tensor has modes; their lengths (also called dimensionalities) are denoted by through , respectively. We denote tensors by boldface Euler script letters (e.g., ), matrices by boldface capitals (e.g., ), and vectors by boldface lowercases (e.g., ). An entry of a tensor is denoted by the symbolic name of the tensor with its indices in subscript. For example, indicates the th entry of , and denotes the th entry of . The th row of is denoted by , and the th column of is denoted by .
Symbol  Definition 
input tensor  
core tensor  
order of  
dimensionality of the th mode of and  
th factor matrix  
th entry of  
set of observable entries of  
set of observable entries whose th mode’s index is  
number of observable entries of and  
regularization parameter for factor matrices  
Frobenius norm of tensor  
number of threads  
an entry of input tensor  
an entry of core tensor  
cache table  
truncation rate 
IiB Tensor Operations
We review some tensor operations used for Tucker factorization. More tensor operations are summarized in [11].
Definition 1 (Frobenius Norm)
Given an Norder tensor , the Frobenius norm of is denoted by and defined as follows:
(1) 
Definition 2 (Matricization/Unfolding)
Matricization transforms a tensor into a matrix. The mode matricization of a tensor is denoted as . The mapping from an element of to an element of is given as follows:
(2) 
Note that all indices of a tensor and a matrix begin from 1.
Definition 3 (nMode Product)
nmode product enables multiplications between a tensor and a matrix. The nmode product of a tensor with a matrix is denoted by (). Elementwise, we have
(3) 
IiC Tensor Factorization Methods
Our proposed method PTucker is based on Tucker factorization, one of the most popular decomposition methods. More details about other factorization algorithms are summarized in Section VI and [11].
Definition 4 (Tucker Factorization)
Given an Norder tensor , Tucker factorization approximates by a core tensor and factor matrices . Figure 1 illustrates a Tucker factorization result for a 3way tensor. Core tensor is assumed to be smaller and denser than the input tensor , and factor matrices to be normally orthogonal. Regarding interpretations of factorization results, each factor matrix represents the latent features of the object related to the th mode of , and each element of core tensor indicates the weights of the relations composed of columns of factor matrices. Tucker factorization with tensor operations is presented as follows:
(4) 
Note that the loss function (4) is calculated by all entries of , and whole missing values of are regarded as zeros. Concurrently, an elementwise expression is given as follows:
(5) 
Equation (5) is used to predict values of missing entries after are found. We define the reconstruction error of Tucker factorization of by the following rule. Note that is the set of observable entries of .
(6) 
Definition 5 (Sparse Tucker Factorization)
Given a tensor with observable entries , the goal of sparse Tucker factorization of is to find factor matrices and a core tensor , which minimize (7).
(7) 
Note that the loss function (7) only depends on observable entries of , and regularization is used in (7) to prevent overfitting, which has been generally utilized in machine learning problems [21, 22, 23].
Definition 6 (Alternating Least Squares)
To minimize the loss functions (4) and (7), an alternating least squares (ALS) technique is widely used [11, 14], which updates one factor matrix or core tensor while keeping all others fixed.
Algorithm 1 describes a conventional Tucker factorization based on the ALS, which is called the higherorder orthogonal iteration (HOOI) (see [11] for details). The computational and memory bottleneck of Algorithm 1 is updating factor matrices (lines 45), which requires tensor operations and SVD. Specifically, Algorithm 1 requires storing a fulldense matrix , and the amount of memory needed for storing is . The required memory grows rapidly when the order, the dimensionality, or the rank of a tensor increase, and ultimately causes intermediate data explosion [16]. Moreover, Algorithm 1 computes SVD for a given , where the complexity of exact SVD is . The computational costs for SVD increase rapidly as well for a largescale tensor. Notice that Algorithm 1 assumes missing entries of as zeros during the update process (lines 45), and core tensor (line 7) is uniquely determined and relatively easy to be computed by an input tensor and factor matrices.
In summary, applying the naive TuckerALS algorithm on sparse tensors generates severe accuracy and scalability issues. Therefore, Algorithm 1 needs to be revised to focus only on observed entries and scale for largescale tensors at the same time. In that case, a gradientbased ALS approach is applicable to Algorithm 1, which is utilized for partially observable matrices [23] and CP factorizations [24]. The gradientbased ALS approach is discussed in Section III.
Definition 7 (Intermediate Data)
We define intermediate data as memory requirements for updating (lines 45 in Algorithm 1), excluding memory space for storing , , and . The size of intermediate data plays a critical role in determining which Tucker factorization algorithms are spaceefficient, as we will discuss in Section IIIE2.
Iii Proposed Method
In this section, we describe PTucker, our proposed Tucker factorization algorithm for sparse tensors. As described in Definition 6, the computational and memory bottleneck of the standard TuckerALS algorithm occurs while updating factor matrices. Therefore, it is imperative to update them efficiently in order to maximize scalability of the algorithm. However, there are several challenges in designing an optimized algorithm for updating factor matrices.

Exploit the characteristic of sparse tensors. Sparse tensors are composed of a vast number of missing entries and a relatively small number of observable entries. How can we exploit the sparsity of given tensors to design an accurate and scalable algorithm for updating factor matrices?

Maximize scalability. The aforementioned TuckerALS algorithm suffers from intermediate data explosion and high computational costs while updating factor matrices. How can we formulate efficient algorithms for updating factor matrices in terms of time and memory?

Parallelization. It is crucial to avoid race conditions and adjust workloads between threads to thoroughly employ multicore parallelism. How can we apply data parallelism on updating factor matrices in order to scale up linearly with respect to the number of threads?
To overcome the above challenges, we suggest the following main ideas, which we describe in later subsections.

PTuckerCache and PTuckerApprox accelerate the update process by caching intermediate calculations and utilizing a truncated core tensor, while PTucker itself provides a memoryoptimized algorithm by default (Section IIIC).

Careful distribution of work assures that each thread has independent tasks and balanced workloads when PTucker updates factor matrices. (Section IIID).
We first suggest an overview of how PTucker factorizes sparse tensors using Tucker method in Section IIIA. After that, we describe details of our main ideas in Sections IIIBIIID, and we offer a theoretical analysis of PTucker in Section IIIE.
Iiia Overview
PTucker provides an efficient Tucker factorization algorithm for sparse tensors.
Figure 2 and Algorithm 2 describe the main process of PTucker. First, PTucker initializes all and with random real values between 0 and 1 (step 1 and line 1). After that, PTucker updates factor matrices (steps 23 and line 3) by Algorithm 3 explained in Section IIIB. When all factor matrices are updated, PTucker measures reconstruction error using (6) (step 4 and line 4). In case of PTuckerApprox (step 5 and lines 56), PTuckerApprox removes “noisy” entries of by Algorithm 4 explained in Section IIIC. PTucker stops iterations if the error converges or the maximum iteration is reached (line 7). Finally, PTucker performs QR decomposition on all to make them orthogonal and updates (step 6 and lines 811). Specifically, QR decomposition [25] on each is defined as follows:
(8) 
where is columnwise orthonormal and is uppertriangular. Therefore, by substituting for , PTucker succeeds in making factor matrices orthogonal. Core tensor must be updated accordingly in order to maintain the same reconstruction error. According to [26], the update rule of core tensor is given as follows:
(9) 
IiiB Gradientbased ALS for Updating Factor Matrices
PTucker adopts a gradientbased ALS method to update factor matrices, which concentrates only on observed entries of a tensor. From a highlevel point of view, as most ALS methods do, PTucker updates a factor matrix at a time while maintaining all others fixed. However, when all other matrices are fixed, there are several approaches [24] for updating a single factor matrix. Among them, PTucker selects a rowwise update method; a key benefit of the rowwise update is that all rows of a factor matrix are independent of each other in terms of minimizing the loss function (7). This property enables applying multicore parallelism on updating factor matrices. Given a row of a factor matrix, PTucker updates the row by a gradientbased update rule. To be more specific, the update rule is derived by computing a gradient with respect to the given row and setting it as zero, which minimizes the loss function (7). The update rule for the th row of the th factor matrix (see Figure 4) is given as follows; the proof of Equation (10) is in Theorem 1.
(10) 
where is a matrix whose th entry is
(11) 
is a length vector whose th entry is
(12) 
is a length vector whose th entry is
(13) 
indicates the subset of whose th mode’s index is , is a regularization parameter, and is a identity matrix. As shown in Figure 4, the update rule for the th row of requires three intermediate data , , and . Those data are computed by the subset of observable entries . Thus, computational costs of updating factor matrices are proportional to the number of observable entries, which lets PTucker fully exploit the sparsity of given tensors. Moreover, PTucker predicts missing values of a tensor using (5), not as zeros. Equation (5) is computed by updated factor matrices and a core tensor, and they are learned by observed entries of a tensor. Hence, PTucker not only enhances the accuracy of factorizations, but also reflects the latentcharacteristics of observed entries of a tensor. Note that a matrix is positivedefinite and invertible, and a proof of the update rule is summarized in Section IIIE1.
Algorithm 3 describes how PTucker updates factor matrices. First, in case of PTuckerCache (lines 14), it computes the values of all entries in a cache table which caches intermediate multiplication results generated while updating factor matrices. This memoization technique allows PTuckerCache to be a timeefficient algorithm. Next, PTucker chooses a row of a factor matrix to update (lines 56). After that, PTucker computes and required for updating a row (lines 713). PTucker performs matrix inverse operation on (line 14) and updates a row by the multiplication of and (line 15). In case of PTuckerCache, it recalculates using the existing and updated (lines 1619) whenever is updated. Note that and indicate an entry of and , respectively.
IiiC Variants: PTuckerCache and PTuckerApprox
As discussed in Section IIIB, PTucker requires three intermediate data: , , and whose memory requirements are . Considering the memory complexity of the naive TuckerALS, which is , PTucker successfully provides a memoryoptimized algorithm. We can further optimize PTucker in terms of time by a caching algorithm (PTuckerCache) and an approximation algorithm (PTuckerApprox).
The crucial difference between PTucker and PTuckerCache lies in the computation of the intermediate vector (lines 912 in Algorithm 3). In case of PTucker, updating requires times of multiplications for a given entry pair (line 10), which takes . However, if we cache the results of those multiplications for all entry pairs, the update only takes (line 12). This tradeoff distinguishes PTuckerCache and PTucker. PTuckerCache accelerates intermediate calculations by the memoization technique with the cache table . Meanwhile, PTucker requires only small vectors and () and a small matrix () as intermediate data. Note that when is 0 (lines 12 and 19), PTuckerCache conducts the multiplications as PTucker does (line 10).
The main intuition of PTuckerApprox is that there exist “noisy” entries in a core tensor , and we can accelerate the update process by truncating these “noisy” entries of . Then, how can we determine whether an entry of is “noisy” or not? A naive approach could be treating an entry with small value as ”noisy” like the truncated SVD [27]. However, in this case, smallvalue entries are not always negligible since their contributions to minimizing the error (6) can be larger than that of largevalue ones. Hence, we propose more precise criterion which regards an entry with a high value as “noisy”. indicates a partial reconstruction error produced by an entry , derived by the sum of terms only related to in (6). Given an entry , is given as follows:
(14) 
Note that we use , , and symbols to simplify the equation. suggests a more precise guideline of “noisy” entries since is a part of (6), while the naive approach assumes the error based on the value . Figure 5 illustrates a distribution of and a cumulative function of relative reconstruction error on the latest MovieLens dataset (). As expected by our intuition, only 20% entries of generate about 80% of total reconstruction error. Algorithm 4 describes how PTuckerApprox truncates “noisy” entries in . It first computes (lines 12) for all entries in , and sort in descending order (line 3) as well as their indices. Finally, it truncates top “noisy” entries of (line 4). PTuckerApprox performs Algorithm 4 for each iteration (lines 27 in Algorithm 2), which reduces the number of nonzeros in stepbystep. Therefore, the elapsed time per iteration also decreases since the time complexity of PTuckerApprox depends on the number of nonzeros . Note that we can find an optimal approximation point whose speedup over accuracy loss is maximized (see Figure 9).
With the above optimizations, PTucker becomes the most time and memory efficient method in theoretical and experimental perspectives (see Table III).
IiiD Careful Distribution of Work
There are three sections where multicore parallelization is applicable in Algorithms 2 and 3. The first section (lines 24 and 1719 in Algorithm 3) is for PTuckerCache when it computes and updates the cache table . The second section (lines 615 in Algorithm 3) is for updating factor matrices, and the last section (line 4 in Algorithm 2) is for measuring the reconstruction error. For each section, PTucker carefully distributes tasks to threads while maintaining the independence between them. Furthermore, PTucker utilizes a dynamic scheduling method [28] to assure that each thread has balanced workloads. The details of how PTucker parallelizes each section are as follows. Note that indicates the number of threads used for parallelization.

Section 1: Computing and Updating Cache Table (Only for PTuckerCache). All rows of are independent of each other when they are computed or updated. Thus, PTucker distributes all rows equally over threads, and each thread computes or updates allocated rows independently using static scheduling.

Section 2: Updating Factor Matrices. All rows of are independent of each other regarding minimizing the loss function (7). Therefore, PTucker distributes all rows uniformly to each thread, and updates them in parallel. Since differs for each row, the workload of each thread may vary considerably. Thus, PTucker employs dynamic scheduling in this part.

Section 3: Calculating Reconstruction Error. All observable entries are independent of each other in measuring the reconstruction error. Thus, PTucker distributes them evenly over threads, and each thread computes the error separately using static scheduling. At the end, PTucker aggregates the partial error from each thread.
IiiE Theoretical Analysis
Convergence Analysis
In this section, we theoretically prove the correctness and the convergence of PTucker.
Theorem 1 (Correctness of PTucker)
Proof 1
Theorem 2 (Convergence of PTucker)
PTucker converges since (7) is bounded and decreases monotonically.
Proof 2
Algorithm  Time Complexity  Memory 
(per iteration)  Complexity  
PTucker  
PTuckerCache  
PTuckerApprox  
TuckerwOpt [18]  
TuckerCSF [20]  
[17] 
Complexity Analysis
In this section, we analyze time and memory complexities of PTucker and its variants. For simplicity, we assume and . Table III summarizes the time and memory complexities of PTucker and other methods. As expected in Section IIIC, PTucker presents the best memory complexity among all algorithms. While PTuckerCache shows better time complexity than that of PTucker, PTuckerApprox exhibits the best time complexity thanks to the reduced number of nonzeros in . Note that we calculate time complexities per iteration (lines 27 in Algorithm 2), and we focus on memory complexities of intermediate data, not of all variables.
Theorem 3 (Time complexity of PTucker)
The time complexity of PTucker is .
Proof 3
Given the th row of (lines 56) in Algorithm 3 , computing (line 10) takes . Updating and (line 13) takes since is already calculated. Inverting (line 14) takes , and updating a row (line 15) takes . Thus, the time complexity of updating the th row of (lines 715) is . Iterating it for all rows of takes . Finally, updating all takes . According to (6), reconstruction (line 4 in Algorithm 2) takes . Thus, the time complexity of PTucker is .
Theorem 4 (Memory complexity of PTucker)
The memory complexity of PTucker is .
Proof 4
The intermediate data of PTucker consist of two vectors and () , and two matrices and (). Memory spaces for those variables are released after updating the th row of . Thus, they are not accumulated during the iterations. Since each thread has their own intermediate data, the total memory complexity of PTucker is .
Theorem 5 (Time complexity of PTuckerCache)
The time complexity of PTuckerCache is .
Proof 5
In Algorithm 3, computing (line 12) takes by the caching method. Precomputing and updating (lines 24 and 1719) also take . Since all the other parts of PTuckerCache are equal to those of PTucker, the time complexity of PTuckerCache is .
Theorem 6 (Memory complexity of PTuckerCache)
The memory complexity of PTuckerCache is .
Proof 6
The cache table requires memory space, which is much larger than that of other intermediate data (see Theorem 4). Thus, the memory complexity of PTuckerCache is .
Theorem 7 (Time complexity of PTuckerApprox)
The time complexity of PTuckerApprox is .
Proof 7
Refer to the supplementary material [29].
Theorem 8 (Memory complexity of PTuckerApprox)
The memory complexity of PTuckerApprox is .
Proof 8
Refer to the supplementary material [29].
Iv Experiments
In this section, we present experimental results of PTucker and other methods. We focus on answering the following questions.

Data Scalability (Section IVB). How well do PTucker and competitors scale up with respect to the following aspects of a given tensor: 1) the order, 2) the dimensionality, 3) the number of observable entries, and 4) the rank?

Effectiveness of PTuckerCache and PTuckerApprox (Section IVC). How successfully do PTuckerCache and PTuckerApprox suggest the tradeoffs between timememory and timeaccuracy, respectively?

Parallelization Scalability (Section IVD). How well does PTucker scale up with respect to the number of threads used for parallelization?

RealWorld Accuracy (Section IVE). How accurately do PTucker and other methods factorize realworld tensors and predict their missing entries?
We describe the datasets and experimental settings in Section IVA, and answer the questions in Sections IVE, IVD, IVC and IVB.
Name  Order  Dimensionality  Rank  
Yahoomusic  4  (1M, 625K, 133, 24)  252M  10 
MovieLens  4  (138K, 27K, 21, 24)  20M  10 
Video (Wave)  4  (112,160,3,32)  160K  3 
Image (Lena)  3  (256,256,3)  20K  3 
Synthetic  310  10010M  100M  311 
Iva Experimental Settings
Datasets
We use both realworld and synthetic tensors to evaluate PTucker and competitors.
Table IV summarizes the tensors we used in experiments, which are available at https://datalab.snu.ac.kr/ptucker/.
For realworld tensors, we use Yahoomusic
Competitors
We compare PTucker and its variants with three stateoftheart Tucker factorization (TF) methods. Descriptions of all methods are given as follows:

PTucker (default): the proposed method which minimizes intermediate data by a gradientbased update rule, used by default throughout all experiments.

PTuckerCache: the timeoptimized variant of PTucker, which caches intermediate multiplications to update factor matrices efficiently.

PTuckerApprox: the timeoptimized variant of PTucker, which shows a tradeoff between time and accuracy by truncating “noisy” entries of a core tensor.

TuckerwOpt [18]: the accuracyfocused TF method utilizing a nonlinear conjugate gradient algorithm for updating factor matrices and a core tensor.

TuckerCSF [20]: the speedfocused TF algorithm which accelerates a tensortimesmatrix chain (TTMc) by a compressed sparse fiber (CSF) structure.
Note that other TF methods (e.g., [19, 30]) are excluded since they present similar or limited scalability than that of competitors mentioned above, and some factorization models (e.g., [31, 32]) not directly applicable to tensors are not considered as well.
Environment
PTucker is implemented in C with OpenMP and Armadillo libraries utilized for parallelization and linear algebra operations, and the source code of PTucker is publicly available at https://datalab.snu.ac.