Scalable Tucker Factorization for Sparse Tensors - Algorithms and Discoveries
Given sparse multi-dimensional data (e.g., (user, movie, time; rating) for movie recommendations), how can we discover latent concepts/relations and predict missing values? Tucker factorization has been widely used to solve such problems with multi-dimensional data, which are modeled as tensors. However, most Tucker factorization algorithms regard and estimate missing entries as zeros, which triggers a highly inaccurate decomposition. Moreover, few methods focusing on an accuracy exhibit limited scalability since they require huge memory and heavy computational costs while updating factor matrices.
In this paper, we propose P-Tucker, a scalable Tucker factorization method for sparse tensors. P-Tucker performs an alternating least squares with a gradient-based update rule in a fully parallel way, which significantly reduces memory requirements for updating factor matrices. Furthermore, we offer two variants of P-Tucker: a caching algorithm P-Tucker-Cache and an approximation algorithm P-Tucker-Approx, both of which accelerate the update process. Experimental results show that P-Tucker exhibits 1.7-14.1 speed-up and 1.4-4.8 less error compared to the state-of-the-art. In addition, P-Tucker scales near linearly with the number of non-zeros in a tensor and number of threads. Thanks to P-Tucker, we successfully discover hidden concepts and relations in a large-scale real-world tensor, while existing methods cannot reveal latent features due to their limited scalability or low accuracy.
Given a large-scale sparse tensor, how can we discover latent concepts/relations and predict missing entries? How can we design a time and memory efficient algorithm for analyzing a given tensor? Various real-world data can be modeled as tensors or multi-dimensional arrays (e.g., (user, movie, time; rating) for movie recommendations). Many real-world tensors are sparse and partially observable, i.e., composed of a vast number of missing entries and a relatively small number of observable entries. Examples of such data include item ratings , social network , and web search logs  where most entries are missing. Tensor factorization has been used effectively for analyzing tensors [4, 5, 6, 7, 8, 9, 10]. Among tensor factorization methods , Tucker factorization has received much interest since it is a generalized form of other factorization methods like CANDECOMP/PARAFAC (CP) decomposition, and it allows us to examine not only latent factors but also relations hidden in tensors.
While many algorithms have been developed for Tucker factorization [12, 13, 14, 15], most methods produce highly inaccurate factorizations since they assume and predict missing entries as zeros, and the values of whose missing entries are unknown. Moreover, existing methods focusing only on observed entries exhibit limited scalability since they exploit tensor operations and singular value decomposition (SVD), leading to heavy memory and computational requirements. In particular, tensor operations generate huge intermediate data for large-scale tensors, which is a problem called intermediate data explosion . A few Tucker algorithms [17, 18, 19, 20] have been developed to address the above problems, but they fail to solve the scalability and accuracy issues at the same time. In summary, the major challenges for decomposing sparse tensors are 1) how to handle missing entries for an accurate and scalable factorization, and 2) how to avoid intermediate data explosion and high computational costs caused by tensor operations and SVD.
In this paper, we propose P-Tucker, a scalable Tucker factorization method for sparse tensors. P-Tucker performs an alternating least squares (ALS) with a gradient-based update rule, which focuses only on observed entries of a tensor. The gradient-based approach considerably reduces the amount of memory required for updating factor matrices, enabling P-Tucker to avoid the intermediate data explosion problem. Besides, to speed up the update procedure, we provide its time-optimized versions: a caching method P-Tucker-Cache and an approximation method P-Tucker-Approx. P-Tucker fully employs multi-core parallelism by carefully allocating rows of a factor matrix to each thread considering independence and fairness. Table I summarizes a comparison of P-Tucker and competitors with regard to various aspects.
Our main contributions are the following:
Algorithm. We propose P-Tucker, a scalable Tucker factorization method for sparse tensors. P-Tucker not only enhances the accuracy of factorization by focusing on observed values but also achieves higher scalability by utilizing a gradient-based ALS rather than using tensor operations and SVD for updating factor matrices.
Theory. We suggest a row-wise update rule of factor matrices, and prove the correctness and convergence of it. Moreover, we analyze the time and memory complexities of P-Tucker and other methods, as summarized in Table III.
Performance. P-Tucker provides the best performance across all aspects: tensor scale, factorization speed, memory requirement, and accuracy of decomposition. Experimental results demonstrate that P-Tucker achieves 1.7-14.1 speed-up with 1.4-4.8 less error for large-scale tensors, as summarized in Figures 6, 7, and 11.
The source code of P-Tucker and datasets used in this paper are publicly available at https://datalab.snu.ac.kr/ptucker/ for reproducibility. The rest of this paper is organized as follows. Section II explains preliminaries on a tensor, its operations, and its factorization methods. Section III describes our proposed method P-Tucker. Section IV presents experimental results of P-Tucker and other methods. Section V describes our discovery results on the MovieLens dataset. After introducing related works in Section VI, we conclude in Section VII.
In this section, we describe the preliminaries of a tensor in Section II-A, its operations in Section II-B, and its factorization methods in Section II-C. Notations and definitions are summarized in Table II.
Tensors, or multi-dimensional arrays, are a generalization of vectors (-order tensors) and matrices (-order tensors) to higher orders. As a matrix has rows and columns, an -order tensor has modes; their lengths (also called dimensionalities) are denoted by through , respectively. We denote tensors by boldface Euler script letters (e.g., ), matrices by boldface capitals (e.g., ), and vectors by boldface lowercases (e.g., ). An entry of a tensor is denoted by the symbolic name of the tensor with its indices in subscript. For example, indicates the th entry of , and denotes the th entry of . The th row of is denoted by , and the th column of is denoted by .
|dimensionality of the th mode of and|
|th factor matrix|
|th entry of|
|set of observable entries of|
|set of observable entries whose th mode’s index is|
|number of observable entries of and|
|regularization parameter for factor matrices|
|Frobenius norm of tensor|
|number of threads|
|an entry of input tensor|
|an entry of core tensor|
Ii-B Tensor Operations
We review some tensor operations used for Tucker factorization. More tensor operations are summarized in .
Definition 1 (Frobenius Norm)
Given an N-order tensor , the Frobenius norm of is denoted by and defined as follows:
Definition 2 (Matricization/Unfolding)
Matricization transforms a tensor into a matrix. The mode- matricization of a tensor is denoted as . The mapping from an element of to an element of is given as follows:
Note that all indices of a tensor and a matrix begin from 1.
Definition 3 (n-Mode Product)
n-mode product enables multiplications between a tensor and a matrix. The n-mode product of a tensor with a matrix is denoted by (). Element-wise, we have
Ii-C Tensor Factorization Methods
Definition 4 (Tucker Factorization)
Given an N-order tensor , Tucker factorization approximates by a core tensor and factor matrices . Figure 1 illustrates a Tucker factorization result for a 3-way tensor. Core tensor is assumed to be smaller and denser than the input tensor , and factor matrices to be normally orthogonal. Regarding interpretations of factorization results, each factor matrix represents the latent features of the object related to the th mode of , and each element of core tensor indicates the weights of the relations composed of columns of factor matrices. Tucker factorization with tensor operations is presented as follows:
Note that the loss function (4) is calculated by all entries of , and whole missing values of are regarded as zeros. Concurrently, an element-wise expression is given as follows:
Equation (5) is used to predict values of missing entries after are found. We define the reconstruction error of Tucker factorization of by the following rule. Note that is the set of observable entries of .
Definition 5 (Sparse Tucker Factorization)
Given a tensor with observable entries , the goal of sparse Tucker factorization of is to find factor matrices and a core tensor , which minimize (7).
Note that the loss function (7) only depends on observable entries of , and regularization is used in (7) to prevent overfitting, which has been generally utilized in machine learning problems [21, 22, 23].
Definition 6 (Alternating Least Squares)
Algorithm 1 describes a conventional Tucker factorization based on the ALS, which is called the higher-order orthogonal iteration (HOOI) (see  for details). The computational and memory bottleneck of Algorithm 1 is updating factor matrices (lines 4-5), which requires tensor operations and SVD. Specifically, Algorithm 1 requires storing a full-dense matrix , and the amount of memory needed for storing is . The required memory grows rapidly when the order, the dimensionality, or the rank of a tensor increase, and ultimately causes intermediate data explosion . Moreover, Algorithm 1 computes SVD for a given , where the complexity of exact SVD is . The computational costs for SVD increase rapidly as well for a large-scale tensor. Notice that Algorithm 1 assumes missing entries of as zeros during the update process (lines 4-5), and core tensor (line 7) is uniquely determined and relatively easy to be computed by an input tensor and factor matrices.
In summary, applying the naive Tucker-ALS algorithm on sparse tensors generates severe accuracy and scalability issues. Therefore, Algorithm 1 needs to be revised to focus only on observed entries and scale for large-scale tensors at the same time. In that case, a gradient-based ALS approach is applicable to Algorithm 1, which is utilized for partially observable matrices  and CP factorizations . The gradient-based ALS approach is discussed in Section III.
Definition 7 (Intermediate Data)
We define intermediate data as memory requirements for updating (lines 4-5 in Algorithm 1), excluding memory space for storing , , and . The size of intermediate data plays a critical role in determining which Tucker factorization algorithms are space-efficient, as we will discuss in Section III-E2.
Iii Proposed Method
In this section, we describe P-Tucker, our proposed Tucker factorization algorithm for sparse tensors. As described in Definition 6, the computational and memory bottleneck of the standard Tucker-ALS algorithm occurs while updating factor matrices. Therefore, it is imperative to update them efficiently in order to maximize scalability of the algorithm. However, there are several challenges in designing an optimized algorithm for updating factor matrices.
Exploit the characteristic of sparse tensors. Sparse tensors are composed of a vast number of missing entries and a relatively small number of observable entries. How can we exploit the sparsity of given tensors to design an accurate and scalable algorithm for updating factor matrices?
Maximize scalability. The aforementioned Tucker-ALS algorithm suffers from intermediate data explosion and high computational costs while updating factor matrices. How can we formulate efficient algorithms for updating factor matrices in terms of time and memory?
Parallelization. It is crucial to avoid race conditions and adjust workloads between threads to thoroughly employ multi-core parallelism. How can we apply data parallelism on updating factor matrices in order to scale up linearly with respect to the number of threads?
To overcome the above challenges, we suggest the following main ideas, which we describe in later subsections.
P-Tucker-Cache and P-Tucker-Approx accelerate the update process by caching intermediate calculations and utilizing a truncated core tensor, while P-Tucker itself provides a memory-optimized algorithm by default (Section III-C).
Careful distribution of work assures that each thread has independent tasks and balanced workloads when P-Tucker updates factor matrices. (Section III-D).
We first suggest an overview of how P-Tucker factorizes sparse tensors using Tucker method in Section III-A. After that, we describe details of our main ideas in Sections III-BIII-D, and we offer a theoretical analysis of P-Tucker in Section III-E.
P-Tucker provides an efficient Tucker factorization algorithm for sparse tensors.
Figure 2 and Algorithm 2 describe the main process of P-Tucker. First, P-Tucker initializes all and with random real values between 0 and 1 (step 1 and line 1). After that, P-Tucker updates factor matrices (steps 2-3 and line 3) by Algorithm 3 explained in Section III-B. When all factor matrices are updated, P-Tucker measures reconstruction error using (6) (step 4 and line 4). In case of P-Tucker-Approx (step 5 and lines 5-6), P-Tucker-Approx removes “noisy” entries of by Algorithm 4 explained in Section III-C. P-Tucker stops iterations if the error converges or the maximum iteration is reached (line 7). Finally, P-Tucker performs QR decomposition on all to make them orthogonal and updates (step 6 and lines 8-11). Specifically, QR decomposition  on each is defined as follows:
where is column-wise orthonormal and is upper-triangular. Therefore, by substituting for , P-Tucker succeeds in making factor matrices orthogonal. Core tensor must be updated accordingly in order to maintain the same reconstruction error. According to , the update rule of core tensor is given as follows:
Iii-B Gradient-based ALS for Updating Factor Matrices
P-Tucker adopts a gradient-based ALS method to update factor matrices, which concentrates only on observed entries of a tensor. From a high-level point of view, as most ALS methods do, P-Tucker updates a factor matrix at a time while maintaining all others fixed. However, when all other matrices are fixed, there are several approaches  for updating a single factor matrix. Among them, P-Tucker selects a row-wise update method; a key benefit of the row-wise update is that all rows of a factor matrix are independent of each other in terms of minimizing the loss function (7). This property enables applying multi-core parallelism on updating factor matrices. Given a row of a factor matrix, P-Tucker updates the row by a gradient-based update rule. To be more specific, the update rule is derived by computing a gradient with respect to the given row and setting it as zero, which minimizes the loss function (7). The update rule for the th row of the th factor matrix (see Figure 4) is given as follows; the proof of Equation (10) is in Theorem 1.
where is a matrix whose th entry is
is a length vector whose th entry is
is a length vector whose th entry is
indicates the subset of whose th mode’s index is , is a regularization parameter, and is a identity matrix. As shown in Figure 4, the update rule for the th row of requires three intermediate data , , and . Those data are computed by the subset of observable entries . Thus, computational costs of updating factor matrices are proportional to the number of observable entries, which lets P-Tucker fully exploit the sparsity of given tensors. Moreover, P-Tucker predicts missing values of a tensor using (5), not as zeros. Equation (5) is computed by updated factor matrices and a core tensor, and they are learned by observed entries of a tensor. Hence, P-Tucker not only enhances the accuracy of factorizations, but also reflects the latent-characteristics of observed entries of a tensor. Note that a matrix is positive-definite and invertible, and a proof of the update rule is summarized in Section III-E1.
Algorithm 3 describes how P-Tucker updates factor matrices. First, in case of P-Tucker-Cache (lines 1-4), it computes the values of all entries in a cache table which caches intermediate multiplication results generated while updating factor matrices. This memoization technique allows P-Tucker-Cache to be a time-efficient algorithm. Next, P-Tucker chooses a row of a factor matrix to update (lines 5-6). After that, P-Tucker computes and required for updating a row (lines 7-13). P-Tucker performs matrix inverse operation on (line 14) and updates a row by the multiplication of and (line 15). In case of P-Tucker-Cache, it recalculates using the existing and updated (lines 16-19) whenever is updated. Note that and indicate an entry of and , respectively.
Iii-C Variants: P-Tucker-Cache and P-Tucker-Approx
As discussed in Section III-B, P-Tucker requires three intermediate data: , , and whose memory requirements are . Considering the memory complexity of the naive Tucker-ALS, which is , P-Tucker successfully provides a memory-optimized algorithm. We can further optimize P-Tucker in terms of time by a caching algorithm (P-Tucker-Cache) and an approximation algorithm (P-Tucker-Approx).
The crucial difference between P-Tucker and P-Tucker-Cache lies in the computation of the intermediate vector (lines 9-12 in Algorithm 3). In case of P-Tucker, updating requires times of multiplications for a given entry pair (line 10), which takes . However, if we cache the results of those multiplications for all entry pairs, the update only takes (line 12). This trade-off distinguishes P-Tucker-Cache and P-Tucker. P-Tucker-Cache accelerates intermediate calculations by the memoization technique with the cache table . Meanwhile, P-Tucker requires only small vectors and () and a small matrix () as intermediate data. Note that when is 0 (lines 12 and 19), P-Tucker-Cache conducts the multiplications as P-Tucker does (line 10).
The main intuition of P-Tucker-Approx is that there exist “noisy” entries in a core tensor , and we can accelerate the update process by truncating these “noisy” entries of . Then, how can we determine whether an entry of is “noisy” or not? A naive approach could be treating an entry with small value as ”noisy” like the truncated SVD . However, in this case, small-value entries are not always negligible since their contributions to minimizing the error (6) can be larger than that of large-value ones. Hence, we propose more precise criterion which regards an entry with a high value as “noisy”. indicates a partial reconstruction error produced by an entry , derived by the sum of terms only related to in (6). Given an entry , is given as follows:
Note that we use , , and symbols to simplify the equation. suggests a more precise guideline of “noisy” entries since is a part of (6), while the naive approach assumes the error based on the value . Figure 5 illustrates a distribution of and a cumulative function of relative reconstruction error on the latest MovieLens dataset (). As expected by our intuition, only 20% entries of generate about 80% of total reconstruction error. Algorithm 4 describes how P-Tucker-Approx truncates “noisy” entries in . It first computes (lines 1-2) for all entries in , and sort in descending order (line 3) as well as their indices. Finally, it truncates top- “noisy” entries of (line 4). P-Tucker-Approx performs Algorithm 4 for each iteration (lines 2-7 in Algorithm 2), which reduces the number of non-zeros in step-by-step. Therefore, the elapsed time per iteration also decreases since the time complexity of P-Tucker-Approx depends on the number of non-zeros . Note that we can find an optimal approximation point whose speed-up over accuracy loss is maximized (see Figure 9).
With the above optimizations, P-Tucker becomes the most time and memory efficient method in theoretical and experimental perspectives (see Table III).
Iii-D Careful Distribution of Work
There are three sections where multi-core parallelization is applicable in Algorithms 2 and 3. The first section (lines 2-4 and 17-19 in Algorithm 3) is for P-Tucker-Cache when it computes and updates the cache table . The second section (lines 6-15 in Algorithm 3) is for updating factor matrices, and the last section (line 4 in Algorithm 2) is for measuring the reconstruction error. For each section, P-Tucker carefully distributes tasks to threads while maintaining the independence between them. Furthermore, P-Tucker utilizes a dynamic scheduling method  to assure that each thread has balanced workloads. The details of how P-Tucker parallelizes each section are as follows. Note that indicates the number of threads used for parallelization.
Section 1: Computing and Updating Cache Table (Only for P-Tucker-Cache). All rows of are independent of each other when they are computed or updated. Thus, P-Tucker distributes all rows equally over threads, and each thread computes or updates allocated rows independently using static scheduling.
Section 2: Updating Factor Matrices. All rows of are independent of each other regarding minimizing the loss function (7). Therefore, P-Tucker distributes all rows uniformly to each thread, and updates them in parallel. Since differs for each row, the workload of each thread may vary considerably. Thus, P-Tucker employs dynamic scheduling in this part.
Section 3: Calculating Reconstruction Error. All observable entries are independent of each other in measuring the reconstruction error. Thus, P-Tucker distributes them evenly over threads, and each thread computes the error separately using static scheduling. At the end, P-Tucker aggregates the partial error from each thread.
Iii-E Theoretical Analysis
In this section, we theoretically prove the correctness and the convergence of P-Tucker.
Theorem 1 (Correctness of P-Tucker)
Theorem 2 (Convergence of P-Tucker)
P-Tucker converges since (7) is bounded and decreases monotonically.
In this section, we analyze time and memory complexities of P-Tucker and its variants. For simplicity, we assume and . Table III summarizes the time and memory complexities of P-Tucker and other methods. As expected in Section III-C, P-Tucker presents the best memory complexity among all algorithms. While P-Tucker-Cache shows better time complexity than that of P-Tucker, P-Tucker-Approx exhibits the best time complexity thanks to the reduced number of non-zeros in . Note that we calculate time complexities per iteration (lines 2-7 in Algorithm 2), and we focus on memory complexities of intermediate data, not of all variables.
Theorem 3 (Time complexity of P-Tucker)
The time complexity of P-Tucker is .
Given the th row of (lines 5-6) in Algorithm 3 , computing (line 10) takes . Updating and (line 13) takes since is already calculated. Inverting (line 14) takes , and updating a row (line 15) takes . Thus, the time complexity of updating the th row of (lines 7-15) is . Iterating it for all rows of takes . Finally, updating all takes . According to (6), reconstruction (line 4 in Algorithm 2) takes . Thus, the time complexity of P-Tucker is .
Theorem 4 (Memory complexity of P-Tucker)
The memory complexity of P-Tucker is .
The intermediate data of P-Tucker consist of two vectors and () , and two matrices and (). Memory spaces for those variables are released after updating the th row of . Thus, they are not accumulated during the iterations. Since each thread has their own intermediate data, the total memory complexity of P-Tucker is .
Theorem 5 (Time complexity of P-Tucker-Cache)
The time complexity of P-Tucker-Cache is .
In Algorithm 3, computing (line 12) takes by the caching method. Precomputing and updating (lines 2-4 and 17-19) also take . Since all the other parts of P-Tucker-Cache are equal to those of P-Tucker, the time complexity of P-Tucker-Cache is .
Theorem 6 (Memory complexity of P-Tucker-Cache)
The memory complexity of P-Tucker-Cache is .
The cache table requires memory space, which is much larger than that of other intermediate data (see Theorem 4). Thus, the memory complexity of P-Tucker-Cache is .
Theorem 7 (Time complexity of P-Tucker-Approx)
The time complexity of P-Tucker-Approx is .
Refer to the supplementary material .
Theorem 8 (Memory complexity of P-Tucker-Approx)
The memory complexity of P-Tucker-Approx is .
Refer to the supplementary material .
In this section, we present experimental results of P-Tucker and other methods. We focus on answering the following questions.
Data Scalability (Section IV-B). How well do P-Tucker and competitors scale up with respect to the following aspects of a given tensor: 1) the order, 2) the dimensionality, 3) the number of observable entries, and 4) the rank?
Effectiveness of P-Tucker-Cache and P-Tucker-Approx (Section IV-C). How successfully do P-Tucker-Cache and P-Tucker-Approx suggest the trade-offs between time-memory and time-accuracy, respectively?
Parallelization Scalability (Section IV-D). How well does P-Tucker scale up with respect to the number of threads used for parallelization?
Real-World Accuracy (Section IV-E). How accurately do P-Tucker and other methods factorize real-world tensors and predict their missing entries?
|Yahoo-music||4||(1M, 625K, 133, 24)||252M||10|
|MovieLens||4||(138K, 27K, 21, 24)||20M||10|
Iv-a Experimental Settings
We use both real-world and synthetic tensors to evaluate P-Tucker and competitors.
Table IV summarizes the tensors we used in experiments, which are available at https://datalab.snu.ac.kr/ptucker/.
For real-world tensors, we use Yahoo-music
We compare P-Tucker and its variants with three state-of-the-art Tucker factorization (TF) methods. Descriptions of all methods are given as follows:
P-Tucker (default): the proposed method which minimizes intermediate data by a gradient-based update rule, used by default throughout all experiments.
P-Tucker-Cache: the time-optimized variant of P-Tucker, which caches intermediate multiplications to update factor matrices efficiently.
P-Tucker-Approx: the time-optimized variant of P-Tucker, which shows a trade-off between time and accuracy by truncating “noisy” entries of a core tensor.
Tucker-wOpt : the accuracy-focused TF method utilizing a nonlinear conjugate gradient algorithm for updating factor matrices and a core tensor.
Tucker-CSF : the speed-focused TF algorithm which accelerates a tensor-times-matrix chain (TTMc) by a compressed sparse fiber (CSF) structure.
Note that other TF methods (e.g., [19, 30]) are excluded since they present similar or limited scalability than that of competitors mentioned above, and some factorization models (e.g., [31, 32]) not directly applicable to tensors are not considered as well.
P-Tucker is implemented in C with OpenMP and Armadillo libraries utilized for parallelization and linear algebra operations, and the source code of P-Tucker is publicly available at https://datalab.snu.ac.