Scalable Tucker Factorization for Sparse Tensors - Algorithms and Discoveries

Scalable Tucker Factorization for Sparse Tensors - Algorithms and Discoveries

Abstract

Given sparse multi-dimensional data (e.g., (user, movie, time; rating) for movie recommendations), how can we discover latent concepts/relations and predict missing values? Tucker factorization has been widely used to solve such problems with multi-dimensional data, which are modeled as tensors. However, most Tucker factorization algorithms regard and estimate missing entries as zeros, which triggers a highly inaccurate decomposition. Moreover, few methods focusing on an accuracy exhibit limited scalability since they require huge memory and heavy computational costs while updating factor matrices.

In this paper, we propose P-Tucker, a scalable Tucker factorization method for sparse tensors. P-Tucker performs an alternating least squares with a gradient-based update rule in a fully parallel way, which significantly reduces memory requirements for updating factor matrices. Furthermore, we offer two variants of P-Tucker: a caching algorithm P-Tucker-Cache and an approximation algorithm P-Tucker-Approx, both of which accelerate the update process. Experimental results show that P-Tucker exhibits 1.7-14.1 speed-up and 1.4-4.8 less error compared to the state-of-the-art. In addition, P-Tucker scales near linearly with the number of non-zeros in a tensor and number of threads. Thanks to P-Tucker, we successfully discover hidden concepts and relations in a large-scale real-world tensor, while existing methods cannot reveal latent features due to their limited scalability or low accuracy.

\appto

I Introduction

Given a large-scale sparse tensor, how can we discover latent concepts/relations and predict missing entries? How can we design a time and memory efficient algorithm for analyzing a given tensor? Various real-world data can be modeled as tensors or multi-dimensional arrays (e.g., (user, movie, time; rating) for movie recommendations). Many real-world tensors are sparse and partially observable, i.e., composed of a vast number of missing entries and a relatively small number of observable entries. Examples of such data include item ratings [1], social network [2], and web search logs [3] where most entries are missing. Tensor factorization has been used effectively for analyzing tensors [4, 5, 6, 7, 8, 9, 10]. Among tensor factorization methods [11], Tucker factorization has received much interest since it is a generalized form of other factorization methods like CANDECOMP/PARAFAC (CP) decomposition, and it allows us to examine not only latent factors but also relations hidden in tensors.

While many algorithms have been developed for Tucker factorization [12, 13, 14, 15], most methods produce highly inaccurate factorizations since they assume and predict missing entries as zeros, and the values of whose missing entries are unknown. Moreover, existing methods focusing only on observed entries exhibit limited scalability since they exploit tensor operations and singular value decomposition (SVD), leading to heavy memory and computational requirements. In particular, tensor operations generate huge intermediate data for large-scale tensors, which is a problem called intermediate data explosion [16]. A few Tucker algorithms [17, 18, 19, 20] have been developed to address the above problems, but they fail to solve the scalability and accuracy issues at the same time. In summary, the major challenges for decomposing sparse tensors are 1) how to handle missing entries for an accurate and scalable factorization, and 2) how to avoid intermediate data explosion and high computational costs caused by tensor operations and SVD.

Method Scale Speed Memory Accuracy
Tucker-wOpt [18]
Tucker-CSF [20]
 [17]
P-Tucker
TABLE I: Scalability summary of our proposed method P-Tucker and competitors. A check-mark of a method indicates that the algorithm is scalable with a particular aspect. P-Tucker is the only method scalable with all aspects of tensor scale, factorization speed, memory requirement, and accuracy of decomposition; on the other hand, competitors have limited scalability for some aspects.

In this paper, we propose P-Tucker, a scalable Tucker factorization method for sparse tensors. P-Tucker performs an alternating least squares (ALS) with a gradient-based update rule, which focuses only on observed entries of a tensor. The gradient-based approach considerably reduces the amount of memory required for updating factor matrices, enabling P-Tucker to avoid the intermediate data explosion problem. Besides, to speed up the update procedure, we provide its time-optimized versions: a caching method P-Tucker-Cache and an approximation method P-Tucker-Approx. P-Tucker fully employs multi-core parallelism by carefully allocating rows of a factor matrix to each thread considering independence and fairness. Table I summarizes a comparison of P-Tucker and competitors with regard to various aspects.

Our main contributions are the following:

  • Algorithm. We propose P-Tucker, a scalable Tucker factorization method for sparse tensors. P-Tucker not only enhances the accuracy of factorization by focusing on observed values but also achieves higher scalability by utilizing a gradient-based ALS rather than using tensor operations and SVD for updating factor matrices.

  • Theory. We suggest a row-wise update rule of factor matrices, and prove the correctness and convergence of it. Moreover, we analyze the time and memory complexities of P-Tucker and other methods, as summarized in Table III.

  • Performance. P-Tucker provides the best performance across all aspects: tensor scale, factorization speed, memory requirement, and accuracy of decomposition. Experimental results demonstrate that P-Tucker achieves 1.7-14.1 speed-up with 1.4-4.8 less error for large-scale tensors, as summarized in Figures 6, 7, and 11.

  • Discovery. P-Tucker successfully reveals hidden concepts and relations in a large-scale real-world tensor, MovieLens dataset, while the state-of-the-art methods cannot identify latent features due to their limited scalability or low accuracy (see Tables VVI).

The source code of P-Tucker and datasets used in this paper are publicly available at https://datalab.snu.ac.kr/ptucker/ for reproducibility. The rest of this paper is organized as follows. Section II explains preliminaries on a tensor, its operations, and its factorization methods. Section III describes our proposed method P-Tucker. Section IV presents experimental results of P-Tucker and other methods. Section V describes our discovery results on the MovieLens dataset. After introducing related works in Section VI, we conclude in Section VII.

Ii Preliminaries

In this section, we describe the preliminaries of a tensor in Section II-A, its operations in Section II-B, and its factorization methods in Section II-C. Notations and definitions are summarized in Table II.

Ii-a Tensor

Tensors, or multi-dimensional arrays, are a generalization of vectors (-order tensors) and matrices (-order tensors) to higher orders. As a matrix has rows and columns, an -order tensor has modes; their lengths (also called dimensionalities) are denoted by through , respectively. We denote tensors by boldface Euler script letters (e.g., ), matrices by boldface capitals (e.g., ), and vectors by boldface lowercases (e.g., ). An entry of a tensor is denoted by the symbolic name of the tensor with its indices in subscript. For example, indicates the th entry of , and denotes the th entry of . The th row of is denoted by , and the th column of is denoted by .

Symbol Definition
input tensor
core tensor
order of
dimensionality of the th mode of and
th factor matrix
th entry of
set of observable entries of
set of observable entries whose th mode’s index is
number of observable entries of and
regularization parameter for factor matrices
Frobenius norm of tensor
number of threads
an entry of input tensor
an entry of core tensor
cache table
truncation rate
TABLE II: Table of symbols.

Ii-B Tensor Operations

We review some tensor operations used for Tucker factorization. More tensor operations are summarized in [11].

Definition 1 (Frobenius Norm)

Given an N-order tensor , the Frobenius norm of is denoted by and defined as follows:

(1)
Definition 2 (Matricization/Unfolding)

Matricization transforms a tensor into a matrix. The mode- matricization of a tensor is denoted as . The mapping from an element of to an element of is given as follows:

(2)

Note that all indices of a tensor and a matrix begin from 1.

Definition 3 (n-Mode Product)

n-mode product enables multiplications between a tensor and a matrix. The n-mode product of a tensor with a matrix is denoted by (). Element-wise, we have

(3)

Ii-C Tensor Factorization Methods

Fig. 1: Tucker factorization for a 3-way tensor.

Our proposed method P-Tucker is based on Tucker factorization, one of the most popular decomposition methods. More details about other factorization algorithms are summarized in Section VI and [11].

Definition 4 (Tucker Factorization)

Given an N-order tensor , Tucker factorization approximates by a core tensor and factor matrices . Figure 1 illustrates a Tucker factorization result for a 3-way tensor. Core tensor is assumed to be smaller and denser than the input tensor , and factor matrices to be normally orthogonal. Regarding interpretations of factorization results, each factor matrix represents the latent features of the object related to the th mode of , and each element of core tensor indicates the weights of the relations composed of columns of factor matrices. Tucker factorization with tensor operations is presented as follows:

(4)

Note that the loss function (4) is calculated by all entries of , and whole missing values of are regarded as zeros. Concurrently, an element-wise expression is given as follows:

(5)

Equation (5) is used to predict values of missing entries after are found. We define the reconstruction error of Tucker factorization of by the following rule. Note that is the set of observable entries of .

(6)
Definition 5 (Sparse Tucker Factorization)

Given a tensor with observable entries , the goal of sparse Tucker factorization of is to find factor matrices and a core tensor , which minimize (7).

(7)

Note that the loss function (7) only depends on observable entries of , and regularization is used in (7) to prevent overfitting, which has been generally utilized in machine learning problems [21, 22, 23].

Definition 6 (Alternating Least Squares)

To minimize the loss functions (4) and (7), an alternating least squares (ALS) technique is widely used [11, 14], which updates one factor matrix or core tensor while keeping all others fixed.

Input :  Tensor , and
core tensor dimensionality .
Output :  Updated factor matrices , and
updated core tensor .
1 initialize all factor matrices
2 repeat
3        for  do
4              
5               leading left singular vectors of
6       
7until the max. iteration or reconstruction error converges;
Algorithm 1 Tucker-ALS

Algorithm 1 describes a conventional Tucker factorization based on the ALS, which is called the higher-order orthogonal iteration (HOOI) (see [11] for details). The computational and memory bottleneck of Algorithm 1 is updating factor matrices (lines 4-5), which requires tensor operations and SVD. Specifically, Algorithm 1 requires storing a full-dense matrix , and the amount of memory needed for storing is . The required memory grows rapidly when the order, the dimensionality, or the rank of a tensor increase, and ultimately causes intermediate data explosion [16]. Moreover, Algorithm 1 computes SVD for a given , where the complexity of exact SVD is . The computational costs for SVD increase rapidly as well for a large-scale tensor. Notice that Algorithm 1 assumes missing entries of as zeros during the update process (lines 4-5), and core tensor (line 7) is uniquely determined and relatively easy to be computed by an input tensor and factor matrices.

In summary, applying the naive Tucker-ALS algorithm on sparse tensors generates severe accuracy and scalability issues. Therefore, Algorithm 1 needs to be revised to focus only on observed entries and scale for large-scale tensors at the same time. In that case, a gradient-based ALS approach is applicable to Algorithm 1, which is utilized for partially observable matrices [23] and CP factorizations [24]. The gradient-based ALS approach is discussed in Section III.

Definition 7 (Intermediate Data)

We define intermediate data as memory requirements for updating (lines 4-5 in Algorithm 1), excluding memory space for storing , , and . The size of intermediate data plays a critical role in determining which Tucker factorization algorithms are space-efficient, as we will discuss in Section III-E2.

Iii Proposed Method

In this section, we describe P-Tucker, our proposed Tucker factorization algorithm for sparse tensors. As described in Definition 6, the computational and memory bottleneck of the standard Tucker-ALS algorithm occurs while updating factor matrices. Therefore, it is imperative to update them efficiently in order to maximize scalability of the algorithm. However, there are several challenges in designing an optimized algorithm for updating factor matrices.

  1. Exploit the characteristic of sparse tensors. Sparse tensors are composed of a vast number of missing entries and a relatively small number of observable entries. How can we exploit the sparsity of given tensors to design an accurate and scalable algorithm for updating factor matrices?

  2. Maximize scalability. The aforementioned Tucker-ALS algorithm suffers from intermediate data explosion and high computational costs while updating factor matrices. How can we formulate efficient algorithms for updating factor matrices in terms of time and memory?

  3. Parallelization. It is crucial to avoid race conditions and adjust workloads between threads to thoroughly employ multi-core parallelism. How can we apply data parallelism on updating factor matrices in order to scale up linearly with respect to the number of threads?

To overcome the above challenges, we suggest the following main ideas, which we describe in later subsections.

  1. Gradient-based ALS fully exploits the sparsity of a given tensor and enhances the accuracy of a factorization (Figure 3 and Section III-B).

  2. P-Tucker-Cache and P-Tucker-Approx accelerate the update process by caching intermediate calculations and utilizing a truncated core tensor, while P-Tucker itself provides a memory-optimized algorithm by default (Section III-C).

  3. Careful distribution of work assures that each thread has independent tasks and balanced workloads when P-Tucker updates factor matrices. (Section III-D).

We first suggest an overview of how P-Tucker factorizes sparse tensors using Tucker method in Section III-A. After that, we describe details of our main ideas in Sections III-BIII-D, and we offer a theoretical analysis of P-Tucker in Section III-E.

Iii-a Overview

P-Tucker provides an efficient Tucker factorization algorithm for sparse tensors.

Fig. 2: An overview of P-Tucker. After initialization, P-Tucker updates factor matrices in a fully-parallel way. When the reconstruction error converges, P-Tucker performs QR decomposition to make factor matrices orthogonal and updates a core tensor.
Input :  Tensor ,
core tensor dimensionality , and
truncation rate (P-Tucker-Approx only).
Output :  Updated factor matrices ,
and updated core tensor .
1 initialize factor matrices and core tensor
2 repeat
3        update factor matrices by Algorithm 3
4        calculate reconstruction error using (6)
5        if P-Tucker-Approx then   Truncation
6               remove “noisy” entries of by Algorithm 4
7       
8until the maximum iteration or converges;
9for n = 1… do
10         QR decomposition
11         Orthogonalize
12         Update core tensor
Algorithm 2 P-Tucker for Sparse Tensors

Figure 2 and Algorithm 2 describe the main process of P-Tucker. First, P-Tucker initializes all and with random real values between 0 and 1 (step 1 and line 1). After that, P-Tucker updates factor matrices (steps 2-3 and line 3) by Algorithm 3 explained in Section III-B. When all factor matrices are updated, P-Tucker measures reconstruction error using (6) (step 4 and line 4). In case of P-Tucker-Approx (step 5 and lines 5-6), P-Tucker-Approx removes “noisy” entries of by Algorithm 4 explained in Section III-C. P-Tucker stops iterations if the error converges or the maximum iteration is reached (line 7). Finally, P-Tucker performs QR decomposition on all to make them orthogonal and updates (step 6 and lines 8-11). Specifically, QR decomposition [25] on each is defined as follows:

(8)

where is column-wise orthonormal and is upper-triangular. Therefore, by substituting for , P-Tucker succeeds in making factor matrices orthogonal. Core tensor must be updated accordingly in order to maintain the same reconstruction error. According to [26], the update rule of core tensor is given as follows:

(9)
Fig. 3: An overview of updating factor matrices. P-Tucker performs a gradient-based ALS method which updates each factor matrix in a row-wise manner while keeping all the others fixed. Since all rows of a factor matrix are independent of each other in terms of minimizing the loss function (7), P-Tucker fully exploits multi-core parallelism to update all rows of . First, all rows are carefully distributed to all threads to achieve a uniform workload among them. After that, all threads update their allocated rows in a fully parallel way. In a single thread, the allocated rows are updated in a sequential way. Finally, P-Tucker aggregates all updated rows from all threads to update . P-Tucker iterates this update procedure for all factor matrices one by one.

Iii-B Gradient-based ALS for Updating Factor Matrices

P-Tucker adopts a gradient-based ALS method to update factor matrices, which concentrates only on observed entries of a tensor. From a high-level point of view, as most ALS methods do, P-Tucker updates a factor matrix at a time while maintaining all others fixed. However, when all other matrices are fixed, there are several approaches [24] for updating a single factor matrix. Among them, P-Tucker selects a row-wise update method; a key benefit of the row-wise update is that all rows of a factor matrix are independent of each other in terms of minimizing the loss function (7). This property enables applying multi-core parallelism on updating factor matrices. Given a row of a factor matrix, P-Tucker updates the row by a gradient-based update rule. To be more specific, the update rule is derived by computing a gradient with respect to the given row and setting it as zero, which minimizes the loss function (7). The update rule for the th row of the th factor matrix (see Figure 4) is given as follows; the proof of Equation (10) is in Theorem 1.

(10)

where is a matrix whose th entry is

(11)

is a length vector whose th entry is

(12)

is a length vector whose th entry is

(13)
Fig. 4: An illustration of an update rule for a row of a factor matrix. P-Tucker requires three intermediate data , , and for updating the th row of . Note that is a regularization parameter, and is a identity matrix.

indicates the subset of whose th mode’s index is , is a regularization parameter, and is a identity matrix. As shown in Figure 4, the update rule for the th row of requires three intermediate data , , and . Those data are computed by the subset of observable entries . Thus, computational costs of updating factor matrices are proportional to the number of observable entries, which lets P-Tucker fully exploit the sparsity of given tensors. Moreover, P-Tucker predicts missing values of a tensor using (5), not as zeros. Equation (5) is computed by updated factor matrices and a core tensor, and they are learned by observed entries of a tensor. Hence, P-Tucker not only enhances the accuracy of factorizations, but also reflects the latent-characteristics of observed entries of a tensor. Note that a matrix is positive-definite and invertible, and a proof of the update rule is summarized in Section III-E1.

Input :  Tensor ,
factor matrices ,
core tensor , and
cache table (P-Tucker-Cache only).
Output :  Updated factor matrices .
1 if P-Tucker-Cache then  Precompute
2        for  do  In parallel
3               for  do
4                     
5              
6       
7for n = 1… do  nth factor matrix
8        for  = 1… do  th row, in parallel
9               for  do
10                      for  do  Compute
11                             if P-Tucker then
12                                   
13                                   
14                            if P-Tucker-Cache then
15                                   
16                                   
17                            
18                     calculate and using (11) and (12)
19                     
20              find the inverse matrix of
21               update using (10)
22       if P-Tucker-Cache then  Update
23               for  do  In parallel
24                      for  do
25                            
26                     
27              
28       
Algorithm 3 P-Tucker for Updating Factor Matrices

Algorithm 3 describes how P-Tucker updates factor matrices. First, in case of P-Tucker-Cache (lines 1-4), it computes the values of all entries in a cache table which caches intermediate multiplication results generated while updating factor matrices. This memoization technique allows P-Tucker-Cache to be a time-efficient algorithm. Next, P-Tucker chooses a row of a factor matrix to update (lines 5-6). After that, P-Tucker computes and required for updating a row (lines 7-13). P-Tucker performs matrix inverse operation on (line 14) and updates a row by the multiplication of and (line 15). In case of P-Tucker-Cache, it recalculates using the existing and updated (lines 16-19) whenever is updated. Note that and indicate an entry of and , respectively.

Iii-C Variants: P-Tucker-Cache and P-Tucker-Approx

As discussed in Section III-B, P-Tucker requires three intermediate data: , , and whose memory requirements are . Considering the memory complexity of the naive Tucker-ALS, which is , P-Tucker successfully provides a memory-optimized algorithm. We can further optimize P-Tucker in terms of time by a caching algorithm (P-Tucker-Cache) and an approximation algorithm (P-Tucker-Approx).

The crucial difference between P-Tucker and P-Tucker-Cache lies in the computation of the intermediate vector (lines 9-12 in Algorithm 3). In case of P-Tucker, updating requires times of multiplications for a given entry pair (line 10), which takes . However, if we cache the results of those multiplications for all entry pairs, the update only takes (line 12). This trade-off distinguishes P-Tucker-Cache and P-Tucker. P-Tucker-Cache accelerates intermediate calculations by the memoization technique with the cache table . Meanwhile, P-Tucker requires only small vectors and () and a small matrix () as intermediate data. Note that when is 0 (lines 12 and 19), P-Tucker-Cache conducts the multiplications as P-Tucker does (line 10).

Input :  Tensor ,
factor matrices ,
core tensor , and
truncation rate
Output :  Truncated core tensor .
1 for  do
2        compute a partial reconstruction error by (14)
3sort in descending order with their indices
remove entries in , whose value are ranked within top- among all values.
Algorithm 4 P-Tucker-Approx

The main intuition of P-Tucker-Approx is that there exist “noisy” entries in a core tensor , and we can accelerate the update process by truncating these “noisy” entries of . Then, how can we determine whether an entry of is “noisy” or not? A naive approach could be treating an entry with small value as ”noisy” like the truncated SVD [27]. However, in this case, small-value entries are not always negligible since their contributions to minimizing the error (6) can be larger than that of large-value ones. Hence, we propose more precise criterion which regards an entry with a high value as “noisy”. indicates a partial reconstruction error produced by an entry , derived by the sum of terms only related to in (6). Given an entry , is given as follows:

(14)

Note that we use , , and symbols to simplify the equation. suggests a more precise guideline of “noisy” entries since is a part of (6), while the naive approach assumes the error based on the value . Figure 5 illustrates a distribution of and a cumulative function of relative reconstruction error on the latest MovieLens dataset (). As expected by our intuition, only 20% entries of generate about 80% of total reconstruction error. Algorithm 4 describes how P-Tucker-Approx truncates “noisy” entries in . It first computes (lines 1-2) for all entries in , and sort in descending order (line 3) as well as their indices. Finally, it truncates top- “noisy” entries of (line 4). P-Tucker-Approx performs Algorithm 4 for each iteration (lines 2-7 in Algorithm 2), which reduces the number of non-zeros in step-by-step. Therefore, the elapsed time per iteration also decreases since the time complexity of P-Tucker-Approx depends on the number of non-zeros . Note that we can find an optimal approximation point whose speed-up over accuracy loss is maximized (see Figure 9).

Fig. 5: Distribution of partial reconstruction error and accumulation of relative reconstruction error produced by an entry of a core tensor . Note that 20% “noisy“ entries of generate 80% of total reconstruction error.

With the above optimizations, P-Tucker becomes the most time and memory efficient method in theoretical and experimental perspectives (see Table III).

Iii-D Careful Distribution of Work

There are three sections where multi-core parallelization is applicable in Algorithms 2 and 3. The first section (lines 2-4 and 17-19 in Algorithm 3) is for P-Tucker-Cache when it computes and updates the cache table . The second section (lines 6-15 in Algorithm 3) is for updating factor matrices, and the last section (line 4 in Algorithm 2) is for measuring the reconstruction error. For each section, P-Tucker carefully distributes tasks to threads while maintaining the independence between them. Furthermore, P-Tucker utilizes a dynamic scheduling method [28] to assure that each thread has balanced workloads. The details of how P-Tucker parallelizes each section are as follows. Note that indicates the number of threads used for parallelization.

  • Section 1: Computing and Updating Cache Table (Only for P-Tucker-Cache). All rows of are independent of each other when they are computed or updated. Thus, P-Tucker distributes all rows equally over threads, and each thread computes or updates allocated rows independently using static scheduling.

  • Section 2: Updating Factor Matrices. All rows of are independent of each other regarding minimizing the loss function (7). Therefore, P-Tucker distributes all rows uniformly to each thread, and updates them in parallel. Since differs for each row, the workload of each thread may vary considerably. Thus, P-Tucker employs dynamic scheduling in this part.

  • Section 3: Calculating Reconstruction Error. All observable entries are independent of each other in measuring the reconstruction error. Thus, P-Tucker distributes them evenly over threads, and each thread computes the error separately using static scheduling. At the end, P-Tucker aggregates the partial error from each thread.

Iii-E Theoretical Analysis

Convergence Analysis

In this section, we theoretically prove the correctness and the convergence of P-Tucker.

Theorem 1 (Correctness of P-Tucker)

The proposed row-wise update rule (15) minimizes the loss function (7) regarding the updated parameters.

(15)
Proof 1

Note that the full proof of Theorem 1 is in the supplementary material of P-Tucker [29].

Theorem 2 (Convergence of P-Tucker)

P-Tucker converges since (7) is bounded and decreases monotonically.

Proof 2

According to Theorem 1, the loss function (7) never increases since every update in P-Tucker minimizes it, and (7) is bounded by 0. Thus, P-Tucker converges.

Algorithm Time Complexity Memory
(per iteration) Complexity
P-Tucker
P-Tucker-Cache
P-Tucker-Approx
Tucker-wOpt [18]
Tucker-CSF [20]
 [17]
TABLE III: Complexity analysis of P-Tucker and other methods with respect to time and memory. The optimal complexities are in bold. P-Tucker and its variants exhibit the best time and memory complexity among all methods. Note that memory complexity indicates the space requirement for intermediate data.

Complexity Analysis

In this section, we analyze time and memory complexities of P-Tucker and its variants. For simplicity, we assume and . Table III summarizes the time and memory complexities of P-Tucker and other methods. As expected in Section III-C, P-Tucker presents the best memory complexity among all algorithms. While P-Tucker-Cache shows better time complexity than that of P-Tucker, P-Tucker-Approx exhibits the best time complexity thanks to the reduced number of non-zeros in . Note that we calculate time complexities per iteration (lines 2-7 in Algorithm 2), and we focus on memory complexities of intermediate data, not of all variables.

Theorem 3 (Time complexity of P-Tucker)

The time complexity of P-Tucker is .

Proof 3

Given the th row of (lines 5-6) in Algorithm 3 , computing (line 10) takes . Updating and (line 13) takes since is already calculated. Inverting (line 14) takes , and updating a row (line 15) takes . Thus, the time complexity of updating the th row of (lines 7-15) is . Iterating it for all rows of takes . Finally, updating all takes . According to (6), reconstruction (line 4 in Algorithm 2) takes . Thus, the time complexity of P-Tucker is .

Theorem 4 (Memory complexity of P-Tucker)

The memory complexity of P-Tucker is .

Proof 4

The intermediate data of P-Tucker consist of two vectors and () , and two matrices and (). Memory spaces for those variables are released after updating the th row of . Thus, they are not accumulated during the iterations. Since each thread has their own intermediate data, the total memory complexity of P-Tucker is .

Theorem 5 (Time complexity of P-Tucker-Cache)

The time complexity of P-Tucker-Cache is .

Proof 5

In Algorithm 3, computing (line 12) takes by the caching method. Precomputing and updating (lines 2-4 and 17-19) also take . Since all the other parts of P-Tucker-Cache are equal to those of P-Tucker, the time complexity of P-Tucker-Cache is .

Theorem 6 (Memory complexity of P-Tucker-Cache)

The memory complexity of P-Tucker-Cache is .

Proof 6

The cache table requires memory space, which is much larger than that of other intermediate data (see Theorem 4). Thus, the memory complexity of P-Tucker-Cache is .

Theorem 7 (Time complexity of P-Tucker-Approx)

The time complexity of P-Tucker-Approx is .

Proof 7

Refer to the supplementary material [29].

Theorem 8 (Memory complexity of P-Tucker-Approx)

The memory complexity of P-Tucker-Approx is .

Proof 8

Refer to the supplementary material [29].

Iv Experiments

In this section, we present experimental results of P-Tucker and other methods. We focus on answering the following questions.

  1. Data Scalability (Section IV-B). How well do P-Tucker and competitors scale up with respect to the following aspects of a given tensor: 1) the order, 2) the dimensionality, 3) the number of observable entries, and 4) the rank?

  2. Effectiveness of P-Tucker-Cache and P-Tucker-Approx (Section IV-C). How successfully do P-Tucker-Cache and P-Tucker-Approx suggest the trade-offs between time-memory and time-accuracy, respectively?

  3. Parallelization Scalability (Section IV-D). How well does P-Tucker scale up with respect to the number of threads used for parallelization?

  4. Real-World Accuracy (Section IV-E). How accurately do P-Tucker and other methods factorize real-world tensors and predict their missing entries?

We describe the datasets and experimental settings in Section IV-A, and answer the questions in Sections IV-E, IV-D, IV-C and IV-B.

{savenotes}
Name Order Dimensionality Rank
Yahoo-music 4 (1M, 625K, 133, 24) 252M 10
MovieLens 4 (138K, 27K, 21, 24) 20M 10
Video (Wave) 4 (112,160,3,32) 160K 3
Image (Lena) 3 (256,256,3) 20K 3
Synthetic 310 10010M 100M 311
TABLE IV: Summary of real-world and synthetic tensors used for experiments. M: million, K: thousand.

Iv-a Experimental Settings

Datasets

We use both real-world and synthetic tensors to evaluate P-Tucker and competitors. Table IV summarizes the tensors we used in experiments, which are available at https://datalab.snu.ac.kr/ptucker/. For real-world tensors, we use Yahoo-music1, MovieLens2, Sea-wave video, and ‘Lena’ image. Yahoo-music is music rating data which consist of (user, music, year-month, hour, rating). MovieLens is movie rating data which consist of (user, movie, year, hour, rating). Sea-wave video and ‘Lena’ image are 10%-sampled tensors from original data. Note that we normalize all values of real-world tensors to numbers between 0 to 1. We also use 90% of observed entries as training data and the rest of them as test data for measuring the accuracy of P-Tucker and competitors. For synthetic tensors, we create random tensors, which we describe in Section IV-B.

Competitors

We compare P-Tucker and its variants with three state-of-the-art Tucker factorization (TF) methods. Descriptions of all methods are given as follows:

  • P-Tucker (default): the proposed method which minimizes intermediate data by a gradient-based update rule, used by default throughout all experiments.

  • P-Tucker-Cache: the time-optimized variant of P-Tucker, which caches intermediate multiplications to update factor matrices efficiently.

  • P-Tucker-Approx: the time-optimized variant of P-Tucker, which shows a trade-off between time and accuracy by truncating “noisy” entries of a core tensor.

  • Tucker-wOpt [18]: the accuracy-focused TF method utilizing a nonlinear conjugate gradient algorithm for updating factor matrices and a core tensor.

  • Tucker-CSF [20]: the speed-focused TF algorithm which accelerates a tensor-times-matrix chain (TTMc) by a compressed sparse fiber (CSF) structure.

  •  [17]: the TF method designed for large-scale tensors, which avoids intermediate data explosion [16] by on-the-fly computation.

Note that other TF methods (e.g., [19, 30]) are excluded since they present similar or limited scalability than that of competitors mentioned above, and some factorization models (e.g., [31, 32]) not directly applicable to tensors are not considered as well.

(a) Tensor order.
(b) Tensor dimensionality.
(c) Number of observable entries.
(d) Tensor rank.
Fig. 6: The scalability of P-Tucker and competitors for large-scale synthetic tensors. O.O.M.: out of memory. P-Tucker exhibits 7.1-14.1x speed up compared to the state-of-the-art with respect to all aspects. Notice that Tucker-wOpt presents O.O.M. in most cases due to their limited scalability, and P-Tucker indicates the default memory-optimized version, not P-Tucker-Cache or P-Tucker-Approx.

Environment

P-Tucker is implemented in C with OpenMP and Armadillo libraries utilized for parallelization and linear algebra operations, and the source code of P-Tucker is publicly available at https://datalab.snu.ac.