Block-Randomized Stochastic Proximal Gradient for Low-Rank Tensor Factorization

Block-Randomized Stochastic Proximal Gradient for Low-Rank Tensor Factorization

Xiao Fu, Cheng Cao, Hoi-To Wai, and Kejun Huang
 
School of Electrical Engineering and Computer Science
Oregon State University
Department of Systems Engineering and Engineering Management
The Chinese University of Hong Kong
Department of Computer and Information Science and Engineering
University of Florida

 
July 20, 2019
Abstract

This work considers the problem of computing the canonical polyadic decomposition (CPD) of large tensors. Prior works mostly leverage data sparsity to handle this problem, which are not suitable for handling dense tensors that often arise in applications such as medical imaging, computer vision, and remote sensing. Stochastic optimization is known for its low memory cost and per-iteration complexity when handling dense data. However, existing stochastic CPD algorithms are hard to incorporate a variety of constraints and regularizations that are of interest in signal and data analytics. Convergence properties of many such algorithms are also unclear. In this work, we propose a stochastic optimization framework for large-scale CPD with constraints/regularizations. The framework works under a doubly randomized fashion, and can be regarded as a judicious combination of randomized block coordinate descent (BCD) and stochastic proximal gradient (SPG). The algorithm enjoys lightweight updates and small memory footprint, and thus scales well. In addition, this framework entails considerable flexibility—many frequently used regularizers and constraints can be readily handled under the proposed scheme. The approach is also supported by convergence analysis. Numerical results on large-scale dense tensors are employed to showcase the effectiveness of the proposed approach.

1 Introduction

Canonical polyadic decomposition (CPD) [previously known as parallel factor analysis (PARAFAC)] [1, 2, 3] is arguably the most popular low-rank tensor decomposition model. CPD has successfully found many applications in various fields, such as analytical chemistry [4], social network mining [5], hyperspectral imaging [6], topic modeling [7], and time series analysis [8]; also see [9, 10, 11] for more classic applications in communications.

Computing the CPD of a tensor, however, is a quite challenging optimization problem [12]. Many algorithms have been proposed through the years [13, 14, 15, 3]. To keep pace with the ever growing volume of available data, one pressing challenge is to compute CPD at scale. The classic alternating least squares (ALS) algorithm [3] has an elegant algorithmic structure, but also suffers from a number of numerical issues [16, 17] and is hardly scalable. In recent years, many new CPD algorithms have appeared, triggered by the advances in big data analytics and first-order optimization [18, 19, 14, 13]. Many of these algorithms leverage data sparsity to scale up CPD—by cleverly using the zero elements in huge tensors, computationally costly key operations in ALS (e.g., the matricized tensor times Khatri-Rao product (MTTKRP) operation) can be significantly simplified. Consequently, the classic ALS algorithm can be modified to handle CPD of huge and sparse tensors.

However, when the tensor to be factored is dense—i.e., when most entries of the tensor are nonzero—the sparsity-enabled efficient algorithms [13, 14, 18, 19, 6] are no longer applicable. Note that large and dense tensors arise in many timely and important applications such as medical imaging [20], hyperspectral imaging [6], and computer vision [21]. In fact, since big dense tensors typically cost a lot of memory (e.g., a dense tensor with a size of occupies 57.52GB memory if saved as double-precision numbers), it is even hard to load them into the RAM of laptops, desktops, or servers. This also raises serious challenges in the era of Internet of Things (IoT)—where edge computing on small devices is usually preferable.

Stochastic approximation is a powerful tool for handling optimization problems involving dense data, which is known for its low per-iteration memory and computational complexities [22]. A number of stochastic optimization based CPD algorithms have been proposed in the literature [23, 24, 25]. Specifically, The works in [23, 24] work in an iterative manner. In each iteration, the algorithm samples a random subset of the tensor entries and update the corresponding parts of the latent factors using the sampled data. The algorithms have proven quite effective in practice, and features distributed implementation [24]. The challenge here is that every tensor entry only contains information of a certain row of the latent factors, and updating the entire latent factors may need a lot of iterations. This may lead to slow improvement of the latent factor estimation accuracy. More importantly, this update strategy loses the opportunity to incorporate constraints/regularizations on the whole latent factors, since the sampled entries only contain partial information of them. This is undesired in practice, since prior information on the latent factors are critical for enhancing performance, especially in noisy cases.

Recently, a stochastic algorithm that ensures updating one entire latent factor in every iteration was proposed in [25]. Instead of sampling tensor entries, the algorithm works via sampling tensor fibers that contain information of the whole latent factors. However, this algorithm works with at least as many fibers as the tensor rank, which in some cases gives rise to much higher per-iteration complexity relative to the algorithms in [23, 24]. In addition, like those in [23, 24], the algorithm in [25] cannot handle constraints or regularizations on the latent factors, either. In addition, although empirically working well, convergence properties of many stochastic CPD algorithms such as those in [23, 25] are unclear.

Contributions In this work, we propose a new stochastic algorithmic framework for computing the CPD of large-scale dense tensors. Specifically, our contributions include:

A Doubly Randomized Computational Framework for Large-Scale CPD. Our first contribution lies in proposing an efficient and flexible computational framework for CPD of large dense tensors. Our method is a judicious combination of randomized block coordinate descent (BCD) [26, 27] and stochastic proximal gradient (SPG) [28, 29]. Specifically, in each iteration, our method first samples a mode from all modes of the tensor. Then, the algorithm samples some fibers of this mode and updates the corresponding latent factor via stochastic proximal operations. Such a combination exhibits an array of attractive features: It admits much smaller per-iteration memory and computational complexities relative to the existing fiber sampling based method in [25]. More importantly, it is very flexible in terms of incorporating regualrizations and constraints on the latent factors.

Rigorous Convergence Analysis. Both BCD and SPG are well studied topics in the optimization literature [30, 27, 26]. However, convergence properties of the proposed framework is not immediately clear, due to the nonconvex nature of CPD. The existing block-randomized SGD (BR-SGD) framework [31] only considers convex optimization. A related work in [32] deals with nonconvex problems via block stochastic gradient, but their Gauss-Seidel type BCD strategy (without block randomization) makes the convergence analysis inapplicable to our case. Hence, we offer tailored convergence analyses for our proposed CPD algorithms.

Implementation-friendly Adaptive Stepsize Scheduling. In practice, one of the most challenging aspects in stochastic optimization is selecting a proper stepsize schedule. To make the proposed algorithms friendly to use by practitioners, we propose a practical and adaptive stepsize schedule that is based on the celebrated Adagrad algorithm [33]. Adagrad is an adaptive stepsize selection method that was devised for single-block gradient descent. Nonetheless, we find through extensive simulations that it largely helps reduce the agonizing pain of tuning stepsize when implementing our multi-block algorithm for CPD. In addition, we also show that the adaptive stepsize-based algorithm converges to a stationary problem almost surely under some conditions.

A quick demonstration of the effectiveness of the proposed algorithms is shown in Fig. 1, where the average mean squared error (MSE) of the estimated latent factors [cf. Eq. (21)] against the number of MTTKRP computed (which dominates the complexity) is plotted. One can see that the proposed algorithm largely outperforms a couple of state-of–the-art algorithms for constrained CPD. More thorough numerical results can be seen in Sec. 6.

Part of the work was submitted to ICASSP 2019 [34]. In this new version, we have additionally included detailed convergence proofs and the new adaptive stepsize based algorithm. More extensive simulations and real-data experiments are also included.

Figure 1: The proposed algorithms (AdaCPD and BrasCPD) exhibit low complexity for achieving high accuracy of the estimated latent factors. The tensor under test has a size of and the rank is 10. The latent factors are constrained to be nonnegative. The baselines are two state-of-art constrained CPD algorithms AO-ADMM [13] and APG [14].

Notation. We follow the established conventions in signal processing. , , , and denote scaler, vector, matrix, and tensor, respectively; denotes the Euclidean norm, i.e., and , respectively; , , and denote outer product, Khatri-Rao product, and Hadamard product, respectively; denotes the vectorization operator that concatenates the columns of ; means that all the entries of are nonnegative; and denote transpose and pseudo-inverse, respectively; denotes the cardinality of of set ; denotes the largest eigenvalue of a matrix.

2 Background

We first introduce some notions that are heavily used in tensor algebra.

2.1 Tensors and CPD

An th order tensor is an array whose entries are indexed by coordinates; i.e., denotes an element of the tensor with a size of . Like matrices, tensors can be represented as sum of rank-one components:

(1)

where “” denotes the outer product of vectors, and an matrix that is often referred to as the mode- latent factor. When is the minimal integer that satisfies the expression in (1), the right hand side in (1) is called the canonical polyadic decomposition of the tensor . At the entry level, the CPD can be expressed as

(2)

for . The CPD of a tensor is essentially unique under mild conditions (meaning that the latent factors ’s that constitute the data are unique up to some trivial ambiguities like column permutations and scalings [2]). In practice, the CPD of a tensor is often obtained via solving a certain optimization criterion

(3)

One of most commonly seen optimization surrogate for CPD in the literature is the least squares (LS) fitting criterion [3, 13, 14]:

In the sequel, we will often use the shorthand notation to denote , where

Note that many other criteria have also been considered, e.g., the Kullback-Leibler (KL) divergence [35] and robust fitting [36, 37] criteria—which serve for different purposes. Nevertheless, the LS fitting criterion is arguably the most popular one.

2.2 Unfolding, ALS and MTTKRP

The matricization operation, or matrix unfolding of a tensor, has proven very useful in designing tensor factorization algorithms. The mode- unfolding of a tensor is a matrix where

and we have and [1]. The CPD representation in Eq. (1) can be expressed as

(4)

where the matrix is defined as

The elegant form of the unfoldings has enabled the famous alternating least squares (ALS) algorithm [3] for handling Problem (3) with the LS objective. Specifically, ALS solves the following cyclically for :

(5)

Problem (5) is nothing but a least squares problem that admits the following closed-form solution:

if . Note that is not difficult to compute by exploiting the Khatri-Rao structure of [13, 2, 1]. However, when the problem dimension is large (which often happens in applications such as medical imaging, remote sensing, and computer vision), solving the seemingly simple problem in (5) can be computationally prohibitive. The reason is that both and can be very large matrices. In particular, the so-called matricized tensor times Khatri-Rao product (MTTKRP) operation, i.e.,

that happens in every iteration of ALS costs flops (or, if ). This is quite costly even if is moderately large. Many works have considered fast algorithms for computing MTTKRP, but these methods are mainly enabled by judiciously exploiting sparsity of the tensor data [38, 18].

In a lot of applications, some prior knowledge on the latent factors is known—e.g., in image processing, ’s are normally assumed to be nonnegative [6]; in statistical machine learning, sometimes the columns of are assumed to be constrained within the probability simplex [35, 39]; i.e.,

(6)

In those cases, the following criterion is often of interest:

(7)

Compared to the unconstrained version, Problem (7) is even harder to handle. Some recent methods combine first-order constrained optimization and ALS [14, 13] to make the tensor factorization algorithms more flexible in handling constraints and regularizations—but the complexity orders of those algorithms often scale similarly as that of ALS, since these algorithms cannnot avoid computing that is the bottleneck for computing CPD.

2.3 Stochastic Optimization

When the tensor is large and dense, working with the entire dataset could be computationally and memory-wise expensive. A popular workaround is to apply stochastic approximation (SA)—i.e., sampling parts of the data at random and use the sampled piece to update the latent factors. Using Eq. (2), Problem (3) with the LS objective is equivalent to the following:

(8)

where and

The objective function in (8) can be understood as the empirical risk of SA [22]. Using this observation, the algorithms in [23, 24] randomly sample a subset set of entries indexed by and update the pertinent parts of the latent factors (note that the th entry of tensor contains the information of for ) using the sampled entries of the tensor. For example, [24] uses a stochastic gradient (SG) based approach and update the ’s that are associated with the sampled entries. The sampling method in [23] is similar, while the update is not gradient-based but Gauss-Newton or ALS applied to the sampled set of entries (or, sub-tensors, to be precise). The upshot of this line of work is that the per-iteration complexity can be quite low.

Despite of such favorable complexity savings, the approaches in [23, 24] have a couple of limitations. First, in every iteration only a small part of the ’s (i.e., some rows) are updated—which may result in slow improvement of estimation accuracy of the latent factors. Second—which is perhaps more critical—many useful prior information cannot be incorporated in the algorithm. The reason is that these algorithms update some of the rows of ’s, while many useful priors are defined w.r.t. the columns of the latent factors, e.g., the probability simplex constraint in (6) and the total variation constraint that is heavily used in image processing. Third, convergence properties of these methods are often unclear.

An alternative [25] to the SA based methods above is to leverage the tensor data structure by considering randomly sampled fibers of tensors. Note that a mode- fiber of (cf. Fig. 2) is a row of the mode- unfolding [1]. Now, assuming that one samples a set of mode- fibers indexed by , then can be updated by solving a ‘sketched version’ of Problem (5):

(9)

If , then can be invertible and the sketched system of linear equations is over-determined. Hence, one can update by solving the dimensional linear system

Similar to the ALS algorithm, after updating , the algorithm moves to mode- fibers and repeats the same for updating . Intuitively, the method in [25] can be more efficient than SG based methods in terms of estimating the ’s. The downside is that it needs to sample at least fibers for each update, and can be larger than in tensor decomposition. In addition, the update in (5) can only handle unconstrained/unregularized tensor decomposition, while incorporating constraints/regularizations is often critical in practice. Convergence properties of this method is also unclear.

Figure 2: From left to right: mode- tensor fibers of a third-order tensor, respectively.

3 Proposed Algorithm

In this work, we propose a new stochastic optimization strategy for CPD. Our method combines the insights from ALS and fiber sampling, but allows . This is instrumental in practice, since it is the key for achieving low per-iteration complexity. The proposed algorithm can easily handle a variety of constraints and regularizations that are commonly used in signal processing and data analytics—which is reminiscent of stochastic proximal gradient (SPG) [29, 40]. In addition, we provide convergence analyses to back up the proposed approach.

3.1 Basic Idea: Unconstrained Case

We first consider Problem (3). Our idea is combining SA and exploiting the tensor fiber structure. Specifically, at each iteration, we sample a set of mode- fibers for a certain as the method in [25] does. However, instead of exactly solving the least squares subproblems (5) for all the modes following a Gauss-Seidel manner in each iteration, we update using a doubly stochastic procedure. To be more precise, at iteration , we first randomly sample a mode index . Then, we randomly sample a set of mode- fibers that is indexed by . Let be a collection of matrices, representing the stochastic gradient as:

(10)

where denotes the th block of , and we used the shorthand notations

The latent variables are updated by

(11)

One observation is that is simply an SA applied to the full gradient of Problem (3) w.r.t. the chosen mode- variable , and the update is an iteration of the classical SG algorithm (with a minibatch size ) for solving the problem in (5).

The proposed update is very efficient, since the most resource-consuming update in algorithms such as those in [13, 14] is avoided. The corresponding part costs only flops—and is under our control. Note that the first step in this procedure is different from standard ALS-type algorithms that update the block variables cyclically instead of updating a randomly sampled block. As we will show, this modification greatly simplifies our convergence analysis.

3.2 Constrained and Regularized Case

As mentioned, there are many cases in practice where considering regularizations or constraints on ’s can benefit the associated tasks. Since our framework updates an entire in each iteration, it is friendly for incorporating a large variety of commonly used constraints/regularizations—which is more flexible relative to the entry sampling based approaches in [23, 24]. Specifically, the algorithm can be easily extended to handle the constrained/regularized case:

(12)
subject to

where is the objective function of (3), denotes a structure-promoting regularizer on . Note that can also be written as a regularization if is defined as the indicator function of set , i.e.,

where denotes the indicator function of the set . Using the same fiber sampling strategy as in the previous subsection, we update by

(13a)
(13b)

Problem (13a) is also known as the proximal operator of , which is often denoted as

(14)

Many ’s admit simple closed-form solutions for their respective proximal operators, e.g., when is the indicator function of the nonnegative orthant and ; see Table 1 and more details in [13, 41]. The complexity of computing (14) is often similar to that of the plain update in (11), and thus is also computationally efficient. An overview of the proposed algorithm can be found in algorithm 1, which we name Block-Randomized SGD for CPD (BrasCPD).

input : -way tensor ; rank ; sample size , initialization , step size
1 ; repeat
2        uniformly sample from , then sample from with ; form the stochastic gradient ; update , for ; ;
3until some stopping criterion is reached;
output : 
Algorithm 1 BrasCPD
prox./proj. solution complexity
soft-thresholding
re-scale
block soft-thresholding
hard-thresholding
randomized pivot search [42] in expectation
max
monotonic monotone regression [43]
unimodal unimodal regression [44]

In the table, is the number of optimization variables.

Table 1: Proximal/projection operator of some frequently used regularizations and constraints.

4 Convergence Properties

Note that BrasCPD does not fall into any known framework of block stochastic gradient optimization, and thus its convergence properties are not immediately clear. Two most relevant works from the optimization literature are [14] and [31]. However, the work in [14] considers Gauss-Seidel type block SGD (i.e., cyclically updating the blocks), instead of the block-randomized version as BrasCPD uses. The work in [31] considers block-randomized SGD, but only for the convex case. In addition, many assumptions made in [14, 31] for their respective generic optimization problems are not easily satisfied by our CPD problem. In this section, we offer tailored convergence analyses for BrasCPD.

4.1 Unconstrained Case

To proceed, we will use the following assumptions:

Assumption 1.

The stepsize schedule follows the Robbins-Monro rule [45]:

Assumption 2.

The updates are bounded for all .

Assumption 1 is a principle for stepsize scheduling, which is commonly used in stochastic approximation. Assumption 2 is a working assumption that we make to simplify the analysis. It is considered a relatively strong assumption, since it is hard to check or guarantee. Nevertheless, unbounded iterates rarely happen in practice, if the stepsize is well controlled.

There are also an array of problem structures that are useful for studying convergence of the algorithm.

Fact 1

The LS fitting part in the objective function (12) (i.e.) satisfies

(15)

where , is a feasible point, and and are extracted/constructed from following the respective definitions.

Eq. (1) holds because the objective function w.r.t. is a plain least squares fitting criterion, which is known to have a Lipschitz continuous gradient—and the smallest Lipschitz constant is .

The second fact is instrumental in proving convergence of the algorithm:

Fact 2

Denote and as the random variables that are responsible for selecting the mode and fibers in iteration , respectively. Also denote , i.e., the filtration up to . The block-wise stochastic gradient constructed in (10) is an unbiased estimation for the full gradient w.r.t. , i.e.,

(16)

if admits the following probability mass function (PMF):

(17)

The proof of the above is straightforward and thus skipped. The Fact says that even if our block stochastic gradient is not exactly an unbiased estimation for , it is an unbiased estimation for the “block gradient” . This fact will prove quite handy in establishing convergence. In fact, the two-level sampling strategy (i.e., block sampling and fiber sampling, respectively), makes the gradient estimation w.r.t. unbiased up to a scaling factor (see Appendix A). This connection intuitively suggests that the proposed algorithm should behave similarly as an ordinary single-block stochastic gradient descent algorithm.

We first have the following convergence property:

Proposition 1

Consider the case where and Assumptions 1-2 hold. Then, the solution sequence produced by BrasCPD satisfies:

The proof is relegated to Appendix B. The above proposition implies that there exists a subsequence of the solution sequence that converges to a stationary point in expectation. We should mention that the SGD/stochastic proximal gradient type update and the block sampling step are essential for establishing convergence—and using the exact solution to (9) as in [25] may not have such convergence properties.

4.2 Constrained/Regularized Case

To understand convergence of the proximal gradient version with , denote as the objective function. Our optimality condition amounts to where

i.e., the optimality condition is satisfied in a blockwise fashion [30, 32]. Hence, our goal of this section is to show that for all vanishes when grows. We will use the following assumption:

Assumption 3.

There exists a sequence for , such that

and

(18)

We show that BrasCPD produces a convergent solution sequence in the following proposition:

Proposition 2

Assume that Assumptions 1-3 hold. Also assume that is a convex function. Then, the solution sequence produced by BrasCPD satisfies

Remark 1.

Note that the convergence result in Proposition 2 inherits one possible drawback from single-block stochastic proximal gradient algorithms for nonsmooth nonconvex optimization. To be specific, the relatively strong assumption in (18) needs to be assumed for ensuring convergence. Assumption 3 essentially means that the variance of the gradient estimation error decreases and converges to zero. This is not entirely trivial. One way to fulfill this assumption is to increase the minibatch size along the iterations, e.g., by setting [29, 32]:

Then, one can see that so that . Another popular way for achieving (18) is to use some advanced variance reduction techniques such as SVRG [40]—which may go beyond the scope of this paper and thus is left out of the discussion. Also notice that as the convergence analysis is pessimistic, in practice constant minibatch size works fairly well—as we will see soon.

5 An Adaptive Stepsize Scheme

One may have noticed that the convergence theories in Propositions 1-2 do not specify the sequence except two constraints as in Assumption 1. This oftentimes gives rise to agonizing tuning experience for practitioners when implementing stochastic algorithms.

Recently, a series of algorithms were proposed in the machine learning community for adaptive stepsize scheduling when training deep neural networks [46, 47, 48]. Most of these works are variants of the Adagrad algorithm [33]. The insight of Adagrad can be understood as follows: If one optimization variable has been heavily updated before, then it is given a smaller stepsize for the current iteration (and a larger stepsize otherwise). This way, all the optimization variables can be updated in a balanced manner. Adagrad was proposed for single-block algorithms, and this simple strategy also admits many provable benefits under the context of convex optimization [33]. For our multi-block nonconvex problem, we extend the idea and propose the following updating rule: In iteration , if , then, for all and all , we have

(19a)
(19b)
(19c)

where . The Adagrad version of block-randomized CPD algorithm is very simple to implement. The algorithm is summarized in Algorithm 2, which is named AdaCPD.

As one will soon see, such a simple stepsize strategy is very robust to a large number of scenarios under test—i.e., in most of the cases, AdaCPD performs well without tuning the stepsize schedule. In addition, the AdaCPD algorithm works well for both the constrained and unconstrained case.

Proving convergence for nonconvex Adagrad-like algorithms is quite challenging [49, 50]. In this work, we show that the following holds:

Proposition 3

Assume for all , and that for all and . Under the Assumptions 1-2, the solution sequence produced by AdaCPD satisfies

Proposition 3 asserts that the algorithm converges almost surely. The proof is relegated to Appendix D in the supplementary materials due to page limitations. Our proof extends the idea from a recent paper [49] that focuses on using Adagrad for solving single-block nonconvex problems. As mentioned, our two-level sampling strategy makes our algorithm very similar to single-block SGD with a scaled gradient estimation (cf. Appendix A), and thus with careful modifications the key proof techniques in [49] goes through. Nevertheless, we detail the proof for being self-containing.

input : -way tensor ; rank ; sample size , initialization
1 ; repeat
2        uniformly sample from , then sample from with ; form the stochastic gradient ; determine the step size update , for ; ;
3until some stopping criterion is reached;
output : 
Algorithm 2 AdaCPD

6 Numerical Results

In this section, we use simulations and real-data experiments to showcase the effectiveness of the proposed algorithm.

6.1 Synthetic Data Simulations

6.1.1 Data Generation

Throughout this subsection, we use synthetic third-order tensors (i.e., ) whose latent factors are drawn from i.i.d. uniform distribution between and —unless otherwise specified. This way, large and dense tensors can be created. For simplicity, we set for all and test the algorithms on tensors having different ’s and ’s. In some simulations, we also consider CPD for noisy tensors, i.e., factoring data tensors that have the following signal model:

where is the noiseless low-rank tensor and denotes the additive noise. We use zero-mean i.i.d. Gaussian noise with variance in our simulations, and the signal-to-noise ratio (SNR) (in dB) is defined as .

6.1.2 Baselines

A number of baseline algorithms are employed as benchmarks. Specifically, we mainly use the AO-ADMM algorithm [51] and the APG algorithm [14] as our baselines since they are the most flexible algorithms with the ability of handling many different regularizations and constraints. We also present the results output by the CPRAND algorithm [25]. Note that we are preliminarily interested in constrained/regularized CPD. Because CPRAND operates without constraints, the comparison is not entirely fair (e.g., CPRAND can potentially attain smaller cost values since it has a much larger feasible set). Nevertheless, we employ it as a benchmark since it uses the same fiber sampling strategy as ours. All the algorithms are initialized with the same random initialization; i.e., ’s entries follow the uniform distribution between 0 and 1.

6.1.3 Parameter Setting

For BrasCPD, we set the stepsize to be

(20)

where is the number of iterations, and typically takes a value in between 0.001 and 0.1, and we try multiple choices of in our simulations. The batch size is set to be below 25, which will be specified later. For AdaCPD, we fix and for all the simulations. For CPRAND, we follow the instruction in the original paper [25] and sample fibers for each update.

6.1.4 Performance Metrics

To measure the performance, we employ two metrics. The first one is the value of the cost function, i.e., The second one is the estimation accuracy of the latent factors, for . The accuracy is measured by the mean squared error (MSE) which is as defined in [52, 53]:

(21)

where denotes the estimate of and ’s are under the constraint —which is used to fix the intrinsic column permutation in CPD.

Since the algorithms under test have very different operations and subproblem-solving strategies, it may be challenging to find an exactly unified complexity measure. In this section, we show the peformance of the algorithms against the number of MTTKRP operations used, since is the most costly step that dominates the complexity of all the algorithms under comparison. All the simulations are conducted in Matlab. The results are averaged from ten random trials with different tensors.

6.2 Results

Fig. 1 in Sec. 1 has shown the MSE performance of the algorithms in a relatively small-size example, where , and the nonnegativity constraints are used in the algorithms. In that simulation, we use so that every 500 iterations of the proposed algorithm compute a full MTTKRP. One can see that for this relatively easy case, all the algorithms can reach a good estimation accuracy for the latent factors. Nevertheless, the proposed methods exhibit remarkably higher efficiency.

Fig. 3 shows the average MSEs of the estimated latent factors by the algorithms under a much larger scale simulation, where and . We set so that the proposed algorithms use 5,000 iterations to compute a full MTTKRP. All the algorithms use nonnegativity constraints except CPRAND. There are several observations in order: First, the stochastic algorithms (i.e., BrasCPD, AdaCPD, and CPRAND) are much more efficient relative to the deterministic algorithms (AO-ADMM and APG). After 30 MTTKRPs computed, the stochastic algorithms often have reached a reasonable level of MSE. This is indeed remarkable, since 30 MTTKRPs are roughly equivalent to 10 iterations of AO-ADMM and APG. Second, two of the proposed stochastic algorithms largely outperforms CPRAND. In particular, BrasCPD with gives the most promising performance. However, the performance of BrasCPD is affected a bit significantly by the parameters . One can see that using and the algorithm does not give so promising results under this setting. Third, AdaCPD yields the second lowest MSEs, but its MSE curve starts saturating and decreases slower after it reaches MSE=. This is understandable, since the ‘size’ of is shrinking after each iteration and thus the stepsize could vanish after a large number of iterations. Nevertheless, MSE= is already very satisfactory, and AdaCPD shows surprising robustness to changing scenarios, without changing any setup in its stepsize scheduling strategy. Fig. 4 shows the cost values against the number of full MTTKRPs computed, which is consistent to what we observed in Fig. 3.

Figure 3: MSE of the algorithms. and . .
Figure 4: cost of the algorithms. and . .

Table 2 shows the MSEs and cost values of output by the algorithms when the tensor rank varies under . All the algorithms are stopped after 30 full MTTKRPs are used. One can see that BrasCPD in general exhibits the lowest MSEs if a proper is chosen, under the employed stepsize schedule in (20). However, one can see that when changes, there is a risk that BrasCPD runs into numerical issues and yields unbounded solutions. This suggests that BrasCPD may need extra care for tuning its stepsize. On the other hand, AdaCPD always outputs reasonably good results. The MSEs output by AdaCPD is slightly higher relative to BrasCPD, but is much lower compared to those of the baslines. More importantly, AdaCPD runs without tuning the stepsize parameters—which shows the power of the adaptive stepsize scheduling strategy.

Algorithm Metric
100 200 300 400
BrasCPD (=0.1) MSE NaN NaN NaN
BrasCPD (0.05) MSE 0.0126 0.0494 0.0894 NaN
BrasCPD (0.01) MSE 0.2882 0.3142 0.3235 0.3239
AdaCPD MSE 0.0016 0.1247 0.1467 0.2382
AO-ADMM MSE 0.3190 0.3124 0.3093 0.3033
APG MSE 0.3574 0.3527 0.3538 0.3545
CPRAND MSE 0.0056 0.0967 0.2115 0.2404
BrasCPD (0.1) Cost NaN NaN NaN
BrasCPD (0.05) Cost 0.0046 0.0397 0.1162 NaN
BrasCPD (0.01) Cost 0.0903 0.1832 0.2687 0.3461
AdaCPD Cost 0.1555 0.1684 0.2144
AO-ADMM Cost 0.0990 0.1952 0.2800 0.3520
APG Cost 0.6649 1.3629 2.0664 2.7707
CPRAND Cost 0.0018 0.0691 0.1842 0.2481
Table 2: MSEs of the estimated latent factors by the algorithms under different ; ; all the algorithms are stopped after computing 30 MTTKRPs. “NaN” means the algorithm outputs unbounded solutions. .
Algorithm Metric
100 200 300 400
BrasCPD (=0.1) MSE 0.2432 0.0474
BrasCPD (0.05) MSE 0.2724 0.2000 0.0126
BrasCPD (0.01) MSE 0.2906 0.3086 0.2882 0.2127
AdaCPD MSE 0.2214 0.0121 0.0016
AO-ADMM MSE 0.2561 0.3171 0.3190 0.3235
APG MSE 0.3107 0.3459 0.3574 0.3635
CPRAND MSE 0.1857 0.0459 0.0056 0.0025
BrasCPD (0.1) Cost 0.0795 0.0179
BrasCPD (0.05) Cost 0.0862 0.0668 0.0046
BrasCPD (0.01) Cost 0.1453 0.0981 0.0903 0.2127
AdaCPD Cost 0.0814 0.0058
AO-ADMM Cost 0.0843 0.0957 0.0990 0.1008
APG Cost 0.5936 0.6450 0.6649 0.6776
CPRAND Cost 0.0566 0.0136 0.0018 0.0011
Table 3: Performance of the algorithms under various ’s, . ; all the algorithms are stopped after computing 30 MTTKRPs. .
Algorithm Metric SNR
10 20 30 40
BrasCPD (=0.1) MSE 0.3685 0.0225 0.0024 0.0003
BrasCPD (0.05) MSE 0.2962 0.0198 0.0066 0.0044
BrasCPD (0.01) MSE 0.3125 0.2823 0.2774 0.2758
AdaCPD MSE 0.3285 0.0192 0.0025 0.0004
AO-ADMM MSE 0.3330 0.3135 0.3118 0.3101
APG MSE 0.3524 0.3521 0.3521 0.3520
CPRAND MSE 1.6047 0.0367 0.0104 0.0100
BrasCPD (0.1) Cost 1.9627 0.2081 0.0212 0.0021
BrasCPD (0.05) Cost 0.9086 0.0918 0.0110 0.0025
BrasCPD (0.01) Cost 0.2812 0.1058 0.0885 0.0865
AdaCPD Cost 0.3137 0.0671 0.0155 0.0024
AO-ADMM Cost 0.1533 0.0999 0.0954 0.0948
APG Cost 0.6445 0.6441 0.6430 0.6435
CPRAND Cost 0.8038 0.0811 0.0100 0.0039
Table 4: Performance of the algorithms under various SNRs; all the algorithms are stopped after computing 30 MTTKRPs. , . .
Algorithm Metric SNR
10 20 30 40
BrasCPD (0.1) MSE 0.4697 0.4423 0.3956 0.4320
BrasCPD (0.05) MSE 0.4443 0.4267 0.4135 0.4146
BrasCPD (0.01) MSE 0.3940 0.0335 0.0033 0.0003
AdaCPD MSE 0.2983 0.0611 0.0011 0.0002
AO-ADMM MSE 0.3206 0.2996 0.2973 0.2972
APG MSE 0.2761 0.2760 0.2760 0.2760
CPRAND MSE 1.6020 0.0466 0.0045 0.0112
BrasCPD (0.1) Cost 14274.5141 12059.7192 7386.9652 10944.0721
BrasCPD (0.05) Cost 11424.7030 10159.3117 9059.4911 9152.1151
BrasCPD (0.01) Cost 229.6424 24.8565 2.5698 0.2571
AdaCPD Cost 14.8627 3.4755 0.5916 0.1318
AO-ADMM Cost 9.6097 6.1642 5.8643 5.8359
APG Cost 36.5461 36.5095 36.5059 36.5055
CPRAND Cost 51.5269 5.2663 0.5413 0.2570
Table 5: Performance of the algorithms under various SNRs after computing 30 MTTKRPs. , . , . .

Tables 4-5 show the performance of the algorithms under different SNRs. Except for adding noise, other settings are the same as those in Fig. 3. In a noisy environment, the ability of handling constraints/regularizations is essential for a CPD algorithm, since prior information on the latent factors can help improve estimation accuracy. Table 4 and Table 5 test the cases where is elementwise nonnegative and the columns of reside in a scaled version of the probability simplex, respectively. One can see from the two tables that both BrasCPD (with a proper ) and AdaCPD work very well. In Table 5, one can see that BrasCPD again shows its sensitivity to the choice of , with and actually not working. We also note that when the SNR is low, CPRAND is not as competitive, perhaps because it cannot use constraints to incorporate prior information of the ’s—this also shows the importance of being able to handle various constraints.

6.3 Real-Data Experiment

In this subsection, we test our algorithm on a constrained tensor decomposition problem; i.e., we apply the proposed BrasCPD and AdaCPD to factor hyperspectral images. Hyperspectral images (HSIs) are special images with pixels measured at a large number of wavelengths. Hence, an HSI is usually stored as a third-order tensor with two spatial coordinates and one spectral coordinate. HSIs are dense tensors and thus are suitable for testing the proposed algorithms. We use sub-images of the Indian Pines dataset that has a size of and the Pavia University dataset111Both datasets are available online: http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes that has a size of .

Tables 6-7 show the cost values of the nonnegativity constrained optimization algorithms under different ranks, after computing 10 MTTKRPs for all three modes, which corresponds to 10 iterations for AO-ADMM and APG (we use this “all-mode MTTKRP” in this section since the tensors are unsymmetrical and thus single-mode MTTKRPs cannot be directly translated to iterations in batch algorithms). One can see that the proposed algorithms show the same merits as we have seen in the simulations: BrasCPD can exhibit very competitive performance when is properly chosen (e.g., when and for the Indian Pines dataset); in addition, AdaCPD gives consistently good performance without tuning the stepsize manually. Particularly, on the Pavia University dataset, AdaCPD gives much lower cost values compared to other algorithms. Fig. 5 shows how the cost values change along with the iterations on the Pavia University data using .

Algorithm Metric
10 20 30 40
BrasCPD (4) Cost
BrasCPD (3) Cost
BrasCPD (2) Cost
AdaCPD Cost
AO-ADMM Cost
APG Cost
Table 6: Performance of the algorithms on the Indian Pines dataset under different ’s.
Algorithm Metric
100 200
BrasCPD (0.5) Cost
BrasCPD (0.3) Cost
BrasCPD (0.1) Cost
AdaCPD Cost
AO-ADMM Cost
APG Cost