A Flexible and Efficient Algorithmic Framework for Constrained Matrix and Tensor Factorization
Abstract
We propose a general algorithmic framework for constrained matrix and tensor factorization, which is widely used in signal processing and machine learning. The new framework is a hybrid between alternating optimization (AO) and the alternating direction method of multipliers (ADMM): each matrix factor is updated in turn, using ADMM, hence the name AOADMM. This combination can naturally accommodate a great variety of constraints on the factor matrices, and almost all possible loss measures for the fitting. Computation caching and warm start strategies are used to ensure that each update is evaluated efficiently, while the outer AO framework exploits recent developments in block coordinate descent (BCD)type methods which help ensure that every limit point is a stationary point, as well as faster and more robust convergence in practice. Three special cases are studied in detail: nonnegative matrix/tensor factorization, constrained matrix/tensor completion, and dictionary learning. Extensive simulations and experiments with real data are used to showcase the effectiveness and broad applicability of the proposed framework.
Keywords: constrained matrix/tensor factorization, nonnegative matrix/tensor factorization, canonical polyadic decomposition, PARAFAC, matrix/tensor completion, dictionary learning, alternating optimization, alternating direction method of multipliers.
1 Introduction
Constrained matrix and tensor factorization techniques are widely used for latent parameter estimation and blind source separation in signal processing, dimensionality reduction and clustering in machine learning, and numerous other applications in diverse disciplines, such as chemistry and psychology. Leastsquares lowrank factorization of matrices and tensors without additional constraints is relatively wellstudied, as in the matrix case the basis of any solution is simply the principal components of the singular value decomposition (SVD) [19], also known as principal component analysis (PCA), and in the tensor case alternating least squares (ALS) and other algorithms usually yield satisfactory results [54]. ALS is also used for matrix factorization, especially when the size is so large that performing the exact PCA is too expensive.
Whereas unconstrained matrix and tensor factorization algorithms are relatively mature, their constrained counterparts leave much to be desired as of this writing, and a unified framework that can easily and naturally incorporate multiple constraints on the latent factors is sorely missing. Existing algorithms are usually only able to handle one or at most few specialized constraints, and/or the algorithm needs to be redesigned carefully if new constraints are added. Commonly adopted constraints imposed on the latent factors include nonnegativity [36], sparsity (usually via sparsityinducing regularization [41]), and simplex constraints [27], to name just a few.
On top of the need to incorporate constraints on the latent factors, many established and emerging signal processing applications entail cost (loss) functions that differ from classical leastsquares. Important examples include matrix completion [12] where missing values are ignored by the loss function, and robust PCA [11] where the loss is used. In the matrix case without constraints on the latent factors, these can be formulated as convex problems and solved in polynomialtime. With explicit constraints imposed on the latent factors, and/or for tensor data, however, nonconvex (multilinear) formulations are unavoidable, and a unified algorithmic framework that can handle a variety of constraints and loss functions would be very welcome.
In this paper, we propose a general algorithmic framework that seamlessly and relatively effortlessly incorporates many common types of constraints and loss functions, building upon and bridging together the alternating optimization (AO) framework and the alternating direction method of multipliers (ADMM), hence the name AOADMM.
While combining these frameworks may seem conceptually straightforward at first sight, what is significant is that AOADMM outperforms all prior algorithms for constrained matrix and tensor factorization under nonparametric constraints on the latent factors. One example is nonnegative matrix factorization, where the prior art includes decades of research. This is the biggest but not the only advantage of AOADMM. Carefully developing various aspects of this combination, we show that

AOADMM converges to a stationary point of the original NPhard problem;

Using computation caching, warmstart, and good parameter settings, its periteration complexity is similar to that of ALS;

AOADMM can incorporate a widerange of constraints and regularization penalties on the latent factors at essentially the same complexity;

It can also accommodate a wide variety of cost / loss functions, with only moderate increase in complexity relative to the classical leastsquares loss; and

The core computations are exactly the same as ALS for unconstrained factorization, with some additional elementwise operations to handle constraints, making it easy to incorporate smart implementations of ALS, including sparse, parallel, and highperformance computing enhancements.
1.1 Notation
We denote the (approximate) factorization of a matrix , where is , is , and is , with , and in most cases much smaller. Note that adding constraints on and may turn the solution from easy to find (via SVD) but nonidentifiable, to NPhard to find but identifiable. It has been shown that simple constraints like nonnegativity and sparsity can make the factors essentially unique, but at the same time, computing the optimal solution becomes NPhard—see [29] and references therein.
An way array of dimension , with , is denoted with an underscore, e.g., . In what follows, we focus on the socalled parallel factor analysis (PARAFAC) model, also known as canonical decomposition (CANDECOMP) or canonical polyadic decomposition (CPD), which is essentially unique under mild conditions [48], but constraints certainly help enhance estimation performance, and even identifiability. The factorization is denoted as , which is a concise way of representing the model
Each matrix is of size , corresponding to the factor of the th mode.
1.2 Multilinear Algebra Basics
With the increasing interest in tensor data processing, there exist many tutorials on this topic, for example, [49, 34]. Here we briefly review some basic multilinear operations that will be useful for the purposes of this paper, and refer the readers to those tutorials and the references therein for a more comprehensive introduction.
The mode matricization, also known as mode matrix unfolding, of , denoted as , is a matrix of size . Each row of is a vector obtained by fixing all the indices of except the th one, and the matrix is formed by stacking these row vectors by traversing the rest of the indices from back to 1. As an example, for , the three matricizations are
Notice that, though essentially in the same spirit, this definition of mode matricization may be different from other expressions that have appeared in the literature, but we adopt this one for ease of our use.
The KhatriRao product of matrices and having the same number of columns, denoted as , is defined as the columnwise Kronecker product of and . More explicitly, if is of size , then
The KhatriRao product is associative (although not commutative). We therefore generalize the operator to accept more than two arguments in the following way
With the help of this notation, if admits an exact PARAFAC model , then it can be expressed in matricized form as
Lastly, a nice property of the KhatriRao product is that
where denotes the elementwise (Hadamard) matrix product. More generally, it holds that
2 Alternating Optimization Framework: Preliminaries
We start by formulating the factorization problem as an optimization problem in the most general form
(1) 
with a slight abuse of notation by assuming can also take the value of 2. In (1), can be any loss measure, most likely separable down to the entries of the argument, and is the generalized regularization on , which may take the value of so that any hard constraints can also be incorporated. For example, if we require that the elements of are nonnegative, denoted as , then
Because of the multilinear term , the regularized fitting problem is nonconvex, and in many cases NPhard [58, 26]. A common way to handle this is to use the alternating optimization (AO) technique, i.e., update each factor in a cyclic fashion. The popular ALS algorithm is a special case of this when is the leastsquares loss, and there is no regularization. In this section, we will first revisit the ALS algorithm, with the focus on the periteration complexity analysis. Then, we will briefly discuss the convergence of the AO framework, especially some recent advances on the convergence of the traditional block coordinate descent (BCD) algorithm.
2.1 Alternating LeastSquares Revisited
Consider the unconstrained matrix factorization problem
(2) 
and momentarily ignore the fact that the optimal solution of (2) is given by the SVD. The problem (2) is nonconvex in and jointly, but is convex if we fix one and only treat the other as variable. Supposing is fixed, the subproblem for becomes the classical linear least squares and, if has full column rank, the unique solution is given by
(3) 
In practice, the matrix inverse is almost never explicitly calculated. Instead, the Cholesky decomposition of the Gram matrix is computed, and for each column of , a forward and a backward substitution are performed to get the corresponding column of . Since is and is , forming and takes and flops, respectively, computing the Cholesky decomposition requires flops, and finally the back substitution step takes flops, similar to a matrix multiplication. If , then the overall complexity is .
An important implication is the following. Clearly, if , then the cost of a least squares is . However, as grows, the complexity does not simply grow proportionally with , but rather goes to . The reason is that, although it seems we are now trying to solve leastsquares problems, they all share the same matrix , thus the Cholesky decomposition of can be reused throughout. This is a very nice property of the unconstrained least squares problems, which can be exploited to improve the computational efficiency of the ALS algorithm.
One may recall that another welladopted method for leastsquares is to compute the QR decomposition of as , so that . This can be shown to give the same computational complexity as the Cholesky version, and is actually more stable numerically. However, if has some special structure, it is easier to exploit this structure if we use Cholesky decomposition. Therefore, in this paper we only consider solving leastsquares problems using the Cholesky decomposition.
One important structure that we encounter is in the tensor case. For the ALS algorithm for PARAFAC, the update of is the solution of the following least squares problem
and the solution is given by
As we can see, the Gram matrix is computed efficiently by exploiting the structure, and its Cholesky decomposition can be reused. The most expensive operation is actually the computation of , but very efficient algorithms for this (that work without explicitly forming the KhatriRao product and the mode matricization) are available [4, 30, 43, 16, 50]. If we were to adopt the QR decomposition approach, however, none of these methods could be applied.
In summary, least squares is a very mature technique with many favorable properties that render the ALS algorithm very efficient. On the other hand, most of the algorithms that deal with problems with constraints on the factors or different loss measures do not inherit these good properties. The goal of this paper is to propose an AObased algorithmic framework, which can easily handle many types of constraints on the latent factors and many loss functions, with periteration complexity essentially the same as the complexity of an ALS step.
2.2 The Convergence of AO
Consider the following (usually nonconvex) optimization problem with variables separated into blocks, each with its own constraint set
(4)  
subject to 
A classical AO method called block coordinate descent (BCD) cyclically updates via solving
(5)  
subject to 
at the ()st iteration [6, § 2.7]. Obviously, this will decrease the objective function monotonically. If some additional assumptions are satisfied, then we can have stronger convergence claims [6, Proposition 2.7.1]. Simply put, if the subproblem (5) is convex and has a unique solution, then every limit point is a stationary point; furthermore, if are all compact, which implies that the sequence generated by BCD is bounded, then BCD is guaranteed to converge to a stationary point, even if (4) is nonconvex [56].
In many cases (5) is convex, but the uniqueness of the solution is very hard to guarantee. A special case that does not require uniqueness, first noticed by Grippo and Sciandrone [25], is when . On hindsight, this can be explained by the fact that for , BCD coincides with the socalled maximum block improvement (MBI) algorithm [14], which converges under very mild conditions. However, instead of updating the blocks cyclically, MBI only updates the one block that decreases the objective the most, thus the periteration complexity is times higher than BCD; therefore MBI is not commonly used in practice when is large.
Another way to ensure convergence, proposed by Razaviyayn et al. [44], is as follows. Instead of updating as the solution of (5), the update is obtained by solving a majorized version of (5), called the block successive upperbound minimization (BSUM). The convergence of BSUM is essentially the same, but now we can deliberately design the majorizing function to ensure that the solution is unique. One simple way to do this is to put a proximal regularization term
(6)  
subject to 
for some , where is the update of from the previous iteration. If (5) is convex, then (6) is strongly convex, which gives a unique minimizer. Thus, the algorithm is guaranteed to converge to a stationary point, as long as the sequence generated by the algorithm is bounded. Similar results are also proved in [59], where the authors used a different majorization for constrained matrix/tensor factorization; we shall compare with them in numerical experiments.
3 Solving the Subproblems Using ADMM
The AO algorithm framework is usually adopted when each of the subproblems can be solved efficiently. This is indeed the case for the ALS algorithm, since each update is in closedform. For the general factorization problem (1), we denote the subproblem as
(7) 
For the matrix case, this is simply the subproblem for the right factor, and one can easily figure out the update of the left factor by transposing everything; for the tensor case, this becomes the update of by setting and . This is for ease of notation, as these matricizations and KhatriRao products need not be actually computed explicitly. Also notice that this is the subproblem for the BCD algorithm, and for better convergence we may want to add a proximal regularization term to (7), which is very easy to handle, thus omitted here.
We propose to use the alternating direction method of multipliers (ADMM) to solve (7). ADMM, if used in the right way, inherits a lot of the good properties that appeared in each update of the ALS method. Furthermore, the AO framework naturally provides good initializations for ADMM, which further accelerates its convergence for the subproblem. As a preview, the implicit message here is that closedform solution is not necessary for computational efficiency, as we will explain later. After a brief introduction of ADMM, we first apply it to (7) which has leastsquares loss, and then generalize it to universal loss measures.
3.1 Alternating Direction Method of Multipliers
ADMM solves convex optimization problems that can be written in the form
subject to 
by iterating the following updates
where is a scaled version of the dual variables corresponding to the equality constraint , and is specified by the user.
A comprehensive review of the ADMM algorithm can be found in [7] and the references therein. The beauty of ADMM is that it converges under mild conditions (in the convex case), while artful splitting of the variables into the two blocks and can yield very efficient updates, and/or distributed implementation. Furthermore, if is strongly convex and Lipschitz continuous, then linear convergence of ADMM can be achieved; cf. guidelines on the optimal stepsize in [46, § 9.3], and [22] for an analysis of ADMM applied to quadratic programming.
3.2 LeastSquares Loss
We start by considering in (7) to be the leastsquares loss . The problem is reformulated by introducing a auxiliary variable
(8)  
subject to 
It is easy to adopt the ADMM algorithm and derive the following iterates:
(9)  
One important observation is that, throughout the iterations the same matrix and matrix inversion are used. Therefore, to save computations, we can cache and the Cholesky decomposition of . Then the update of is dominated by one forward substitution and one backward substitution, resulting in a complexity of .
The update of is the socalled proximity operator of the function around point , and in particular if is the indicator function of a convex set, then the update of becomes a projection operator, a special case of the proximity operator. For a lot of regularizations/constraints, especially those that are often used in matrix/tensor factorization problems, the update of boils down to elementwise updates, i.e., costing flops. Here we list some of the most commonly used constraints/regularizations in the matrix factorization problem, and refer the reader to [42, §6]. For simplicity of notation, let us define .

Nonnegativity. In this case is the indicator function of , and the update is simply zeroing out the negative values of . In fact, any elementwise bound constraints can be handled similarly, since elementwise projection is trivial.

Lasso regularization. For , the sparsity inducing regularization, the update is the wellknown softthresholding operator: . The elementwise thresholding can also be converted to blockwise thresholding if one wants to impose structured sparsity, leading to the group Lasso regularization.

Simplex constraint. In some probabilistic model analysis we need to constrain the columns or rows to be elementwise nonnegative and sum up to one. As described in [18], this projection can be done with a randomized algorithm with lineartime complexity on average.

Smoothness regularization. We can encourage the columns of to be smooth by adding the regularization where is an tridiagonal matrix with 2 on the diagonal and on the super and subdiagonal. Its proximity operator is given by . Although it involves a large matrix inversion, notice that it has a fixed bandwidth of 2, thus can be efficiently calculated in time [24, §4.3].
We found empirically that by setting , the ADMM iterates for the regularized leastsquares problem (8) converge very fast. This choise of can be seen as an approximation to the optimal given in [46], but much cheaper to obtain. With a good initialization, naturally provided by the AO framework, the update of usually does not take more than or ADMM iterations, and very soon reduces down to only 1 iteration. The proposed algorithm for the subproblem (8) is summarized in Alg. 1. As we can see, the precalculation step takes flops to form the Cholesky decomposition, and flops to form . Notice that these are actually the only computations in Alg. 1 that involve and , which implies that in the tensor case, all the tricks to compute and can be applied here, and then we do not need to worry about them anymore. The computational load of each ADMM iteration is dominated by the update, with complexity .
It is interesting to compare Alg. 1 with an update of the ALS algorithm, whose complexity is essentially the same as the precalculation step plus one iteration. For a small number of ADMM iterations, the complexity of Alg. 1 is of the same order as an ALS step.
For declaring termination, we adopted the general termination criterion described in [7, §3.3.1]. After some calibration, we define the relative primal residual
(10) 
and the relative dual residual
(11) 
where is from the previous ADMM iteration, and terminate Alg. 1 if both of them are smaller than some threshold.
Furthermore, if the BSUM framework is adopted, we need to solve a proximal regularized version of (8), and that term can easily be absorbed into the update of .
3.3 General Loss
Now let us derive an ADMM algorithm to solve the more general problem (7). For this case, we reformulate the problem by introducing two auxiliary variables and
(12)  
subject to 
To apply ADMM, let be the first block, and be the second block, and notice that in the second block update and can in fact be updated independently. This yields the following iterates:
(13)  
where is the scaled dual variable corresponding to the constraint , and is the scaled dual variable corresponding to the equality constraint . Notice that we set the penalty parameter corresponding to the second constraint to be 1, since it works very well in practice, and also leads to very intuitive results for some loss functions. This can also be interpreted as first preconditioning this constraint to be , and then a common is used. Again we set .
As we can see, the update of is simply a linear least squares problem, and all the previous discussion about caching the Cholesky decomposition applies. It is also easy to absorb an additional proximal regularization term into the update of , if the BSUM framework is adopted. The update of is (similar to the update of ) a proximity operator, and since almost all loss functions we use are elementwise, the update of is also very easy. The updates for some of the most commonly used nonleastsquares loss functions are listed below. For simplicity, we define , similar to the previous subsection.

Missing values. In the case that only a subset of the entries in are available, a common way to handle this is to simply fit the lowrank model only to the available entries. Let denote the index set of the available values in , then the loss function becomes . Thus, the update of in (13) becomes

Robust fitting. In the case that data entries are not uniformly corrupted by noise but only sparingly corrupted by outliers, or when the noise is dense but heavytailed (e.g., Laplaciandistributed), we can use the norm as the loss function for robust (resp. maximumlikelihood) fitting, i.e., . This is similar to the regularization, and the elementwise update is

Huber fitting. Another way to deal with possible outliers in is to use the Huber function to measure the loss where
The elementwise closedform update is

KullbackLeibler divergence. A commonly adopted loss function for nonnegative integer data is the KullbackLeibler (KL) divergence defined as
for which the proximity operator is
where all the operations are taken elementwise [53]. Furthermore, the KL divergence is a special case of certain families of divergence functions, such as divergence and divergence [17], whose corresponding updates are also very easy to derive (boil down to the proximity operator of a scalar function).
An interesting observation is that if the loss function is in fact the leastsquares loss, the matrix that is trying to fit in (13) is the data matrix per se. Therefore, the update rule (13) boils down to the update rule (9) in the leastsquares loss case, with some redundant updates of and . The detailed ADMM algorithm for (12) is summarized in Alg. 2. We use the same termination criterion as in Alg. 1.
Everything seems to be in place to seamlessly move from the leastsquares loss to arbitrary loss. Nevertheless, closer scrutiny reveals that some compromises must be made to take this leap. One relatively minor downside is that with a general loss function we may lose the linear convergence rate of ADMM – albeit with the good initialization naturally provided by the AO framework and our particular choice of , it still converges very fast in practice. The biggest drawback is that, by introducing the auxiliary variable and its dual variable , the big matrix product must be recomputed in each ADMM iteration, whereas in the previous case one only needs to compute once. This is the price we must pay; but it can be moderated by controlling the maximum number of ADMM iterations.
Scalability considerations: As big data analytics become increasingly common, it is important to keep scalability issues in mind as we develop new analysis methodologies and algorithms. Big data is usually stored as a sparse array, i.e., a list of (index,value) pairs, with the unlisted entries regarded as zeros or missing. With the introduction of and , both of size(), one hopes to be able to avoid dense operations. Fortunately, for some commonly used loss functions, this is possible. Notice that by defining , the update essentially becomes
which means a significant portion of entries in are constants—0 if the entries are regarded as missing, or in the robust fitting or Huber fitting case if the entries are regarded as “corrupted”—thus they can be efficiently stored as a sparse array. As for , one can simply generate it “onthefly” using the closedform we provided earlier (notice that has the memoryefficient “lowrank plus sparse” structure). The only occasion that is needed is when computing .
4 Summary of the Proposed Algorithm
We propose to use Alg. 1 or 2 as the core subroutine for alternating optimization. The proposed “universal” multilinear factorization algorithm is summarized as Alg. 3. A few remarks on implementing Alg. 3 are in order.
Since each factor is updated in a cyclic fashion, one expects that after a certain number of cycles (and its dual variable ) obtained in the previous iteration will not be very far away from the update for the current iteration. In this sense, the outer AO framework naturally provides a good initial point to the inner ADMM iteration. With this warmstart strategy, the optimality gap for the subproblem is then bounded by the perstep improvement of the AO algorithm, which is small. This mode of operation is crucial for insuring the efficiency of Alg. 3. Our experiments suggest that soon after an initial transient stage, the subproblems can be solved in just one ADMM iteration (with reasonable precision).
Similar ideas can be used for and in the matrix case if we want to deal with nonleastsquares loss, and actually only one copy of them is needed in the updates of both factors. A few different options are available in the tensor case. If memory is not an issue in terms of the size of , a convenient approach that is commonly adopted in ALS implementations is to store all matricizations , so they are readily available without need for repetitive data reshuffling during runtime. If this practice is adopted, then it makes sense to also have copies of and , in order to save computation. Depending on the size and nature of the data and how it is stored, it may be completely unrealistic to keep multiple copies of the data and the auxiliary variables, at which point our earlier discussion on scalable implementation of Alg. 2 for big but sparse data can be instrumental.
Sometimes an additional proximal regularization is added to the subproblems. The benefit is twofold: it helps the convergence of the AO outerloop when ; while for the ADMM innerloop it improves the conditioning of the subproblem, which may accelerate the convergence of ADMM, especially in the general loss function case when we do not have strong convexity. The convergence of AOADMM is summarized in Proposition 1.
Proposition 1.
Proof.
Note that for , using yields faster convergence than . For , i.e., for tensor data, we can update as follows
(14) 
which was proposed in [44] for unconstrained tensor factorization, and works very well in our context as well.
The convergence result in Proposition 1 has an additional assumption that the sequence generated by the algorithm is bounded. For unconstrained PARAFAC, diverging components may be encountered during AO iterations [35, 52], but adding Frobenious norm regularization for each matrix factor (with a small weight) ensures that the iterates remain bounded.
As we can see, the ADMM is an appealing subroutine for alternating optimization, leading to a simple plugandplay generalization of the workhorse ALS algorithm. Theoretically, they share the same periteration complexity if the number of inner ADMM iterations is small, which is true in practice, after an initial transient. Efficient implementation of the overall algorithm should include datastructurespecific algorithms for or , which dominate the periteration complexity, and may include parallel/distributed computation along the lines of [38].
Finally, if a nonleastsquares loss is to be used, we suggest that the leastsquares loss is first employed to get preliminary estimates (using Alg. 3 calling Alg. 1) which can then be fed as initialization to run Alg. 3 calling Alg. 2. The main disadvantage of Alg. 2 compared to Alg. 1 is that the big matrix (or tensor) multiplication needs to be calculated in each ADMM iteration. Therefore, this strategy can save a significant amount of computations at the initial stage.
5 Case Studies and Numerical Results
In this section we will study some wellknown constrained matrix/tensor factorization problems, derive the corresponding update for in Alg. 1 or and in Alg. 2, and compare it to some of the stateoftheart algorithms for that problem. In all examples we denote our proposed algorithm as AOADMM. All experiments are performed in MATLAB 2015a on a Linux server with 32 Xeon 2.00GHz cores and 128GB memory.
5.1 Nonnegative Matrix and Tensor Factorization
Perhaps the most common constraint imposed on the latent factors is nonnegativity – which is often supported by physical considerations (e.g., when the latent factors represent chemical concentrations, or power spectral densities) or other prior information, or simply because nonnegativity sometimes yields interpretable factors [36]. Due to the popularity and wide range of applications of NMF, numerous algorithms have been proposed for fitting the NMF model, and most of them can be easily generalized to the tensor case. After a brief review of the existing algorithms for NMF, we compare our proposed algorithm to some of the best algorithms reported in the literature to showcase the efficiency of AOADMM.
Let us start by considering NMF with leastsquares loss, which is the prevailing loss function in practice. By adopting the alternating optimization framework, the subproblem that emerges for each matrix factor is nonnegative (linear) leastsquares (NNLS). Some of the traditional methods for NNLS are reviewed in [15] (interestingly, not including ADMM), and most of them have been applied to NMF or nonnegative PARAFAC, e.g., the activeset (AS) method [8, 31] and blockprinciplepivoting (BPP) [32, 33]. Recall that in the context of the overall multilinear factorization problem we actually need to solve a large number of (nonnegative) leastsquares problems sharing the same mixing matrix , and in the unconstrained case this means we only need to calculate the Cholesky factorization of once. Unfortunately, this good property that enables high efficiency implementation of ALS is not preserved by either AS or BPP. Sophisticated methods that group similar rows to reduce the number of inversions have been proposed [57], although as grows larger this does not seem appealing in the worst case. Some other methods, like the multiplicativeupdate (MU) [37] or hierarchical alternating least squares (HALS) [17], ensure that the periteration complexity is dominated by calculating and , although more outerloops are needed for convergence. These are actually one step majorizationminimization or block coordinate descent applied to the NNLS problem. An accelerated version of MU and HALS is proposed in [23], which essentially does a few more innerloops after computing the most expensive .
ADMM, on the other hand, may not be the fastest algorithm for a single NNLS problem, yet its overhead can be amortized when there are many NNLS problem instances sharing the same mixing matrix, especially if good initialization is readily available. This is in contrast to an earlier attempt to adopt ADMM to NMF [10], which did not use Cholesky caching, warm start, and a good choice of to speed up the algorithm. Furthermore, ADMM can seamlessly incorporate different regularizations as well as nonleastsquares loss.
We should emphasize that AO forms the backbone of our proposed algorithm – ADMM is only applied to the subproblems. There are also algorithms that directly apply an ADMM approach to the whole problem [60, 38, 53]. The periteration complexity of those algorithms is also the same as the unconstrained alternating leastsquares. However, due to the nonconvexity of the whole problem, the loss is not guaranteed to decrease monotonically, unlike alternating optimization. Moreover, both ADMM and AO guarantee that every limit point is a stationary point, but in practice AO almost always converges (as long as the updates stay bounded), which is not the case for ADMM applied to the whole problem.
In another recent line of work [59], a similar idea of using an improved AO framework to ensure convergence is used. When [59] is specialized to nonnegative matrix/tensor factorization, each update becomes a simple proximalgradient step with an extrapolation. The resulting algorithm is also guaranteed to converge (likewise assuming that the iterates remain bounded), but it turns out to be slower than our algorithm, as we will show in our experiments.
To apply our proposed algorithm to NMF or nonnegative PARAFAC with leastsquares loss, Alg. 1 is used to solve the subproblems, with line 8 customized as
i.e., zeroing out the negative values of . The tolerance for the ADMM innerloop is set to .
5.1.1 Nonnegative matrix factorization
We compare AOADMM with the following algorithms:

AOBPP. AO using block principle pivoting [32]^{1}^{1}1http://www.cc.gatech.edu/~hpark/nmfsoftware.php;

accHALS. Accelerated HALS [23]^{2}^{2}2https://sites.google.com/site/nicolasgillis/code;

APG. Alternating proximal gradient [59]^{3}^{3}3http://www.math.ucla.edu/~wotaoyin/papers/bcu/matlab.html;

ADMM. ADMM applied to the whole problem [60] ^{4}^{4}4http://mcnf.blogs.rice.edu/.
AOBPP and HALS are reported in [32] to outperform other methods, accHALS is proposed in [23] to improve HALS, APG is reported in [59] to outperform AOBPP, and we include ADMM applied to the whole problem to compare the convergence behavior of AO and ADMM for this nonconvex factorization problem.
The aforementioned NMF algorithms are tested on two datasets. One is a dense image data set, the Extended Yale Face Database B^{5}^{5}5http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html, of size , where each column is a vectorized image of a face, and the dataset is a collection of face images of 29 subjects under various poses and illumination conditions. The other one is the Topic Detection and Tracking 2 (TDT2) text corpus^{6}^{6}6http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html, of size , which is a sparse documentterm matrix where each entry counts the frequency of a term in one document.
The convergence of the relative error versus time in seconds for the Extended Yale B dataset is shown in Fig. 1, with on the left and on the right; and for the TDT2 dataset in Fig. 2, with on the left and on the right. The ADMM algorithm [60] is not included for TDT2 because the code provided online is geared towards imputation of matrices with missing values – it does not treat a sparse input matrix as the full data, unless we fillin all zeros.
Algorithm  run time  iterations  

AOADMM  193.1026  21.7s  86.9 
AOBPP  193.1516  40.9s  52.2 
accHALS  193.1389  26.8s  187.0 
APG  193.1431  25.3s  240.2 
ADMM  193.6808  31.9s  125.2 
We also tested these algorithms on synthetic data. For and , the true and are generated by drawing their elements from an i.i.d. exponential distribution with mean 1, and then 50% of the elements are randomly set to 0. The data matrix is then set to be , where the elements of are drawn from an i.i.d. Gaussian distribution with variance 0.01. The averaged results of 100 MonteCarlo trials are shown in Table 1. As we can see, AObased methods are able to attain smaller fitting errors than directly applying ADMM to this nonconvex problem, while AOADMM provides the most efficient periteration complexity.
5.1.2 Nonnegative PARAFAC
Similar algorithms are compared in the nonnegative PARAFAC case:

AOBPP. AO using block principle pivoting [33]^{1}^{1}footnotemark: 1;

HALS. Hierarchical alternating leastsquares [17]^{1}^{1}footnotemark: 1;

APG. Alternating proximal gradient [59]^{2}^{2}footnotemark: 2;

ADMM. ADMM applied to the whole problem [38].
For our proposed AOADMM algorithm, a diminishing proximal regularization term in the form (6) is added to each subproblem to enhance the overall convergence, with the regularization parameter updated as (14).
Two real datasets are being tested: one is a dense CT image dataset^{7}^{7}7http://www.nlm.nih.gov/research/visible/ of size , which is a collection of 150 CT images of a female’s ankle, each with size ; the other one is a sparse social network dataset – Facebook Wall Posts^{8}^{8}8http://konect.unikoblenz.de/networks/facebookwosnwall, of size , that collects the number of wall posts from one Facebook user to another over a period of 1592 days. The sparse tensor is stored in the sptensor format supported by the tensor_toolbox[5], and all the aforementioned algorithms use this toolbox to handle sparse tensor data.
Algorithm  run time  iterations  

AOADMM  1117.597  145.2s  25.1 
AOBPP  1117.728  679.0s  22.6 
HALS  1117.655  1838.7s  137.7 
APG  1117.649  1077.4s  156.3 
ADMM  1156.799  435.9s  77.2 
Tensorlab  1118.427  375.8s  N/A 
Similar to the matrix case, the normalized root mean squared error versus time in seconds for the CT dataset is shown in Fig. 3, with on the left and on the right, and that for the Facebook Wall Posts data is shown in Fig. 4, with on the left and on the right. As we can see, AOADMM again converges the fastest, not only because of the efficient periteration update from Alg. 1, but also thanks to the additional proximal regularization to help the algorithm avoid swamps, which are not uncommon in alternating optimizationbased algorithms for tensor decomposition.
MonteCarlo simulations were also conducted using synthetic data for 3way nonnegative tensors with and , with the latent factors generated in the same manner as for the previous NMF synthetic data, and the tensor data generated as the lowrank model synthesized from those factors plus i.i.d. Gaussian noise with variance . The averaged result over 100 trials is given in Table 2. For this experiment we have also included Tensorlab [51], which handles nonnegative PARAFAC using “allatonce” updates based on the GaussNewton method. As we can see, AOADMM again outperforms all other algorithms in all cases considered.
5.2 Constrained Matrix and Tensor Completion
As discussed before, realworld data are often stored as a sparse array, i.e., in the form of (index,value) pairs. Depending on the application, the unlisted entries in the array can be treated as zeros, or as not (yet) observed but possibly nonzero. A wellknown example of the latter case is the Netflix prize problem, which involves an array of movie ratings indexed by customer and movie. The data is extremely sparse, but the fact that a customer did not rate a movie does not mean that the customer’s rating of that movie would be zero – and the goal is actually to predict those unseen ratings to provide good movie recommendations.
For matrix data with no constraints on the latent factors, convex relaxation techniques that involve the matrix nuclear norm have been proposed with provable matrix reconstruction bounds [12]. Some attempts have been made to generalize the matrix nuclear norm to tensor data [21, 39], but that boils down to the Tucker model rather than the PARAFAC model that we consider here. A key difference is that Tucker modeling can only hope to impute (recover missing values) in the data, whereas PARAFAC can uniquely recover the latent factors – the important ‘dimensions’ of consumer preference in this context. Another key difference is that the aforementioned convex relaxation techniques cannot incorporate constraints on the latent factors, which can improve the estimation performance. Taking the Netflix problem as an example, userbias and moviebias terms are often successfully employed in recommender systems; these can be easily subsumed in the factorization formulation by constraining, say, the first column of and the second column of to be equal to the allone vector. Moreover, interpreting each column of () as the appeal of a certain movie genre to the different users (movie ratings for a given type of user, respectively), it is natural to constrain the entries of and to be nonnegative.
When matrix/tensor completion is formulated as a constrained factorization problem using a loss function as in Sec. 3.3, there are traditionally two ways to handle it. One is directly using alternating optimization, although due to the random positions of the missing values, the leastsquares problem for each row of will involve a different subset of the rows of , thus making the update inefficient even in the unconstrained case. A more widely used way is an instance of expectationmaximization (EM): one starts by filling the missing values with zeros, and then iteratively fits a (constrained) lowrank model and imputes the originally missing values with predictions from the interim lowrank model. More recently, an ADMM approach that uses an auxiliary variable for the full data was proposed [60], although if we look carefully at that auxiliary variable, it is exactly equal to the filledin data given by the EM method.
In fact, the auxiliary variable that we introduce is similar to that of [60], thus also related to the way that EM imputes the missing values—one can treat our method as imputing the missing values per ADMM innerloop, the method in [60] as imputing per iteration, and EM as imputing after several iterations. However, our proposed AOADMM is able to give better results than EM, despite the similarities. As an illustrative example, consider the Amino acids fluorescence data^{9}^{9}9http://www.models.kvl.dk/Amino_Acid_fluo, which is a tensor known to be generated by a rank3 nonnegative PARAFAC model. However, some of the entries are known to be badly contaminated, and are thus deleted, as shown in Fig. 5. Imposing nonnegativity on the latent factors, the emission loadings of the three chemical components provided by the EM method using the way toolbox[3] and AOADMM are shown in Fig. 6. While both results are satisfactory, AOADMM is able to suppress the artifacts caused by the systematically missing values in the original data, as indicated by the arrows in Fig. 6.
We now evaluate our proposed AOADMM on a movie rating dataset called MovieLens^{10}^{10}10http://grouplens.org/datasets/movielens/, which consists of 100,000 movie ratings from 943 users on 1682 movies. MovieLens includes 5 sets of 80%20% splits of the ratings for training and testing, and for each split we fit a matrix factorization model based on the 80% training data, and evaluate the correctness of the model on the 20% testing data. The averaged performance on this 5fold cross validation is shown in Fig. 7, where we used the mean absolute error (MAE) for comparison with the classical collaborative filtering result [47] (which attains a MAE of 0.73). On the left of Fig. 7, we used the traditional leastsquares criterion to fit the available ratings, whereas on the right we used the KullbackLeibler divergence for fitting, since it is a meaningful statistical model for integer data. For each fitting criterion, we compared the performance by imposing Tikhonov regularization with , or nonnegativity, or nonnegativity with biases (i.e., in addition constraining the first column of and second column of to be all ones). Some observations are as follows:

Lowrank indeed seems to be a good model for this movie rating data, and the right rank seems to be 4 or 5, higher rank leads to overfitting, as evident from Fig. 7;

Imposing nonnegativity reduces the overfitting at higher ranks, whereas the fitting criterion does not seem to be playing a very important role in terms of performance;

By adding biases, the best case prediction MAE at rank 4 is less than 0.69, an approximately 6% improvement over the best result reported in [47].
Notice that our aim here is to showcase how AOADMM can be used to explore possible extentions to the matrix completion problem formulation, rather than come up with the best recommender system method, which would require significant exploration in its own right. We believe with the versability of AOADMM, researchers can easily test various models for matrix/tensor completion, and quickly narrow down the one that works the best for their specific application.
5.3 Dictionary Learning
Many natural signals can be represented as an (approximately) sparse linear combination of some (possibly overcomplete) basis, for example the Fourier basis for speech signals and the wavelet basis for images. If the basis (or dictionary when overcomplete) is known, one can directly do data compression via greedy algorithms or convex relaxations to obtain the sparse representation[9], or even design the sensing procedure to reduce the samples required for signal recovery[13]. If the dictionary is not known, then one can resort to the so called dictionary learning (DL) to try to learn a sparse representation[55], if one exists. The wellknown benchmark algorithm for DL is called SVD[2], which is a geometrybased algorithm, and can be viewed as a generalization of the clustering algorithms means and planes. However, as noted in the original paper, SVD does not scale well as the size of the dictionary increases. Thus SVD is often used to construct a dictionary of small image patches of size , with a few hundreds of atoms.
DL can also be formulated as a matrix factorization problem
(15)  
subject to 
where is a sparsity inducing regularization, e.g., the cardinality, the norm, or the log penalty; conceptually there is no need for a constraint on , however, due to the scaling ambiguity inherent in the matrix factorization problem, we need to impose some norm constraint on the scaling of to make the problem better defined. For example, we can bound the norm of each atom in the dictionary, , where is the th column of , and we adopt this constraint here.
Although bounding the norm of the columns of works well, it also complicates the update of —without this constraint, each row of is the solution of an independent leastsquares problem sharing the same mixing matrix, while the constraint couples the columns of , making the problem nonseparable. Existing algorithms either solve it approximately[45] or by suboptimal methods like cyclic column updates[40]. On the other hand, this is not a problem at all for our proposed ADMM subroutine Alg. 1: the row separability of the cost function and the column separability of the constraints are handled separately by the two primal variable blocks, while our previously discussed Cholesky caching, warm starting, and good choice of ensure that an exact dictionary update can be done very efficiently.
The update of , sometimes called the sparse coding step, is a relatively wellstudied problem for which numerous algorithms have been proposed. We mainly focus on the regularized formulation, in which case the subproblem becomes the wellknown LASSO, and in fact a large number of LASSOs sharing the same mixing matrix. Alg. 1 can be used by replacing the proximity step with the softthresholding operator. Furthermore, if an overcomplete dictionary is trained, the leastsquares step can also be accelerated by using the matrix inversion lemma:
Thus, if , one can cache the Cholesky of instead, and replace the leastsquares step in Alg. 1 with
where . The use of ADMM for LASSO is also discussed in [1, 61, 20], and [7], and we generally followed the one described in [7, §6]. Again, one should notice that compared to a plain LASSO, our LASSO subproblem in the AO framework comes with a good initialization, therefore only a very small number of ADMMiterations are required for convergence.
It is interesting to observe that for the particular constraints and regularization used in DL, incorporating nonnegativity maintains the simplicity of our proposed algorithm—for both the norm bound constraint and regularization, the proximity operator in Alg. 1 with nonnegativity constraint simply requires zeroing out the negative values before doing the same operations. In some applications nonnegativity can greatly help the identification of the dictionary [28].
As an illustrative example, we trained a dictionary from the MNIST handwritten digits dataset^{11}^{11}11http://www.cs.nyu.edu/~roweis/data.html, which is a collection of grayscale images of handwritten digits of size , and for each digit we randomly sampled 1000 images, forming a matrix of size . Nonnegativity constraints are imposed on both the dictionary and the sparse coefficients. For , and by setting the penalty parameter , the trained dictionary after 100 AOADMM (outer)iterations is shown in Fig. 8. On average approximately 11 atoms are used to represent each image, and the whole model is able to describe approximately 60% of the energy of the original data, and the entire training time takes about 40 seconds. Most of the atoms in the dictionary remain readable, which shows the good interpretability afforded by the additional nonnegativity constraint.
For comparison, we tried the same data set with the same parameter settings with the popular and welldeveloped DL package SPAMS^{12}^{12}12http://spamsdevel.gforge.inria.fr/index.html. For fair comparison, we used SPAMS in batch mode with batch size equal to the size of the training data, and run it for 100 iterations (same number of iterations as AOADMM). The quality of the SPAMS dictionary is almost the same as that of AOADMM, but it takes SPAMS about 3 minutes to run through these 100 iterations, versus 40 seconds for AOADMM. The performance does not change much if we remove the nonnegativity constraint when using SPAMS, although the resulting dictionary then loses interpretability. Notice that SPAMS is fully developed in C++, whereas our implementation is simply written in MATLAB, which leaves considerable room for speed improvement using a lowerlevel language compiler.
6 Conclusion
In this paper we proposed a novel AOADMM algorithmic framework for matrix and tensor factorization under a variety of constraints and loss functions. The main advantages of the proposed AOADMM framework are:

Efficiency. By carefully adopting AO as the optimization backbone and ADMM for the individual subproblems, a significant part of the required computations can be effectively cached, leading to a periteration complexity similar to the workhorse ALS algorithm for unconstrained factorization. Warmstart that is naturally provided by AO together with judicious regularization and choice of parameters further reduce the number of inner ADMM and outer AO iterations.

Universality. Thanks to ADMM, which is a special case of the proximal algorithm, nonleastsquares terms can be handled efficiently with elementwise complexity using the wellstudied proximity operators. This includes almost all nonparametric constraints and regularization penalties commonly imposed on the factors, and even nonleastsquares fitting criteria.

Convergence. AO guarantees monotone decrease of the loss function, which is a nice property for the NPhard factorization problems considered. Moreover, recent advances on generalizations of the traditional BCD algorithms further guarantee convergence to a stationary point.
Case studies on nonnegative matrix/tensor factorization, constrained matrix/tensor completion, and dictionary learning, with extensive numerical experiments using real data, corroborate our main claims. We believe that AOADMM can serve as a plugandplay framework that allows easy exploration of different types of constraints and loss functions, as well as different types of matrix and tensor (co)factorization models.
References
 [1] M. V. Afonso, J. M. BioucasDias, and M. A. T. Figueiredo, “Fast image recovery using variable splitting and constrained optimization,” IEEE Trans. on Image Processing, vol. 19, no. 9, pp. 2345–2356, 2010.
 [2] M. Aharon, M. Elad, and A. Bruckstein, “SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. on Signal Processing, vol. 54, no. 11, pp. 4311–4322, 2006.
 [3] C. A. Andersson and R. Bro, “The Nway toolbox for matlab,” Chemometrics and Intelligent Laboratory Systems, vol. 52, no. 1, pp. 1–4, 2000.
 [4] B. W. Bader and T. G. Kolda, “Efficient MATLAB computations with sparse and factored tensors,” SIAM Journal on Scientific Computing, vol. 30, no. 1, pp. 205–231, 2007.
 [5] B. W. Bader, T. G. Kolda et al., “Matlab tensor toolbox version 2.6,” http://www.sandia.gov/~tgkolda/TensorToolbox/, February 2015.
 [6] D. P. Bertsekas, Nonlinear programming. Athena Scientific, 1999.
 [7] S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
 [8] R. Bro and S. De Jong, “A fast nonnegativityconstrained least squares algorithm,” Journal of Chemometrics, vol. 11, no. 5, pp. 393–401, 1997.
 [9] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparse solutions of systems of equations to sparse modeling of signals and images,” SIAM review, vol. 51, no. 1, pp. 34–81, 2009.
 [10] X. Cai, Y. Chen, and D. Han, “Nonnegative tensor factorizations using an alternating direction method,” Frontiers of Mathematics in China, vol. 8, no. 1, pp. 3–18, 2013.
 [11] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM, vol. 58, no. 3, p. 11, 2011.
 [12] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational Mathematics, vol. 9, no. 6, pp. 717–772, 2009.
 [13] E. J. Candès and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 21–30, 2008.
 [14] B. Chen, S. He, Z. Li, and S. Zhang, “Maximum block improvement and polynomial optimization,” SIAM Journal on Optimization, vol. 22, no. 1, pp. 87–107, 2012.
 [15] D. Chen and R. J. Plemmons, “Nonnegativity constraints in numerical analysis,” in Symposium on the Birth of Numerical Analysis, 2009, pp. 109–140.
 [16] J. H. Choi and S. V. N. Vishwanathan, “DFacTo: Distributed factorization of tensors,” in Advances in Neural Information Processing Systems, 2014, pp. 1296–1304.
 [17] A. Cichocki and A.H. Phan, “Fast local algorithms for large scale nonnegative matrix and tensor factorizations,” IEICE Trans. on Fundamentals of Electronics, Communications and Computer Sciences, vol. 92, no. 3, pp. 708–721, 2009.
 [18] J. Duchi, S. ShalevShwartz, Y. Singer, and T. Chandra, “Efficient projections onto the ball for learning in high dimensions,” in Proc. ACM ICML, 2008, pp. 272–279.
 [19] C. Eckart and G. Young, “The approximation of one matrix by another of lower rank,” Psychometrika, vol. 1, no. 3, pp. 211–218, 1936.
 [20] E. Esser, Y. Lou, and J. Xin, “A method for finding structured sparse solutions to nonnegative least squares problems with applications,” SIAM Journal on Imaging Sciences, vol. 6, no. 4, pp. 2010–2046, 2013.
 [21] S. Gandy, B. Recht, and I. Yamada, “Tensor completion and lowrank tensor recovery via convex optimization,” Inverse Problems, vol. 27, no. 2, p. 025010, 2011.
 [22] E. Ghadimi, A. Teixeira, I. Shames, and M. Johansson, “Optimal parameter selection for the alternating direction method of multipliers (ADMM): quadratic problems,” IEEE Transactions on Automatic Control, vol. 60, no. 3, pp. 644–658, March 2015.
 [23] N. Gillis and F. Glineur, “Accelerated multiplicative updates and hierarchical als algorithms for nonnegative matrix factorization,” Neural Computation, vol. 24, no. 4, pp. 1085–1105, 2012.
 [24] G. H. Golub and C. F. Van Loan, Matrix Computations. Johns Hopkins University Press, 1996, vol. 3.
 [25] L. Grippo and M. Sciandrone, “On the convergence of the block nonlinear GaussSeidel method under convex constraints,” Operations Research Letters, vol. 26, no. 3, pp. 127–136, 2000.
 [26] C. J. Hillar and L.H. Lim, “Most tensor problems are NPhard,” Journal of the ACM, vol. 60, no. 6, p. 45, 2013.
 [27] T. Hofmann, “Probabilistic latent semantic indexing,” in Proc. ACM SIGIR Conference, 1999, pp. 50–57.
 [28] P. O. Hoyer, “Nonnegative matrix factorization with sparseness constraints,” Journal of Machine Learning Research, vol. 5, pp. 1457–1469, 2004.
 [29] K. Huang, N. D. Sidiropoulos, and A. Swami, “Nonnegative matrix factorization revisited: Uniqueness and algorithm for symmetric decomposition,” IEEE Trans. on Signal Processing, vol. 62, no. 1, pp. 211–224, Jan 2014.
 [30] U. Kang, E. E. Papalexakis, A. Harpale, and C. Faloutsos, “GigaTensor: scaling tensor analysis up by 100 timesalgorithms and discoveries,” in Proc. ACM SIGKDD, 2012, pp. 316–324.
 [31] H. Kim and H. Park, “Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method,” SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 2, pp. 713–730, 2008.
 [32] J. Kim and H. Park, “Fast nonnegative matrix factorization: An activesetlike method and comparisons,” SIAM Journal on Scientific Computing, vol. 33, no. 6, pp. 3261–3281, 2011.
 [33] ——, “Fast nonnegative tensor factorization with an activesetlike method,” in HighPerformance Scientific Computing. Springer, 2012, pp. 311–326.
 [34] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009.
 [35] J. B. Kruskal, R. A. Harshman, and M. E. Lundy, “How 3MFA data can cause degenerate PARAFAC solutions, among other relationships,” in Multiway Data Analysis, 1989, pp. 115–122.
 [36] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
 [37] ——, “Algorithms for nonnegative matrix factorization,” in Advances in Neural Information Processing Systems (NIPS), 2001, pp. 556–562.
 [38] A. P. Liavas and N. D. Sidiropoulos, “Parallel algorithms for constrained tensor factorization via the alternating direction method of multipliers,” IEEE Trans. on Signal Processing, 2014, submitted.
 [39] J. Liu, P. Musialski, P. Wonka, and J. Ye, “Tensor completion for estimating missing values in visual data,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 208–220, 2013.
 [40] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010.
 [41] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by V1?” Vision Research, vol. 37, no. 23, pp. 3311–3325, 1997.
 [42] N. Parikh and S. P. Boyd, “Proximal algorithms,” Foundations and Trends® in Optimization, vol. 1, no. 3, pp. 123–231, 2014.
 [43] N. Ravindran, N. D. Sidiropoulos, S. Smith, and G. Karypis, “Memoryefficient parallel computation of tensor and matrix products for big tensor decomposition,” in Asilomar Conference on Signals, Systems, and Computers, 2014.
 [44] M. Razaviyayn, M. Hong, and Z.Q. Luo, “A unified convergence analysis of block successive minimization methods for nonsmooth optimization,” SIAM Journal on Optimization, vol. 23, no. 2, pp. 1126–1153, 2013.
 [45] M. Razaviyayn, H.W. Tseng, and Z.Q. Luo, “Dictionary learning for sparse representation: Complexity and algorithms,” in Proc. IEEE ICASSP, 2014, pp. 5247–5251.
 [46] E. Ryu and S. P. Boyd, “Primer on monotone operator methods,” Preprint, available at http://web.stanford.edu/~eryu/papers/monotone_notes.pdf.
 [47] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Itembased collaborative filtering recommendation algorithms,” in Proceedings of the 10th International Conference on World Wide Web, 2001, pp. 285–295.
 [48] N. D. Sidiropoulos and R. Bro, “On the uniqueness of multilinear decomposition of Nway arrays,” Journal of chemometrics, vol. 14, no. 3, pp. 229–239, 2000.
 [49] A. Smilde, R. Bro, and P. Geladi, Multiway analysis: applications in the chemical sciences. John Wiley & Sons, 2005.
 [50] S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis, “SPLATT: Efficient and parallel sparse tensormatrix multiplication,” in IEEE International Parallel & Distributed Processing Symposium, 2015.
 [51] L. Sorber, M. Van Barel, and L. De Lathauwer, “Tensorlab v2.0,” http://www.tensorlab.net/, Jan. 2014.
 [52] A. Stegeman, “Finding the limit of diverging components in threeway candecomp/parafaca demonstration of its practical merits,” Comput. Stat. Data Anal., vol. 75, pp. 203–216, Jul. 2014. [Online]. Available: http://dx.doi.org/10.1016/j.csda.2014.02.010
 [53] D. L. Sun and C. Fevotte, “Alternating direction method of multipliers for nonnegative matrix factorization with the betadivergence,” in Proc. IEEE ICASSP, 2014, pp. 6201–6205.
 [54] G. Tomasi and R. Bro, “A comparison of algorithms for fitting the PARAFAC model,” Computational Statistics & Data Analysis, vol. 50, no. 7, pp. 1700–1734, 2006.
 [55] I. Tosic and P. Frossard, “Dictionary learning,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 27–38, 2011.
 [56] P. Tseng, “Convergence of a block coordinate descent method for nondifferentiable minimization,” Journal of Optimization Theory and Applications, vol. 109, no. 3, pp. 475–494, 2001.
 [57] M. H. Van Benthem and M. R. Keenan, “Fast algorithm for the solution of largescale nonnegativityconstrained least squares problems,” Journal of Chemometrics, vol. 18, no. 10, pp. 441–450, 2004.
 [58] S. A. Vavasis, “On the complexity of nonnegative matrix factorization,” SIAM Journal on Optimization, vol. 20, no. 3, pp. 1364–1377, 2009.
 [59] Y. Xu and W. Yin, “A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion,” SIAM Journal on Imaging Sciences, vol. 6, no. 3, pp. 1758–1789, 2013.
 [60] Y. Xu, W. Yin, Z. Wen, and Y. Zhang, “An alternating direction algorithm for matrix completion with nonnegative factors,” Frontiers of Mathematics in China, vol. 7, no. 2, pp. 365–384, 2012.
 [61] J. Yang and Y. Zhang, “Alternating direction algorithms for problems in compressive sensing,” SIAM Journal on Scientific Computing, vol. 33, no. 1, pp. 250–278, 2011.