Robust Low-rank Tensor Recovery: Models and Algorithms
Robust tensor recovery plays an instrumental role in robustifying tensor decompositions for multilinear data analysis against outliers, gross corruptions and missing values and has a diverse array of applications. In this paper, we study the problem of robust low-rank tensor recovery in a convex optimization framework, drawing upon recent advances in robust Principal Component Analysis and tensor completion. We propose tailored optimization algorithms with global convergence guarantees for solving both the constrained and the Lagrangian formulations of the problem. These algorithms are based on the highly efficient alternating direction augmented Lagrangian and accelerated proximal gradient methods. We also propose a nonconvex model that can often improve the recovery results from the convex models. We investigate the empirical recoverability properties of the convex and nonconvex formulations and compare the computational performance of the algorithms on simulated data. We demonstrate through a number of real applications the practical effectiveness of this convex optimization framework for robust low-rank tensor recovery.
Key words. low-rank tensors, higher-order robust PCA, Tucker rank, tensor decompositions, alternating direction augmented Lagrangian, variable-splitting
AMS subject classifications. 15A69, 90C25, 90C30, 65K10
The rapid advance in modern computer technology has given rise to the wide presence of multidimensional data (tensor data). Traditional matrix-based data analysis is inherently two-dimensional, which limits its usefulness in extracting information from a multidimensional perspective. On the other hand, tensor-based multilinear data analysis has shown that tensor models are capable of taking full advantage of the multilinear structures to provide better understanding and more precision. At the core of multilinear data analysis lies tensor decomposition, which commonly takes two forms: CANDECOMP/PARAFAC (CP) decomposition [10, 22] and Tucker decomposition . Having originated in the fields of psychometrics and chemometrics, these decompositions are now widely used in other application areas such as computer vision , web data mining [18, 42], and signal processing .
Tensor decomposition faces three major challenges: arbitrary outliers, missing data/partial observations, and computational efficiency. Tensor decomposition resembles Principal Component Analysis (PCA) for matrices in many ways. In fact, Tucker decomposition is also known as higher-order SVD (HOSVD) . It is well-known that PCA is sensitive to outliers and gross corruptions (non-Gaussian noise). Since the CP and Tucker decompositions are also based on least-squares approximation, they are prone to these problems as well. Algorithms based on non-convex formulations have been proposed to robustify tensor decompositions against outliers [15, 36] and missing data . However, they suffer from the lack of global optimality guarantees.
In practice, the underlying tensor data is often low-rank, even though the actual data may not be due to outliers and arbitrary errors. In other words, the major part of the variation in the data is often governed by a relatively small number of latent factors. It is thus possible to robustify tensor decompositions by reconstructing the low-rank part of the noisy data. Besides its importance to tensor decompositions, robust low-rank tensor recovery also has many applications in its own right, e.g., shadow/occulsion removal in image processing. Motivated by the aforementioned challenges and opportunities, we study the problem of low-rank tensor recovery that is robust to gross corruptions and missing values in a convex optimization framework. Our work in this paper is built upon two major lines of previous work: Principal Component Pursuit (PCP) for Robust PCA  and Tensor Completion [20, 32, 46]. The main advantages of the convex formulation that we use are that it removes outliers or corrects corruptions based on the global structure of the tensor, and its solution can be obtained by efficient convergent convex optimization algorithms that are easy to implement. Moreover, this solution naturally leads to a robust Tucker decomposition, and the CP decomposition can also be obtained by applying a simple heuristic .
The structure of our paper is as follows. In Section LABEL:sec:notation and LABEL:sec:tensor_decomp, we introduce notation and review some tensor basics. We formulate robust tensor recovery problem as a convex program in Section LABEL:sec:horpca. Two approaches for recovering a low-Tucker-rank tensor are considered. We propose alternating direction augmented Lagrangian (ADAL) methods and accelerated proximal gradient (APG) methods, respectively, for solving the exact constrained version and the Lagrangian version of the robust tensor recovery problem. All of the proposed algorithms have global convergence guarantees. In Section LABEL:sec:ncx, we introduce a non-convex formulation which can nevertheless be solved approximately by an ADAL-based algorithm and can potentially take better advantage of more precise rank information. In Section LABEL:sec:exp, we test the computational performance of the algorithms and analyze the empirical recovery properties of the models on synthetic data, and we demonstrate the practical effectiveness of our algorithms and models through a series of applications in 3D MRI recovery, fluorescence EEM data analysis, face representation, and foreground filtering/background reconstruction of a web game frame sequence.
1.1 Mathematical Notation and Tensor Basics
As in , we denote tensors by boldface Euler script letters, e.g., , matrices by boldface capital letters, e.g., , vectors by boldface lowercase letters, e.g., , and scalars by lowercase letters, e.g., x. The order of a tensor is the number of dimensions (a.k.a. ways or modes). Let be an th-order tensor. A fiber of is a column vector defined by fixing every index of but one. So for a matrix, each row is a mode-2 fiber. The mode- unfolding (matricization) of the tensor is the matrix denoted by that is obtained by arranging (lexicographically in the indices other than the -th index) the mode- fibers as the columns of the matrix. The vectorization of is denoted by .
- Inner product and norms
The inner product of two tensors is defined as , and the Frobenius norm of is denoted by . The nuclear norm (or trace norm) of a matrix is the sum of its singular values, i.e. , where the SVD of . The norm of a vector is defined as . Likewise, for a matrix and a tensor , , and .
- Vector outer product
The vector outer product is denoted by the symbol . The outer product of vectors, is an -th-order tensor, defined as
- Tensor-matrix multiplication
The multiplication of a tensor of size with a matrix in mode is denoted by , and is defined in terms of mode- unfolding as .
- Linear operator and adjoint
We denote linear operators by capital letters in calligraphic font, e.g. , and denotes the result of applying the linear operator to the tensor . The adjoint of is denoted by .
- Homogeneous tensor array
For the ease of notation and visualization, we define a homogeneous tensor array (or tensor array for short) as the tensor obtained by stacking a set of component tensors of the same size along the first mode. An -component tensor array is a ‘vector’ of homogeneous tensors, which we write as TArray(). A linear operator defined on a tensor array operates at the level of the component tensors. As an example, consider the linear (summation) operator such that . Its adjoint is then the linear operator such that . We use the non-calligraphic to denote the matrix corresponding to the equivalent operation carried out by on the mode-1 unfolding of , where . In this example, therefore, .
1.2 Tensor decompositions and ranks
The CP decomposition approximates as , where is a given integer, and for . The CP decomposition is formulated as a non-convex optimization problem and is usually computed via an alternating least-squares (ALS) algorithm; see, e.g., . The rank of , denoted by rank(), is defined as the smallest value of such that the approximation holds with equality. Computing the rank of a specific given tensor is NP-hard in general .
The Tucker decomposition approximates as , where is called the core tensor, and the factor matrices are all column-wise orthonormal. are given integers. The -rank (or mode- rank) of , denoted by , is the column rank of . The set of -ranks of a tensor is also called the Tucker rank. If is of rank-(), then the approximation holds with equality, and for , is the matrix of the left singular vectors of . The Tucker decomposition is also posed as a non-convex optimization problem. A widely-used approach for computing the factor matrices is called the higher-order orthogonal iteration (HOOI) , which is essentially an ALS method based on computing the dominant left singular vectors of each .
1.3 Robust PCA
PCA gives the optimal low-dimensional estimate of a given matrix under additive i.i.d. Gaussian noise, but it is also known to be susceptible to gross corruptions and outliers. Robust PCA (RPCA) is a family of methods that aims to make PCA robust to large errors and outliers. Candes et. al.  proposed to approach RPCA via Principal Component Pursuit (PCP), which decomposes a given observation (noisy) matrix into a low-rank component and a sparse component by solving the optimization problem . This problem is NP-hard to solve, so  replaces the rank and cardinality () functions with their convex surrogates, the nuclear norm and the norm respectively, and solves the following convex optimization problem:
It has been shown that the optimal solution to problem (LABEL:eq:rpca) exactly recovers the low-rank matrix from arbitrary corruptions as long as the errors are sufficiently sparse relative to the rank of , or more precisely, when the following bounds hold : , where and are positive constants, and is the incoherence parameter.
2 Higher-order RPCA (Robust Tensor Recovery)
For noisy tensor data subject to outliers and arbitrary corruptions, it is desirable to be able to exploit the structure in all dimensions of the data. A direct application of RPCA essentially considers the low-rank structure in only one of the unfoldings and is often insufficient. Hence, we need a model that is directly based on tensors. To generalize RPCA to tensors, we regularize the Tucker rank Trank() of a tensor , leading to the following tensor PCP optimization problem: . This problem is NP-hard to solve, so we replace Trank() by the convex surrogate CTrank, and by to make the problem tractable:
We call the model (LABEL:eq:ho-rpca) Higher-order RPCA (HoRPCA). In the subsequent sections, we consider variations of problem (LABEL:eq:ho-rpca) and develop efficient algorithms for solving these models.
2.1 Singleton Model
In the Singleton model, the tensor rank regularization term is the sum of the nuclear norms of the mode- unfoldings, of , i.e. . HoRPCA with the Singleton low-rank tensor model is thus the convex optimization problem
This form of was also considered in [32, 20, 46] for recovering low-rank tensors from partial observations of the data. To solve problem (LABEL:eq:ho-rpca-single), we develop an alternating direction augmented Lagrangian (ADAL) method [21, 19]111This class of algorithms is also known as the alternating direction method of multipliers (ADMM) [8, 14]. to take advantage of the problem structure. Specifically, by applying variable-splitting (e.g. see [7, 8]) to and introducing auxiliary variables , problem (LABEL:eq:ho-rpca-single) is reformulated as
Note that equality among the ’s is enforced implicitly by the constraints, so that an additional auxiliary variable as in  is not required.
Before we develop the ADAL algorithm for solving problem (LABEL:eq:ho-rpca-c) (see Algorithm LABEL:alg:adal-horpca below), we need to define several operations. returns the tensor such that . is the matrix singular value thresholding operator: , where is the SVD of and . We define . is the shrinkage operator on vec() and returns the result as a tensor. The vector shrinkage operator is defined as , where the operations are all element-wise.
Problem (LABEL:eq:ho-rpca-c) is in the generic form
which has the augmented Lagrangian .
In , Eckstein and Bertsekas proved Theorem LABEL:thm:admm_conv below, establishing the convergence of the ADAL, Algorithm LABEL:alg:adal:
Consider problem (LABEL:eq:adal_generic), where both and are proper, closed, convex functions, and has full column rank. Then, starting with an arbitrary and , the sequence generated by Algorithm LABEL:alg:adal converges to a Kuhn-Tucker pair of problem (LABEL:eq:adal_generic), if (LABEL:eq:adal_generic) has one. If (LABEL:eq:adal_generic) does not have an optimal solution, then at least one of the sequences and diverges.
The augmented Lagrangian for problem (LABEL:eq:ho-rpca-c) is
Observe that given , the ’s can be solved for independently by the singular value thresholding operator. Conversely, with fixed ’s, the augmented Lagrangian subproblem with respect to has a closed-form solution as we now show. The subproblem under consideration is , which can be written concisely as
where . The following proposition shows that Problem (LABEL:eq:horpca-Esubprob) has a closed-form solution.
Problem (LABEL:eq:horpca-Esubprob) is equivalent to
which has a closed-form solution .
Proof. Since , the first-order optimality conditions for (LABEL:eq:horpca-Esubprob) are , which are the optimality conditions for (LABEL:eq:horpca-Esubprob-easy).
In Algorithm LABEL:alg:adal-horpca, we state our ADAL algorithm that alternates between two blocks of variables, and . By defining and , it is easy to verify that problem (LABEL:eq:ho-rpca-c) and Algorithm LABEL:alg:adal-horpca satisfy the conditions in Theorem LABEL:thm:admm_conv. Hence, the convergence of Algorithm LABEL:alg:adal-horpca follows from Theorem LABEL:thm:admm_conv:
The sequence generated by Algorithm LABEL:alg:adal-horpca converges to an optimal solution of problem (LABEL:eq:ho-rpca-c). Hence, the sequence converges to an optimal solution of HoRPCA with the Singleton model (LABEL:eq:ho-rpca-single).
2.2 Mixture Model
The Mixture model for a low-rank tensor was introduced in , which only requires that the tensor be the sum of a set of component tensors, each of which is low-rank in the corresponding mode, i.e. , where is a low-rank matrix for each . This is a relaxed version of the Singleton model, which requires that the tensor be low-rank in all modes simultaneously. It is shown in  that the Mixture model is able to automatically detect the rank-deficient modes and yields better recovery performance than the Singleton model for tensor completion tasks when the original tensor is low-rank only in certain modes.
For robust tensor recovery, the Mixture model is equally applicable to represent the low-rank component of a corrupted tensor. Specifically, we solve the convex optimization problem
This is a more difficult problem to solve than (LABEL:eq:ho-rpca-single); while the subproblem with respect to still has a closed-form solution involving the shrinkage operator , the variables are coupled in the constraint, and it is hard to develop an efficient ADAL algorithm with two-block updates that satisfies the conditions in Theorem LABEL:thm:admm_conv. Motivated by the approximation technique used in , we propose an inexact ADAL algorithm to solve problem (LABEL:eq:horpca-mix) with global convergence guarantee.
Consider the augmented Lagrangian subproblem of (LABEL:eq:horpca-mix) with respect to , which can be simplified to
Let be the summation operator, , and , with . Then, problem (LABEL:eq:horpca-mix-Xsubprob) can be written as . This problem is not separable in , and an iterative method (e.g. ) has to be used to solve this problem. Instead, we consider the proximal approximation of the problem and solve, given ,
where is a user-supplied parameter which is less than (equal to in this case). It is easy to see that problem (LABEL:eq:horpca-mix-Xproximal) is separable in , and the optimal solution has a closed-form solution , where . We state this inexact ADAL method below in Algorithm LABEL:alg:adal-horpca-mix. The convergence of the algorithm is guaranteed by:
For any satisfying , the sequence generated by Algorithm LABEL:alg:adal-horpca-mix converges to the optimal solution of problem (LABEL:eq:horpca-mix).
Proof. See Appendix LABEL:sec:app_conv_iadal.
If we have prior knowledge of which modes of are low-rank and which modes are not, we can also use an adaptive version of the Singleton model which allows for different regularization weights to be applied to the nuclear norms:
The downside of using this more general version of the Singleton model over the Mixture model is that there are more parameters to tune, and in general, we do not have the rank information a priori. In our experiments in Section LABEL:sec:exp, we examined the results from the non-adaptive Singleton model, which usually shed some light on the rank of the tensor modes. We then adjusted the ’s accordingly. Note that RPCA applied to the mode- unfolding of a given tensor is a special case of (LABEL:eq:horpca-single-adapt) where for all . When the tensor is low-rank in only one mode, RPCA applied to that particular mode should be able to achieve good recovery results, but it may be necessary to apply RPCA to every unfolding of the tensor to discover the best mode, resulting in a similar number of SVD operations as HoRPCA.
2.3 Unconstrained (Lagrangian) Version
In the Lagrangian version of HoRPCA, the consistency constraints are relaxed and appear as quadratic penalty terms in the objective function. For the Singleton model, we solve the optimization problem
Similarly, the Lagrangian optimization problem for the Mixture model is
Taking advantage of properties of these problems, we develop an accelerated proximal gradient (APG) algorithm by applying FISTA  with continuation to solve them. Before stating this algorithm below in Algorithm LABEL:alg:fista-horpca, we define the following notation. , and , and , for the Singleton model, and , the summation operator defined on , and , for the Mixture model. The global convergence rate is discussed in Appendix LABEL:sec:app_fista.
2.4 Partial Observations
When some of the data is missing, we can enforce the consistency on the observable data through the projection operator that selects the set of observable elements () from the data tensor. Assuming that the support of the true error tensor , is a subset of , which implies that , we essentially have to solve the same optimization problems as above, with a minor modification to the consistency constraint. The optimization problem corresponding to the Singleton model becomes
and that corresponding to the Mixture model is
Note that without the additional assumption that , it is impossible to recover since some of corrupted tensor elements are simply not observed. The -regularization of in the objective function forces any element of an optimal solution in to be zero. In the two-dimensional case, this corresponds to the “Matrix Completion with Corruption” model in [9, 28], and exact recoverability through solving a two-dimensional version of (LABEL:eq:horpca-tc) is guaranteed under some assumptions on the rank, the sparsity, and the fraction of data observed .
For the Mixture model, we can apply Algorithm LABEL:alg:adal-horpca-mix with only minor modifications. Specifically, we replace with and with in the definition of and hence .222Note that the symbol here denotes the pipeline of two linear operators instead of the vector outer product defined in Section LABEL:sec:notation. The subproblem with respect to becomes (after simplification)
The following proposition shows that the above problem admits the closed-form solution
The optimal solution of the optimization problem
Proof. Let , the vector of the elements of whose indices are in the observation set and be the vector of the remaining elements. We can then write problem (LABEL:eq:horpca-mix-tc-Esubprob-short) as
which is decomposable. Obviously, . The optimal solution to problem (LABEL:eq:horpca-mix-tc-Esubprob-decomp) with respect to is given by the shrinkage operator, . Hence, , and we obtain the optimal solution to problem (LABEL:eq:horpca-mix-tc-Esubprob) by setting .
The proximal approximation to the subproblem with respect to is still separable, and each can be solved for by applying the singular value thresholding operator.
For the Singleton model, we introduce an auxiliary variable and reformulate problem (LABEL:eq:horpca-mix) as
We can then develop an ADAL algorithm that employs two-block updates between and . The solutions to still admit closed-form expressions by applying the singular value thresholding operator. The solution to is a similar form as the one for the Mixture model. The augmented Lagrangian subproblem with respect to involves solving a linear system which also has a closed-form solution.
For the two Lagrangian versions of HoRPCA, the required modifications are minimal. We simply need to redefine the linear operator in Algorithm LABEL:alg:fista-horpca to be and apply to . It can also be verified that the Lipschitz constant of is still for both models.
2.5 Related Work
Several methods have proposed for solving the RPCA problem, including the Iterative Thresholding algorithm , the Accelerated Proximal Gradient (APG/FISTA) algorithm with continuation  for the Lagrangian formulation of (LABEL:eq:rpca), a gradient algorithm applied to the dual problem of (LABEL:eq:rpca), and the Inexact Augmented Lagrangian method (IALM) in . It is reported in  that IALM was faster than APG on simulated data sets.
For the unconstrained formulation of Tensor Completion with the Singleton model,
 and  both proposed an ADAL algorithm based on applying variable-splitting on . For the Mixture model version of (LABEL:eq:tc_unc),  also proposed an ADAL method applied to the dual problem.
There have been some attempts to tackle the HoRPCA problem (LABEL:eq:ho-rpca-single) with applications in computer vision and image processing. The RSTD algorithm proposed in  uses a vanilla Block Coordinate Descent (BCD) approach to solve the unconstrained problem
which applies variable-splitting to both and and relaxes all the constraints as quadratic penalty terms. Compared to HoRPCA-SP, RSTD has many more parameters to tune. Moreover, the BCD algorithm used in RSTD does not have a iteration complexity guarantee enjoyed by FISTA. (More sophisticated variants of BCD do have iteration complexity results, e.g. [35, 38, 6].)
The TR-MALM algorithm proposed in  is also an ADAL method and solves a relaxed version of (LABEL:eq:ho-rpca-c):
The final solution is given by . Compared to problem (LABEL:eq:ho-rpca-c), problem (LABEL:eq:horpca_trmalm) does not require equality among the auxiliary variables ’s and ’s. This relaxation allows the problem to be decomposed into independent RPCA instances, each of which solvable by IALM mentioned above. However, this relaxation also makes the final solution hard to interpret since consistency among the auxiliary variables is not guaranteed.
The -regularization used in HoRPCA is a special instance of the exponential family used in . That approach is, nevertheless, very different from ours. Before learning the model proposed in , the (heterogeneous) distributions of parts of the data have to be specified, making the method hard to apply. In our case, the locations of the corrupted entries are not known in advance.
Another related regularization technique for robustifying tensor factorization is the norm used in . In fact, the norm is the group Lasso  regularization with each slice (in the case of a 3D tensor) defined as a group, and the regularization is applied to the error term of the tensor factorization. Hence, it is most effective on data in which dense corruptions are concentrated in a small number of slices, i.e. outlying samples.
2.6 Relationship with the Tucker Decomposition
We can also view HoRPCA with the Singleton model as a method for Robust Tensor Decomposition, because from the auxiliary variables ’s, we can reconstruct the core tensor of by , where is the left factor matrix from the SVD of . Note that the matrices are by-products of Algorithms LABEL:alg:adal-horpca and LABEL:alg:fista-horpca. Hence, we recover the Tucker decomposition of containing sparse arbitrary corruptions without the need to specify the target Tucker rank. The CP decomposition of the low-rank tensor can also be obtained from the output of Algorithms LABEL:alg:adal-horpca and LABEL:alg:fista-horpca by applying the classical CP to the core tensor .
3 Constrained Nonconvex Model
We consider a robust model that has explicit constraints on the Tucker rank of the tensor to be recovered. Specifically, we solve the (nonconvex) optimization problem
The global optimal solution to problem (LABEL:eq:horpca-ncx) is generally NP-hard to find. Here, we develop an efficient algorithm based on ADAL that has the same per-iteration complexity as Algorithm LABEL:alg:adal-horpca and finds a reasonably good solution.
Applying the variable-splitting technique that we used for the Singleton model, we reformulate problem (LABEL:eq:horpca-ncx) as
where ’s are auxiliary variables. Although this is a nonconvex problem, applying ADAL still generates well-defined subproblems which have closed-form global optimal solutions.
The augmented Lagrangian for problem (LABEL:eq:horpca-ncx-split), dualizing only the consistency constraints, is equivalent to (up to some constants) , where . For , the augmented Lagrangian subproblem associated with is
We can interpret this problem as finding the Euclidean projection of onto the set of matrices whose ranks are at most . The global optimal solution is given by
(Eckart-Young ) For any positive integer , where is an arbitrary matrix, the best rank- approximation of , i.e. the global optimal solution to the problem is given by , where is a diagonal matrix consisting of the largest diagonal entries of , given the SVD of .
Hence, we obtain the global minimizer of problem (LABEL:eq:horpca-ncx-Xsubprob) by the truncated SVD of , , which can be computed efficiently if is small relative to rank().
The augmented Lagrangian subproblem associated with has the same form as it has for the Singleton model. By Proposition LABEL:prop:horpca-Esub-sol, we obtain the solution by . Now, we summarize the main steps in Algorithm LABEL:alg:horpca-ncx (HoRPCA-C). For the extension to the partial data case, we can apply variable-splitting as in (LABEL:eq:horpca-tc-split) to derive a similar ADAL algorithm.
3.2 Convergence Analysis
As we mentioned before, ADAL algorithms applied to a convex optimization problem that alternately minimizes two blocks of variables enjoy global convergence results. Since problem (LABEL:eq:horpca-ncx) is nonconvex, the existing convergence results no longer apply. Here, we give a weak convergence result, that guarantees that whenever the iterates of Algorithm LABEL:alg:horpca-ncx converge, they converge to a KKT point of problem (LABEL:eq:horpca-ncx). Although we do not have a proof for the convergence of the iterates, our experiments show strong indication of convergence. To derive the KKT conditions, we first rewrite problem (LABEL:eq:horpca-ncx) as
where ’s and ’s are matrices of appropriate dimensions with columns and rows respectively. Let be the Lagrange multipliers associated with the constraints. The KKT conditions for problem (LABEL:eq:horpca-ncx-factor) are