Robust Lowrank Tensor Recovery: Models and Algorithms
Abstract
Robust tensor recovery plays an instrumental role in robustifying tensor decompositions for multilinear data analysis against outliers, gross corruptions and missing values and has a diverse array of applications. In this paper, we study the problem of robust lowrank tensor recovery in a convex optimization framework, drawing upon recent advances in robust Principal Component Analysis and tensor completion. We propose tailored optimization algorithms with global convergence guarantees for solving both the constrained and the Lagrangian formulations of the problem. These algorithms are based on the highly efficient alternating direction augmented Lagrangian and accelerated proximal gradient methods. We also propose a nonconvex model that can often improve the recovery results from the convex models. We investigate the empirical recoverability properties of the convex and nonconvex formulations and compare the computational performance of the algorithms on simulated data. We demonstrate through a number of real applications the practical effectiveness of this convex optimization framework for robust lowrank tensor recovery.
Key words. lowrank tensors, higherorder robust PCA, Tucker rank, tensor decompositions, alternating direction augmented Lagrangian, variablesplitting
AMS subject classifications. 15A69, 90C25, 90C30, 65K10
1 Introduction
The rapid advance in modern computer technology has given rise to the wide presence of multidimensional data (tensor data). Traditional matrixbased data analysis is inherently twodimensional, which limits its usefulness in extracting information from a multidimensional perspective. On the other hand, tensorbased multilinear data analysis has shown that tensor models are capable of taking full advantage of the multilinear structures to provide better understanding and more precision. At the core of multilinear data analysis lies tensor decomposition, which commonly takes two forms: CANDECOMP/PARAFAC (CP) decomposition [10, 22] and Tucker decomposition [48]. Having originated in the fields of psychometrics and chemometrics, these decompositions are now widely used in other application areas such as computer vision [49], web data mining [18, 42], and signal processing [1].
Tensor decomposition faces three major challenges: arbitrary outliers, missing data/partial observations, and computational efficiency. Tensor decomposition resembles Principal Component Analysis (PCA) for matrices in many ways. In fact, Tucker decomposition is also known as higherorder SVD (HOSVD) [11]. It is wellknown that PCA is sensitive to outliers and gross corruptions (nonGaussian noise). Since the CP and Tucker decompositions are also based on leastsquares approximation, they are prone to these problems as well. Algorithms based on nonconvex formulations have been proposed to robustify tensor decompositions against outliers [15, 36] and missing data [2]. However, they suffer from the lack of global optimality guarantees.
In practice, the underlying tensor data is often lowrank, even though the actual data may not be due to outliers and arbitrary errors. In other words, the major part of the variation in the data is often governed by a relatively small number of latent factors. It is thus possible to robustify tensor decompositions by reconstructing the lowrank part of the noisy data. Besides its importance to tensor decompositions, robust lowrank tensor recovery also has many applications in its own right, e.g., shadow/occulsion removal in image processing. Motivated by the aforementioned challenges and opportunities, we study the problem of lowrank tensor recovery that is robust to gross corruptions and missing values in a convex optimization framework. Our work in this paper is built upon two major lines of previous work: Principal Component Pursuit (PCP) for Robust PCA [9] and Tensor Completion [20, 32, 46]. The main advantages of the convex formulation that we use are that it removes outliers or corrects corruptions based on the global structure of the tensor, and its solution can be obtained by efficient convergent convex optimization algorithms that are easy to implement. Moreover, this solution naturally leads to a robust Tucker decomposition, and the CP decomposition can also be obtained by applying a simple heuristic [46].
The structure of our paper is as follows. In Section LABEL:sec:notation and LABEL:sec:tensor_decomp, we introduce notation and review some tensor basics. We formulate robust tensor recovery problem as a convex program in Section LABEL:sec:horpca. Two approaches for recovering a lowTuckerrank tensor are considered. We propose alternating direction augmented Lagrangian (ADAL) methods and accelerated proximal gradient (APG) methods, respectively, for solving the exact constrained version and the Lagrangian version of the robust tensor recovery problem. All of the proposed algorithms have global convergence guarantees. In Section LABEL:sec:ncx, we introduce a nonconvex formulation which can nevertheless be solved approximately by an ADALbased algorithm and can potentially take better advantage of more precise rank information. In Section LABEL:sec:exp, we test the computational performance of the algorithms and analyze the empirical recovery properties of the models on synthetic data, and we demonstrate the practical effectiveness of our algorithms and models through a series of applications in 3D MRI recovery, fluorescence EEM data analysis, face representation, and foreground filtering/background reconstruction of a web game frame sequence.
1.1 Mathematical Notation and Tensor Basics
As in [26], we denote tensors by boldface Euler script letters, e.g., , matrices by boldface capital letters, e.g., , vectors by boldface lowercase letters, e.g., , and scalars by lowercase letters, e.g., x. The order of a tensor is the number of dimensions (a.k.a. ways or modes). Let be an thorder tensor. A fiber of is a column vector defined by fixing every index of but one. So for a matrix, each row is a mode2 fiber. The mode unfolding (matricization) of the tensor is the matrix denoted by that is obtained by arranging (lexicographically in the indices other than the th index) the mode fibers as the columns of the matrix. The vectorization of is denoted by .
 Inner product and norms

The inner product of two tensors is defined as , and the Frobenius norm of is denoted by . The nuclear norm (or trace norm) of a matrix is the sum of its singular values, i.e. , where the SVD of . The norm of a vector is defined as . Likewise, for a matrix and a tensor , , and .
 Vector outer product

The vector outer product is denoted by the symbol . The outer product of vectors, is an thorder tensor, defined as
 Tensormatrix multiplication

The multiplication of a tensor of size with a matrix in mode is denoted by , and is defined in terms of mode unfolding as .
 Linear operator and adjoint

We denote linear operators by capital letters in calligraphic font, e.g. , and denotes the result of applying the linear operator to the tensor . The adjoint of is denoted by .
 Homogeneous tensor array

For the ease of notation and visualization, we define a homogeneous tensor array (or tensor array for short) as the tensor obtained by stacking a set of component tensors of the same size along the first mode. An component tensor array is a ‘vector’ of homogeneous tensors, which we write as TArray(). A linear operator defined on a tensor array operates at the level of the component tensors. As an example, consider the linear (summation) operator such that . Its adjoint is then the linear operator such that . We use the noncalligraphic to denote the matrix corresponding to the equivalent operation carried out by on the mode1 unfolding of , where . In this example, therefore, .
1.2 Tensor decompositions and ranks
The CP decomposition approximates as , where is a given integer, and for . The CP decomposition is formulated as a nonconvex optimization problem and is usually computed via an alternating leastsquares (ALS) algorithm; see, e.g., [45]. The rank of , denoted by rank(), is defined as the smallest value of such that the approximation holds with equality. Computing the rank of a specific given tensor is NPhard in general [23].
The Tucker decomposition approximates as , where is called the core tensor, and the factor matrices are all columnwise orthonormal. are given integers. The rank (or mode rank) of , denoted by , is the column rank of . The set of ranks of a tensor is also called the Tucker rank. If is of rank(), then the approximation holds with equality, and for , is the matrix of the left singular vectors of . The Tucker decomposition is also posed as a nonconvex optimization problem. A widelyused approach for computing the factor matrices is called the higherorder orthogonal iteration (HOOI) [12], which is essentially an ALS method based on computing the dominant left singular vectors of each .
1.3 Robust PCA
PCA gives the optimal lowdimensional estimate of a given matrix under additive i.i.d. Gaussian noise, but it is also known to be susceptible to gross corruptions and outliers. Robust PCA (RPCA) is a family of methods that aims to make PCA robust to large errors and outliers. Candes et. al. [9] proposed to approach RPCA via Principal Component Pursuit (PCP), which decomposes a given observation (noisy) matrix into a lowrank component and a sparse component by solving the optimization problem . This problem is NPhard to solve, so [9] replaces the rank and cardinality () functions with their convex surrogates, the nuclear norm and the norm respectively, and solves the following convex optimization problem:
\hb@xt@.01(1.1) 
It has been shown that the optimal solution to problem (LABEL:eq:rpca) exactly recovers the lowrank matrix from arbitrary corruptions as long as the errors are sufficiently sparse relative to the rank of , or more precisely, when the following bounds hold [9]: , where and are positive constants, and is the incoherence parameter.
2 Higherorder RPCA (Robust Tensor Recovery)
For noisy tensor data subject to outliers and arbitrary corruptions, it is desirable to be able to exploit the structure in all dimensions of the data. A direct application of RPCA essentially considers the lowrank structure in only one of the unfoldings and is often insufficient. Hence, we need a model that is directly based on tensors. To generalize RPCA to tensors, we regularize the Tucker rank Trank() of a tensor , leading to the following tensor PCP optimization problem: . This problem is NPhard to solve, so we replace Trank() by the convex surrogate CTrank, and by to make the problem tractable:
\hb@xt@.01(2.1) 
We call the model (LABEL:eq:horpca) Higherorder RPCA (HoRPCA). In the subsequent sections, we consider variations of problem (LABEL:eq:horpca) and develop efficient algorithms for solving these models.
2.1 Singleton Model
In the Singleton model, the tensor rank regularization term is the sum of the nuclear norms of the mode unfoldings, of , i.e. . HoRPCA with the Singleton lowrank tensor model is thus the convex optimization problem
\hb@xt@.01(2.2) 
This form of was also considered in [32, 20, 46] for recovering lowrank tensors from partial observations of the data. To solve problem (LABEL:eq:horpcasingle), we develop an alternating direction augmented Lagrangian (ADAL) method [21, 19]^{1}^{1}1This class of algorithms is also known as the alternating direction method of multipliers (ADMM) [8, 14]. to take advantage of the problem structure. Specifically, by applying variablesplitting (e.g. see [7, 8]) to and introducing auxiliary variables , problem (LABEL:eq:horpcasingle) is reformulated as
\hb@xt@.01(2.3) 
Note that equality among the ’s is enforced implicitly by the constraints, so that an additional auxiliary variable as in [20] is not required.
Before we develop the ADAL algorithm for solving problem (LABEL:eq:horpcac) (see Algorithm LABEL:alg:adalhorpca below), we need to define several operations. returns the tensor such that . is the matrix singular value thresholding operator: , where is the SVD of and . We define . is the shrinkage operator on vec() and returns the result as a tensor. The vector shrinkage operator is defined as , where the operations are all elementwise.
Problem (LABEL:eq:horpcac) is in the generic form
\hb@xt@.01(2.4) 
which has the augmented Lagrangian .
In [14], Eckstein and Bertsekas proved Theorem LABEL:thm:admm_conv below, establishing the convergence of the ADAL, Algorithm LABEL:alg:adal:
Theorem 2.1
Consider problem (LABEL:eq:adal_generic), where both and are proper, closed, convex functions, and has full column rank. Then, starting with an arbitrary and , the sequence generated by Algorithm LABEL:alg:adal converges to a KuhnTucker pair of problem (LABEL:eq:adal_generic), if (LABEL:eq:adal_generic) has one. If (LABEL:eq:adal_generic) does not have an optimal solution, then at least one of the sequences and diverges.
The augmented Lagrangian for problem (LABEL:eq:horpcac) is
Observe that given , the ’s can be solved for independently by the singular value thresholding operator. Conversely, with fixed ’s, the augmented Lagrangian subproblem with respect to has a closedform solution as we now show. The subproblem under consideration is , which can be written concisely as
\hb@xt@.01(2.5) 
where . The following proposition shows that Problem (LABEL:eq:horpcaEsubprob) has a closedform solution.
Proposition 2.1
Problem (LABEL:eq:horpcaEsubprob) is equivalent to
\hb@xt@.01(2.6) 
which has a closedform solution .
Proof. Since , the firstorder optimality conditions for (LABEL:eq:horpcaEsubprob) are , which are the optimality conditions for (LABEL:eq:horpcaEsubprobeasy).
In Algorithm LABEL:alg:adalhorpca, we state our ADAL algorithm that alternates between two blocks of variables, and . By defining and , it is easy to verify that problem (LABEL:eq:horpcac) and Algorithm LABEL:alg:adalhorpca satisfy the conditions in Theorem LABEL:thm:admm_conv. Hence, the convergence of Algorithm LABEL:alg:adalhorpca follows from Theorem LABEL:thm:admm_conv:
Corollary 2.1
The sequence generated by Algorithm LABEL:alg:adalhorpca converges to an optimal solution of problem (LABEL:eq:horpcac). Hence, the sequence converges to an optimal solution of HoRPCA with the Singleton model (LABEL:eq:horpcasingle).
2.2 Mixture Model
The Mixture model for a lowrank tensor was introduced in [46], which only requires that the tensor be the sum of a set of component tensors, each of which is lowrank in the corresponding mode, i.e. , where is a lowrank matrix for each . This is a relaxed version of the Singleton model, which requires that the tensor be lowrank in all modes simultaneously. It is shown in [46] that the Mixture model is able to automatically detect the rankdeficient modes and yields better recovery performance than the Singleton model for tensor completion tasks when the original tensor is lowrank only in certain modes.
For robust tensor recovery, the Mixture model is equally applicable to represent the lowrank component of a corrupted tensor. Specifically, we solve the convex optimization problem
\hb@xt@.01(2.7) 
This is a more difficult problem to solve than (LABEL:eq:horpcasingle); while the subproblem with respect to still has a closedform solution involving the shrinkage operator , the variables are coupled in the constraint, and it is hard to develop an efficient ADAL algorithm with twoblock updates that satisfies the conditions in Theorem LABEL:thm:admm_conv. Motivated by the approximation technique used in [51], we propose an inexact ADAL algorithm to solve problem (LABEL:eq:horpcamix) with global convergence guarantee.
Consider the augmented Lagrangian subproblem of (LABEL:eq:horpcamix) with respect to , which can be simplified to
\hb@xt@.01(2.8) 
Let be the summation operator, , and , with . Then, problem (LABEL:eq:horpcamixXsubprob) can be written as . This problem is not separable in , and an iterative method (e.g. [33]) has to be used to solve this problem. Instead, we consider the proximal approximation of the problem and solve, given ,
\hb@xt@.01(2.9) 
where is a usersupplied parameter which is less than (equal to in this case). It is easy to see that problem (LABEL:eq:horpcamixXproximal) is separable in , and the optimal solution has a closedform solution , where . We state this inexact ADAL method below in Algorithm LABEL:alg:adalhorpcamix. The convergence of the algorithm is guaranteed by:
Theorem 2.2
For any satisfying , the sequence generated by Algorithm LABEL:alg:adalhorpcamix converges to the optimal solution of problem (LABEL:eq:horpcamix).
Proof. See Appendix LABEL:sec:app_conv_iadal.
Remark 2.1
If we have prior knowledge of which modes of are lowrank and which modes are not, we can also use an adaptive version of the Singleton model which allows for different regularization weights to be applied to the nuclear norms:
\hb@xt@.01(2.10) 
The downside of using this more general version of the Singleton model over the Mixture model is that there are more parameters to tune, and in general, we do not have the rank information a priori. In our experiments in Section LABEL:sec:exp, we examined the results from the nonadaptive Singleton model, which usually shed some light on the rank of the tensor modes. We then adjusted the ’s accordingly. Note that RPCA applied to the mode unfolding of a given tensor is a special case of (LABEL:eq:horpcasingleadapt) where for all . When the tensor is lowrank in only one mode, RPCA applied to that particular mode should be able to achieve good recovery results, but it may be necessary to apply RPCA to every unfolding of the tensor to discover the best mode, resulting in a similar number of SVD operations as HoRPCA.
2.3 Unconstrained (Lagrangian) Version
In the Lagrangian version of HoRPCA, the consistency constraints are relaxed and appear as quadratic penalty terms in the objective function. For the Singleton model, we solve the optimization problem
\hb@xt@.01(2.11) 
Similarly, the Lagrangian optimization problem for the Mixture model is
\hb@xt@.01(2.12) 
Taking advantage of properties of these problems, we develop an accelerated proximal gradient (APG) algorithm by applying FISTA [5] with continuation to solve them. Before stating this algorithm below in Algorithm LABEL:alg:fistahorpca, we define the following notation. , and , and , for the Singleton model, and , the summation operator defined on , and , for the Mixture model. The global convergence rate is discussed in Appendix LABEL:sec:app_fista.
2.4 Partial Observations
When some of the data is missing, we can enforce the consistency on the observable data through the projection operator that selects the set of observable elements () from the data tensor. Assuming that the support of the true error tensor , is a subset of , which implies that , we essentially have to solve the same optimization problems as above, with a minor modification to the consistency constraint. The optimization problem corresponding to the Singleton model becomes
\hb@xt@.01(2.13)  
and that corresponding to the Mixture model is
\hb@xt@.01(2.14) 
Note that without the additional assumption that , it is impossible to recover since some of corrupted tensor elements are simply not observed. The regularization of in the objective function forces any element of an optimal solution in to be zero. In the twodimensional case, this corresponds to the “Matrix Completion with Corruption” model in [9, 28], and exact recoverability through solving a twodimensional version of (LABEL:eq:horpcatc) is guaranteed under some assumptions on the rank, the sparsity, and the fraction of data observed [28].
For the Mixture model, we can apply Algorithm LABEL:alg:adalhorpcamix with only minor modifications. Specifically, we replace with and with in the definition of and hence .^{2}^{2}2Note that the symbol here denotes the pipeline of two linear operators instead of the vector outer product defined in Section LABEL:sec:notation. The subproblem with respect to becomes (after simplification)
\hb@xt@.01(2.15) 
The following proposition shows that the above problem admits the closedform solution
.
Proposition 2.2
The optimal solution of the optimization problem
\hb@xt@.01(2.16) 
is .
Proof. Let , the vector of the elements of whose indices are in the observation set and be the vector of the remaining elements. We can then write problem (LABEL:eq:horpcamixtcEsubprobshort) as
\hb@xt@.01(2.17) 
which is decomposable. Obviously, . The optimal solution to problem (LABEL:eq:horpcamixtcEsubprobdecomp) with respect to is given by the shrinkage operator, . Hence, , and we obtain the optimal solution to problem (LABEL:eq:horpcamixtcEsubprob) by setting .
The proximal approximation to the subproblem with respect to is still separable, and each can be solved for by applying the singular value thresholding operator.
For the Singleton model, we introduce an auxiliary variable and reformulate problem (LABEL:eq:horpcamix) as
\hb@xt@.01(2.18)  
We can then develop an ADAL algorithm that employs twoblock updates between and . The solutions to still admit closedform expressions by applying the singular value thresholding operator. The solution to is a similar form as the one for the Mixture model. The augmented Lagrangian subproblem with respect to involves solving a linear system which also has a closedform solution.
For the two Lagrangian versions of HoRPCA, the required modifications are minimal. We simply need to redefine the linear operator in Algorithm LABEL:alg:fistahorpca to be and apply to . It can also be verified that the Lipschitz constant of is still for both models.
2.5 Related Work
Several methods have proposed for solving the RPCA problem, including the Iterative Thresholding algorithm [50], the Accelerated Proximal Gradient (APG/FISTA) algorithm with continuation [31] for the Lagrangian formulation of (LABEL:eq:rpca), a gradient algorithm applied to the dual problem of (LABEL:eq:rpca), and the Inexact Augmented Lagrangian method (IALM) in [30]. It is reported in [30] that IALM was faster than APG on simulated data sets.
For the unconstrained formulation of Tensor Completion with the Singleton model,
\hb@xt@.01(2.19) 
[20] and [46] both proposed an ADAL algorithm based on applying variablesplitting on . For the Mixture model version of (LABEL:eq:tc_unc), [46] also proposed an ADAL method applied to the dual problem.
There have been some attempts to tackle the HoRPCA problem (LABEL:eq:horpcasingle) with applications in computer vision and image processing. The RSTD algorithm proposed in [29] uses a vanilla Block Coordinate Descent (BCD) approach to solve the unconstrained problem
which applies variablesplitting to both and and relaxes all the constraints as quadratic penalty terms. Compared to HoRPCASP, RSTD has many more parameters to tune. Moreover, the BCD algorithm used in RSTD does not have a iteration complexity guarantee enjoyed by FISTA. (More sophisticated variants of BCD do have iteration complexity results, e.g. [35, 38, 6].)
The TRMALM algorithm proposed in [43] is also an ADAL method and solves a relaxed version of (LABEL:eq:horpcac):
\hb@xt@.01(2.20) 
The final solution is given by . Compared to problem (LABEL:eq:horpcac), problem (LABEL:eq:horpca_trmalm) does not require equality among the auxiliary variables ’s and ’s. This relaxation allows the problem to be decomposed into independent RPCA instances, each of which solvable by IALM mentioned above. However, this relaxation also makes the final solution hard to interpret since consistency among the auxiliary variables is not guaranteed.
The regularization used in HoRPCA is a special instance of the exponential family used in [24]. That approach is, nevertheless, very different from ours. Before learning the model proposed in [24], the (heterogeneous) distributions of parts of the data have to be specified, making the method hard to apply. In our case, the locations of the corrupted entries are not known in advance.
Another related regularization technique for robustifying tensor factorization is the norm used in [25]. In fact, the norm is the group Lasso [52] regularization with each slice (in the case of a 3D tensor) defined as a group, and the regularization is applied to the error term of the tensor factorization. Hence, it is most effective on data in which dense corruptions are concentrated in a small number of slices, i.e. outlying samples.
2.6 Relationship with the Tucker Decomposition
We can also view HoRPCA with the Singleton model as a method for Robust Tensor Decomposition, because from the auxiliary variables ’s, we can reconstruct the core tensor of by , where is the left factor matrix from the SVD of . Note that the matrices are byproducts of Algorithms LABEL:alg:adalhorpca and LABEL:alg:fistahorpca. Hence, we recover the Tucker decomposition of containing sparse arbitrary corruptions without the need to specify the target Tucker rank. The CP decomposition of the lowrank tensor can also be obtained from the output of Algorithms LABEL:alg:adalhorpca and LABEL:alg:fistahorpca by applying the classical CP to the core tensor [46].
3 Constrained Nonconvex Model
We consider a robust model that has explicit constraints on the Tucker rank of the tensor to be recovered. Specifically, we solve the (nonconvex) optimization problem
\hb@xt@.01(3.1) 
The global optimal solution to problem (LABEL:eq:horpcancx) is generally NPhard to find. Here, we develop an efficient algorithm based on ADAL that has the same periteration complexity as Algorithm LABEL:alg:adalhorpca and finds a reasonably good solution.
3.1 Algorithm
Applying the variablesplitting technique that we used for the Singleton model, we reformulate problem (LABEL:eq:horpcancx) as
\hb@xt@.01(3.2) 
where ’s are auxiliary variables. Although this is a nonconvex problem, applying ADAL still generates welldefined subproblems which have closedform global optimal solutions.
The augmented Lagrangian for problem (LABEL:eq:horpcancxsplit), dualizing only the consistency constraints, is equivalent to (up to some constants) , where . For , the augmented Lagrangian subproblem associated with is
\hb@xt@.01(3.3) 
We can interpret this problem as finding the Euclidean projection of onto the set of matrices whose ranks are at most . The global optimal solution is given by
Theorem 3.1
(EckartYoung [13]) For any positive integer , where is an arbitrary matrix, the best rank approximation of , i.e. the global optimal solution to the problem is given by , where is a diagonal matrix consisting of the largest diagonal entries of , given the SVD of .
Hence, we obtain the global minimizer of problem (LABEL:eq:horpcancxXsubprob) by the truncated SVD of , , which can be computed efficiently if is small relative to rank().
The augmented Lagrangian subproblem associated with has the same form as it has for the Singleton model. By Proposition LABEL:prop:horpcaEsubsol, we obtain the solution by . Now, we summarize the main steps in Algorithm LABEL:alg:horpcancx (HoRPCAC). For the extension to the partial data case, we can apply variablesplitting as in (LABEL:eq:horpcatcsplit) to derive a similar ADAL algorithm.
3.2 Convergence Analysis
As we mentioned before, ADAL algorithms applied to a convex optimization problem that alternately minimizes two blocks of variables enjoy global convergence results. Since problem (LABEL:eq:horpcancx) is nonconvex, the existing convergence results no longer apply. Here, we give a weak convergence result, that guarantees that whenever the iterates of Algorithm LABEL:alg:horpcancx converge, they converge to a KKT point of problem (LABEL:eq:horpcancx). Although we do not have a proof for the convergence of the iterates, our experiments show strong indication of convergence. To derive the KKT conditions, we first rewrite problem (LABEL:eq:horpcancx) as
\hb@xt@.01(3.4) 
where ’s and ’s are matrices of appropriate dimensions with columns and rows respectively. Let be the Lagrange multipliers associated with the constraints. The KKT conditions for problem (LABEL:eq:horpcancxfactor) are