LowRank Tensor Decomposition via Multiple Reshaping and Reordering Operations
Abstract
Tensor decomposition has been widely applied to find lowrank representations for realworld data and more recently for neuralnetwork parameters too. For the latter, the unfolded matrices may not always be lowrank because the modes of the parameter tensor do not usually have any physical meaning that can be exploited for efficiency. This raises the following question: how can we find lowrank structures when the tensor modes do not have any physical meaning associated with them? For this purpose, we propose a new decomposition method in this paper. Our method uses reshaping and reordering operations that are strictly more general than the unfolding operation. These operations enable us to discover new lowrank structures that are beyond the reach of existing tensor methods. We prove an important theoretical result establishing conditions under which our method results in a unique solution. The experimental results confirm the correctness of our theoretical works and the effectiveness of our methods for weight compression in deep neural networks.
1 Introduction
Lowrank tensor decomposition is a powerful method to represent the data with very few features [1]. It has achieved stateoftheart results in many fields, such as, computer vision [2], natural language processing [3, 4], and neural image processing [5]. Such applications of tensor methods rely on the fact that the realworld datasets have modes with some physical meaning, e.g., time and space dimensions in a video. This fact is then exploited to find lowrank structure by using the unfolding operation. However, when the tensor lacks such physical meaning, there is no clear guidelines to obtain an efficient representation using tensor decomposition.
Recent studies that focus on applying tensor methods to deep learning face this issue. Many of these studies hope to find efficient representation of the parameters of the neural networks using tensor decomposition [6, 7, 8, 9, 10]. Unfortunately, the tensor for the network parameters do not have any physically meaningful modes. This makes it harder to find operations that would result in a lowrank decomposition. Existing methods either use heuristics or rely on empiric methods. This raises the following question: how can we find lowrank structures when the tensor modes do not have any physical meaning associated with them?
To model this problem, we introduce a new kind of shaping operations, which reorganize the elements of a tensor into a matrix with a given but flexible shape and order. By using such shaping operations, we propose a simple but novel decomposition method called Multishaping Lowrankness induced (MsLi) decomposition. The proposed method decomposes a vectorized tensor as a sum of latent components, each of which has lowrank structures under different shaping operations.The newly defined shaping operations overcome the limitations that the tensor unfolding must be done along the modes, such that MsLi naturally has the ability to handle additional low rank structures compared to the conventional methods (see Fig. 1). Meanwhile, owing to the incoherence under different shaping operations, MsLi can give the unique solution under mild conditions, which is theoretically proved in this paper. In the experiments, we numerically confirm the correctness of the theoretical results, and then apply the proposed model to the weight compression of the fully connected layers in neural networks. The experimental results demonstrate that the weights can be significantly compressed by our method with even random shaping operations.
2 Related Works
To exploit the lowrank structure of a tensor, several unfolding operations have been introduced with the corresponding tensor decomposition methods. In the classical methods such as CP and Tucker Decomposition (TD) [1], the tensor is unfolded along every one mode, while the unfolding operation in TT is concerned along the first modes [11, 12]. Moreover, some other unfolding operations were also recently developed for the compression and completion problems [13, 14, 15]. However, all the unfolding operations defined in these methods are limited to convert a higherorder tensor to a matrix by concatenating the tensor fibers, while our method overcome this limitation to matricize the tensor by using more general reshaping and reordering operations.
In the existing matrix and tensor decomposition methods, the most similar studies to our work is convex tensor decomposition (CTD), and especially its modification called latent tensor nuclear norm (LaTNN) minimization, that was proposed in NIPS2010 workshop "Tensors, Kernels and Machine Learning" [16]. Compared to LaTNN, our work has mainly two differences. One is the unfolding operation. As a convex alternate of TD, LaTTN only considers unfolding the tensor along each mode, but our method considers more flexible shaping as shown in Fig. 1. Another difference is that the theoretical results about the uniqueness of our method further complements the theoretical framework of LaTNN studied in [17]. Although the Theorem 3 in [17] shows that the estimation error of components tends to be smaller with decreasing the variance of Gaussian noise, the upper bound of the estimation error is NOT guaranteeed to go to zero even though the variance of the Gaussian noise goes to zero. It implies that there is a theoretical gap between consistency and identification for LaTNN. However, the uniqueness (identification) of our method can be easily extended to LaTTN to fill this gap. Furthermore, the incoherencebased uniqueness conditions are theoretically derived in this paper, which is not studied in [17].
In recent years, LaTNN has been studied continuously. In 2014, Wimalawarne et al. proposed scaled LaTNN, in which weights was imposed on every nuclear norm of the components [18]. As the regularization item, the scaled LaTNN was applied to the regression problem [19] in 2016. In the same year, LaTNN based completion was formulated by convex factorization machine for the toxicogenomics prediction problem [20]. In the recent two years, Guo et al. proposed a LaTNN based FrankWolfe algorithm for tensor completion [21], and Nimishakavi et al. developed a new algorithm by Riemannian optimization to ease the sparsity structure led by LaTNN minimization [22]. However, it should be noted that all of them focus on the reconstructed data, and there is no study which concerns characteristics of the latent components.
3 Multishaping Lowrankness Induced Decomposition
3.1 Definition of the Shaping Operation
Without loss of generality, we mathematically use vectors instead of higherorder tensors for brevity. All the theoretical results in this paper can be naturally extended to tensor. To pursuit additional lowrank structures of the data, we extend the unfolding operation in tensor algebra to a more general linear mapping, by which the elements of a dimensional vector are relocated into a matrix with a given shape. Mathematically, it is defined as
Definition 1.
Assume a dimensional vector . Then the shaping operation is defined as a linear mapping , , where for every index of , there exists a nonrepeated pair of the matrix such that the equation holds, where and denote the th element of and the th row and th column element of , respectively.
Remark 1: In tensor algebra, a mode tensor unfolding is basically defined as an operation to concatenate all mode fibers of the tensor into a matrix [1], and tensor unfolding along a combination of modes obeys more sophisticated rules [11, 12, 14, 13]. No matter what the definition of tensor unfolding is in different literatures, it can be simply seen as reorganizing the elements of a higher order tensor into a matrix by a specific rule. Hence, we can see that Def. 1 provides more general operations, by which the elements can be arbitrarily relocated into a matrix.
Remark 2: There are basically two operations which can be described by the introduced shaping function. One is “reshaping”, which means to shape a vector to matrix with different number of rows and columns. Another one is "reordering", which means to change the location of the elements but keep the shape of matrix, also called random permutation.
3.2 Basic Decomposition Model
By using the new shaping operation, it is natural to introduce the concept called shaped rank, which is defined as the rank of the matrix generated by . Compared to the conventional matrix rank, the new one strongly depends on the shaping function . In other words, the shaped rank is generally different if we choose different shaping functions.
Based on the definition of the shaping function and the shaped rank, Multishaping Lowrankness induced (MsLi) decomposition is proposed to decompose a dimensional vector into a sum of latent components. Specifically, assume that the observation is a mixture of components, then MsLi is simply formulated as
(1) 
where denote the latent components. In contrast to the existing methods, the latent component is further assumed to have a low shapedrank structure under the different shaping functions , i.e., such that is of lowrank. Note that we consider the decomposition (1) as a problem of solving linear equations with low shapedrank constraints. For , the shaping function determines the feasible set of , such that the solution for different components and are embedded in different subspaces.
In addition, MsLi can be degraded to the conventional lowrank matrix decomposition if are assumed to be the same for all . So we can see that the proposed model can simultaneously exploit the lowrank structures of the data with multiple shapes and orders, while the matrix lowrank decomposition can only consider the lowrankness with a single shape. Furthermore, it is well known that it generally cannot obtain an unique solution by using the conventional lowrank matrix decomposition because of the random rotation. However, in the next section, we will prove that MsLi has the unique solution under mild conditions.
3.3 Learning Algorithm
To calculate the latent components of MsLi, a direct way is to minimize the sum of shaped rank of the components, that is
(2) 
where denotes the matrix rank. But owing to the nonconvexity and computational difficulty of the rank function, we introduce a convex approximation of (2), called cMsLi, by using the matrix nuclear norm (or Schatten1 norm) as a proxy of the matrix rank, that is
(3) 
where denotes the nuclear norm which equals the sum of singular values of the matrix, and it has been theoretically proved as the convex envelope of the matrix rank [23]. The convexity of the nuclear norm and the linear constraint of the model (3) makes it be a convex optimization problem. Hence many existing algorithms, such as alternating direction method of multipliers (ADMM) [24] can be directly used to find its solution. Details of the algorithm are introduced in the supplementary materials.
However, note that cMsLi is difficult to handle the large scaled problems in practical sense. One reason is the computational complexity of the conventional ADMM method; Another reason is that the storage cost of (3) for all the components is times larger than the original observation . It can lead to serious practical difficulty if the dataset is huge. To solve this problem, we further decompose every shaped component to a product of two smaller matrices based on the low shapedrank assumption, that is (nMsLi)
(4) 
where denotes the inverse of , and are exploited to approximate the shaped rank latent component. Compared to (3), the memory cost of (4) is compressed from to when and . In addition, the function is imposed to denote the distance measurement between the observation and its approximation. Different generally implies different assumptions on the error item. For example, we can apply the Frobenius norm of the fit error or cross entropy as for different applications.
4 Theoretical Guarantee for Uniqueness
In this section, we focus on the model (3) to discuss the uniqueness of MsLi. The route of the proof is mainly inspired from Chandrasekaran et al. for Robust Principal Component Analysis [25]. To simplify the theoretical results, we ignore the influence by noise and first assume the latent components have a low but fixed shaped rank structure. Specifically, let be latent components of the observation . For each component, define denoting the shaping functions as introduced in section 3 in which . By these notations, we consider the uniqueness of MsLi as looking for the exact reconstruction of from the mixture . In other words, the problem is mathematically to find the condition of the unique solution from the underdetermined linear equations (1). To solve this problem, we firstly define the feasible set of the problem, which is given by
(5) 
Note that all the elements in the set have the same shaped rank corresponding to , and satisfies the requirements as a smooth manifold [26]. Thus the tangent space to the manifold at is given by [25]
(6) 
where and represent the truncated left and right singular matrices of , respectively^{1}^{1}1Note that the singular matrices and is totally differently defined from the ones in (4).. By using the definitions (5) and (6), a preliminary uniqueness condition can be given from the following lemma:
Lemma 1.
Assume that, for all , is a given tangent space to at defined in (6), and . Then there exists a unique tuple such that and holds if
(7) 
for all , where denotes a sequential direct sum of the linear subspaces , that is
(8) 
From Lemma 1, it implies that can be perfectly reconstructed as long as there is no common “information” shared among . However, if we focus on the cMsLi (3), only satisfying the condition (7) cannot guarantee that can be exactly reconstructed by cMsLi. In order to get the unique solution by (3), we impose more conditions as the following theorem:
Theorem 1.
Assume that , and are defined as (6). Then the tuple is the unique solution of (3) if the following conditions are satisfied for all :

.

There exists a dual such that

where denotes the spectral norm which is defined as the largest singular value of a matrix, represents the normal space [26] to the manifold at , is a projection into a linear subspace , and denote the left and right truncated singular matrices of , respectively.
To build a more intuitive connection between the uniqueness of MsLi and the shaped rank of latent components, we intorduce a new kind of incoherence measurements for shapedrank, and a similar conception was also mentioned in [17]. In MsLi, the shapedrank incoherence measurement for the th component is defined as
(9) 
The measurement (9) reflects the difference of ranks for a latent component by using different shaping functions. For example, a smaller value of implies a higher rank for than . The following lemma reconsiders the uniqueness problem in Lemma 1 by imposing the incoherence measurements.
Lemma 2.
Assume that, for all , is a given tangent space to at defined in (6), and . Then there exists a unique tuple such that and holds if
(10) 
It implies from Lemma 2 that MsLi has an unique solution if the value of for all is small enough. In other words, MsLi is unique as long as the rank of is far smaller that the rank of for all . Similarly, to get the exact reconstruction from (3), we have the following result:
Theorem 2.
Based on Theorem 2, we can get a more interesting uniqueness conditions for cMsLi if putting more assumptions on the shaping functions . That is,
Corollary 1.
Assume that . Define that as the shaping functions. In addition, the function obeys the following assumptions: (a) the rank of equals ; (b) it is fullrank for all matrices; (c) For each matrix , its nonzero singular values are equal to each other. Then is the unique solution of (3) if .
It implies from Corollary 1 that, the size has a linear relationship to the shaped rank of components , but a quadratic relationship to the number of components for the uniqueness of cMsLi. Although assumptions in Corollary 1 is so strong that there may be no shaping functions that can totally satisfy them, its result still reveals some relations among , and in the context of uniqueness. Furthermore, we found that some shaping operations like random permutation of a vector which can be considered as a rough but suitable approximation used in Corollary 1.
5 Experiments
5.1 Validation of Uniqueness
At the beginning, the synthetic data are used to confirm the theoretical results in the paper. The goal of the first experiment is to recover the true latent components from a mixture. In the experiment, we suppose that the observation contains components, that is , and all the shaped components are square matrices, i.e. , and . For each of the shaped components, it is generated by a multiplication of two random semiorthonormal matrices with the same rank , that is in which denote the random semiorthonormal matrices. The shaping function is instantiated by different random elementwise permutation.
As to the algorithm, we choose (3) as the optimization model, and utilize ADMM to search the optimal point. Meanwhile, the total signal to interference ratio (tSIR) is used to evaluate the reconstruction error of the method ().
Fig. 2 shows the phase transition of cMsLi with different parameters, such as the shaped component rank , size and number . In each of panel, the white blocks indicate exact reconstruction (), the black blocks indicate that cMsLi is totally failed () and the gray area corresponds the boundary results (). In addition, the red line in the figure represents the uniqueness bound derived from Corollary 1. It implies that the uniqueness is theoretically guaranteed in the area upon the read line. As shown in Fig. 2, the following facts are numerically verified: (1) cMsLi can exactly reconstruct the latent components under certain conditions; (2) Although the theoretical bound is not tight enough, the experimental results can be used to reflect the correctness of the theoretical results; (3) If we fix the number of the components, as shown in (a),(b) of Fig. 2, there is a linear relationship between the size and the rank for exact reconstruction. However it becomes quadratic if we consider the relationship between the number of the components and their rank like (c)(d) in Fig. 2. This result coincidentally meets the formula in Corollary 1.
To simulate a more practical sense, real grayscale images are exploited instead of the synthetic data as the latent components in the experiment. In contrast to synthetic data, the lowrank structures of real images are more complicated, because there are many small but nonzero singular vaules in each of the images. The details of the experiment are as follows: Total 9 grayscaled real images are used as the candidates^{2}^{2}2The candidates include {facade, lena, baboon, peppers, sailboat, barbara, airplane, house and giant}, which are popularly used in image processing. of the shaped latent components. For each of them, the size is fixed to be , the range is rescaled into , and the mean value is removed. As the shaping function, we use the same manner as in the first experiment. In each run of the experiment, images are randomly chosen (repeatable) from the candidates and then they are shaped and mixed as the observation. Table 1 shows the tSIR performance of independent runs with different number of components. To compare with the convetnional lowrank matrix decomposition, we apply PCA, RPCA [25] and BM3D [27] to the shaped data as the baseline. In each of them, the observation is firstly shaped into the corresponding shape of the targeted component, and the competing methods try to reconstruct the target component by considering other components as random noise. As illustration in Table 1, the images can be more precisely reconstructed by cMsLi than other methods, and the reconstruction accuracy is decreased with the growth of the number of latent components. According to the results, it can be also inferred that cMsLi has better capacity than other matrix based methods to represent the data of which the components have the lowrank structure with different shapes. Fig. 3 shows an example of the reconstructions by different methods. we can easily find from Fig. 3 that the proposed method can give more precise reconstructions of the images.
cMsLi  

RsPCA  
RsRPCA  
RsBM3D 
5.2 Weights Compression in Neural Networks
In this section, we are going to demonstrate that not only tensorizing but also random elementwise permutation of the weights can compress the weights of neural networks very well. The experiment is run on both MNIST [28] and CIFAR10 [29] for image classification. In MNIST, we use a twolayers’ neural network as a baseline, in which 1024 hidden rectified linear units (ReLU) are implemented. In addition, the original images are resized to as in [7], so that we can get a weight matrix in the first layer. In CIFAR10, we also applied a similar scheme to the one in [7]. The network includes two convolution layers, which consists of convolutions, pooling and ReLU. Following that, two fullyconnected layers are connected, and we get a weight matrix in the first fullyconnected layer, which is also denoted as without ambiguity.
To compare the compression performance, we use different lowrank matrix/tensor decomposition on for both of the networks, which include matrix decomposition (MD), tensor train (TT), Tucker Decomposition (TD), TD plus random permutation (RsTD), and the proposed nMsLi. Note that we control the rank of MD, TT, and (Rs)TD to change the fitting capacity, but we fix the rank in nMsLi to equal and control the number of components, so that the only difference between MD and nMsLi is the additional shaping operations, i.e., random permutation. Fig. 4 shows the test error with different numbers of parameters of the methods and the baseline of which the weights are not compressed. As shown in Fig. 4, the performance of MD is drastically decreased if it’s rank is small. But as long as we impose random permutation operations, the performance is significantly improved by nMsLi. A more interesting thing is that the performance is also improved in RsTD compared to the conventional TD method, especially when then number of the parameters is small. However, it is worthwhile to note that TTbased method still gives the stateoftheart results in our experiments. It is mainly because that compression ability of nMsLi is limited so that the number of parameters is highly larger than TT even if we just use one rank1 component to represent the weight matrix. But the performance of nMsLi and RsTD is comparable to TT even if the shaping operations are randomly generated.
6 Discussion
By overcoming the limitation of the tensor unfolding operation, the additional lowrank structures can be exploited such that it is possible to find more compact lowrank representation than the conventional tensor decomposition methods; Moreover, an unusual property, namely the uniqueness, is exhibited in the proposed method. In conventional tensor decomposition, when merely considering the lowrank assumption, the unique solution can be only obtained by CP decomposition, of which the algorithm generally has nervewracking numerical characteristics [30]. Hence the studies in this paper further reveal the potential importance of the lowrank structures contained in the tensor.
Moreover, owing to the uniqueness, we can also treat MsLi as a new blind source separation (BSS) method, which has been successfully used in brain and speech signal processing [31, 32]. In contrast to the classical BSS methods which depends on the nonGaussianity or sparsity of the data, MsLi considers the lowrank structure and the incoherence of the shaping operations. Recently it is proved that the wireless communication signals also have lowrank structures with multiple specific shaping operations [33]. Hence we believe the proposed method in this paper will be promising in various applications.
References
 [1] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
 [2] Guillaume Rabusseau and Hachem Kadri. Lowrank regression with tensor responses. In Advances in Neural Information Processing Systems, pages 1867–1875, 2016.
 [3] Vatsal Sharan and Gregory Valiant. Orthogonalized als: A theoretically principled tensor decomposition algorithm for practical use. arXiv preprint arXiv:1703.01804, 2017.
 [4] Cesar F Caiafa, Olaf Sporns, Andrew Saykin, and Franco Pestilli. Unified representation of tractography and diffusionweighted mri data using sparse multidimensional arrays. In Advances in Neural Information Processing Systems, pages 4343–4354, 2017.
 [5] Lifang He, ChunTa Lu, Guixiang Ma, Shen Wang, Linlin Shen, S Yu Philip, and Ann B Ragin. Kernelized support tensor machines. In International Conference on Machine Learning, pages 1442–1451, 2017.
 [6] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. arXiv preprint arXiv:1412.6553, 2014.
 [7] Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, pages 442–450, 2015.
 [8] Edwin Stoudenmire and David J Schwab. Supervised learning with tensor networks. In Advances in Neural Information Processing Systems, pages 4799–4807, 2016.
 [9] Jean Kossaifi, Zachary C Lipton, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar. Tensor regression networks. arXiv preprint arXiv:1707.08308, 2017.
 [10] Yongxin Yang and Timothy Hospedales. Deep multitask representation learning: A tensor factorisation approach. arXiv preprint arXiv:1605.06391, 2016.
 [11] Ivan V Oseledets. Tensortrain decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317, 2011.
 [12] Masaaki Imaizumi, Takanori Maehara, and Kohei Hayashi. On tensor train rank minimization: Statistical efficiency and scalable algorithm. In Advances in Neural Information Processing Systems, pages 3933–3942, 2017.
 [13] Chao Li, Lili Guo, Yu Tao, Jinyu Wang, Lin Qi, and Zheng Dou. Yet another schatten norm for tensor recovery. In International Conference on Neural Information Processing, pages 51–60. Springer, 2016.
 [14] Cun Mu, Bo Huang, John Wright, and Donald Goldfarb. Square deal: Lower bounds and improved relaxations for tensor recovery. In International Conference on Machine Learning, pages 73–81, 2014.
 [15] Kishan Wimalawarne, Makoto Yamada, and Hiroshi Mamitsuka. Convex coupled matrix and tensor completion. arXiv preprint arXiv:1705.05197, 2017.
 [16] Ryota Tomioka, Kohei Hayashi, and Hisashi Kashima. Estimation of lowrank tensors via convex optimization. arXiv preprint arXiv:1010.0789, 2010.
 [17] Ryota Tomioka and Taiji Suzuki. Convex tensor decomposition via structured schatten norm regularization. In Advances in neural information processing systems, pages 1331–1339, 2013.
 [18] Kishan Wimalawarne, Masashi Sugiyama, and Ryota Tomioka. Multitask learning meets tensor factorization: task imputation via convex optimization. In Advances in neural information processing systems, pages 2825–2833, 2014.
 [19] Kishan Wimalawarne, Ryota Tomioka, and Masashi Sugiyama. Theoretical and experimental analyses of tensorbased regression and classification. Neural computation, 28(4):686–715, 2016.
 [20] Makoto Yamada, Wenzhao Lian, Amit Goyal, Jianhui Chen, Kishan Wimalawarne, Suleiman A Khan, Samuel Kaski, Hiroshi Mamitsuka, and Yi Chang. Convex factorization machine for toxicogenomics prediction. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1215–1224. ACM, 2017.
 [21] Xiawei Guo, Quanming Yao, and James TinYau Kwok. Efficient sparse lowrank tensor completion using the frankwolfe algorithm. In AAAI, pages 1948–1954, 2017.
 [22] Madhav Nimishakavi, Pratik Jawanpuria, and Bamdev Mishra. A dual framework for lowrank tensor completion. CoRR, abs/1712.01193, 2017.
 [23] Maryam Fazel, Haitham Hindi, and Stephen P Boyd. A rank minimization heuristic with application to minimum order system approximation. In American Control Conference, 2001. Proceedings of the 2001, volume 6, pages 4734–4739. IEEE, 2001.
 [24] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
 [25] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A Parrilo, and Alan S Willsky. Ranksparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572–596, 2011.
 [26] S Hosseini, DR Luke, and A Uschmajew. Tangent and normal cones for lowrank matrices. 2017.
 [27] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik, and Karen Egiazarian. Bm3d image denoising with shapeadaptive principal component analysis. In SPARS’09Signal Processing with Adaptive Sparse Structured Representations, 2009.
 [28] Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2, 2010.
 [29] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
 [30] Nicholas D Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E Papalexakis, and Christos Faloutsos. Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13):3551–3582, 2017.
 [31] TzyyPing Jung, Scott Makeig, Colin Humphries, TeWon Lee, Martin J Mckeown, Vicente Iragui, and Terrence J Sejnowski. Removing electroencephalographic artifacts by blind source separation. Psychophysiology, 37(2):163–178, 2000.
 [32] Aapo Hyvärinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural networks, 13(45):411–430, 2000.
 [33] Yuning Zhao, Chao Li, Zheng Dou, and Xiaodong Yang. A novel framework for wireless digital communication signals via a tensor perspective. Wireless Personal Communications, 99(1):509–537, 2018.
 [34] Zhouchen Lin, Minming Chen, and Yi Ma. The augmented lagrange multiplier method for exact recovery of corrupted lowrank matrices. arXiv preprint arXiv:1009.5055, 2010.
Appendix A Supplementary Materials
a.1 The algorithm of cMsLi
In this paper, we utilize ADMM to solve cMsLi. Recall the objective function (3) in the paper, its corresponding augmented Lagrangian function is given by
(12) 
where denotes a positive scalar. The algorithm of cMsLI is given as Alg. 1, in which denotes softthresholding the singular values of a matrix with the scalar . If is the SVD of , then , where the element of satisfies
(13) 
The convergence of the algorithm can be theoretically guaranteed by Theorem 3 in [34] with tiny modifications.
a.2 Proof of Lemma 1
Proof.
Assume that there is another tuple such that and . In addition, let , where represents the amended vector. Then the following statements apparently hold:

,

,

there are at least two nonzero vectors in .
Without loss of generality, assume that . Then It holds that by the statement 2. Furthermore, it can be found that and . Thus we have . ∎
a.3 Proof of Theorem 1
Proof.
In the proof, we use instead of for simplicity. Based on subgradient optimality conditions applied at the tuple , there must exist dual which satisfies the conditions in the proposition. Let be any subgradient of , i.e. . Then we can get
(14) 
For where , the following equations and inequalities holds:
In the formulations, the first inequality holds because the firstorder Taylor expansion is an underestimator of a convex function, and the third equation holds due to the definition of the dual . Since is defined as any subgradient of , then there must exist a specific such that
(16) 
Meanwhile, because of the dual relationship between the spectral norm and the nuclear norm, we have
(17) 
If we impose (16) and (17) into (A.3), we have
(18) 
In (18), the item 1 is always larger than because of the definition of . In addition, the item 2 is always larger or equal to , and the equality holds if and only if . It is because of holds. Otherwise, if but for a given , then . Since . It leads to a contradiction. Hence, is always true, and the equality holds if and only if . ∎
a.4 Proof of Lemma 2
Proof.
In the proof, we use instead of for simplicity. First, we need to prove that the equation (7) in Lemma 1 of the paper holds if
(19) 
for the given . For the sake of a contradiction, we assume that there exists a nonzero vector . Then it can be easily find that . By using , we have
(22)  
This leads to the contradiction. Next we need to prove that
(23) 
This inequality is true because of the definition of and the direct sum of linear subspaces. By using this inequality, we can find an upper bound of . That is
(27)  
for all , where the first equation holds because of (23), and the third inequality holds because of the definition of the incoherence measurement in the paper. Hence the equation (7) in the paper holds if
(28) 
∎
a.5 Proof of Theorem 2
Proof.
For simplicity, we use instead of and use instead of . According to the Lemma 2, we construct the dual , where with any . To make satisfy the conditions in Theorem 1, we have
(29) 
Thus for all , we have
(30) 
Next, an upper bound of the spectral norm of the projection of the dual is given by
(35)  
for all . is also upper bounded by
(40)  
Use instead of in the formulation for simplicity. Then, we have
(41) 
Let , then we have
(42) 
According the basic properties for real inequalities, we have the equivalent forms
(43)  
(44)  
(45)  
(46) 
We get the inequality (46) by accumulating the inequalities (45) for all . Use to further simply the formulation, then we have
(47) 
By some derivation, we can get
(48) 
Then we get an upper bound of as
(49) 
if . Next, according to Theorem 1, we should be make the following inequality hold
(50) 
Since we have get an upper bound of , the inequality (50) holds if
(51) 
Hence
(54)  
As the result, we can get the range of by solving this equality, that is
(56) 
∎