Extension of PCA to higher order data structures: An Introduction to Tensors, Tensor Decompositions, and Tensor PCA
Abstract.
The widespread use of multisensor technology and the emergence of big data sets have brought the necessity to develop more versatile tools to represent higherorder data. Data in the form of multidimensional arrays, also referred to as tensors, arises in a variety of applications including chemometrics, hyperspectral imaging, high resolution videos, neuroimaging, biometrics, and social network analysis. These applications produce massive amounts of data with multiple aspects and high dimensionality. Early multiway data analysis approaches reformatted such tensor data as large vectors or matrices and then resorted to methods developed for classical twoway analysis. However, one cannot discover hidden components within multiway data using conventional matrix decomposition methods. To this end, tensor decomposition methods which are flexible in the choice of the constraints and that extract more general latent components have been proposed. In this paper, we review the major tensor decomposition methods for lowrank tensor approximation for data reduction and learning applications along with their shortcomings and challenges in implementation. We also evaluate these methods on different tensor type data in order to contrast their performance.
A. Zare (zareali@msu.edu) is with the Department of Computational Mathematics, Science and Engineering (CMSE), Michigan State University, East Lansing, MI, 48824, USA.
Mark A. Iwen (markiwen@math.msu.edu) is with the Department of Mathematics, and the Department of Computational Mathematics, Science and Engineering (CMSE), Michigan State University, East Lansing, MI, 48824, USA.
This work was supported in part by NSF CCF1615489.
1. Introduction
Principal Component Analysis (PCA) is perhaps the most standard tool used to extract lowdimensional subspaces from high dimensional data, and has been used widely in diverse fields from neuroscience to computer graphics because of its simple, nonparametric nature [1]. Over the past decades many PCA variants have been developed to address important problems in dimensionality reduction for extremely large and high dimensional data sets [2], applications to unsupervised learning [3], and more recently lowrank matrix recovery from missing samples or grossly corrupted data [4, 5]. However, these advances have been mostly limited to vector or matrix type data despite the fact that continued advances in information and sensing technology have been making largescale, multimodal, and multirelational datasets evermore commonplace. Indeed, such multimodal data sets are now commonly encountered in a huge variety of applications including chemometrics, hyperspectral imaging, high resolution videos, neuroimaging (EEG, fMRI), biometrics and social network analysis [6, 7, 8]. Tensors, which are multidimensional generalizations of matrices, provide a useful representation for such data and their related multilinearalgebraic framework has been shown to provide great flexibility in the analysis of big, multimodal data. Standard vector and matrix models such as PCA, on the other hand, have been shown to be less capable of capturing the crosscouplings across the different modes in higher order data [9, 10, 11]. This motivates the need for multimodal dimensionality reduction and analysis tools that can learn from tensor data while respecting its inherent multimodal structure.
The purpose of this survey article is to introduce those who are well familiar with PCA methods for vector type data to tensors with an eye toward discussing extensions of PCA and its variants for tensor type data. In order to accomplish this goal we review two main lines of research in tensor decompositions herein. First, we present methods for tensor decomposition aimed at lowdimensional/lowrank approximation of higher order tensors. Early multiway data analysis relied on reshaping tensor data as a matrix and resorted to classical matrix factorization methods. However, the matricization of tensor data cannot always capture the interactions and couplings across the different modes. For this reason, extensions of twoway matrix analysis techniques such as PCA, SVD and nonnegative matrix factorization were developed in order to better address the issue of dimensionality reduction in tensors. After reviewing basic tensor definitions in Section 2, we then discuss these extensions in Section 3. In particular, we review several tensor decomposition methods therein including the CANonical DECOMPosition (CANDECOMP), also known as PARAllel FACtor (PARAFAC) model, as well as the Tucker, or multilinear singular value, decomposition [12]. Both of these methods aim to provide lowrank approximations to a given tensor based on different notions of rank. We then review tensor networks, including the TensorTrain (TT), also known as the MatrixProduct State (MPS), decomposition, which are designed to tackle the issue of computational complexity and compression rate in very high order tensors. This framework provides a highly compressed representation of a given tensor through a set of sparsely interconnected matrices and core tensors of low order. Secion 3 concludes with an empirical comparison of several tensor decomposition methods’ ability to compress several example datasets.
Second, in Section 4, we summarize extensions of PCA and linear subspace learning methods in the context of tensors. These include both Multilinear Principal Component Analysis (MPCA) and the Tensor RankOne Decomposition (TROD) methods which utilize the notions of rank introduced by the Tucker and CANDECOMP models, respectively, in order to learn a common subspace for a collection of tensors in a given training set. This common subspace is then used, e.g., to project test tensor samples into lowerdimensional spaces and classify them [13, 14]. This framework has found applications in supervised learning settings, in particular for face recognition for face images collected across different modalities and angles [15]. Next, in Section 5, we address the issue of robust lowrank tensor recovery for grossly corrupted and noisy higher order data. In particular, we review methods developed to extract lowrank tensors from corrupted data in both the PARAFAC and Tucker models. Finally, in Section 6 we provide an overview of ongoing work in the area of large tensor data factorization and computationally efficient implementation of the existing methods.
2. Notation, Tensor Basics, and Preludes to Tensor PCA
Let for all . A mode, or thorder, tensor is simply a dimensional array of complex valued data for given dimension sizes . Given this, each entry of a tensor is indexed by an index vector . The entry of indexed by will always be denoted by . The entry position of the index vector for a tensor will always be referred to as the mode of .
Example 1.
Every vector is a mode tensor indexed by . Its entries are denoted by .
Example 2.
Every matrix is a mode tensor indexed by . Its entries are denoted by . Its first and second modes are commonly referred to as its “rows” and “columns”, respectively.
Following the conventions established above, throughout the remainder of this paper vectors are always bolded, matrices capitalized, tensors of order potentially italicized, and tensor entries/scalars written in lower case. If kept in mind these conventions should help the reader rapidly identify the types of objects under consideration in most situations.
2.1. Fibers, Slices, and Other Subtensors
When encountered with a higher order tensor it is often beneficial to, e.g., look for correlations across its different modes. For this reason some of the many lowerorder subtensors contained within any given higherorder tensor have been given special names and thereby elevated to special status. In this subsection we will define a few of these.
Fibers are mode subtensors (i.e., subvectors) of a given mode tensor . More specifically, a mode fiber of is a mode subtensor indexed by the mode of for any given choices of and for all . Each such mode fiber is denoted by
(1) 
The entry of a mode fiber is for each . Note that there are mode fibers of any given .
Example 3.
Consider a mode tensor (i.e., matrix) . Its mode fiber for any given is its row, . There are rows total. Its mode fiber for any given is its column, . There are columns total.
Example 4.
Consider a mode tensor . Its mode fiber for any given is the mode subtensor . There are such mode fibers of . Fibers of a mode tensor are depicted in Figure 1(b).
In this paper slices will always be mode subtensors of a given mode tensor . In particular, a mode slice of is a mode subtensor indexed by all but the mode of for any given choice of . Each such mode slice is denoted by
(2) 
for any choice of . The entry of a mode slice is
(3) 
for each . It is easy to see that there are always just mode slices of any given .
Example 5.
Consider a matrix . Its mode slice for any given is its row, . Note that this is equivalent to its mode fiber for since is an ndorder tensor. Similarly, ’s mode slice for any given is its column, . This is equivalent to its mode fiber for since is a matrix.
Example 6.
Consider a mode tensor . Its mode slice for any given is the mode subtensor (i.e., matrix) . There are such mode slices of . In Figure 1(c) the mode slices of can be viewed.
Of course, many other subtensors of a given mode tensor can be formed using the notation of fibers and slices. For example, one might consider one of the different mode subtensors (i.e., submatrices) of indexed by ’s last two modes, . Alternatively, one might consider one of the different mode subtensors of formed by fixing the values of ’s first two modes using some , . A general view of a mode tensor showing the concept of fiber and slice together is illustrated in Figure 1(a).
2.2. Tensor Vectorization, Flattening, and Reshaping
There are a tremendous multitude of ways one can reshape a mode tensor into another tensor with a different number of modes. Perhaps most important among these are the transformation of a given tensor into a vector or matrix so that methods from standard numerical linear algebra can be applied to the reshaped tensor data thereafter.
The vectorization of will always reshape into a vector (i.e., storder tensor) denoted by . This process can be accomplished numerically by, e.g., recursively vectorizing the last two modes of (i.e., each matrix according to their rowmajor order until only one mode remains. When done in this fashion the entries of the vectorization can be rapidly retrieved from via the formula
(4) 
where each of the index functions is defined for all by
(5) 
Herein we will always use the convention that to handle the case where, e.g., is the empty set above.
The process of reshaping a mode tensor into a matrix is known as matricizing, flattening, or unfolding, the tensor. There are nontrivial ways in which one may create a matrix from a mode tensor by partitioning its modes into two different “row” and “column” subsets of modes (each of which is then implicitly vectorized separately).^{1}^{1}1One can use Stirling numbers of the second kind to easily enumerate all possible mode partitions. The most oft considered of these are the mode variants mentioned below (excluding, perhaps, the alternate matricizations utilized as part of, e.g., the tensor train [16] and hierarchical SVD [17] decomposition methods.)
The mode matricization, mode flattening, or mode unfolding of a tensor , denoted by , is a matrix whose columns consist of all the mode fibers of . More explicitly, is determined herein by defining its entries to be
(6) 
where, e.g., the index functions are defined by
(7) 
for all and . Note that ’s columns are ordered by varying the index of the largest non mode ( unless a mode unfolding is being constructed) fastest, followed by varying the second largest non mode second fastest, etc.. In Figure 2, the mode matricization of is formed in this way using its mode fibers.
As mentioned above, the mode fibers of make up the columns of . Similarly, each row of is the vectorization of a mode slice of . For the special case of ndorder tensors (i.e., matrices) we’d also like to point out that mode flattening has a special relationship to matrix transposition. If then is simply again (i.e., ), while is the transpose of (i.e., ).
Example 7.
As an example, consider a mode tensor . Its mode matricization is
(8) 
where is the mode matricization of the mode slice of at for each . Note that we still consider each mode slice to be indexed by and above. As a result, e.g., will be considered meaningful while will be considered meaningless in the same way that, e.g., is meaningless. That is, even though only has two modes, we will still consider them to be its “first” and “third” modes (i.e., its two mode positions are still indexed by and , respectively). Though potentially counterintuitive at first, this notational convention will simplify the interpretation of expressions like (8) – (10) going forward.
Given this convention, we also have that
(9) 
and
(10) 
More generally, it is not difficult to see that there are a large number of ways one can reshape a given tensor beyond the simple mode unfoldings discussed above. A thorder tensor can always be reshaped into another thorder tensor as long as . Herein we have focussed on the special cases of or , but it is worth considering the possibility that there may be good reasons, heretofore opaque, for occasionally reshaping data into even higher mode tensors.
2.3. The Standard Inner Product Space of mode Tensors
The set of all thorder tensors forms a vector space over the complex numbers when equipped with componentwise addition and scalar multiplication. This vector space is usually endowed with the Euclidean inner product. More specifically, the inner product of will always be given by
(11) 
This inner product then gives rise to the standard Euclidean norm
(12) 
If we will say that and are orthogonal. If and are orthogonal and also have unit norm (i.e., have ) we will say that they are orthonormal.
It is worth noting that trivial inner product preserving isomorphisms exist between this standard inner product space and any of its reshaped versions (i.e., reshaping can be viewed as an isomorphism between the original mode tensor vector space and its reshaped target vector space). In particular, the process of reshaping tensors is linear. If, for example, then one can see that the mode unfolding of is for all . Similarly, the vectorization of is always exactly .
Furthermore, if we let be the set of all indexes for the entries of , then we can also see that
will always hold. Here is the Euclidean inner product between the vectorizations of and , and is the HilbertSchmidt inner product between the mode flattenings of and for each . The fact that these inner products are all preserved further implies the following norm preservations: we have that
(13) 
also hold for all , where is the standard Euclidean vector norm, and is the Frobenius matrix norm.
2.4. Tensor Products and Mode Products
It is occasionally desirable to build one’s own higher order tensor using two lower order tensors. This is particularly true when one builds them up using vectors as part of, e.g., PARAFAC/CANDECOMP decomposition techniques [18, 19, 20, 21, 22]. Toward this end we will utilize the tensor product of two tensors and . The result of the tensor product, , is a mode tensor whose entries are given by
(14) 
Example 8.
Consider two vectors (i.e., mode tensors) and . Their tensor product is the mode tensor (i.e., matrix) , where denotes the conjugate transpose operation. This is exactly the rank one matrix obtained by taking the outer product of a and b.
A thorder tensor which is built up from vectors using the tensor product is called a rank tensor. For example, is a rank tensor with modes which is built from , , , and . Note that this thorder tensor is unambiguously called “rank1” due to the fact that is both built up from rank tensors, and because every mode unfolding of is also a rank matrix. In Figure 3, a mode rank tensor is formed using the outer product of vectors.
Finally, the mode product of mode tensor with a matrix is another mode tensor . Its entries are given by
(15) 
for all . Looking at the mode unfoldings of and one can easily see that holds for all .
Example 9.
Let be a mode tensor (i.e., a matrix). Similarly, let and . Then, we have that and .
Mode products play a particularly important role in many tensor PCA and tensor factorization methods [23, 12, 15, 22]. For this reason it is worth stating some of their basic properties: mainly, mode products are bilinear, commute on different modes, and combine in reverse order on the same mode. The following simple lemma formally lists these important properties.
Lemma 1.
Let , , and for all . The following four properties hold:

.

.

If then .

If then .
Looking, e.g., at property of Lemma 1 for a mode tensor we can see that it is a simple consequence of the associativity of matrix multiplication. In particular, given and we have that
We are now equipped with all of the tools we need in order to begin discussing tensor PCA variants.
2.5. Trivial PCA for Tensors Based on Implicit Vectorization
Some of the first engineering methodologies involving PCA for thorder () tensor data were developed in the late ’s in order to aid in facial recognition, computer vision, and image processing tasks (see, e.g., [24, 25, 26, 27] for several variants of such methods). In these applications preprocessed pictures of individuals were treated as individual mode tensors. In order to help provide additional information each individual might further be imaged under several different conditions (e.g., from a few different angles, etc.). The collection of each individual’s images across each additional condition’s mode (e.g., camera angle) would then result in a thorder () tensor of image data for each individual. The objective would then be to perform PCA across the individuals’ face image data (treating each individual’s image data as a separate data point) in order to come up with a reduced face model that could later be used for various computer vision tasks (e.g., face recognition/classification).
Mathematically these early methods perform implicitly vectorized PCA on mode tensors each of which represents an individual’s image(s). Assuming that the image data has been centered so that , this problem reduces to finding a set of orthonormal “eigenface” basis tensors whose span ,
minimizes the error
(16) 
Among other things, one can then approximate each of the original individual’s image data, , in compressed form via a sum
(17) 
for some optimal .
It is not too difficult to see that this problem can be solved in vectorized form by (partially) computing the Singular Value Decomposition (SVD) of an matrix whose columns are the vectorized image tensors . As a result one will obtain a vectorized “eigenface” basis each of which can then be reshaped back into an image tensor . Though conceptually simple this approach still encounters significant computational challenges. In particular, the total dimensionality of each tensor, , can be extremely large making both the computation of the SVD above very expensive, and the storage of the basis tensors inefficient. The challenges involving computation of the SVD in such situations can be addressed using tools from numerical linear algebra (see, e.g., [28, 29]). The challenges involving efficient storage and computation with the high dimensional tensors obtained by this (or any other approach discussed below) can be addressed using the tensor factorization and compression methods discussed in the next section.
3. Tensor Factorization and Compression Methods
Tensor decomposition methods provide efficient representations for multilinear datasets by reducing their complexity similar to the way PCA/SVD does for matrices. As a result, tensor decomposition methods allow their users to work with and store smaller numbers of parameters for tensor data. Advantages of using multiway analysis over twoway analysis in terms of uniqueness, robustness to noise, and computational complexity have been shown in many studies (see, e.g., [30, 31, 32]). In this section we review some of the most commonly used approaches for tensor representation and compression and present results on their uniqueness and convergence to the optimal representation. We then empirically evaluate the compression versus reconstruction error performance of several of these methods for three different higher order datasets before moving on to consider PCA variants for collections of tensors in the next section.
3.1. CANDECOMP/PARAFAC Decomposition (CPD)
CPD is a generalization of PCA to higher order array and represents a mode tensor as a combination of rankone tensors [33].
(18) 
where is a positive integer, is the weight of the th rankone tensor, is the th factor of th mode with unit norm where and , and “” denotes the tensor (outer) product of vectors. Alternatively, can be represented as mode product of a diagonal core tensor with entries and factor matrices for :
(19) 
The main restriction of the PARAFAC model is that the factors across different modes only interact factorwise. For example, for a mode tensor, the th factor corresponding to the first mode only interacts with the th factors of the second and third modes. However, this restriction also provides the same number of factors for each mode and yields a unique solution for PARAFAC model [10, 30, 22]. In bilinear methods there is a wellknown problem of rotational freedom. This is not the case in CPD. Kruskal provided results on uniqueness of mode CPD depending on matrix rank as:
(20) 
where is the maximum value such that any columns of are linearly independent [33]. This result is later generalized for mode tensors in [34] as:
(21) 
Under these conditions, the CPD solution is unique and the estimated model cannot be rotated without a loss of fit. RankR approximation of a thorder tensor obtained by CPD is represented using parameters, which is less than the number of parameters required for PCA applied to an unfolded matrix.
CPD is most commonly computed by alternating least squares (ALS) by successively assuming the factors in modes known and then estimating the unknown set of parameters of the last mode. For each mode and each iteration, the Frobenious norm of the difference between input tensor and CPD approximation is minimizaed. AlS is an attractive method since it ensures the improvement of the solution in every iteration. However, in practice, the existence of large amount of noise or the high order of the model may prevent ALS to converge to global minima or require several thousands of iterations [22, 10, 35]. Different methods have been proposed to improve performance and accelerate convergence rate of CPD algorithms [36, 37]. A number of particular techniques exist, such as line search extrapolation methods [38, 39, 40] and compression [41]. Instead of alternating estimation, allatonce algorithms such as the OPT algorithm [42], the conjugate gradient algorithm for nonnegative CP [43], the PMF3, damped GaussNewton (dGN) algorithms [44] and fast dGN [45] have been studied to deal with problems of a slow convergence of the ALS in some cases. Another approach is to consider the CP decomposition as a joint diagonalization problem [46, 47].
3.2. Tucker Decomposition and HoSVD
Tucker decomposition is a natural extension of the SVD to mode tensors and decomposes the tensor into a core tensor multiplied by a matrix along each mode [48, 10, 35]. Tucker decomposition of a mode tensor is written as:
(22) 
where the matrices s are square factor matrices and the core tensor is obtained by . It is common for the Tucker decomposition to assume the rank of s to be less than so that is a compression of . In contrast to PARAFAC, Tucker models allow interactions between the factors obtained across the modes and the core tensor includes the strength of these interactions. However, the main drawback of Tucker decomposition is that the factors are not necessarily unique [30]. For example, the effect of rotating one of the mode matrices can be eliminated by inversely rotating the core tensor.
The Higher Order SVD (HoSVD) is a special case of Tucker decomposition obtained by adding an orthogonality constraint to the component matrices. In HoSVD, the factor matrices, s, are the left singular vectors of each flattening . In HoSVD, low nrank approximation of can be obtained by truncating the orthogonal factor matrices of HoSVD resulting in truncated HoSVD. As opposed to the SVD for matrices, the truncation of the HoSVD is not the the best approximation of . The best rank approximation is obtained by solving the following optimization problem.
(23) 
It has been shown that this optimization problem can be solved by ALS approach iteratively and the method is known as higherorder orthogonal iteration (HOOI) [48]. For many applications, HoSVD is considered to be sufficiently good, or it can serve as an initial value in algorithms for finding the best approcimation [49]. MultilinearrankR approximation of a thorder tensor is represented using parameters in Tucker model.
3.3. Nonnegative Tucker Decomposition (NTD)
To identify hidden nonnegative patterns in a tensor, nonnegative matrix factorization algorithms have been adapted to Tucker model [50, 51, 52, 53, 54]. NTD of a tensor can be obtained by solving:
(24) 
3.4. Hierarchical Tensor Decomposition
To reduce the memory requirements of Tucker decomposition, hierarchical Tucker decomposition has been proposed [56, 17, 35]. Hierarchical Tucker Decomposition (HT) recursively splits the modes based on a hierarchy and creates a binary tree containing a subset of the modes at each node [17]. Factor matrices s are obtained from the SVD of which is the matricization of a tensor corresponding to the subset of the modes at each node. However, this matricization is different from moden matricization of the tensor, and rows of correspond to the modes in the set of while columns of store indices of the remaining modes. Constructed tree structure yields hierarchy amongst the factor matrices whose columns span for each . Let and be successors of . For and , there exists a matrix such that , where . By assuming , HTrankR approximation of requires storing s for the leaf nodes and s for the other nodes in with parameters [35].
3.5. TensorTrain Decomposition
TensorTrain Decomposition (TT) has been proposed to compress large tensor data into smaller core tensors [16]. This model allows users to avoid the exponential growth of Tucker model and provides more efficient storage complexity. TT decomposition of a tensor is written as:
(25) 
where is the th lateral slice of the core and . TT decomposition of is obtained as follows. First, is obtained from SVD of mode1 matricization of as
(26) 
where and . Note that, . Let be a reshaped version of . Then, is obtained by reshaping leftsingular vectors of , where and , and . By repeating this procedure, all of the core tensors s are obtained by a sequence of SVD decompositions of specific matricizations of . The storage complexity of TTrankR approximation of a thorder tensor is .
3.6. An Empirical Comparison of Several Different Tensor Decomposition Methods
In this section the CANDECOMP/PARAFAC, Tucker (HOOI), HoSVD, HT, and TT decompositions are compared in terms of data reduction rate and normalized reconstruction error. The data sets used for this purpose are:
(1) The PIE data set: This database contains images taken from one individual under different illumination conditions and from different angles [57]. All images of the individual form a mode tensor .
(2) A Hyperspectral Image (HSI) data set: This database contains images taken at wavelengths [58]. Images consist of pixels, forming a mode tensor .
(3) The COIL100 data set: This database includes images taken from objects [59]. Each object was imaged at different angles separated by degrees, resulting in images per object, each one consisting of pixels. The original database is a tensor which was reshaped as a mode tensor for the experiments.
Sample images from the above data sets can be viewed in Figure 4.
Several software packages were used to generate the results. The Tucker (HOOI) and CPD methods were evaluated using the TensorLab package [60]. For CPD, the structure detection and exploitation option was disabled in order to avoid complications on the larger datasets (COIL100 and