Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions

Era of Big Data Processing: A New Approach via Tensor Networks and Tensor Decompositions

Andrzej CICHOCKI
RIKEN Brain Science Institute, Japan
and Systems Research Institute of the Polish Academy of Science, Poland
a.cichocki@riken.jp
Part of this work was presented on the International Workshop on Smart Info-Media Systems in Asia, (invited talk - SISA-2013) Sept.30–Oct.2, 2013, Nagoya, Japan
Abstract

Many problems in computational neuroscience, neuroinformatics, pattern/image recognition, signal processing and machine learning generate massive amounts of multidimensional data with multiple aspects and high dimensionality. Tensors (i.e., multi-way arrays) provide often a natural and compact representation for such massive multidimensional data via suitable low-rank approximations. Big data analytics require novel technologies to efficiently process huge datasets within tolerable elapsed times. Such a new emerging technology for multidimensional big data is a multiway analysis via tensor networks (TNs) and tensor decompositions (TDs) which represent tensors by sets of factor (component) matrices and lower-order (core) tensors. Dynamic tensor analysis allows us to discover meaningful hidden structures of complex data and to perform generalizations by capturing multi-linear and multi-aspect relationships. We will discuss some fundamental TN models, their mathematical and graphical descriptions and associated learning algorithms for large-scale TDs and TNs, with many potential applications including: Anomaly detection, feature extraction, classification, cluster analysis, data fusion and integration, pattern recognition, predictive modeling, regression, time series analysis and multiway component analysis.

Keywords: Large-scale HOSVD, Tensor decompositions, CPD, Tucker models, Hierarchical Tucker (HT) decomposition, low-rank tensor approximations (LRA), Tensorization/Quantization, tensor train (TT/QTT) - Matrix Product States (MPS), Matrix Product Operator (MPO), DMRG, Strong Kronecker Product (SKP).

I Introduction and Motivations

Big Data consists of multidimensional, multi-modal data-sets that are so huge and complex that they cannot be easily stored or processed by using standard computers. Big data are characterized not only by big Volume but also another specific “V” features (see Fig. 1). High Volume implies the need for algorithms that are scalable; High Velocity address the challenges related to process data in near real-time, or virtually real-time; High Veracity demands robust and predictive algorithms for noisy, incomplete or inconsistent data, and finally, high Variety may require integration across different kind of data, e.g., neuroimages, time series, spiking trains, genetic and behavior data.

Figure  1: Big Data analysis for neuroscience recordings. Brain data can be recorded by electroencephalography (EEG), electrocorticography (ECoG), magnetoencephalography (MEG), fMRI, DTI, PET, Multi Unit Recording (MUR). This involves analysis of multiple modalities/multiple subjects neuroimages, spectrograms, time series, genetic and behavior data. One of the challenges in computational and system neuroscience is to make fusion (assimilation) of such data and to understand the multiple relationships among them in such tasks as perception, cognition and social interactions. The our “V”s of big data: Volume - scale of data, Variety - different forms of data, Veracity - uncertainty of data, and Velocity - analysis of streaming data, comprise the challenges ahead of us.

Many challenging problems for big data are related to capture, manage, search, visualize, cluster, classify, assimilate, merge, and process the data within a tolerable elapsed time, hence demanding new innovative solutions and technologies. Such emerging technology is Tensor Decompositions (TDs) and Tensor Networks (TNs) via low-rank matrix/tensor approximations. The challenge is how to analyze large-scale, multiway data sets. Data explosion creates deep research challenges that require new scalable, TD and TN algorithms.

Figure  2: A 3rd-order tensor , with entries and its sub-tensors: Slices and fibers. All fibers are treated as column vectors.
Figure  3: Block matrices and their representations by (a) a 3rd-order tensor and (b) a 4th-order tensor.

Tensors are adopted in diverse branches of science such as a data analysis, signal and image processing [1, 2, 3, 4], Psychometric, Chemometrics, Biometric, Quantum Physics/Information, and Quantum Chemistry [5, 6, 7]. Modern scientific areas such as bioinformatics or computational neuroscience generate massive amounts of data collected in various forms of large-scale, sparse tabular, graphs or networks with multiple aspects and high dimensionality.

Tensors, which are multi-dimensional generalizations of matrices (see Fig. 2 and Fig. 3), provide often a useful representation for such data. Tensor decompositions (TDs) decompose data tensors in factor matrices, while tensor networks (TNs) represent higher-order tensors by interconnected lower-order tensors. We show that TDs and TNs provide natural extensions of blind source separation (BSS) and 2-way (matrix) Component Analysis (2-way CA) to multi-way component analysis (MWCA) methods. In addition, TD and TN algorithms are suitable for dimensionality reduction and they can handle missing values, and noisy data. Moreover, they are potentially useful for analysis of linked (coupled) block of tensors with millions and even billions of non-zero entries, using the map-reduce paradigm, as well as divide-and-conquer approaches [8, 9, 10]. This all suggest that multidimensional data can be represented by linked multi-block tensors which can be decomposed into common (or correlated) and distinctive (uncorrelated, indpendent) components [3, 11, 12]. Effective analysis of coupled tensors requires the development of new models and associated algorithms that can identify the core relations that exist among the different tensor modes, and the same tome scale to extremely large datasets. Our objective is to develop suitable models and algorithms for linked low-rank tensor approximations (TAs), and associated scalable software to make such analysis possible.

Review and tutorial papers [2, 4, 13, 14, 15] and books [6, 5, 1] dealing with TDs already exist, however, they typically focus on standard TDs and/or do not provide explicit links to big data processing topics and do not explore natural connections with emerging areas including multi-block coupled tensor analysis and tensor networks. This paper extends beyond the standard TD models and aims to elucidate the power and flexibility of TNs in the analysis of multi-dimensional, multi-modal, and multi-block data, together with their role as a mathematical backbone for the discovery of hidden structures in large-scale data [1, 2, 4].

Motivations - Why low-rank tensor approximations? A wealth of literature on (2-way) component analysis (CA) and BSS exists, especially on Principal Component Analysis (PCA), Independent Component Analysis (ICA), Sparse Component Analysis (SCA), Nonnegative Matrix Factorizations (NMF), and Morphological Component Analysis (MCA) [16, 1, 17]. These techniques are maturing, and are promising tools for blind source separation (BSS), dimensionality reduction, feature extraction, clustering, classification, and visualization [1, 17].

The “flattened view” provided by 2-way CA and matrix factorizations (PCA/SVD, NMF, SCA, MCA) may be inappropriate for large classes of real-world data which exhibit multiple couplings and cross-correlations. In this context, higher-order tensor networks give us the opportunity to develop more sophisticated models performing distributed computing and capturing multiple interactions and couplings, instead of standard pairwise interactions. In other words, to discover hidden components within multiway data the analysis tools should account for intrinsic multi-dimensional distributed patterns present in the data.

Figure  4: Basic symbols for tensor network diagrams.

Ii Basic Tensor Operations

th-order tensor of size
core tensors
th-order diagonal core tensor with nonzero entries on main diagonal
matrix with column vectors and entries
component matrices
vector of indices
mode- unfolding of
mode-1 fiber of obtained by fixing all but one index
tensor slice of obtained by fixing all but two indices
subtensor of , in which several indices are fixed
vectorization of
diagonal matrix
TABLE I: Basic tensor notation and symbols. A tensor are denoted by underline bold capital letters, matrices by uppercase bold letters, vectors by lowercase boldface letters and scalars by lowercase letters.
Figure  5: Hierarchical block matrices and their representations as tensors: (a) a 4th-order tensor for a block matrix , comprising block matrices , (b) a 5th-order tensor and (c) a 6th-order tensor.

4th-order tensor

5th-order tensors

6th-order tensor

Figure  6: Graphical representations and symbols for higher-order block tensors. Each block represents a 3rd-order tensor or 2nd-order tensor. An external circle represent a global structure of the block tensor (e.g., a vector, a matrix, a 3rd-order tensor) and inner circle represents a structure of each element of the block tensor.

(a)

(b)

Figure  7: Unfoldings in tensor networks: (a) Graphical representation of the basic mode- unfolding (matricization, flattening) for an th-order tensor . (b) More general unfolding of the th-order tensor into a matrix . All entries of an unfolded tensor are arranged in a specific order, e.g., in lexicographical order. In a more general case, let be the row indices and be the column indices, then the mode- unfolding of is denoted as .
mode- product of and yields , with entries , or equivalently
tensor or outer product of and yields th-order tensor , with entries
tensor or outer product of vectors forms a rank-1 tensor, with entries
, ,
transpose, inverse and Moore-Penrose pseudo-inverse of
Kronecker product of and yields , with entries , where
Khatri-Rao product of and yields , with columns
TABLE II: Basic tensor/matrix operations.

A higher-order tensor can be interpreted as a multiway array, as illustrated graphically in Figs. 2, 3 and 4. Our adopted convenience is that tensors are denoted by bold underlined capital letters, e.g., , and that all data are real-valued. The order of a tensor is the number of its “modes”, “ways” or “dimensions”, which can include space, time, frequency, trials, classes, and dictionaries. Matrices (2nd-order tensors) are denoted by boldface capital letters, e.g., , and vectors (1st-order tensors) by boldface lowercase letters; for instance the columns of the matrix are denoted by and elements of a matrix (scalars) are denoted by lowercase letters, e.g., (see Table I).

The most common types of tensor multiplications are denoted by: for the Kronecker, for the Khatri-Rao, for the Hadamard (componentwise), for the outer and for the mode- products (see Table II).

TNs and TDs can be represented by tensor network diagrams, in which tensors are represented graphically by nodes or any shapes (e.g., circles, spheres, triangular, squares, ellipses) and each outgoing edge (line) emerging from a shape represents a mode (a way, dimension, indices) (see Fig. 4) Tensor network diagrams are very useful not only in visualizing tensor decompositions, but also in their different transformations/reshapings and graphical illustrations of mathematical (multilinear) operations.

It should also be noted that block matrices and hierarchical block matrices can be represented by tensors. For example, 3rd-order and 4th-order tensors that can be represented by block matrices as illustrated in Fig. 3 and all algebraic operations can be performed on block matrices. Analogously, higher-order tensors can be represented as illustrated in Fig. 5 and Fig. 6. Subtensors are formed when a subset of indices is fixed. Of particular interest are fibers, defined by fixing every index but one, and matrix slices which are two-dimensional sections (matrices) of a tensor, obtained by fixing all the indices but two (see Fig. 2). A matrix has two modes: rows and columns, while an th-order tensor has modes.

The process of unfolding (see Fig. 7) flattens a tensor into a matrix. In the simplest scenario, mode- unfolding (matricization, flattening) of the tensor yields a matrix , with entries such that remaining indices are arranged in a specific order, e.g., in the lexicographical order [4]. In tensor networks we use, typically a generalized mode- unfolding as illustrated in Fig. 7 (b).

By a multi-index , we denote an index which takes all possible combinations of values of , for in a specific and consistent orders. The entries of matrices or tensors in matricized and/or vectorized forms can be ordered in at least two different ways.

Remark: The multi–index can be defined using two different conventions:

1) The little–-endian convention

(1)

2) The big–-endian

(2)

The little–endian notation is consistent with the Fortran style of indexing, while the big–endian notation is similar to numbers written in the positional system and corresponds to reverse lexicographic order. The definition unfolding of tensors and the Kronecker (tensor) product should be also consistent with the chosen convention111The standard and more popular definition in multilinear algebra assumes the big–endian convention, while for the development of the efficient program code for big data usually the little–endian convention seems to be more convenient (see for more detail papers of Dolgov and Savostyanov [18, 19]).. In this paper we will use the big-endian notation, however it is enough to remember that means that .

The Kronecker product of two tensors: and yields , with entries , where .

The mode- product of a tensor by a vector is defined as a tensor , with entries , while a mode- product of the tensor by a matrix is the tensor , with entries . This can also be expressed in a matrix form as (see Fig. 8), which allows us to employ fast matrix by vector and matrix by matrix multiplications for very large scale problems.

(a)

(b)

Figure  8: From a matrix format to the tensor network format. (a) Multilinear mode- product of a 3rd-order tensor and a factor (component) matrix yields a tensor . This is equivalent to simple matrix multiplication formula . (b) Multilinear mode- product an th-order tensor and a factor matrix .

If we take all the modes, then we have a full multilinear product of a tensor and a set of matrices, which is compactly written as [4] (see Fig. 9 (a))):

(3)

(a)                                               (b)

Figure  9: Multilinear products via tensor network diagrams. (a) Multilinear full product of tensor (Tucker decomposition) and factor (component) matrices () yields the Tucker tensor decomposition . b) Multilinear product of tensor and vectors yields a vector .

In a similar way to mode- multilinear product, we can define the mode- product of two tensors (tensor contraction) and , with common modes that yields an -order tensor :

(4)

with entries (see Fig. 10 (a)). This operation can be considered as a contraction of two tensors in single common mode. Tensors can be contracted in several modes or even in all modes as illustrated in Fig. 10.

If not confusing a super- or sub-index can be neglected. For example, the multilinear product of the tensors and , with a common modes can be written as

(5)

with entries: . Furthermore, note that for multiplications of matrices and vectors this notation implies that , , , and .

Remark: If we use contraction for more than two tensors the order has to be specified (defined) as follows:
for .

Figure  10: Examples of tensor contractions: (a) Multilinear product of two tensors is denoted by . (b) Inner product of two 3rd-order tensors yields a scalar . (c) Tensor contraction of two 4th-order tensors yields , with entries . (d) Tensor contraction of two 5th-order tensors yields 4th-order tensor , with entries .

The outer or tensor product of the tensors and is the tensor , with entries . Specifically, the outer product of two nonzero vectors produces a rank-1 matrix and the outer product of three nonzero vectors: and produces a 3rd-order rank-1 tensor: , whose entries are . A tensor is said to be rank-1 if it can be expressed exactly as , with entries , where are nonzero vectors. We refer to [1, 4] for more detail regarding the basic notations and tensor operations.

Iii Tensor Networks

A tensor network aims to represent or decompose a higher-order tensor into a set of lower-order tensors (typically, 2nd (matrices) and 3rd-order tensors called cores or components) which are sparsely interconnected. In other words, in contrast to TDs, TNs represent decompositions of the data tensors into a set of sparsely (weakly) interconnected lower-order tensors. Recently, the curse of dimensionality for higher-order tensors has been considerably alleviated or even completely avoided through the concept of tensor networks (TN) [20, 21]. A TN can be represented by a set of nodes interconnected by lines. The lines (leads, branches, edges) connecting tensors between each other correspond to contracted modes, whereas lines that do not go from one tensor to another correspond to open (physical) modes in the TN (see Fig. 11).

An edge connecting two nodes indicates a contraction of the respective tensors in the associated pair of modes as illustrated in Fig. 10. Each free (dangling) edge corresponds to a mode, that is not contracted and, hence, the order of the entire tensor network is given by the number of free edges (called often physical indices). A tensor network may not contain any loops, i.e., any edges connecting a node with itself. Some examples of tensor network diagrams are given in Fig. 11.

Figure  11: Examples of tensor networks. Illustration of representation of 9th-order tensor by different kinds of tensor networks (TNs): Tensor Train (TT) which is equivalent to the Matrix Product State (MPS) (with open boundary conditions (OBC)), the Projected Entangled-Pair State (PEPS), called also Tensor Product States (TPS, and Hierarchical Tucker (HT) decomposition, which is equivalent to the Tree-Tensor Network State (TTNS). The objective is to decompose a high-order tensor into sparsely connected low-order and low-rank tensors, typically 3rd-order and/or 4th-order tensors, called cores.

If a tensor network is a tree, i.e., it does not contain any cycle, each of its edges splits the modes of the data tensor into two groups, which is related to the suitable matricization of the tensor. If, in such a tree tensor network, all nodes have degree 3 or less, it corresponds to an Hierarchical Tucker (HT) decomposition shown in Fig. 12 (a). The HT decomposition has been first introduced in scientific computing by Hackbusch and Kühn and further developed by Grasedyck, Kressner, Tobler and others [22, 7, 23, 24, 25, 26]. Note that for 6th-order tensor, there are two such tensor networks (see Fig. 12 (b)), and for 10th-order there are 11 possible HT decompositions [24, 25].

(a)

(b)

Figure  12: (a) The standard Tucker decomposition and its transformation into Hierarchical Tucker (HT) model for an 8th-order tensor using interconnected 3rd-order core tensors. (b) Various exemplary structure HT/TT models for different order of data tensors. Green circles indicate factor matrices while red circles indicate cores.

A simple approach to reduce the size of core tensors is to apply distributed tensor networks (DTNs), which consists in two kinds of cores (nodes): Internal cores (nodes) which have no free edges and external cores which have free edges representing physical indices of a data tensor as illustrated in Figs. 12 and 13.

The idea in the case of the Tucker model, is that a core tensor is replaced by distributed sparsely interconnected cores of lower-order, resulting in a Hierarchical Tucker (HT) network in which only some cores are connected (associated) directly with factor matrices [7, 23, 26, 22].

For some very high-order data tensors it has been observed that the ranks (internal dimensions of cores) increase rapidly with the order of the tensor and/or with an increasing accuracy of approximation for any choice of tensor network, that is, a tree (including TT and HT decompositions) [25]. For such cases, the Projected Entangled-Pair State (PEPS) or the Multi-scale Entanglement Renormalization Ansatz (MERA) tensor networks can be used. These contain cycles, but have hierarchical structures (see Fig. 13). For the PEPS and MERA TNs the ranks can be kept considerably smaller, at the cost of employing 5th and 4th-order cores and consequently a higher computational complexity w.r.t. tensor contractions. The main advantage of PEPS and MERA is that the size of each core tensor in the internal tensor network structure is usually much smaller than the cores in TT/HT decompositions, so consequently the total number of parameters can be reduced. However, it should be noted that the contraction of the resulting tensor network becomes more difficult when compared to the basic tree structures represented by TT and HT models. This is due to the fact that the PEPS and MERA tensor networks contain loops.

Figure  13: Alternative distributed representation of 8th-order Tucker decomposition where a core tensor is replaced by MERA (Multi-scale Entanglement Renormalization Ansatz) tensor network which employs 3rd-order and 4th-order core tensors. For some data-sets, the advantage of such model is relatively low size (dimensions) of the distributed cores.

Iv Basic Tensor Decompositions and their Representation via Tensor Networks Diagrams

The main objective of a standard tensor decomposition is to factorize a data tensor into physically interpretable or meaningful factor matrices and a single core tensor which indicates the links between components (vectors of factor matrices) in different modes.

Iv-a Constrained Matrix Factorizations and Decompositions – Two-Way Component Analysis

Two-way Component Analysis (2-way CA) exploits a priori knowledge about different characteristics, features or morphology of components (or source signals) [1, 27] to find the hidden components thorough constrained matrix factorizations of the form

(6)

where the constraints imposed on factor matrices and/or include orthogonality, sparsity, statistical independence, nonnegativity or smoothness. The CA can be considered as a bilinear (2-way) factorization, where is a known matrix of observed data, represents residuals or noise, is the unknown (usually, full column rank ) mixing matrix with basis vectors , and is the matrix of unknown components (factors, latent variables, sources).

Two-way component analysis (CA) refers to a class of signal processing techniques that decompose or encode superimposed or mixed signals into components with certain constraints or properties. The CA methods exploit a priori knowledge about the true nature or diversities of latent variables. By diversity, we refer to different characteristics, features or morphology of sources or hidden latent variables [27].

For example, the columns of the matrix that represent different data sources should be: as statistically independent as possible for ICA; as sparse as possible for SCA; take only nonnegative values for (NMF) [1, 27, 16].

Remark: Note that matrix factorizations have an inherent symmetry, Eq. (6) could be written as , thus interchanging the roles of sources and mixing process.

Singular value decomposition (SVD) of the data matrix is a special case of the factorization in Eq. (6). It is exact and provides an explicit notion of the range and null space of the matrix (key issues in low-rank approximation), and is given by

(7)

where and are column-wise orthonormal matrices and is a diagonal matrix containing only nonnegative singular values .

Another virtue of component analysis comes from a representation of multiple-subject, multiple-task datasets by a set of data matrices , allowing us to perform simultaneous matrix factorizations:

(8)

subject to various constraints. In the case of statistical independence constraints, the problem can be related to models of group ICA through suitable pre-processing, dimensionality reduction and post-processing procedures [28].

The field of CA is maturing and has generated efficient algorithms for 2-way component analysis (especially, for sparse/functional PCA/SVD, ICA, NMF and SCA) [16, 1, 29]. The rapidly emerging field of tensor decompositions is the next important step that naturally generalizes 2-way CA/BSS algorithms and paradigms. We proceed to show how constrained matrix factorizations and component analysis (CA) models can be naturally generalized to multilinear models using constrained tensor decompositions, such as the Canonical Polyadic Decomposition (CPD) and Tucker models, as illustrated in Figs. 14 and 15.

Iv-B The Canonical Polyadic Decomposition (CPD)

The CPD (called also PARAFAC or CANDECOMP) factorizes an th-order tensor into a linear combination of terms , which are rank-1 tensors, and is given by [30, 31, 32]

(9)

where the only non-zero entries of the diagonal core tensor are located on the main diagonal (see Fig. 14 for a 3rd-order and 4th-order tensors).

Via the Khatri-Rao products the CPD can also be expressed in a matrix/vector form as:

(10)

where , and is a diagonal matrix.

Figure  14: Representation of the CPD. (a) The Canonical Polyadic Decomposition (CPD) of a 3rd-order tensor as: with and . (b) The CPD for a 4th-order tensor as: . The objective of the CPD is to estimate the factor matrices and a rank of tensor , that is, the number of components .

The rank of tensor is defined as the smallest for which CPD (9) holds exactly.

Algorithms to compute CPD. In the presence of noise in real world applications the CPD is rarely exact and has to be estimated by minimizing a suitable cost function, typically of the Least-Squares (LS) type in the form of the Frobenius norm , or using Least Absolute Error (LAE) criteria [33]. The Alternating Least Squares (ALS) algorithms [13, 31, 1, 34] minimize the LS cost function by optimizing individually each component matrix, while keeping the other component matrices fixed. For instance, assume that the diagonal matrix has been absorbed in one of the component matrices; then, by taking advantage of the Khatri-Rao structure the component matrices can be updated sequentially as [4]

(11)

which requires the computation of the pseudo-inverse of small matrices.

The ALS is attractive for its simplicity and for well defined problems (not too many, well separated, not collinear components) and high SNR, the performance of ALS algorithms is often satisfactory. For ill-conditioned problems, more advanced algorithms exist, which typically exploit the rank-1 structure of the terms within CPD to perform efficient computation and storage of the Jacobian and Hessian of the cost function [35, 36].

Constraints. The CPD is usually unique by itself, and does not require constraints to impose uniqueness [37]. However, if components in one or more modes are known to be e.g., nonnegative, orthogonal, statistically independent or sparse, these constraints should be incorporated to relax uniqueness conditions. More importantly, constraints may increase the accuracy and stability of the CPD algorithms and facilitate better physical interpretability of components [38, 39].

Iv-C The Tucker Decomposition

The Tucker decomposition can be expressed as follows [40]:

(12)

where is the given data tensor, is the core tensor and are the mode- component matrices, (see Fig. 15).

Figure  15: Representation of the Tucker Decomposition (TD). (a) TD of a 3rd-order tensor . The objective is to compute factor matrices and core tensor . In some applications, in the second stage, the core tensor is approximately factorized using the CPD as . (b) Graphical representation of the Tucker and CP decompositions in two-stage procedure for a 4th-order tensor as: .

Using Kronecker products the decomposition in (12) can be expressed in a matrix and vector form as follows:

(13)
Figure  16: Graphical illustration of constrained Tucker and CPD models: (a) Tucker- decomposition of a th-order tensor, with ,