Tensor Networks for Big Data Analytics and Large-Scale Optimization Problems
Tensor decompositions and tensor networks are emerging and promising tools for data analysis and data mining.
In this paper we review basic and emerging models and associated algorithms for large-scale
tensor networks, especially Tensor Train (TT) decompositions using novel mathematical and graphical representations.
We discus the concept of tensorization (i.e., creating very high-order tensors
from lower-order original
data) and super compression of data achieved via quantized tensor train (QTT) networks.
The main objective of this paper is
to show how tensor networks can be used to solve a wide class of big data optimization problems (that are far from tractable by classical numerical methods) by applying tensorization and performing all operations using relatively small size matrices and tensors and applying iteratively optimized and approximative tensor contractions.
Keywords: Tensor networks, tensor train (TT) decompositions, matrix product states (MPS), matrix product operators (MPO), basic tensor operations, optimization problems for very large-scale problems: generalized eigenvalue decomposition (GEVD), PCA/SVD, canonical correlation analysis (CCA).
I Introduction and Motivations
Big Data can have a such huge volume and high complexity that existing standard methods and algorithms become inadequate for the processing and optimization of such data. Big data is characterized not only by big Volume but also by other specific “V” features/challenges: Veracity, Variety, Velocity, Value. Fig. 1 illuminates the big data characteristics for brain research related problems. High Volume implies the need for algorithms that are scalable; high Velocity is related to the processing of stream of data in near real-time; high Veracity calls for robust and predictive algorithms for noisy, incomplete and/or inconsistent data, high Variety require integration across different types of data, e.g., binary, continuous data, images, time series, etc., and finally Value refers to extracting high quality and consistent data which could lend themselves to meaningful and interpretable results.
Multidimensional data is becoming ubiquitous across the sciences and engineering because they are increasingly being gathered by information-sensing devices and remote sensing. Big data such as multimedia data (speech, video), and medical/biological data, the analysis of which critically requires a paradigm shift in order to efficiently process massive datasets within tolerable time. Tensors – multi-dimensional generalizations of matrices, provide often a natural sparse and distributed representation for such data.
Tensors have been adopted in diverse branches of data analysis, such as in signal and image processing, Psychometric, Chemometrics, Biometric, Quantum Physics/Information, Quantum Chemistry and Brain Science [1, 2, 3, 4, 5, 6, 7, 8]. Tensors are particularly attractive for data which exhibit not only huge volumes but also very high variety, for example, they are suited for problems in bio- and neuro-informatics or computational neuroscience where data are collected in various forms of big, sparse tabular, graphs or networks with multiple aspects and high dimensionality.
Tensor decompositions (TDs) provide some extensions of blind source separation (BSS) and 2-way (matrix) Component Analysis (2-way CA) to multi-way component analysis (MWCA) methods . Furthermore, TNs/TDs are suitable for dimensionality reduction, they can handle missing values, and noisy data . They are also potentially useful for analysis of linked (coupled) block of big tensors with millions and even billions of non-zero entries, using the map-reduce paradigm, as well as out-of-core approaches [2, 10, 11, 12, 13]. Moreover, multi-block tensors which arise in numerous important applications (that require the analysis of diverse and partially related data) can be decomposed to common (or correlated) and uncorrelated or statistically independent components. The effective analysis of coupled tensors requires the development of new models and associated algorithms and software that can identify the core relations that may exist among the different tensors, and scale to extremely large datasets.
Complex interactions and operations between tensors can be visualized by tensor network diagrams in which tensors are represented graphically by nodes or any shapes (e.g., circles, spheres, triangular, squares, ellipses) and each outgoing edge (line) emerging from a node represents a mode (a way, a dimension, indices) (see Fig. 2). In contrast to classical graphs, in tensor network diagrams an edge does not need connect two nodes, but may be connected to only one node. Each such free (dangling) edge corresponds to a (physical) mode that is not contracted and, hence, the order of the entire tensor network is given by the number of free (dangling) edges (see Fig. 3). Tensor network diagrams are very helpful not only in visualizing tensor decompositions but also to express complex mathematical (multilinear) operations of contractions of tensors. Tensor networks are connected to quantum physics, quantum chemistry and quantum information, which studies the ways to possibly build a quantum computer and to program it [14, 15].
To summarize, the benefits of multiway (tensor) analysis methods for big data include:
“Super” - compression of huge multidimensional data via tensorization and decompositions of a high-order tensor into factor matrices and/or core tensors of low-rank and low-order;
By performing all mathematical operations in feasible tensor formats ;
Very flexible distributed representations of structurally rich data;
Possibility to operate with noisy and missing data by using powerful low-rank tensor/matrix approximations and by exploiting robustness and stability of tensor network decomposition algorithms;
A framework to incorporate various diversities or constraints in different modes or different factors (core tensors) and thus naturally extend the standard (2-way) CA and BSS methods to large-scale multidimensional data;
Tensor networks not only provide graphically illustrative large distributed networks but also perform complex tensor operations (i.e., tensor contractions and reshaping) in an intuitive way and without using explicitly mathematical expressions.
Review and tutorial papers [1, 4, 17, 18, 19, 20] and books [7, 8, 3, 6] dealing with TDs and TNs already exist, however, they typically focus on standard models and/or do not provide explicit links to big data processing topics and/or do not explore connections to wide class of optimization problems. This paper extends beyond the standard tensor decomposition models such as the Tucker and CPD models, and aims to demonstrate flexibilities of TNs in the optimization problems of multi-dimensional, multi-modal data, together with their role as a mathematical backbone for the discovery of hidden structures in large-scale data [3, 4].
Our objective is to both review tensor models for big data, and to systematically introduce emerging models and associated algorithms for large-scale TNs/TDs, together with illustrating the many potential applications. Apart from the optimization framework considered many other challenging problems for big data related to anomaly detection, visualization, clustering, feature extraction and classification can also be solved using tensor network decompositions and low-rank tensor approximations.
Ii Basic Tensor Operations
A higher-order tensor can be interpreted as a multiway array of numbers, as illustrated in Figs. 2 and 3. Tensors are denoted by bold underlined capital letters, e.g., (we assume we shall assume that all entries of a tensor are real-valued). The order of a tensor is the number of its “modes”, “ways” or “dimensions”, which include e.g., space, time, frequency, trials, classes, and dictionaries. Matrices (2nd-order tensors) are denoted by boldface capital letters, e.g., , and vectors (1st-order tensors) by boldface lowercase letters; for instance the columns of the matrix are denoted by and elements of a matrix (scalars) are denoted by lowercase letters, e.g., . Basic tensor and TN notations are given in Table I and illustrated in Figs. 2 – 4. It should be noted that hierarchical block matrices can be represented by tensors and vice versa. For example, 3rd- and 4th-order tensors can be represented by block matrices and all algebraic operations can be equally performed on block matrices .
The most common tensor multiplications are denoted by: for the Kronecker, for the Khatri-Rao, for the Hadamard (componentwise), for the outer and for the mode- products (see also Table I). General basic operations, e.g., , , are defined as in MATLAB. We refer to [2, 3, 4] for more detail regarding the basic notations and tensor operations.
Subtensors are formed when a subset of indices is fixed. Of particular interest are fibers (vectors), defined by fixing every index but one, and slices which are two-dimensional sections (matrices) of a tensor, obtained by fixing all the indices but two. A matrix has two modes: rows and columns, while an th-order tensor has modes.
The process of unfolding (see Fig. 5) flattens a tensor into a matrix . In the simplest scenario, mode- unfolding (matricization, flattening) of the tensor yields a matrix , with entries such that grouped indices are arranged in a specific order, (in this paper rows and columns are ordered colexicographically). In tensor networks we use, typically a generalized mode- unfolding as illustrated in Fig. 5.
By a multi-index we denote an index which takes all possible combination of values of , for , in a specific order.
Remark. The entries of tensors in matricized and/or vectorized form can be ordered in different forms. In fact, the multi–index can be defined using two different conventions :
1) Little-endian convention
The little–endian notation is consistent with the Fortran style of indexing, while the big–endian notation is similar to numbers written in the positional system and corresponds to reverse lexicographic order [22, 21]. The definition of unfolding and the Kronecker (tensor) product should be also consistent with the chosen convention111The standard and more popular definition in multilinear algebra assumes the big–endian convention, which corresponds to colexicographic order, while for the development of the efficient program code, usually, the little–endian convention seems to be more convenient (See more detail the paper of Dolgov and Savostyanov ).. In this paper we will use the big-endian notation, however to follow this work it is sufficient to remember that means that .
The Kronecker product of two tensors and yields with entries , where .
The outer or tensor product of the tensors and is the tensor with entries . Specifically, the outer product of two nonzero vectors produces a rank-1 matrix and the outer product of three nonzero vectors: and produces a 3rd-order rank-1 tensor: , whose entries are . A tensor is said to be rank-1 if it can be expressed exactly as with entries , where are nonzero vectors.
The mode- product of the tensor and vector is defined as a tensor , with entries , while a mode- product of the tensor and a matrix is the tensor with entries . This can be also expressed in a matrix form as .
A full multilinear product of a tensor and a set of matrices takes into account all the modes, and can be compactly written as (see Fig 6 (a)):
In a similar way, we can define the mode- product of two tensors and with common modes that produces a -order tensor :
with entries (see Fig. 7) (a).
When not confusing, the super-index can be neglected. For example, the mode-1 product of the tensors and , with a common first mode can be written as
with entries , when using MATLAB notation, . This operation can be considered as a tensor contraction of two modes. Tensors can be contracted in several modes or even in all modes (see Fig. 7).
Tensor contraction is a fundamental operation, which can be considered as a higher dimensional analogue of inner product, outer product and matrix multiplications, and comprises computationally dominant operations in most numerical algorithms. However, unlike the matrix by matrix multiplications for which many efficient distributed-memory parallel schemes have been developed, for a tensor contraction we have a rather limited number of available optimized algorithms [24, 25, 26]. In practice, we usually implement approximate tensors contractions with reduced ranks . A significant help in developing effective distributed tensor contraction algorithms is that the tensors used in computational models often exhibit symmetry over all or multiple modes; exploitation of the symmetry is essential, both in order to save on storage as well as to avoid unnecessary arithmetic operations [25, 26].
Iii Low-Rank Tensor Approximations via Tensor Networks
Iii-a Basic Tensor Network Models
Tensor networks can be considered as a new “language” for big data tensor decompositions in simulation of large complex systems (e.g., in condensed matter physics and quantum physics) even with using standard computers [15, 28, 29, 2]. In other words, tensor networks, can be considered as a diagrammatic language for capturing the internal structure of high-order tensor decompositions.
In contrast to the CPD or Tucker decompositions, that have only one single core tensor, TNs decompose a high-order tensor into several lower-order core tensors. The branches (leads, lines, edges) connecting core tensors between each other correspond to contracted modes (and represent a TN rank), whereas lines that do not go from one tensor to another correspond to physical modes in the TN. A tensor network is a set of weakly connected core tensors, where some or all indices are contracted according to some rules.
Some examples of basic tensor network diagrams are given in Figs. 10, 11, 12, and 13 [2, 14]. A tensor network may not contain any loops, i.e., any edges connecting a node with itself. If a tensor network is a binary tree, i.e., if it does not contain any cycles (loops), each of its edges splits the modes of the data tensor into two or more groups, which is related to the suitable matricization of the tensor [30, 31]. A tree tensor network, whose all nodes have degree 3 or 4, corresponds to a Hierarchical Tucker (HT) decomposition of the tensor illustrated in Fig. 13 (a). The HT decompositions in the numerical analysis community have been introduced by Hackbusch and Kühn  and Grasedyck  (see also [30, 34, 35, 36, 37] and references therein). The general construction of the HT decomposition requires a hierarchical splitting of the modes (with sizes ). The construction of Hierarchical Tucker format relies on the notion of a dimension tree, chosen a priori, which specifies the topology of the HT decomposition. Intuitively, the dimension tree specifies which groups of modes are “separated” from other groups of modes, where sequential HT decomposition can be performed via (truncated) SVD applied to unfolded matrices .
The Tensor Train (TT) format proposed in the numerical analysis community by Oseledets and Tyrtyshnikow  (see also [39, 40, 13, 41, 42, 43]) can be interpreted as a special case of the HT, where all nodes of the underlying tensor network are aligned and where, moreover, the leaf matrices are assumed to be identities (and thus need not be stored). An advantage of the TT format is its simpler practical implementation, as no binary tree need be involved. The Tensor Trains [38, 40, 44], called also Matrix Product States (MPS) in quantum information theory [45, 46, 15, 47, 48], is the simplest TN model222In fact, the TT was rediscovered several times under different names: MPS, valence bond states and density matrix renormalization group (DMRG). The DMRG usually means not only tensor format but also power-full computational algorithms (see  and references therein)..
For some very high-order data tensors it has been observed that the ranks of 3rd-order tensors increase rapidly with the order of the tensor, for any choice of tensor network that is a tree (including TT and HT decompositions) .
For such cases, PEPS and the Multi-scale Entanglement Renormalization Ansatz (MERA) tensor networks can be used which contain cycles, but have hierarchical structures (see Fig. 13) (c). For the PEPS and MERA TNs the ranks can be kept considerably smaller, at the cost of employing 5th and 4th-order core tensors and consequently a higher computational complexity w.r.t. their ranks [50, 51].
Some interesting connections between tensor networks and graphical models used extensively in machine learning and statistics as shown in Table II [52, 53, 54, 41, 55]. Despite clear analogy, more research is needed to find more deep and precise relationships .
|Tensor Networks||Graphical Models in ML/Statistics|
|TT/MPS||Hidden Markov Models (HMM)|
|HT/TTNS||Gaussian Mixture Model (GMM)|
|TNS/PEPS||Markov Random Field (MRF) and Conditional Random Field (CRF)|
|MERA||Deep Belief Networks (DBN)|
|DMRG and MALS Algs.||Forward-Backward Algs., Block Nonlinear Gauss-Seidel Methods|
Iii-B Changing the Structure of Tensor Networks
One advantage of a graphical representation of a tensor network is that it allows us to perform even most c complex mathematical operations in intuitive and easy to understand way. Another important advantage is the ability to modify or optimize a TN structure, that is, to change its topology, preserving physical modes unchanged. In fact, in some applications it is quite useful to modify the topology of a tensor network with or without approximation by providing simplified or more convenient graphical representation of the same higher-order data tensor [56, 57, 58]. For instance, tensor networks may consist of many cycles, those can t be reduced or completely eliminated in order to reduce computational complexity of contraction of core tensors and to provide stability of computation. Again, observe a strong link with loop elimination in control theory, in addition tensor networks having many cycles may not admit stable algorithm. By changing the topology to a tree structure (TT/HT models), we can often reduce complexity of computation and improve stability of algorithms.
Performing contraction of core tensors iteratively for tree–structured tensor networks has usually a much smaller complexity than tensor networks containing many cycles. One could transform a specific tensor network with cycles into a tree structure, perform stable computations333The TT decomposition is stable in the sense that the best approximation of a data tensor with bounded TT-ranks always exist and a quasi-optimal approximation can be computed by a sequence of truncated SVDs of suitably reshaping matrices of cores [39, 40]., with it and re-transform it back to the original structure if necessary. Furthermore, in the cases that we need to compare or analyze a set of blocks of tensor data, it is important that such tensors are represented by the same or very similar structures to analyze link or correlation between them or detect common cores or hidden components. Performing such analysis with differently structured tensor networks is in general difficult or even impossible.
A Tensor network can be relatively easily transformed from one form to another one via tensor contractions, reshaping and basic matrix factorizations, typically using SVD [39, 40]. The basic approach to modify tensor structure is to perform: sequential core contractions, unfolding contracting tensors into matrices, performing matrix factorizations (typically, SVD) and finally reshaping matrices back to new core tensors. These principles are illustrated graphically in Figs 14 (a), (b), (c).
For example, in Fig 14 (a) in the first step we perform a contraction of two core tensors and , as:
with entries . In the next step, we transform the tensor into a matrix via unfolding and low-rank matrix factorization via the SVD
In the last step, we reshape factor matrices and back to new core tensors: and .
In Fig. 14 (c) we have illustrated how to convert tensor chain (TC) into TT/MPS with OBC, by contracting sequentially two core tensors, unfolding them, applying SVD and reshaping matrices back into core tensors . More precisely, in the first step, we perform a contraction of two tensors and , as:
with entries . In the next step, we can transform this tensor into a matrix in order to perform the truncated SVD:
In the next step, we reshape orthogonal matrices and back to core tensors: and . The procedure is repeated again and again for different pair of cores as illustrated in the Fig. 14 (c).
Iii-C Distributed (Concatenated) Representation of Tensors
A simple approach to reduce the size or rank of core tensors is to apply distributed tensor networks (DTNs), which consists of two kind of cores (nodes): internal nodes which has no free edges and external nodes which have free edges representing natural (physical) indices of a data tensor as illustrated in Figs. 13 and 15. A simple idea is that each of the core tensor in an original TN is itself repeatedly replaced by another TN (see Fig. 16), resulting in another TN in which only some core tensors are associated with physical (natural) modes of the original data tensor .
The main advantage of DTNs is that the size of each of the core tensors in the internal tensor network structure is usually much smaller than the initial core tensor so consequently the total number of parameters can be reduced . However, it should be noted that the contraction of the resulting tensor network becomes more difficult when compared to the initial tree structure. This is due to the fact that the distributed tensor network contains loops.
Many algorithms applied to tensor networks scale with the size or of the core tensors of the network. In spite of the usually polynomial scaling of these algorithms, the computations quickly become intractable for increasing , so that a network containing core tensors with small dimensions are favorable in general. See as examples the distributed Tucker models shown in Fig. 15 (a) and (b).
Iv Tensorization – Blessing of Dimensionality
The procedure of creating a higher-order tensor from lower-order original data is referred to as tensorization. In other words, lower-order data tensors can be reshaped (reformatted) into high-order tensors. The purpose of a such tensorization or reshaping is to achieve a low-rank approximation with high level of compression. For example, big vectors, matrices even low-order tensors can be easily tensorized to very high-order tensors, then efficiently compressed by applying a suitable tensor network decomposition; this is the underlying principle for big data analysis [1, 2, 59, 16] (see also Figs. 17 - 20).
Iv-a Curse of Dimensionality
The term curse of dimensionality, in the context of tensors, refers to the fact that the number of elements of an th-order tensor, , grows exponentially with the tensor order . For example, for the Tucker decomposition the number of entries of an original data tensor but also a core tensor scales exponentially in the tensor order, for instance, the number of entries of an th-order core tensor is .
If all computations are performed on a CP tensor format and not on the raw data tensor itself, then instead of the original raw data entries, the number of parameters in a CP decomposition reduces to , which scales linearly in and . This effectively bypasses the curse of dimensionality, however the CP approximation may provide a poor fit to the data and may involve numerical problems, since existing CPD algorithms are not stable for high-order tensors. In this paper we exploit TT decompositions which are stable and robust with ability to control an approximation error i.e., to achieve any desired accuracy of TT approximation [40, 60]. The main idea of using low-rank tensor-structured approximations is to reduce the complexity of computation and relax or avoid the curse of dimensionality.
Iv-B Quantized Tensor Networks
The curse of dimensionality can be overcome relatively easily through quantized tensor networks, which represent a tensor of possibly very high-order as a set of sparsely interconnected of low dimensions (typically, 3rd-order) cores [60, 59]. The concept of quantized tensor networks was first proposed by Oseledets  and Khoromskij .
For example, the quantization and tensorization of a huge vector , can be achieved through reshaping to give an tensor of order , as illustrated in Fig. 17. Such a quantized tensor often admits low-rank matrix/tensor approximations, so that a good compression of a huge vector can be achieved by enforcing a maximum possible low-rank structure on the tensor , thus admitting highly compressed representation via a tensor network.
Even more generally, an th-order tensor , with , can be quantized in all modes simultaneously to yield a quantized tensor of higher-order, with small .
In practice, a fine ( ) quantization is desirable to create as many virtual modes as possible, thus allowing us to implement an efficient low-rank tensor approximations. For example, the binary encoding () reshapes an th-order tensor with elements into a tensor of order , with the same number of elements. In other words, the idea of the quantized tensor is quantization of the each -th “physical” mode (dimension) by replacing it with “virtual” modes, provided that the corresponding mode size are factorized as . This corresponds to reshaping the -th mode of size into modes of sizes .
In example shown in Fig. 18, the Tensor Train of huge 3rd-order tensor is expressed by the strong Kronecker products of block tensors with relatively small 3rd-order tensor blocks. Since large-scale tensors cannot be loaded explicitly in main memory, they usually reside in distributed storage by splitting tensors to smaller blocks. Our approach is to apply tensor networks and represent big data by high-order tensors not explicitly but in compressed TT formats.
The TT decomposition applied to quantized tensors is referred to as the QTT; it was first introduced as a compression scheme for large-scale structured matrices, which admit low-rank TT approximation , and also developed for more general settings [16, 61, 62]. The attractive property of QTT is that not only its rank is typically small (below 10) but it is almost independent or at least uniformly bounded by data size, providing a logarithmic (sub-linear) reduction of storage requirements: – so-called super-compression .
Note also that, unlike in Tucker or CPD, the TT decomposition relies on a certain ordering of the modes so that reordering modes may affect the numerical values of TT ranks significantly.
Quantization is quite important for reducing the computational complexity further, since it allows the TT decomposition to resolve and represent more structure in the data by splitting the “virtual” dimensions introduced by the quantization, as well as the “physical” ones. In practice it appears the most efficient to use as fine a quantization as possible (typically, with ) and to generate as many virtual modes as possible.
A TT decomposition of the quantized vector is referred to as QTT decomposition of the original vector; the ranks of this TT decomposition are called ranks of the QTT decomposition of the original vector.
V Mathematical and Graphical Representation of Tensor Trains
In order to perform efficiently various mathematical operations in the TT formats we need to represent TT decompositions in compact and easily understandable mathematical and graphical representations [2, 13].