Tensor Networks for Big Data Analytics and LargeScale Optimization Problems
Abstract
Tensor decompositions and tensor networks are emerging and promising tools for data analysis and data mining.
In this paper we review basic and emerging models and associated algorithms for largescale
tensor networks, especially Tensor Train (TT) decompositions using novel mathematical and graphical representations.
We discus the concept of tensorization (i.e., creating very highorder tensors
from lowerorder original
data) and super compression of data achieved via quantized tensor train (QTT) networks.
The main objective of this paper is
to show how tensor networks can be used to solve a wide class of big data optimization problems (that are far from tractable by classical numerical methods) by applying tensorization and performing all operations using relatively small size matrices and tensors and applying iteratively optimized and approximative tensor contractions.
Keywords: Tensor networks, tensor train (TT) decompositions, matrix product states (MPS), matrix product operators (MPO), basic tensor operations, optimization problems for very largescale problems: generalized eigenvalue decomposition (GEVD), PCA/SVD, canonical correlation analysis (CCA).
I Introduction and Motivations
Big Data can have a such huge volume and high complexity that existing standard methods and algorithms become inadequate for the processing and optimization of such data. Big data is characterized not only by big Volume but also by other specific “V” features/challenges: Veracity, Variety, Velocity, Value. Fig. 1 illuminates the big data characteristics for brain research related problems. High Volume implies the need for algorithms that are scalable; high Velocity is related to the processing of stream of data in near realtime; high Veracity calls for robust and predictive algorithms for noisy, incomplete and/or inconsistent data, high Variety require integration across different types of data, e.g., binary, continuous data, images, time series, etc., and finally Value refers to extracting high quality and consistent data which could lend themselves to meaningful and interpretable results.
Multidimensional data is becoming ubiquitous across the sciences and engineering because they are increasingly being gathered by informationsensing devices and remote sensing. Big data such as multimedia data (speech, video), and medical/biological data, the analysis of which critically requires a paradigm shift in order to efficiently process massive datasets within tolerable time. Tensors – multidimensional generalizations of matrices, provide often a natural sparse and distributed representation for such data.
Tensors have been adopted in diverse branches of data analysis, such as in signal and image processing, Psychometric, Chemometrics, Biometric, Quantum Physics/Information, Quantum Chemistry and Brain Science [1, 2, 3, 4, 5, 6, 7, 8]. Tensors are particularly attractive for data which exhibit not only huge volumes but also very high variety, for example, they are suited for problems in bio and neuroinformatics or computational neuroscience where data are collected in various forms of big, sparse tabular, graphs or networks with multiple aspects and high dimensionality.
Tensor decompositions (TDs) provide some extensions of blind source separation (BSS) and 2way (matrix) Component Analysis (2way CA) to multiway component analysis (MWCA) methods [1]. Furthermore, TNs/TDs are suitable for dimensionality reduction, they can handle missing values, and noisy data [9]. They are also potentially useful for analysis of linked (coupled) block of big tensors with millions and even billions of nonzero entries, using the mapreduce paradigm, as well as outofcore approaches [2, 10, 11, 12, 13]. Moreover, multiblock tensors which arise in numerous important applications (that require the analysis of diverse and partially related data) can be decomposed to common (or correlated) and uncorrelated or statistically independent components. The effective analysis of coupled tensors requires the development of new models and associated algorithms and software that can identify the core relations that may exist among the different tensors, and scale to extremely large datasets.
Complex interactions and operations between tensors can be visualized by tensor network diagrams in which tensors are represented graphically by nodes or any shapes (e.g., circles, spheres, triangular, squares, ellipses) and each outgoing edge (line) emerging from a node represents a mode (a way, a dimension, indices) (see Fig. 2). In contrast to classical graphs, in tensor network diagrams an edge does not need connect two nodes, but may be connected to only one node. Each such free (dangling) edge corresponds to a (physical) mode that is not contracted and, hence, the order of the entire tensor network is given by the number of free (dangling) edges (see Fig. 3). Tensor network diagrams are very helpful not only in visualizing tensor decompositions but also to express complex mathematical (multilinear) operations of contractions of tensors. Tensor networks are connected to quantum physics, quantum chemistry and quantum information, which studies the ways to possibly build a quantum computer and to program it [14, 15].
To summarize, the benefits of multiway (tensor) analysis methods for big data include:

“Super”  compression of huge multidimensional data via tensorization and decompositions of a highorder tensor into factor matrices and/or core tensors of lowrank and loworder;

By performing all mathematical operations in feasible tensor formats [16];

Very flexible distributed representations of structurally rich data;

Possibility to operate with noisy and missing data by using powerful lowrank tensor/matrix approximations and by exploiting robustness and stability of tensor network decomposition algorithms;

A framework to incorporate various diversities or constraints in different modes or different factors (core tensors) and thus naturally extend the standard (2way) CA and BSS methods to largescale multidimensional data;

Tensor networks not only provide graphically illustrative large distributed networks but also perform complex tensor operations (i.e., tensor contractions and reshaping) in an intuitive way and without using explicitly mathematical expressions.
Review and tutorial papers [1, 4, 17, 18, 19, 20] and books [7, 8, 3, 6] dealing with TDs and TNs already exist, however, they typically focus on standard models and/or do not provide explicit links to big data processing topics and/or do not explore connections to wide class of optimization problems. This paper extends beyond the standard tensor decomposition models such as the Tucker and CPD models, and aims to demonstrate flexibilities of TNs in the optimization problems of multidimensional, multimodal data, together with their role as a mathematical backbone for the discovery of hidden structures in largescale data [3, 4].
Our objective is to both review tensor models for big data, and to systematically introduce emerging models and associated algorithms for largescale TNs/TDs, together with illustrating the many potential applications. Apart from the optimization framework considered many other challenging problems for big data related to anomaly detection, visualization, clustering, feature extraction and classification can also be solved using tensor network decompositions and lowrank tensor approximations.
Ii Basic Tensor Operations
A higherorder tensor can be interpreted as a multiway array of numbers, as illustrated in Figs. 2 and 3. Tensors are denoted by bold underlined capital letters, e.g., (we assume we shall assume that all entries of a tensor are realvalued). The order of a tensor is the number of its “modes”, “ways” or “dimensions”, which include e.g., space, time, frequency, trials, classes, and dictionaries. Matrices (2ndorder tensors) are denoted by boldface capital letters, e.g., , and vectors (1storder tensors) by boldface lowercase letters; for instance the columns of the matrix are denoted by and elements of a matrix (scalars) are denoted by lowercase letters, e.g., . Basic tensor and TN notations are given in Table I and illustrated in Figs. 2 – 4. It should be noted that hierarchical block matrices can be represented by tensors and vice versa. For example, 3rd and 4thorder tensors can be represented by block matrices and all algebraic operations can be equally performed on block matrices [2].
The most common tensor multiplications are denoted by: for the Kronecker, for the KhatriRao, for the Hadamard (componentwise), for the outer and for the mode products (see also Table I). General basic operations, e.g., , , are defined as in MATLAB. We refer to [2, 3, 4] for more detail regarding the basic notations and tensor operations.
Subtensors are formed when a subset of indices is fixed. Of particular interest are fibers (vectors), defined by fixing every index but one, and slices which are twodimensional sections (matrices) of a tensor, obtained by fixing all the indices but two. A matrix has two modes: rows and columns, while an thorder tensor has modes.
The process of unfolding (see Fig. 5) flattens a tensor into a matrix [4]. In the simplest scenario, mode unfolding (matricization, flattening) of the tensor yields a matrix , with entries such that grouped indices are arranged in a specific order, (in this paper rows and columns are ordered colexicographically). In tensor networks we use, typically a generalized mode unfolding as illustrated in Fig. 5.


core tensors  




component matrices  


vectorization of  









By a multiindex we denote an index which takes all possible combination of values of , for , in a specific order.
Remark. The entries of tensors in matricized and/or vectorized form can be ordered in different forms. In fact, the multi–index can be defined using two different conventions [21]:
1) Littleendian convention
(1)  
2) Bigendian
The little–endian notation is consistent with the Fortran style of indexing, while the big–endian notation is similar to numbers written in the positional system and corresponds to reverse lexicographic order [22, 21]. The definition of unfolding and the Kronecker (tensor) product should be also consistent with the chosen convention^{1}^{1}1The standard and more popular definition in multilinear algebra assumes the big–endian convention, which corresponds to colexicographic order, while for the development of the efficient program code, usually, the little–endian convention seems to be more convenient (See more detail the paper of Dolgov and Savostyanov [21]).. In this paper we will use the bigendian notation, however to follow this work it is sufficient to remember that means that .
The Kronecker product of two tensors and yields with entries , where [23].
The outer or tensor product of the tensors and is the tensor with entries . Specifically, the outer product of two nonzero vectors produces a rank1 matrix and the outer product of three nonzero vectors: and produces a 3rdorder rank1 tensor: , whose entries are . A tensor is said to be rank1 if it can be expressed exactly as with entries , where are nonzero vectors.
The mode product of the tensor and vector is defined as a tensor , with entries , while a mode product of the tensor and a matrix is the tensor with entries . This can be also expressed in a matrix form as .
A full multilinear product of a tensor and a set of matrices takes into account all the modes, and can be compactly written as (see Fig 6 (a)):
(3)  
In a similar way, we can define the mode product of two tensors and with common modes that produces a order tensor :
(4) 
with entries (see Fig. 7) (a).
When not confusing, the superindex can be neglected. For example, the mode1 product of the tensors and , with a common first mode can be written as
(5) 
with entries , when using MATLAB notation, . This operation can be considered as a tensor contraction of two modes. Tensors can be contracted in several modes or even in all modes (see Fig. 7).
Tensor contraction is a fundamental operation, which can be considered as a higher dimensional analogue of inner product, outer product and matrix multiplications, and comprises computationally dominant operations in most numerical algorithms. However, unlike the matrix by matrix multiplications for which many efficient distributedmemory parallel schemes have been developed, for a tensor contraction we have a rather limited number of available optimized algorithms [24, 25, 26]. In practice, we usually implement approximate tensors contractions with reduced ranks [27]. A significant help in developing effective distributed tensor contraction algorithms is that the tensors used in computational models often exhibit symmetry over all or multiple modes; exploitation of the symmetry is essential, both in order to save on storage as well as to avoid unnecessary arithmetic operations [25, 26].
Iii LowRank Tensor Approximations via Tensor Networks
Iiia Basic Tensor Network Models
Tensor networks can be considered as a new “language” for big data tensor decompositions in simulation of large complex systems (e.g., in condensed matter physics and quantum physics) even with using standard computers [15, 28, 29, 2]. In other words, tensor networks, can be considered as a diagrammatic language for capturing the internal structure of highorder tensor decompositions.
In contrast to the CPD or Tucker decompositions, that have only one single core tensor, TNs decompose a highorder tensor into several lowerorder core tensors. The branches (leads, lines, edges) connecting core tensors between each other correspond to contracted modes (and represent a TN rank), whereas lines that do not go from one tensor to another correspond to physical modes in the TN. A tensor network is a set of weakly connected core tensors, where some or all indices are contracted according to some rules.
Some examples of basic tensor network diagrams are given in Figs. 10, 11, 12, and 13 [2, 14]. A tensor network may not contain any loops, i.e., any edges connecting a node with itself. If a tensor network is a binary tree, i.e., if it does not contain any cycles (loops), each of its edges splits the modes of the data tensor into two or more groups, which is related to the suitable matricization of the tensor [30, 31]. A tree tensor network, whose all nodes have degree 3 or 4, corresponds to a Hierarchical Tucker (HT) decomposition of the tensor illustrated in Fig. 13 (a). The HT decompositions in the numerical analysis community have been introduced by Hackbusch and Kühn [32] and Grasedyck [33] (see also [30, 34, 35, 36, 37] and references therein). The general construction of the HT decomposition requires a hierarchical splitting of the modes (with sizes ). The construction of Hierarchical Tucker format relies on the notion of a dimension tree, chosen a priori, which specifies the topology of the HT decomposition. Intuitively, the dimension tree specifies which groups of modes are “separated” from other groups of modes, where sequential HT decomposition can be performed via (truncated) SVD applied to unfolded matrices [30].
The Tensor Train (TT) format proposed in the numerical analysis community by Oseledets and Tyrtyshnikow [38] (see also [39, 40, 13, 41, 42, 43]) can be interpreted as a special case of the HT, where all nodes of the underlying tensor network are aligned and where, moreover, the leaf matrices are assumed to be identities (and thus need not be stored). An advantage of the TT format is its simpler practical implementation, as no binary tree need be involved. The Tensor Trains [38, 40, 44], called also Matrix Product States (MPS) in quantum information theory [45, 46, 15, 47, 48], is the simplest TN model^{2}^{2}2In fact, the TT was rediscovered several times under different names: MPS, valence bond states and density matrix renormalization group (DMRG). The DMRG usually means not only tensor format but also powerfull computational algorithms (see [49] and references therein)..
For some very highorder data tensors it has been observed that the ranks of 3rdorder tensors increase rapidly with the order of the tensor, for any choice of tensor network that is a tree (including TT and HT decompositions) [35].
For such cases, PEPS and the Multiscale Entanglement Renormalization Ansatz (MERA) tensor networks can be used which contain cycles, but have hierarchical structures (see Fig. 13) (c). For the PEPS and MERA TNs the ranks can be kept considerably smaller, at the cost of employing 5th and 4thorder core tensors and consequently a higher computational complexity w.r.t. their ranks [50, 51].
Some interesting connections between tensor networks and graphical models used extensively in machine learning and statistics as shown in Table II [52, 53, 54, 41, 55]. Despite clear analogy, more research is needed to find more deep and precise relationships [55].
Tensor Networks  Graphical Models in ML/Statistics 

TT/MPS  Hidden Markov Models (HMM) 
HT/TTNS  Gaussian Mixture Model (GMM) 
TNS/PEPS  Markov Random Field (MRF) and Conditional Random Field (CRF) 
MERA  Deep Belief Networks (DBN) 
DMRG and MALS Algs.  ForwardBackward Algs., Block Nonlinear GaussSeidel Methods 
IiiB Changing the Structure of Tensor Networks
One advantage of a graphical representation of a tensor network is that it allows us to perform even most c complex mathematical operations in intuitive and easy to understand way. Another important advantage is the ability to modify or optimize a TN structure, that is, to change its topology, preserving physical modes unchanged. In fact, in some applications it is quite useful to modify the topology of a tensor network with or without approximation by providing simplified or more convenient graphical representation of the same higherorder data tensor [56, 57, 58]. For instance, tensor networks may consist of many cycles, those can t be reduced or completely eliminated in order to reduce computational complexity of contraction of core tensors and to provide stability of computation. Again, observe a strong link with loop elimination in control theory, in addition tensor networks having many cycles may not admit stable algorithm. By changing the topology to a tree structure (TT/HT models), we can often reduce complexity of computation and improve stability of algorithms.
Performing contraction of core tensors iteratively for tree–structured tensor networks has usually a much smaller complexity than tensor networks containing many cycles. One could transform a specific tensor network with cycles into a tree structure, perform stable computations^{3}^{3}3The TT decomposition is stable in the sense that the best approximation of a data tensor with bounded TTranks always exist and a quasioptimal approximation can be computed by a sequence of truncated SVDs of suitably reshaping matrices of cores [39, 40]., with it and retransform it back to the original structure if necessary. Furthermore, in the cases that we need to compare or analyze a set of blocks of tensor data, it is important that such tensors are represented by the same or very similar structures to analyze link or correlation between them or detect common cores or hidden components. Performing such analysis with differently structured tensor networks is in general difficult or even impossible.
A Tensor network can be relatively easily transformed from one form to another one via tensor contractions, reshaping and basic matrix factorizations, typically using SVD [39, 40]. The basic approach to modify tensor structure is to perform: sequential core contractions, unfolding contracting tensors into matrices, performing matrix factorizations (typically, SVD) and finally reshaping matrices back to new core tensors. These principles are illustrated graphically in Figs 14 (a), (b), (c).
For example, in Fig 14 (a) in the first step we perform a contraction of two core tensors and , as:
(6) 
with entries . In the next step, we transform the tensor into a matrix via unfolding and lowrank matrix factorization via the SVD
(7) 
In the last step, we reshape factor matrices and back to new core tensors: and .
The above procedure has been applied in Fig. 14 (b) to transform HoneyComb lattice into tensor chain (TC) along with tensor contraction of three cores [57].
In Fig. 14 (c) we have illustrated how to convert tensor chain (TC) into TT/MPS with OBC, by contracting sequentially two core tensors, unfolding them, applying SVD and reshaping matrices back into core tensors [56]. More precisely, in the first step, we perform a contraction of two tensors and , as:
(8) 
with entries . In the next step, we can transform this tensor into a matrix in order to perform the truncated SVD:
(9) 
In the next step, we reshape orthogonal matrices and back to core tensors: and . The procedure is repeated again and again for different pair of cores as illustrated in the Fig. 14 (c).
IiiC Distributed (Concatenated) Representation of Tensors
A simple approach to reduce the size or rank of core tensors is to apply distributed tensor networks (DTNs), which consists of two kind of cores (nodes): internal nodes which has no free edges and external nodes which have free edges representing natural (physical) indices of a data tensor as illustrated in Figs. 13 and 15. A simple idea is that each of the core tensor in an original TN is itself repeatedly replaced by another TN (see Fig. 16), resulting in another TN in which only some core tensors are associated with physical (natural) modes of the original data tensor [58].
The main advantage of DTNs is that the size of each of the core tensors in the internal tensor network structure is usually much smaller than the initial core tensor so consequently the total number of parameters can be reduced [58]. However, it should be noted that the contraction of the resulting tensor network becomes more difficult when compared to the initial tree structure. This is due to the fact that the distributed tensor network contains loops.
Many algorithms applied to tensor networks scale with the size or of the core tensors of the network. In spite of the usually polynomial scaling of these algorithms, the computations quickly become intractable for increasing , so that a network containing core tensors with small dimensions are favorable in general. See as examples the distributed Tucker models shown in Fig. 15 (a) and (b).
Iv Tensorization – Blessing of Dimensionality
The procedure of creating a higherorder tensor from lowerorder original data is referred to as tensorization. In other words, lowerorder data tensors can be reshaped (reformatted) into highorder tensors. The purpose of a such tensorization or reshaping is to achieve a lowrank approximation with high level of compression. For example, big vectors, matrices even loworder tensors can be easily tensorized to very highorder tensors, then efficiently compressed by applying a suitable tensor network decomposition; this is the underlying principle for big data analysis [1, 2, 59, 16] (see also Figs. 17  20).
Iva Curse of Dimensionality
The term curse of dimensionality, in the context of tensors, refers to the fact that the number of elements of an thorder tensor, , grows exponentially with the tensor order . For example, for the Tucker decomposition the number of entries of an original data tensor but also a core tensor scales exponentially in the tensor order, for instance, the number of entries of an thorder core tensor is .
If all computations are performed on a CP tensor format and not on the raw data tensor itself, then instead of the original raw data entries, the number of parameters in a CP decomposition reduces to , which scales linearly in and . This effectively bypasses the curse of dimensionality, however the CP approximation may provide a poor fit to the data and may involve numerical problems, since existing CPD algorithms are not stable for highorder tensors. In this paper we exploit TT decompositions which are stable and robust with ability to control an approximation error i.e., to achieve any desired accuracy of TT approximation [40, 60]. The main idea of using lowrank tensorstructured approximations is to reduce the complexity of computation and relax or avoid the curse of dimensionality.
IvB Quantized Tensor Networks
The curse of dimensionality can be overcome relatively easily through quantized tensor networks, which represent a tensor of possibly very highorder as a set of sparsely interconnected of low dimensions (typically, 3rdorder) cores [60, 59]. The concept of quantized tensor networks was first proposed by Oseledets [59] and Khoromskij [16].
For example, the quantization and tensorization of a huge vector , can be achieved through reshaping to give an tensor of order , as illustrated in Fig. 17. Such a quantized tensor often admits lowrank matrix/tensor approximations, so that a good compression of a huge vector can be achieved by enforcing a maximum possible lowrank structure on the tensor , thus admitting highly compressed representation via a tensor network.
Even more generally, an thorder tensor , with , can be quantized in all modes simultaneously to yield a quantized tensor of higherorder, with small .
In practice, a fine ( ) quantization is desirable to create as many virtual modes as possible, thus allowing us to implement an efficient lowrank tensor approximations. For example, the binary encoding () reshapes an thorder tensor with elements into a tensor of order , with the same number of elements. In other words, the idea of the quantized tensor is quantization of the each th “physical” mode (dimension) by replacing it with “virtual” modes, provided that the corresponding mode size are factorized as . This corresponds to reshaping the th mode of size into modes of sizes .
In example shown in Fig. 18, the Tensor Train of huge 3rdorder tensor is expressed by the strong Kronecker products of block tensors with relatively small 3rdorder tensor blocks. Since largescale tensors cannot be loaded explicitly in main memory, they usually reside in distributed storage by splitting tensors to smaller blocks. Our approach is to apply tensor networks and represent big data by highorder tensors not explicitly but in compressed TT formats.
The TT decomposition applied to quantized tensors is referred to as the QTT; it was first introduced as a compression scheme for largescale structured matrices, which admit lowrank TT approximation [59], and also developed for more general settings [16, 61, 62]. The attractive property of QTT is that not only its rank is typically small (below 10) but it is almost independent or at least uniformly bounded by data size, providing a logarithmic (sublinear) reduction of storage requirements: – socalled supercompression [16].
Note also that, unlike in Tucker or CPD, the TT decomposition relies on a certain ordering of the modes so that reordering modes may affect the numerical values of TT ranks significantly.
Quantization is quite important for reducing the computational complexity further, since it allows the TT decomposition to resolve and represent more structure in the data by splitting the “virtual” dimensions introduced by the quantization, as well as the “physical” ones. In practice it appears the most efficient to use as fine a quantization as possible (typically, with ) and to generate as many virtual modes as possible.
A TT decomposition of the quantized vector is referred to as QTT decomposition of the original vector; the ranks of this TT decomposition are called ranks of the QTT decomposition of the original vector.
V Mathematical and Graphical Representation of Tensor Trains
In order to perform efficiently various mathematical operations in the TT formats we need to represent TT decompositions in compact and easily understandable mathematical and graphical representations [2, 13].