# Infinite Tucker Decomposition: Nonparametric Bayesian Models for Multiway Data Analysis

## Abstract

Tensor decomposition is a powerful computational tool for multiway data analysis. Many popular tensor decomposition approaches—such as the Tucker decomposition and CANDECOMP/PARAFAC (CP)—amount to multi-linear factorization. They are insufficient to model (i) complex interactions between data entities, (ii) various data types (e.g.missing data and binary data), and (iii) noisy observations and outliers. To address these issues, we propose tensor-variate latent nonparametric Bayesian models, coupled with efficient inference methods, for multiway data analysis. We name these models InfTucker. Using these InfTucker, we conduct Tucker decomposition in an infinite feature space. Unlike classical tensor decomposition models, our new approaches handle both continuous and binary data in a probabilistic framework. Unlike previous Bayesian models on matrices and tensors, our models are based on latent Gaussian or processes with nonlinear covariance functions. To efficiently learn the InfTucker from data, we develop a variational inference technique on tensors. Compared with classical implementation, the new technique reduces both time and space complexities by several orders of magnitude. Our experimental results on chemometrics and social network datasets demonstrate that our new models achieved significantly higher prediction accuracy than the most state-of-art tensor decomposition approaches.

## 1 Introduction

Many real-world datasets with multiple aspects can be described by tensors (i.e., multiway arrays). For example, email
correspondences can be represented by a tensor with four modes `(sender, receiver, date, content)`

and user
customer ratings by a tenor with four modes `(user, item, rating, time)`

.
Given the tensor-valued data, traditional multiway factor models— such as the Tucker decomposition [21] and CANDECOMP/PARAFAC (CP) [6]—have been widely applied for various applications (e.g., network traffic analysis [25], computer vision [17] and social network analysis [18, 14, 19], etc). These models, however, face serious challenges for modeling complex multiway interactions. First the interactions between entities in each mode may be coupled together and highly nonlinear. The classical multi-linear models cannot capture these intricate relationships.
Second, the data are often noisy, but the classical models are not designed to deal with noisy observations.
Third, the data may contain many missing values. We need to first impute the missing values before we can apply the classical multiway factor models. Forth, the data may not be restricted to real values: they can be binary as in dynamic network data or have ordinal values for user-movie-ratings. But the classical models simply treat them as continuous data—this treatment would lead to degenerated predictive performance.

To address these challenges we propose a nonparametric Bayesian multiway analysis model, InfTucker. Based on latent Gaussian processes or processes, it conducts the Tucker decomposition in an infinite dimensional feature space. It generalize the elegant work of Chu and Ghahramani [5] by capturing nonlinear interactions between different tensor modes. Grounded in a probabilistic framework, it naturally handles noisy observations and missing data. Furthermore, it handles various data types—binary or continuous—by simply using suitable data likelihoods. Although InfTucker offers an elegant solution to multiway analysis, learning the model from data is computationally challenging. To overcome this challenge, we develop an efficient variational Bayesian approach that explores tensor structures to significantly reduce the computational cost. This efficient inference technique also enables the usage of nonlinear covariance functions for latent Gaussian and processes on datasets with reasonably large size.

Our experimental results on chemometrics and social network datasets demonstrate that the InfTucker achieves significantly higher prediction accuracy than state-of-the-art tensor decomposition approaches—including High Order Singular Value Decomposition (HOSVD) [10], Weighted CP [1] and nonnegative tensor decomposition [17].

## 2 Preliminary

Notations. Throughout this paper, we denote scalars by lower case letters (e.g. ), vectors by bold lower case letters (e.g. ), matrices by bold upper case letters (e.g. ), and tensors by calligraphic upper case letters (e.g. ). Calligraphic upper case letters are also used for probability distributions, , . We use to represent the entry of a matrix , to represent the entry of a tensor . denotes the Kronecker product of the two matrices there. We define the vectorization operation, denoted by , to stack the tensor entries into a by 1 vector. The entry of is mapped to the entry at position of ^{1}

(1) |

Tensor decomposition: There are two families of tensor decomposition, the Tucker family and the CP family. The Tucker family extends bilinear factorization models to handle tensor datasets. For an observed -mode tensor , the general form of Tucker decomposition is

(2) |

where is the core tensor, and are latent factor matrices. As in [9], We collectively denote the group of matrices as a Tucker tensor with a identity core —this allows us to compactly represent the Tucker decomposition as . The vector form of (2) is

(3) |

The CP family is a restricted form of the Tucker family. The entry-wise definition of CP is . The alternating least square (ALS) method has been used to solve both Tucker decomposition and CP [9].

## 3 Infinite Tucker decomposition

In this section we present the infinite Tucker decomposition based on latent Gaussian processes and processes. The following discussion is primarily for latent Gaussian processes. The model derivation for latent processes is similar to that of latent Gaussian processes.

We extend classical Tucker decomposition in three aspects: i) flexible noise models for both continuous and binary observations; ii) an infinite core tensor to model complex interactions; and iii) latent Gaussian process prior or latent process.

More specifically, we assume the observed tensor is sampled from a latent real-valued tensor via a probabilistic noise model .

We conduct Tucker decomposition for with a core tensor of infinite size. To do so, we use a countably infinite feature mapping for the rows of the component matrix . Let denotes the -th row of , A feature mapping maps each to the infinite feature space , where denotes the countable infinity. The inner product of the feature mapping is denoted as . Let denote the first coordinates of , denote an infinite -mode core tensor, and denote the first dimensions in every mode of . The infinite Tucker decomposition “” for the latent tensor can be formally defined as the limit of a series of finite Tucker decompositions.

(4) |

where .

As shown in the next Section, we use a latent tensor-variate Gaussian process prior on and then marginalize it out to obtain a Gaussian process over . Alternatively, we can also use a latent tensor-variate process prior on and obtain a process over .

### 3.1 Tensor-variate Gaussian processes

Before formally defining the tensor-variate process, we denote the domain of the mode by , the covariance functions by , the covariance matrices by a Tucker tensor and . The norm of the a tensor is defined as . Then we define tensor-variate processes as follows.

###### Definition 1 (Tensor-variate Gaussian Processes)

Given location sets , , let be the mean function. is a set of random tensor variables where is a random function. For any finite sets , let , where , be a random tensor and be the mean tensor.

We say follows a tensor-variate Gaussian process, if follows a tensor-variate normal distribution:

(5) |

In this paper, we set the mean function to be zero, i.e. . Let denotes a normal distribution with mean and covariance . If the latent tensor is drawn from a tensor-variate Gaussian process, then , where . We choose the prior on the truncated core tensor to be , where denotes the identity matrix. The next theorem proves that the limit defined in (4) is the corresponding tensor process.

###### Theorem 2

Let , and be a series of covariance functions. Define a multi-linear function by

where . If , then follows a tensor-variate Gaussian distribution , and it converges to in distribution as .

Finally, to encourage sparsity in estimated —for easy model interpretation—we use Laplace prior .

### 3.2 Tensor-variate processes

Because of the strong relation between -distributions and Gaussian distributions— distributions can be regarded as mixtures of Gaussian distributions weighted by Gamma distributions, we can easily define tensor-variate processes:

###### Definition 3 (Tensor-variate Processes)

Let be the Gamma function. The set follows a tensor-variate process with degree of freedom , if follows tensor distribution with the following density

We can also prove a similar convergence result for tensor-variate Gaussian distribution.

###### Theorem 4

If , then follows a tensor-variate distribution , and it converges to tensor-variate processes in distribution as .

The proof of Theorem 2 follows exactly the same path as that of the convergence result for tensor-variate Gaussian processes.

The above theorem shows that probabilistic infinite Tucker decomposition of can be realized by modeling as a draw from a tensor-variate process on the location vectors induced from the unknown component matrices . Our definition of tensor-variate processes generalizes matrix-variate process defined in [26]. Theorem 2 also suggests a constructive definition of tensor-variate processes for general covariance functions.

### 3.3 Noise models

We use a noise model to link the infinite Tucker decomposition and the tensor observation .

Probit model: In this case, each entry of the observation is binary; that is, . A probit function models the binary observation. Note that is the standard normal cumulative distribution function.

Gaussian model: We use a Gaussian likelihood to model the real-valued observation .

Missing values: We allow missing values in the observation. Let denote the indices of the observed entries in . Then we have as the likelihood.

## 4 Algorithm

Given the observed tensor , we aim to estimate the component matrices by maximizing the marginal likelihood Integrating out in the above equation is intractable however. Therefore, we resort to approximate inference; more specifically, we develop a variational expectation maximization (EM) algorithm. In the following paragraphs, we first present the inference and prediction algorithms for both of the noise models, and then describe an efficient algebraic approach to significantly reduce the computation complexity. Due to space limitation, we only describe the algorithm for tensor-variate -distribution. The algorithm for tensor-variate Gaussian distribution can be derived similarly.

### 4.1 Inference

Probit noise: We follow the data augmentation scheme by Albert and Chib [2] to decompose the probit model into . Let be the indicator function, we have

It is well known that a distribution can be factorized into a normal distribution convolved with a Gamma distribution, such that

(6) |

where denotes the tensor-variate normal distribution. The joint probability likelihood with data augmentation is

(7) |

where and is the tensor-variate normal distribution and the Gamma distribution in (4.1). is the Laplace prior.

Our variational EM algorithm consists of a variational E-step and a gradient-based M-step. In the E-step, we approximate the posterior distribution by a fully factorized distribution . Variational inference minimizes the Kullback-Leibler (KL) divergence between the approximate posterior and the true posterior.

(8) |

The variational approach optimizes one approximate distribution, e.g., , in (8) at a time, while having all the other approximate distributions fixed [3]. We loop over , and to iteratively optimize the KL divergence until convergence.

Given and , the is a truncated normal distribution

(9) | |||

(10) |

Given and , it is more convenient to write the optimized approximate distribution for in its vectorized form. Let , we have

(11) | ||||

(12) | ||||

(13) |

The optimized is also a Gamma distribution:

Based on the variational approximate distribution obtained in the E-step, we maximize the expected log likelihood over in the M-step.

(14) |

After eliminating constant terms, we need to solve the following optimization problem

(15) |

where . In the above equation (15), is considered as a function of , and is a function of . and are the statistics computed in the E-step, and they have fixed values. The gradient of w.r.t. to a scalar can be found in Appendix B. With an penalty on , we choose a projected scaled subgradient L-BFGS algorithm for optimization—due to its excellent performance [16].

Gaussian noise: The inference for the regression case follows the same format as the binary classification case. The only changes are: 1) replacing by and skipping updating . 2) The variational EM algorithm are only applied to the observed entries.

### 4.2 Prediction

Probit noise: Given a missing value index , the predictive distribution is

(16) |

The above integral is intractable, so we replace integral by the mode of its approximate posterior distribution , thus the predictive distribution is approximated by

(17) |

where

Gaussian noise: The predictive distribution for the regression case is the following integral

(18) |

### 4.3 Efficient Computation

A naïve implementation of the above algorithm requires prohibitive time complexity and space complexity for each EM iteration. The key computation bottlenecks are the operations involving defined in equation (13). To avoid this high complexity, we can make use of the Kronecker product structure. We assume to simplify the computation, it is easy to adapt our computation strategies to . Let be the singular value decomposition of the covariance matrix , is an orthogonal matrix and is a diagonal matrix. can be represented as

Let , . It is obvious that is an orthogonal matrix and is a diagonal matrix. The above relation implies that we can actually compute the singular value decomposition of from covariance matrices .

In order to efficiently compute appearing in equation (15), we use the following relations

(19) |

where with being a computed statistics in the E-step, denotes the diagonal elements of , and is a tensor of size , such that . Both time and space complexities of the last formula (19) is .

We denote and . For any tensor of the same size as , means multiplying the -th element of by the , which is the -th element of . So we have , where denotes the Hadamard product, i.e. entry-wise product. In light of this relation, we can efficiently compute (12) by

(20) |

The right-hand side of Equation (20) effectively reduce the time and space complexities of the left-hand side operations to and , respectively.

We can further reduce the complexities by approximating the covariance matrices via truncated SVD.

## 5 Related Works

The InfTucker model extends Probabilistic PCA (PPCA) [20] and Gaussian process latent variable models (GPLVMs) [11]: while PPCA and GPLVM model interactions of one mode of a matrix and ignore the joint interactions of two modes, InfTucker does. Our model is also related to previous matrix-variate GPs Yu et al. [23], Yu and Chu [22]. The main difference lies in the fact they used linear covariance functions to reduce the computational complexities and dealt with matrix-variate data for online recommendation and link prediction.

The most closely related work is the probabilistic Tucker decomposition (pTucker) model [5]; actually the GP-based InfTucker reduces to pTucker as a special case when using a linear covariance function. Our TP-based InfTucker further differs from pTucker by marginalizing out a scaling hyperparameter of the covariance function. Another related work is probabilistic high order PCA [24], which is essentially equivalent to a linear PPCA after transforming the tensor to a long vector. Hoff [7] proposed a hierarchical Bayesian extension to CANDECOMP/PARAFAC that captures the interactions of component matrices. Unlike these approaches, ours can handle non-Gaussian noise and uses nonlinear covariance functions to model complex interactions. In addition, Chu and Ghahramani [5] did not exploits the Kronecker structure of the covariance matrices, so it is difficult for pTucker to scale to large datasets and high order tensors; and Hoff [7] used a Gibbs sampler for inference—requiring high computational cost and making their approach infeasible for tensors with moderate and large sizes. By contrast, we provide a deterministic approximate inference method that exploits structures in Kronecker products in a variational Bayesian framework, making InfTucker much more efficient than competing methods.

To handle missing data, enhance model interpretability, and avoid overfitting, several extensions (e.g., using nonnegativity constraints) to tensor decomposition have been proposed, including nonnegative tensor decomposition (NTD) [17, 8, 13, 15] and Weighted tensor decomposition (WTD) [1]. Unlike ours, these models either solve the core tensors explicitly, or do not handle nonlinear multiway interactions.

Finally, note that the inference technique described in Section 4 can be adopted for Gaussian process or -process multi-task learning [4, 26]. Let be the number of tasks and be the number of data points in each task. Our inference technique can be used to reduce their time complexity from to and the space complexity from to .

## 6 Experiments

Data | amino | flow injection | bread |
---|---|---|---|

CP | 0.0530.002 | 0.0510.005 | 0.2380.001 |

TD | 0.0540.002 | 0.0510.003 | 0.2480.001 |

HOSVD | 0.0530.002 | 0.0520.004 | 0.2590.001 |

NCP | 0.0570.005 | 0.1100.023 | 0.2330.001 |

PTD | 0.0540.002 | 0.0480.002 | 0.2400.001 |

WCP | 0.0490.004 | 0.0790.011 | 0.2460.003 |

0.0470.003 | 0.0490.002 | 0.2320.001 | |

0.0470.003 | 0.0460.002 | 0.2250.001 |

We use and to denote the two new infinite Tucker decomposition models based on tensor-variate Gaussian and processes, respectively.
To evaluate them, we conducted two sets of experiments, one on continuous tensor data and the other on binary tensor data. For both experiments, we compared InfTucker with the following tensor decomposition methods: CANDECOMP/PARAFAC (CP), Tucker decomposition (TD), Nonnegative CP (NCP), High Order SVD (HOSVD), Weighted CP (WCP) and Probabilistic Tucker Decomposition (PTD). We implemented PTD as described in the paper by Chu and Ghahramani [5] and applied to a small continuous tensor data (bread as described in the 6.1.1). To handle larger and binary datasets, we used probit models and the efficient computation techniques described in Section 4.3 for PTD.
For the other methods, we used the implementation of the tensor data analysis toolbox^{2}

### 6.1 Experiment on continuous tensor data

#### Experimental setting

We used three continuous chemometrics datasets^{3}

All the above tensor data were normalized such that each element of the tensor has zero mean and unit variance (based on the vectorized representations). For each tensor, we randomly split it via 5-fold cross validation: each time four folds are used for training and one fold for testing. This procedure was repeated 10 times, each time with a different partition for the 5-fold cross validation. In , the degree of freedom in the tensor-variate process is fixed to . We chose the Gaussian/exponential covariance functions , where and is selected from by 5-fold cross validation. The regularization parameter for and is chosen from .

#### Results

We compared the the prediction accuracies of all the approaches on hold-out elements of the tensor data. For each comparison, we used the same number of latent factors, denoted as , for all the approaches. We varied from 3 to 5 and computed the averaged mean square errors (MSEs) and the standard errors of the MSEs. Based on cross-validation, we set . The MSEs on the three datasets are summarized in Table 1. Based on the prediction accuracies, PTD and WCP tie on the third best, while HOSVD is the worst ( perhaps due to the strong nonnegativity constraint on the latent factors). Clearly, achieved higher prediction accuracies than all the previous approaches on all the datasets, and further outperformed for most cases.

### 6.2 Experiment on binary tensor data

#### Experimental setting

We extracted three binary social network datasets, Enron, Digg1, and Digg2, for our experimental evaluation. Enron is a relational dataset describing the three-way relationship: sender-receiver-email. This dataset, extracted from the Enron email dataset^{4}^{5}

#### Results

We chose from the range {3,5,8,10,15,20} based on cross-validation. Since the data are binary, we evaluated all these approaches by area-under-curve (AUC) values averaged over 50 runs. The larger the averaged AUC value an approach achieves, the better it is. We reported the averaged AUC values for all algorithms in Figure 1. Again, the proposed and approaches significantly outperform all the others. Note that the nonprobabilistic approaches—such as CP and TD—- suffer severely from the least square minimization; given the sparse and binary training data, the least-square-minimization leads to too many predictions with zero values, a result of both overfitting and mis-model fitting. This experimental comparison fully demonstrates the advantages of InfTucker (stemming from the right noise models and the nonparametric Bayesian treatment).

## 7 Conclusion

To conduct multiway data analysis, we have proposed a new nonparametric Bayesian tensor decomposition framework, InfTucker, where the observed tensor is modeled as a sample from a stochastic processes on tensors. In particular, we have employed tensor-variate Gaussian and processes. This new framework can model nonlinear interactions between multi-aspects of the tensor data, handle missing data and noise, and quantify prediction confidence (based on predictive posterior distributions). We have also presented an efficient variational method to estimate InfTucker from data. Experimental results on chemometrics and social network datasets demonstrated that the superior predictive performance of InfTucker over the alternative tensor decomposition approaches.

## Appendix A Proof of Theorem 2

Proof Sketch If , then . Let be location sets (matrices) as used in (4), we have

(21) |

Thus, , where are the covariance matrix. Inverting (21) gives , which proves follows the tensor Gaussian process.

From the definition of inner product in , we have the following identity on the convergence of covariance function.

Convergence in distribution follows from this convergence result.

## Appendix B Gradient of

(22) | ||||

### Footnotes

- Unlike the usual column-wise -operation, our definition of on matrices is row-wise, which avoids the use of transpose in many equations throughout this paper.
- http://csmr.ca.sandia.gov/~tgkolda/TensorToolbox/
- Available from http://www.models.kvl.dk/datasets
- Available at http://www.cs.cmu.edu/~enron/
- Available at http://www.public.asu.edu/~ylin56/kdd09sup.html

### References

- E. Acar, D. M. Dunlavy, T. G. Kolda, and M. Mørup. Scalable tensor factorizations for incomplete data. Chemometrics and Intelligent Laboratory Systems, 106(1):41–56, 2011.
- James H Albert and Siddhartha Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88:669–679, 1993.
- Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2007.
- Edwin Bonilla, Kian Ming Chai, and Chris Williams. Multi-task Gaussian process prediction. In Advances in Neural Information Processing Systems 20. 2008.
- Wei Chu and Zoubin Ghahramani. Probabilistic models for incomplete multi-dimensional arrays. Journal of Machine Learning Research - Proceedings Track, 5:89–96, 2009.
- R. A. Harshman. Foundations of the PARAFAC procedure: Model and conditions for an”explanatory”multi-mode factor analysis. UCLA Working Papers in Phonetics, 16:1–84, 1970.
- Peter D. Hoff. Hierarchical multilinear models for multiway data. Computational Statistics & Data Analysis, 55:530–543, 2011.
- Yong-Deok Kim and Seungjin Choi. Nonnegative Tucker decomposition. In Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, pages 1 –8, June 2007.
- Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455–500, September 2009.
- Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. A multilinear singular value decomposition. SIAM J. Matrix Anal. Appl, 21:1253–1278, 2000.
- Neil Lawrence. The Gaussian process latent variable model. Technical Report CS-06-03, The University of Sheffield, 2006.
- Neil D. Lawrence and Michael I. Jordan. Semi-supervised learning via gaussian processes. In Advances in Neural Information Processing Systems 17. 2005.
- Tzu-Kuo Huang Jeff Schneider Jaime G. Carbonell Liang Xiong, Xi Chen. Temporal collaborative filtering with bayesian probabilistic tensor factorization. In Proceedings of SDM, 2010.
- Yu-Ru Lin, Jimeng Sun, Paul Castro, Ravi Konuru, Hari Sundaram, and Aisling Kelliher. Metafac: community discovery via relational hypergraph factorization. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining.
- Ian Porteous, Evgeniy Bart, and Max Welling. Multi-HDP: A non-parametric Bayesian model for tensor factorization. In Proc. AAAI Conference on Artificial Intelligence, 2008.
- Mark Schmidt. Graphical Model Structure Learning with L1-Regularization. PhD thesis, University of British Columbia, 2010.
- Amnon Shashua and Tamir Hazan. Non-negative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22nd ICML, 2005.
- Jimeng Sun, Dacheng Tao, and Christos Faloutsos. Beyond streams and graphs: Dynamic tensor analysis. In KDD, 2006.
- Jimeng Sun, Spiros Papadimitriou, Ching-Yung Lin, Nan Cao, Shixia Liu, and Weihong Qian. Multivis: Content-based social network exploration through multi-way visual analysis. In SDM’09, pages 1063–1074, 2009.
- Michael E. Tipping and Christopher M. Bishop. Probabilistic principal component analysis. Journal of The Royal Statistical Society Series B-statistical Methodology, 61:611–622, 1999.
- Ledyard Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31:279–311, 1966.
- Kai Yu and Wei Chu. Gaussian process models for link analysis and transfer learning. In Advances in Neural Information Processing Systems 20. 2008.
- Kai Yu, Wei Chu, Shipeng Yu, Volker Tresp, and Zhao Xu. Stochastic relational models for discriminative link prediction. In Advances in Neural Information Processing Systems, 2007.
- Shipeng Yu, Jinbo Bi, and Jieping Ye. Matrix-variate and higher-order probabilistic projections. Data Min. Knowl. Discov., 22:372–392, 2011.
- Yin Zhang, Matthew Roughan, Walter Willinger, and Lili Qiu. Spatio-temporal compressive sensing and internet traffic matrices. In Proceedings of the ACM SIGCOMM 2009 conference on Data communication.
- Yu Zhang and Dit-Yan Yeung. Multi-task learning using generalized t process. Journal of Machine Learning Research - Proceedings Track, 9:964–971, 2010.