Scalable Bayesian NonNegative Tensor Factorization for Massive Count Data
Abstract
We present a Bayesian nonnegative tensor factorization model for countvalued tensor data, and develop scalable inference algorithms (both batch and online) for dealing with massive tensors. Our generative model can handle overdispersed counts as well as infer the rank of the decomposition. Moreover, leveraging a reparameterization of the Poisson distribution as a multinomial facilitates conjugacy in the model and enables simple and efficient Gibbs sampling and variational Bayes (VB) inference updates, with a computational cost that only depends on the number of nonzeros in the tensor. The model also provides a nice interpretability for the factors; in our model, each factor corresponds to a “topic”. We develop a set of online inference algorithms that allow further scaling up the model to massive tensors, for which batch inference methods may be infeasible. We apply our framework on diverse realworld applications, such as multiway topic modeling on a scientific publications database, analyzing a political science data set, and analyzing a massive household transactions data set.
Keywords:
Tensor factorization, Bayesian learning, latent factor models, count data, online Bayesian inferenceAuthors’ Instructions
1 Introduction
Discovering interpretable latent structures in complex multiway (tensor) data is an important problem when learning from polyadic relationships among multiple sets of objects. Tensor factorization [14, 5] offers a promising way of extracting such latent structures. The inferred factors can be used to analyze objects in each mode of the tensor (e.g., via classification or clustering using the factors), or to do tensor completion.
Of particular interest, in the context of such data, are sparselyobserved countvalued tensors. Tensors are routinely encountered in many applications. For example, in analyzing a database of scientific publications, the data may be in form of a sparse fourway countvalued tensor (authors words journals years). Another application where multiway count data is routinely encountered is the analysis of contingency tables [11] which represent the cooccurrence statistics of multiple sets of objects.
We present a scalable Bayesian model for analyzing such sparselyobserved tensor data. Our framework is based on a betanegative binomial construction, which provides a principled generative model for tensors with sparse and potentially overdispersed count data, and produces a nonnegative tensor factorization. In addition to performing nonnegative tensor factorization and tensor completion for countvalued tensors, our model has the property that each latent factor inferred for a tensor mode also represents a distribution (or “topic”, as in topic models) over the objects of that tensor mode; our model naturally accomplishes this by placing a Dirichlet prior over the columns of the factor matrix of each tensor mode. In addition to providing an expressive and interpretable model for analyzing sparse countvalued tensors, the model automatically infers the rank of the decomposition, which sidesteps the crucial issue of prespecifying the rank of the decomposition [14, 18, 22].
Our framework also consists of a set of batch and scalable online inference methods. Using a reparameterization of the Poisson distribution as a multinomial allows us to achieve conjugacy, which facilitates closedform Gibbs sampling as well as variational Bayes (VB) inference. Moreover, we also develop two online inference algorithms  one based on online MCMC [7] and the other based on stochastic variational inference [9]. These inference algorithms enable scaling up the model to massivesized tensor data.
One of the motivations behind our work is analyzing massive multiway data for tasks such as understanding thematic structures in scholarly databases (e.g., to design better recommender systems for scholars), understanding consumer behavior from shopping patterns of large demographies (e.g., to design better marketing and supply strategies), and understanding international relations in political science studies. In our experiments, we provide qualitative analyses for such applications on largescale realworld data sets, and the scalability behavior of our model.
2 Canonical PARAFAC Decomposition
Given a tensor of size , with denoting the size of along the mode (or “way”) of the tensor, the goal in a Canonical PARAFAC (CP) decomposition [14] is to decompose into a set of factor matrices where , denotes the factor matrix associated with mode . In its most general form, CP decomposition expresses the tensor via a weighted sum of rank1 tensors as . The form of depends on the type of data being modeled (e.g., can be Gaussian for realvalued, Bernoullilogistic for binaryvalued, Poisson for countvalued tensors). Here is the weight associated with the rank1 component, the column vector represents the latent factor of mode , and denotes vector outer product.
3 BetaNegative Binomial CP Decomposition
We focus on modeling countvalued tensor data [4] and assume the following generative model for the tensor
(1)  
(2)  
(3)  
(4) 
We use subscript to denote the index of the th entry in . Using this notation, the th entry of the tensor can be written as . We assume that we are given observations from the tensor .
Since the gammaPoisson mixture distribution is equivalent to a negative binomial distribution [15], (1) and (3), coupled with the beta prior (Eq 4) on , lead to what we will call the betanegative binomial CP (BNBCP) decomposition model. A few things worth noting about our model are

The Dirichlet prior on the factors naturally imposes nonnegativity constraints [4] on the factor matrices . Moreover, since each column of these factor matrices sums to 1, can also be thought of a distribution (e.g., a “topic”) over the entities in mode .

The gammabeta hierarchical construction of (Eq 3 and 4) allows inferring the rank of the tensor by setting an upper bound on the number of factors and letting the inference procedure infer the appropriate number of factors by shrinking the coefficients ’s to close to zero for the irrelevant factors.

The resulting negative binomial model is useful for modeling overdispersed count data in cases where the Poisson likelihood may not be suitable.
3.1 Reparametrizing the Poisson Distribution
The generative model described in Eq (1)(4) is not conjugate. We now describe two equivalent parametrizations [6, 24] of (1), which transform (1)(4) into a fully conjugate model and facilitate easytoderive and scalable inference procedures. These parameterizations are based on a data augmentation scheme described below.
The first parametrization expresses the th countvalued entry of the tensor as a sum of latent counts
(5) 
The second parametrization assumes the vector of latent counts is drawn from a multinomial as
(6) 
Lemma 1
Suppose that are independent random variables with and . Set ; let be another set of random variables such that , and . Then the distribution of is the same as the distribution of .
These parameterizations, along with the fact that the columns of each factor matrix are drawn from a Dirichlet, allows us to leverage the Dirichletmultinomial conjugacy and derive simple Gibbs sampling and variational Bayes (VB) inference update equations, as described in Section 4.
4 Inference
We first present the update equations for batch Gibbs sampling (Section 4.1) and batch VB inference (Section 4.2). We then present two online inference algorithms, based on: () conditional density filtering [7], which provides an efficient way to perform online MCMC sampling using conditional sufficient statistics of the model parameters; and () stochastic variational inference [9], which will allow scaling up VB inference by processing data in small minibatches.
We also define two quantities and which denote aggregates (sufficient statistics) computed using the latent counts . These quantities appear at various places in the description of the inference algorithms we develop.
4.1 Gibbs Sampling

Sampling : The latent counts are sampled from a multinomial (6).

Sampling : Due to the Dirichletmultinomial conjugacy, the columns of each factor matrix have Dirichlet posterior and are sampled as
(7) 
Sampling : Again using the fact that , and due to the gammaPoisson conjugacy, we have
(9)
Computational Complexity: Sampling the latent counts for each nonzero observation (note that for , the latent counts are trivially zero) requires computing , and computing each requires time (Eq 6). Therefore, sampling all the latent counts requires time. Sampling the latent factors for the tensor modes requires time. Sampling and requires time each. Of all these steps, sampling the latent counts (which are also used to compute the sufficient statistics and ) is the most dominant step, leading to an overall timecomplexity of for the Gibbs sampling procedure.
The linear dependence on (number of nonzeros) is especially appealing because most realworld countvalued tensors are extremely sparse (have much less than even 1% nonzeros. In contrast to the standard negativebinomial models for count data, for which the inference complexity also depends on the zeros whose number may be massive (and therefore heuristics, such as subsampling the zeros, are needed), the reparametrizations (Section 3.1) used by our model allow us to ignore the zeros in the multinomial sampling step (the sufficient statistics do not depend on the zero entries in the tensor), thereby significantly speeding up the inference.
4.2 Variational Bayes Inference
Using the meanfield assumption [12], we approximate the target posterior distribution by . Our fully conjugate model enables closedform variational Bayes (VB) inference updates, with the distribution , , , and being multinomial, Dirichlet, beta, and gamma, respectively. We summarize the update equations for the variational parameters of each of these distributions, below:

Updating : Using (6), the updates for are given by where is defined as and can be computed as
(10) where is the digamma function, which is the first derivative of the logarithm of the gamma function.

Updating : The meanfield posterior is Dirichlet with each of the component means given by where .

Updating : The meanfield posterior is beta with mean given by where , .

Updating : The meanfield posterior is gamma with mean given by , where and .
A note on Gibbs vs VB: The periteration timecomplexity of the VB inference procedure is also . It is to be noted however that, in practice, one iteration of VB in this model is a bit more expensive than one iteration of Gibbs, due to the digamma function evaluation for the which is needed in VB when updating the ’s. Prior works on Bayesian inference for topic models [8] also support this observation.
4.3 Online Inference
Batch Gibbs (Section 4.1) and VB (Section 4.2) inference algorithms are simple to implement and efficient to run on moderately largesized problems. These algorithms can however be slow to run for massive data sets (e.g., where the number of tensor entries and/or the dimension of the tensor is massive). The Gibbs sampler may exhibit slow mixing and the batch VB may be slow to converge. To handle such massive tensor data, we develop two online inference algorithms. The first is online MCMC based conditional density filtering [7], while the second is based on stochastic variational inference [9]. Both these inference algorithms allow processing data in small minibatches and enable our model to analyze massive and/or streaming tensor data.
4.3.1 Conditional Density Filtering
The conditional density filtering (CDF) algorithm [7] for our model selects a minibatch of tensor entries at each iteration, samples the latent counts for these entries conditiond on the previous estimates of the model parameters, updates the sufficient statistics and using these latent counts (as described below), and resamples the model parameters conditioned on these sufficient statistics. Denoting as data indices in minibatch at round , the algorithm proceeds as

Sampling : For all , sample the latent counts using (6).

Updating the conditional sufficient statistics: Using data from the current minibatch, update the conditional sufficient statistics as:
(11) (12) Note that the updated conditional sufficient statistics (CSS), indexed by superscript , is a weighted average of the old CSS, indexed by superscript , and of that computing only using the current minibatch (of size ). In addition, the latter term is further weighted by so as to represent the average CSS over the entire data. In the above, is defined as , , and is needed to guarantee convergence [3].

Updating : Using the updated CSS, draw samples for each of the model parameters , from the following conditionals:
(13) (14) (15) and either store the sample averages of , and , or their analytic means to use for the next CDF iteration [7]. Since the analytic means of the model parameters are available in closedform in this case, we use the latter option, which obviates the need to draw samples, thereby also speeding up the inference significantly.
We next describe the stochastic (online) VB inference for our model.
4.3.2 Stochastic Variational Inference
The batch VB inference (Section 4.2) requires using the entire data for the parameter updates in each iteration, which can be computationally expensive and can also result in slow convergence. Stochastic variational inference (SVI), on the other hand, leverages ideas from stochastic optimization [9] and, in each iteration, uses a small randomly chosen minibatch of the data to updates the parameters. Data from the current minibatch is used to compute stochastic gradients of the variational objective w.r.t. each of the parameters and these gradients are subsequently used in the parameter updates. For our model, the stochastic gradients depend on the sufficient statistics computed using the current minibatch : and , where is computed using Eq 10. Denoting as the minibatch size, we reweight these statistics by to compute the average sufficient statistics over the entire data [9] and update the other variational parameters as follows:
(16)  
(17)  
(18)  
(19)  
(20) 
where is defined as , , and is needed to guarantee convergence [9].
4.3.3 Computational Complexity:
In contrast to the batch Gibbs and batch VB, both of which have cost periteration, the periteration cost of the online inference algorithms (CDF and SVI) is where is the minibatch size at round . We use a fixed minibatch size for each minibatch, so the periteration cost is .
5 Related Work
Although tensor factorization methods have received considerable attention recently, relatively little work exists on scalable analysis of massive countvalued tensor data. Most of the recently proposed methods for scalable tensor decomposition [13, 2, 10, 17] are based on minimizing the Frobenious norm of the tensor reconstruction error, which may not be suitable for count or overdispersed count data. The rank of decomposition also needs to be prespecified, or chosen via crossvalidation. Moreover, these methods assume the tensor to be fully observed and thus cannot be used for tensor completion tasks. Another key difference between these methods and ours is that scaling up these methods requires parallel or distributed computing infrastructure, whereas our fully Bayesian method exhibits excellent scalability on a single machine. At the same time, the simplicity of the inference update equations would allow our model to be easily parallelized or distributed. We leave this possibility to future work.
One of the first attempts to explicitly handle count data in the context of nonnegative tensor factorization includes the work of [4], which is now part of the Tensor Toolbox ^{1}^{1}1http://www.sandia.gov/~tgkolda/TensorToolbox/index2.6.html. This method optimizes the Poisson likelihood, using an alternating Poisson regression subroutine, with nonnegative constraints on the factor matrices. However, this method requires the rank of the decomposition to be specified, and cannot handle missing data. Due to its inability in handling missing data, for our experiments (Section 6), as a baseline, we implement and use a Bayesian version of this model which can handle missing data.
Among other works of tensor factorization for count data, the method in [1] can deal with missing values, though the rank still needs to be specified, and moreover the factor matrices are assumed to be realvalued, which makes it unsuitable for interpretability of the inferred factor matrices.
In addition to the Poisson nonnegative tensor factorization method of [4], some other nonnegative tensor factorization methods [21, 5, 20] also provide interpretability for the factor matrices. However, these methods usually have one or more of the following limitations: (1) there is no explicit generative model for the count data, (2) the rank needs to be specified, and (3) the methods do not scale to the massive tensor data sets of scales considered in this work.
Methods that facilitate a full Bayesian analysis for massive countvalued tensors, which are becoming increasingly prevalent nowadays, are even fewer. A recent attempt on Bayesian analysis of count data using Poisson likelihood is considered in [19]; however, unlike our model, their method cannot infer the rank and relies on batch VB inference, limiting its scaling behavior. Moreover, the Poisson likelihood may not be suitable for overdispersed counts.
Finally, inferring the rank of the tensor, which is NPcomplete in general [14], is another problem for which relatively little work exists. Recent attempts at inferring the rank of the tensor in the context of CP decomposition include [18, 22]; however (1) these methods are not applicable for count data, and (2) the inferred factor matrices are realvalued, lacking the type of interpretability needed in many applications.
Our framework is similar in spirit to the matrix factorization setting proposed in [24] which turns out to be a special case of our framework. In addition, while [24] only developed (batch) Gibbs sampling based inference, we present both Gibbs sampling as well as variational Bayesian inference, and design efficient online Bayesian inference methods to scale up our framework for handling massive realworld tensor data.
To summarize, in contrast to the existing methods for analyzing tensors, our fully Bayesian framework, based on a proper generative model, provides a flexible method for analyzing massive countvalued tensors, sidestepping crucial issues such as rankspecification, providing good interpretability of the latent factors, while still being scalable for analyzing massive realworld tensors via online Bayesian inference.
6 Experiments
We apply the proposed model on a synthetic and three realworld data sets that range in their sizes from moderate to medium to massive. The realworld tensor data sets we use in our experiments are from diverse application domains, such as analyzing countrycountry interaction data in political science, topic modeling on multiway publications data (with entities being authors, words, and publication venues), and analysis of massive household transactions data. These data sets include:

Synthetic Data: This is a tensor of size generated using our model by setting an upper bound over the number of factors; only 20 factors were significant (based on the values of ), resulting in an effective rank 20.

Political Science Data (GDELT): This is a realworld fourway tensor data of countrycountry interactions. The data consists of 220 countries, 20 action types, and the interactions date back to 1979 [16]. We focus on a subset of this data collected during the year 2011, resulting in a tensor of size . Section 6.4 provides further details.

Publications Data: This is a countvalued tensor, constructed from a database of research papers published by researchers at Duke University^{2}^{2}2Obtained from https://scholars.duke.edu/; the three tensor modes correspond to authors, words, and venues. Section 6.3 provides further details.

Transactions (Food) Data: This is a countvalued tensor, constructed from a database of transactions data of food item purchases at various stores in the US ^{3}^{3}3Data provided by United States Department of Agriculture (USDA) under a Third Party Agreement with Information Resources, Inc. (IRI); the three tensor modes correspond to households, stores, and items. Section 6.5 provides further details.
We compared our model with the following baselines: () Bayesian Poisson Tensor Factorization (BayesPTF), which is fully Bayesian version of the Poisson Tensor Factorization model proposed in [4], and () Nonnegative Tensor Decomposition based on Lowrank Approximation (lraNTD) proposed in [23]. All experiments are done on a standard desktop computer with Intel i7 3.4GHz processor and 24GB RAM.
6.1 Inferring the Rank
To begin with, as a sanity check for our model, we first perform an experiment on the synthetic data described above to see how well the model can recover the true rank (tensor completion results are presented separately in Section 6.2). For this experiment, we run the batch Gibbs sampler (the other inference methods also yield similar results) with 1000 burnins, and 1000 collection samples. We experiment with three settings: using 20%, 50% and 80% data for training. The empirical distribution (estimated using the collected MCMC samples) of the effective inferred rank for each of these settings is shown in Figure 1 (left). In each collection iteration, the effective rank is computed after a simple thresholding on the ’s where components with very small are not counted (also see Figure 1 (right)). With 80% training data, the distribution shows a distinct peak at 20 and even with smaller amounts of training data (20% and 50%), the inferred rank is fairly close to the ground truth of 20. In Figure 1 (right), we show the spectrum of all the ’s comparing the ground truth vs the inferred values;
6.2 Tensor Completion Results
We next experiment on the task of tensor completion, where for each method 95% of the data are used for training and the remaining 5% data is used as the heldout set (note that the data sets we use are extremely sparse in nature, with considerably less than 1% entries of the tensor being actually observed). The results are reported in Table 1 where we show the log likelihood and the meanabsolute error (MAE) in predicting the heldout data. Timingcomparison for the various batch and online inference methods is presented separately in Section 6.6.
For this experiment, we compare our BNBCP model (using the various inference methods) with (1) BayesPTF  a fully Bayesian variant (we implented it ourselves) of a stateoftheart Poisson Tensor Factorization model originally proposed in [4] (which cannot however handle missing data), and (2) lraNTD [23] which is an optimization based nonnegative tensor decomposition method. As Table 1 shows, our methods achieve better loglikelihood and MAE as compared to these baselines. Moreover, among our batch and online Bayesian inference methods, the online inference methods give competitive or better results as compared to their batch counterparts. In particular, the online MCMC method based on conditional density filtering (BNBCPCDF) works the best across all the methods (please see Section 6.6 for a timing comparison).
Datasets  Toy data  GDELT  Publication  Food  Toy data  GDELT  Publication  Food 

BayesPTF  107563  4425695  860808  2425433  1.012  55.478  1.636  1.468 
lraNTD  N/A  N/A  N/A  N/A  1.019  65.049  N/A  N/A 
BNBCPGibbs  97580  3079883  619258  2512112  0.989  45.436  1.565  1.459 
BNBCPVB  99381  2971769  632224  2533086  0.993  43.485  1.574  1.472 
BNBCPCDF  95472  2947309  597817  2403094  0.985  44.243  1.555  1.423 
BNBCPOnlineVB  98446  3169335  660068  2518996  0.989  46.188  1.601  1.461 
6.3 Analyzing Publications Database
The next experiment is on a threeway tensor constructed from a scientific publications database. The data consist of abstracts from papers published by various researchers at Duke University ^{4}^{4}4Data crawled from https://scholars.duke.edu/. In addition to the paper abstract, the venue information for each paper is also available. The data collection contains 2425 authors, 9088 words (after removing stopwords), and 4068 venues which results in a wordcounts tensor, on which we run our model. As the output of the tensor decomposition, we get three factor matrices. Since the latent factors in our model are nonnegative and sum to one, each latent factor can also be interpreted as a distribution over authors/words/venues, and consequently represents a “topic”. Therefore the three factor matrices inferred by our model for this data correspond to authors topics, words topics, and venue topics, which we use to further analyze the data.
We apply the model BNBCPCDF on this data (with ) and using the inferred words topics matrix, in Table 2 (left) we show the list of 10 most probable words in four factors/topics that seem to correspond to optics, genomics, machine learning & signal processing, and statistics. To show the topic representation across different departments, we present a histogram of departmental affiliations for 20 authors with highest probabilities in these four factors. We find that, for the genomics factor, the top authors (based on their topic scores) have affiliations related to biology which makes intuitive sense. Likewise, for the statistics factor, most of the top authors are from statistics and biostatistics departments. The top 20 authors in factors that correspond to optics and machine learning & signal processing, on the other hand, are from departments of electrical and computer engineering and/or computer science, etc.
Optics  Genomics  ML/SP  Stats  Top Venues in ML/SP 

gigapixel  gene  dictionary  model  ICASSP 
microcamera  chromatin  sparsity  priors  IEEE trans. sig. proc. 
cameras  occupancy  model  bayesian  ICML 
aperture  centromere  bayesian  lasso  Siam J. img. sci. 
lens  transcription  compressed  latent  IEEE trans. img. proc. 
multiscale  genome  compressive  inference  IEEE int. symp. biomed. img. 
optical  sites  matrix  regression  NIPS 
system  expression  denoising  sampler  IEEE trans. wireless comm. 
nanoprobes  sequence  gibbs  semiparametric  IEEE workshop stat. sig. proc. 
metamaterial  vegfa  noise  nonparametric  IEEE trans. inf. theory 
Similarly, using the inferred venues topics matrix, we list the most likely venues for each topic. Due to spacelimitations, here we only present the most likely venues in machine learning & signal processing factor/topic; the result is shown in Table 2 (rightmost column). The result shows that venues like ICASSP, IEEE Trans. Signal Proc., ICML, and NIPS all rank at the top in the machine learning & signal processing factor, which again makes intuitive sense.
6.4 Analyzing Political Science Data
We use the model to analyze a realworld political science data set consisting of countrycountry interactions. Such analyses are typically done by political scientists to study, analyze and understand complex international multilateral relations among countries. The data set is from the Global Database of Events, Location, and Tone (GDELT) [16]. GDELT records the dyadic interactions between countries in the form of “Country A did something to Country B”. In our experiments, we consider 220 countries (“actors”) and 20 unique highlevel action types in 52 weeks of year 2012. After preprocessing, we have a fourway (countrycountryactiontime) action counts tensor of size . Note that both first and second tensor mode represents countries; first mode as “sender” and the second mode as “receiver” of a particular action. In this analysis, we set to be large enough (200) and the model discovered roughly about 120 active components (i.e., components with significant value of ).
We apply the model (BNBCPCDF; other methods yield similar results) and examine each of the time dimension factors, specifically looking for the significant components (based on the magnitude of ) in which the time dimension factor also peaks during certain time(s) of the year. We show results with two such factors in Figure 3. In Figure 3 (column 1), the time and country (actor) factors seems to suggest that this factor/topic corresponds to the event “Julian Assange”. The actor subplot shows spikes at Ecuador, United Kingdom, United States, and Sweden whereas the time factor in the bottom left subplot shows spikes between June and August. The time and countries involved are consistent with the public knowledge of the event of Julian Assange seeking refuge in Ecuador.
Likewise, in Figure 3 (column 2), the time and country (actor) factors seems to suggest that this factor corresponds to the event “Benghazi Attack” which took place on Sept. 12 (week 37) of 2012, in which Islamic militants attacked American diplomatic compound in Benghazi, Libya. The attack killed an US Ambassador. As the Figure shows, the top actors identified are US, Libya and Egypt, and spikes are found at around week 37 and 38, which are consistent with the public knowledge of this event.
The results of these analyses demonstrate that the interpretability of our model can be useful for identifying events or topics in such multiway interaction data.
6.5 Analyzing Transactions Data
We next apply our model (BNBCPCDF; other methods yield similar results) for analyzing transactions data for food item purchases made at stores. Our data is collected for a demographically representative sample of US consumers who reside in large urban and suburban areas and purchase food in supermarkets and grocery stores. The data were provided by the USDA under a Third Party Agreement with IRI. Each transaction is identified by a unique Universal Product Code (UPC) barcode and the store where the transaction occurred. Some items such as fresh produce do not have UPCs and are identified separately. The households are observed over a four year period, during which they are provided with a technology that allows them to scan each purchase and record additional information such as the store where the purchase was made (and other economic data). Participating households are provided with incentives designed to encourage compliance. For each householdproductstore combination we record the number of unique purchases over the sampling period. The database has a total of 117,054 unique households, 438 stores, and 67,095 unique items and we construct a 3way count tensor of size with about 6.2 million nonzero entries.
We apply the proposed model on this data by setting (out of which about 60 components were inferred to have a significant value of ) and looked at the stores factor matrix. Since each column (which sums to 1) of the store factor matrix can be thought of as a distribution over the stores, we look at three of the factors from the store factor matrix and tried to identify the stores that rank at the top in that factor. In Table 3, we show results from each of these factors. Factor 1 seems to suggest that it is about the most popular stores (included Walmart, for example), Factor 2 has stores that primarily deal in wholesale (e.g., Costco, Sam’s Wholesale Club), and Factor 3 contains stores that sell none or very few food items (e.g., Mobil, Petco). Note that the Walmart Super Center figures prominently in both Factor 1 and Factor 2.
Factor 1  Factor 2  Factor 3 

Walmart Sup. Center  Sam’s Club  Dick’s Sporting 
Walmart Traders  Meijer  Mobil 
Walmart Neighb.  Costco  Petco 
Walmart  B J’S Wholesale  Sally Beauty 
Kroger  Walmart Sup. Center  GNC All 
We next look at the items factor matrix. In Figure 2, we plot the inferred distribution over items in each of the three clusters described above. For factors 1 and 2 (which correspond to the most popular stores and wholesale stores respectively), the distribution over the items (top and bottom panel in Figure 2) have a reasonably significant mass over a certain range of items (for the items indexed towards the left side in the plots of factors 1 and 2). On the other hand, for factor 3 which corresponds to stores that sell no or very few types of food items, the distribution over the items is rather flat and diffuse with very weak intensities (looking at the scale on the y axis). From the Figure 2, it is also interesting to observe that the set of active items in factors (1 & 2) vs factor 3 seem to be mostly disjoint.
This analysis provides a first attempt to analyze food shopping patterns for American consumers on a large scale. As the world, at large, struggles with a combination of increasing obesity rates and food insecurity, this analysis shows that consumer preferences are densely clustered across both stores and items. This indicates that household tend to have fairly rigid preferences over the stores where they shop. Furthermore, they tend to consume a relatively small number of products from the universe of available products. The concentration in both stores and products is indicative of limited search behavior and substantial behavioral rigidity which may be associated with suboptimal outcomes in terms of nutrition and health.
6.6 Scalability
We now perform an experiment comparing the proposed inference methods (batch and online) to assess their scalability (Figure 5). We first use the Transactions data () for this experiment. We would like to note that the stateoftheart methods for countvalued tensor, such as the Poisson Tensor Factorization (PTF) method from the Tensor Toolbox [4], are simply infeasible to run on this data because of storage explosion issue (the method requires expensive flattening operations of the tensor). The other baseline lraNTD [23] we used in our experiments was also infeasible to run on this data. We set for each method (about 60 factors were found to be significant, based on the inferred values of the ’s) and use a minibatch size of 100000 for all the online inference methods. For the conditional density filtering as well as stochastic variational inference, we set the learning rate as and . Figure 5 shows that online inference methods (conditional density filtering and stochastic variational inference) converge much faster to a good solution than batch methods. This experiment shows that our online inference methods can be computationally viable alternatives if their batch counterparts are slow/infeasible to run on such data.
We then perform another experiment on the Scholars data, on which the PTF method of [4] was feasible to run and compare its periteration running time with our model (using both batch as well as online inference). Since PTF cannot handle missing data, for this experiment, each method was run with all the data. As Fig 6 shows, our methods have running times that are considerably smaller than that of PTF.
7 Conclusion
We have presented a fully Bayesian framework for analyzing massive tensors with count data, and have designed a suite of scalable inference algorithms for handling massive tensor data. In addition to giving interpretable results and inferring the rank from the data, the proposed model can infer the distribution over objects in each of the tensor modes which can be useful for understanding groups of similar objects, and also for doing other types of qualitative analyses on such data, as shown by our various experiments on realworld data sets. Simplicity of the inference procedure also makes the proposed model amenable for parallel and distributed implementations. e.g., using MapReduce or Hadoop. The model can be a useful tool for analyzing data from diverse applications and scalability of the model opens door to the application of scalable Bayesian methods for analyzing massive multiway count data.
Acknowledgments
The research reported here was supported in part by ARO, DARPA, DOE, NGA and ONR. Any opinions, findings, recommendations, or conclusions are those of the authors and do not necessarily reflect the views of the Economic Research Service, U.S. Department of Agriculture. The analysis, findings, and conclusions expressed in this paper also should not be attributed to either Nielsen or Information Resources, Inc. (IRI). This research was conducted in collaboration with USDA under a Third Party Agreement with IRI.
References
 [1] Bazerque, J. A., Mateos, G., and Giannakis, G. B. Inference of poisson count processes using lowrank tensor data. In ICASSP (2013).
 [2] Beutel, A., Kumar, A., Papalexakis, E. E., Talukdar, P. P., Faloutsos, C., and Xing, E. P. Flexifact: Scalable flexible factorization of coupled tensors on hadoop. In SDM (2014).
 [3] Cappé, O., and Moulines, E. Online expectation–maximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71, 3 (2009), 593–613.
 [4] Chi, E. C., and Kolda, T. G. On tensors, sparsity, and nonnegative factorizations. SIAM Journal on Matrix Analysis and Applications 33, 4 (2012), 1272–1299.
 [5] Cichocki, A., Zdunek, R., Phan, A. H., and Amari, S. Nonnegative matrix and tensor factorizations: applications to exploratory multiway data analysis and blind source separation. John Wiley & Sons, 2009.
 [6] Dunson, D. B., and Herring, A. H. Bayesian latent variable models for mixed discrete outcomes. Biostatistics 6, 1 (2005), 11–25.
 [7] Guhaniyogi, R., Qamar, S., and Dunson, D. B. Bayesian conditional density filtering. arXiv preprint arXiv:1401.3632 (2014).
 [8] Heinrich, G., and Goesele, M. Variational bayes for generic topic models. In KI 2009: Advances in Artificial Intelligence. Springer, 2009, pp. 161–168.
 [9] Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. The Journal of Machine Learning Research 14, 1 (2013), 1303–1347.
 [10] Inah, J., Papalexakis, E. E., Kang, U., and Faloutsos, C. Haten2: Billionscale tensor decompositions. In ICDE (2015).
 [11] Johndrow, J. E., Battacharya, A., and Dunson, D. B. Tensor decompositions and sparse loglinear models. arXiv preprint arXiv:1404.0396 (2014).
 [12] Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models. Machine learning 37, 2 (1999), 183–233.
 [13] Kang, U., Papalexakis, E., Harpale, A., and Faloutsos, C. Gigatensor: scaling tensor analysis up by 100 timesalgorithms and discoveries. In KDD (2012).
 [14] Kolda, T. G., and Bader, B. W. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455–500.
 [15] Kozubowski, T. J., and Podgórski, K. Distributional properties of the negative binomial Lévy process. Mathematical Statistics, Centre for Mathematical Sciences, Faculty of Engineering, Lund University, 2008.
 [16] Leetaru, K., and Schrodt, P. A. Gdelt: Global data on events, location, and tone, 1979–2012. In ISA Annual Convention (2013), vol. 2, p. 4.
 [17] Papalexakis, E., Faloutsos, C., and Sidiropoulos, N. Parcube: Sparse parallelizable candecompparafac tensor decompositions. ACM Transactions on Knowledge Discovery from Data (2015).
 [18] Rai, P., Wang, Y., Guo, S., Chen, G., Dunson, D., and Carin, L. Scalable bayesian lowrank decomposition of incomplete multiway tensors. In ICML (2014).
 [19] Schein, A., Paisley, J., Blei, D. M., and H, W. Inferring polyadic events with poisson tensor factorization. In NIPS Workshop (2014).
 [20] Schmidt, M., and Mohamed, S. Probabilistic nonnegative tensor factorisation using markov chain monte carlo. In 17th European Signal Processing Conference (2009).
 [21] Shashua, A., and Hazan, T. Nonnegative tensor factorization with applications to statistics and computer vision. In ICML (2005).
 [22] Zhao, Q., Zhang, L., and Cichocki, A. Bayesian cp factorization of incomplete tensors with automatic rank determination.
 [23] Zhou, G., Cichocki, A., and S., X. Fast nonnegative matrix/tensor factorization based on lowrank approximation. IEEE Transactions on Signal Processing 60, 6 (2012), 2928–2940.
 [24] Zhou, M., Hannah, L. A., Dunson, D., and Carin, L. Betanegative binomial process and poisson factor analysis. In AISTATS (2012).