We study the problem of learning latent feature models (LFMs) for tensor data commonly observed in science and engineering such as hyperspectral imagery. However, the problem is challenging not only due to the non-convex formulation, the combinatorial nature of the constraints in LFMs, but also the high-order correlations in the data. In this work, we formulate a tensor latent feature learning problem by representing the data as a mixture of high-order latent features and binary codes, which are memory efficient and easy to interpret. To make the learning tractable, we propose a novel optimization procedure, Binary Matching Pursuit (BMP), that iteratively searches for binary bases via a MAXCUT-like boolean quadratic solver. Such procedure is guaranteed to achieve an -suboptimal solution in greedy steps, resulting in a trade-off between accuracy and sparsity. When evaluated on both synthetic and real datasets, our experiments show superior performance over baseline methods.
Learning Tensor Latent Features
Sung-En Chang Xun Zheng Ian E.H. Yen Pradeep Ravikumar Rose Yu
Northeastern University Carnegie Mellon University Snap Inc. Carnegie Mellon University Northeastern University
Latent feature models [ghahramani2006infinite] (LFMs) are a family of unsupervised models that describe an observation as a combination of continuous valued latent features and binary valued sparse codes:
Thus, for a given observation , its corresponding sparse code indicates the presence or absence of latent features corresponding to the rows of . Latent feature models have been widely studied in the context of blind signal separation [van1997analytical], overlapping clustering [banerjee2005model] and modeling natural image patches [melchior2017gaussian]. Compared to the classic latent factor models [koren2009matrix], latent feature models have two main benefits: (1) interpretablity: the binary codes directly reveal whether certain features exist in the data, thus provide more interpretable latent profiles [valera2017general]; (2) scalability: compared with real-valued codes, binary codes require fewer bits to store, thereby cutting down the memory footprint, making it easier to deploy into memory constrained environments such as mobile devices.
Tensor latent feature models generalize traditional matrix latent feature models to represent high-order correlation structures in the data. For example, in spatiotemporal recommender systems, the observations are user activities over different locations and time. We want to learn the latent features and codes that correspond to user, space and time simultaneously without assuming conditional independence of these three dimensions. In this case, we can first represent such data as a high-order tensor and assign binary codes encoding presence or absence of rows for individual modes of the tensor. These codes can then help us answer the “when” and “where” questions regarding the learned user preferences.
Besides the non-convex formulation of the maximum likelihood estimation (MLE) objective, learning latent feature models is further complicated by the combinatorial nature of the codes. Inference of LFM from data in generally intractable. [tung2014spectral, anandkumar2014tensor, jaffe2018learning] propose spectral methods to bypass the issue of MLE objective. Although spectral methods provide global guarantees for the estimator, they usually have high sample complexity. Most recently, [yen2017latent] proposed a solver based on convex relaxation that achieves linear sample complexity w.r.t number of atoms and dimensions, but is only limited to matrix LFMs. To learn a tensor model with integer components, [perros2018sustain] assumes all the components are integral and propose to use hierarchical alternating least square (ALS) to fit the integral codes for each mode iteratively. However, ALS is numerically unstable and susceptible to local optimum solutions [cichocki2007regularized], and furthermore lacks theoretical guarantees.
In this paper, we propose an extension to latent feature models: Tensor Latent Feature Models for high-order data, which represent the data as a combination of high-order features and a set of binary codes. We first formulate a non-convex optimization problem corresponding to a MLE estimator. We then propose a novel optimization algorithm based on greedy matching pursuit to iteratively select atoms of steepest descent, by reduction the atom search problem to a MAXCUT problem amenable to efficient boolean quadratic solver. In summary, our contributions include:
We study latent variable models for high-order data and formulate a tensor latent feature learning as a problem of convex objective with atomic sparsity constraint.
Then we design a fast optimization algorithm based on greedy matching pursuit and MAXCUT-like Boolean quadratic subproblems to solve the problem efficiently with approximation guarantee.
We theoretically analyze our algorithm and show that, to approximate an optimal solution of atoms, the algorithm achieves -suboptimal loss with number of atoms.
We experiment exhaustively on synthetic and real-world benchmark datasets and observe superior performance for denoising and recovery tasks.
Across the paper, we use calligraphy font for tensors, such as , bold uppercase letters for matrices, such as , and bold lowercase letters for vectors, such as . For an -mode tensor , a generalized tensor unfolding w.r.t. a mode subset results in a matrix where the cyclic concatenation of modes is treated as the first dimension and the cyclic concatenation of the rest are treated as the second dimension. Here, and in the sequel, we use a slight abuse of notation and use to indicate , and correspondingly for . The reverse operation is defined similarly. The indexing follows the convention in [kolda2009tensor].
2 Related Work
Latent feature models.
Our work is related to estimating latent feature models. For the non-parametric setting with Indian Buffet Process(IBP) prior, many [ghahramani2006infinite, doshi2009accelerated, broderick2013mad] have studied estimation procedures based on Markov Chain Monte Carlo (MCMC) [doshi2009accelerated] or variational inference [doshi2009variational]. For the parametric version of the model, [tung2014spectral, anandkumar2014tensor, jaffe2018learning] propose to use spectral methods to estimate the moments of the distribution at different orders, but can suffer from high sample complexity. Under certain identifiability condition, [slawski2013matrix] proposed a convex optimization algorithm by selecting a maximal affine independent subset. However, the selection process in their algorithm has an exponential computational complexity. Perhaps the work that is most related to ours is [yen2017latent] in which a convex estimator for matrix latent feature models which under certain identifiability conditions achieves a linear sample complexity.
Tensor decomposition has been the subject of extensive study; please see the review paper by [kolda2009tensor] and references therein. Most tensor factorization work focuses on extracting high-order structure with real-value components. Due to the non-convex nature of low-rank tensors, [gandy2011tensor, tomioka2013convex] propose to use n-mode nuclear norm as a convex surrogate and solve the problem using alternating direction method of multipliers (ADMM). Solving convex surrogates have global guarantees but can suffer from high computational costs. [bahadori2014fast] designed a non-convex solver based on greedy matching pursuit and demonstrate significant speedup. There has also been work on binary tensor decomposition where the input tensor has binary values [miettinen2011boolean, rai2015scalable], which is different from our problem where the learned components are binary. [perros2018sustain] have a similar formulation of the problem but propose a heuristic based on alternating least squares. To the best of our knowledge, our work is the first algorithm for learning tensor decomposition with binary components with theoretical guarantees.
3 Learning Tensor Latent Features
In this section, we first review the background of matrix LFMs and describe their extension to tensors, aka tensor latent feature models. We then formulate the tensor LFM learning problem as sparsity constrained convex minimization on an atomic set.
3.1 Matrix Latent Feature Model
In the regime of latent variable modeling, popular choices include mixture models which assume each data sample is represented by a single latent class. Latent feature models [ghahramani2006infinite] generalizes mixture models by assuming each sample to come from a combination of latent features, resulting the following data generation process:
where is a set of latent features, is a latent binary vector that indicates the presence or absence of each feature. is the unknown noise vector with mean zero. LFMs are used to model matrix data and assume linear correlations among features.
3.2 Tensor Latent Feature Model
To model tensor data with high-order non-linear correlations, we generalize LFMs to tensor latent feature models. In particular, let be a general -mode data tensor, and be a subset of the modes. Let be a set of different partitions of the modes , i.e. a collection of pairs specified by the user. This allows us to choose the modes where the high-order correlations exist.
Tensor latent feature models assume each observation to be represented by a set of latent tensor features and a binary vector of codes indicating the presence of the features. Specifically, we impose the following latent feature structure to the tensor:
where is the unknown noise tensor whose elements are drawn i.i.d from a zero mean distribution. For each mode subset , the corresponding basis tensor is a folded rank-1 matrix consists of a binary code vector and a real-valued feature vector .
The imposed structure in Eqn. 2 generalizes the matrix LFM model not only by extending from a 2-dimensional array to the multi-dimensional setting, but also by allowing all specified partitions of modes that can be explained by binary combination of features. One can also draw an analogy from the dirty statistical model [yang2013dirty], which aims to learn multiple different correlation structures of the data simultaneously, e.g. sparsity and low-rankness. In the tensor latent feature model, each partition of the modes can be seen as a clean statistical model that promotes a particular low-rank structure, and the combination of such partitions is leads to a dirty statistical model.
3.3 Reformulating via Atomic Set
The estimation (learning) problem of tensor latent feature models in Eqn. 2 is generally intractable: searching for latent features and binary codes leads to a hard non-convex optimization problem with combinatorial constraints. In addition, general mixed integer program solvers are not applicable due to the real-valued features . To make the learning tractable, we reformulate the regularized MLE estimation problem as sparsity constrained convex minimization using an atomic set.
To begin with, let us define a set of atoms given a mode subset :
where is a unit sphere in , hence we use instead of to make this change explicit. Then the atom sets for all the partitions is
Now every real-valued features is a unit vector, and we can associate each “unit” basis tensor with a real-valued coefficient :
For the tensor latent feature learning problem, given an observation generated from the model (2), we would like to find a most compact representation of in the space of . We can then write the learning problem as the following convex minimization problem with non-convex sparsity constraints over the atom set:
where is the tensor Frobenius norm, and is the desired number of atoms from the atom set .
4 Binary Matching Pursuit (BMP)
Solving problem in Eqn. (4) is still difficult as the atomic set contains an immensely large number of atoms. In this section, we introduce Binary Matching Pursuit (BMP), a binary variant of the classic matching pursuit algorithm with an efficient MAXCUT-like Boolean quadratic solver as a subroutine. The algorithms are outlined in Algorithm 1 and Algorithm 2.
4.1 Greedy Matching Pursuit
Matching pursuit [mallat1993matching, tropp2006algorithms] is a sparse approximation algorithm that aims to find the “best match” of the data onto a span of basis. Similar to matching pursuit, we first perform a linearization of the objective. In particular, at iteration , we greedily find a basis that corresponds to the steepest gradient descent direction.
Since we know the partition set , the greedy step is in fact solving a nested minimization problem:
The difficulty in solving (6) lies in the fact that the inner optimization problem over the atom set involves a combinatorially large number of atoms. A key contribution of this work is a novel procedure based on a MAXCUT like Boolean quadratic solver to tackle this problem efficiently.
For ease of illustration, assume that is already unfolded into a matrix, where and . Then each inner minimization problem in Eqn. (6) can be written as
Since the feature vector lies in the space of real-valued unit vectors, when fixing and minimizing w.r.t. , we have
Therefore, the joint minimization w.r.t. is equivalent to solving
which can be solved using a Boolean quadratic solver efficiently, as detailed in the next section.
4.2 Boolean Quadratic Solver via MAXCUT
With change of variables, it has been shown that the problem in Eqn 4.1 is a MAXCUT-like problem [yen2017latent], which has an efficient relaxation with constant approximation guarantee.
Specifically, define a vector , then . Augmented with another dummy variable , the problem can be rewritten as
which is now in a MAXCUT-like formulation. In general, even if the quadratic factor is positive definite, i.e., , the decision version of the problem is still NP-complete [garey1979computers]. However, there exist semidefinite programming (SDP) relaxations that has constant factor approximation guarantees [nesterov1997quality]:
Rounding of the solution to the above SDP problem guarantees a -approximation [nesterov1997quality, yen2017latent].
Indeed, although the polynomial time complexity of SDPs is already a big saving compared to NP-completeness, it is still impractical to employ a general SDP solver as the subroutine. Fortunately, unlike general SDPs, the SDP of the form (10) has specialized solver [wang2017mixing], whose time complexity only depends linearly on the number of non-zeros in .
4.3 Weight Adjustment
After obtaining an atom using the Boolean quadratic solver, we can add the atom to the current atom set that maintains active atoms considered so far. The next step is to adjust the atom weights to reflect changes in the atom set. Let be the number of active atoms at iteration . Fixing the atom set , we can rewrite (4) in terms of as:
which is a simple least-square problem with a closed-form solution:
where is a , each column of which is a flattened atom in the active set .
5 Convergence Analysis
In this section, we show that the greedy algorithm proposed in section 4 leads to an accuracy-sparsity trade-off: by running the algorithm for iterations, one can achieve an -suboptimal loss relative to the optimal solution of (4) with number of atoms.
A key ingredient of our analysis is the constant-approximation guarantee in the greedy step given by the SDP-based MAXCUT solver, where we have the following guarantee for the atom picked by Algorithm 2:
The definition of atomic norm [chandrasekaran2012convex] will be useful in the analysis:
By the fully-corrective weight adjustment (11), for each iteration we have
since . Let be the Lipschitz-continuous parameter of w.r.t. the atomic norm . We have
for any , and therefore
Note since for any , we also have
where the first inequality is due to , and the second inequality is from convexity. Let . Combining the above inequality with (16), we have
The recurrence then leads to the convergence result. ∎
Consider an optimal solution with support of size
Based on Theorem 5.1, we develop a bound on the optimization error directly in terms of the ratio where is the number of atoms selected by our algorithm.
Let be the strongly-convex parameter of the function . Then the loss obtained from running iterates of Algorithm 1 satisfies
We evaluate our method on two benchmark tasks for latent feature modeling in the context of tensors: tensor denoising and tensor recovery. For denoising, we assume that the observations are contaminated by element-wise Gaussian noise, and the task is to filter out the noise. For recovery, the data have randomly missing values, and the task is to recover these missing entries solely based on observed ones.
SUSTain [perros2018sustain] solves the following integer tensor factorization problem for observation :
where denotes outer product, is the set of nonnegative integers, and is the set of nonnegative integers up to . We use a publicly available implementation 111https://github.com/kperros/SUSTain/ and the default parameter settings provided in the repository. For our experiments, we set and run SUSTain to generate binary components. We note that with this setting, SUSTain constrains all the tensor components to be binary, which is a stronger assumption than ours. Therefore the results presented in this experiment is not necessarily indicative of SUSTain’s performance on general integer tensor factorization problems.
We also compare with matrix latent feature models (LFMs) in order to demonstrate the benefit of exploiting high-order structures. Specifically, we unfold the input tensor w.r.t. each mode separately, resulting a set of unfolded matrices. For each mode, we have the input matrix where for each . We use LFM mode to denote applying LFM to the unfolding matrix of mode .
We use three synthetic and three real-world datasets. For synthetic dataset, we set , , and . The ground truth tensor with atoms is generated as follows. For each atom , let be the chosen mode. We generate a random binary vector for the mode , and generate two random real-valued vectors , for the remaining modes and . The -th atom is then formed by taking outer products of , , and in appropriate order. The final ground truth tensor is simply the summation of all atoms . We test on synthetic data sets with different number of atoms: , and call them Syn0, Syn1, and Syn2 respectively.
We use the following three real-world hyperspectral image datasets: Indian pines (Ind.)222http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes, JasperRidge (Jas.)333http://www.escience.cn/people/feiyunZHU/Dataset_GT.html, and Samson1 (Sam.)444http://www.escience.cn/people/feiyunZHU/Dataset_GT.html. The Indian Pines dataset was gathered over the Indian Pines test site in northwestern Indiana and consists of pixels and 224 spectral reflectance bands [PURR1947]. Jasper Ridge has pixels, each of which is recorded at 224 channels ranging from 380 nm to 2500 nm. Samson has pixels, each recorded at 156 channels covering the wavelengths ranging from 401 nm to 889 nm. Since this hyperspectral imagery is very large, we consider a sub-image of for Jasper Ridge and pixels for Samson, similar to [fyzhu_2014_JSTSP_RRLbS, fyzhu_2014_TIP_DgS_NMF, fyzhu_2014_IJPRS_SS_NMF]. The data sets statistics are summarized in the table below.
For both synthetic and real dataset, we set the partition as , where , , and .
6.1 Tensor Denoising
For the denoising experiments, we generate noisy observations from a ground truth tensor using
The goal is to filter out noise from the observed . For evaluation, we use root-mean-square error (RMSE) between the ground truth and the estimate :
The results are shown in Figure 1. The top row shows the comparison on three synthetic data sets where the true number of atoms are respectively. One can observe that BMP achieves much better RMSE across different numbers of atoms compared to other baselines. The improvement over LFM baselines suggests the advantage of modeling high-order correlations using tensors. The comparison with fully binary-valued SUSTain confirms the advantage of real-valued factors in tensor latent feature models. We also notice that as the number of atoms increases, the advantage of our method becomes more pronounced.
The bottom row displays the denoising results on the hyperspectral imagery. Although initially on a limited number of atoms BMP has a large RMSE, as the number of atoms grows, it quickly outperforms baseline methods. Also notice that for Indian Pines BMP achieves similar performance as LFM, whereas in Jasper Ridge and Samson dataset BMP significantly outperforms LFM.
6.2 Tensor Recovery
In the recovery task, we randomly remove 10, 25, and 40 percent of the entries from the ground truth tensor with the goal of repainting the missing entries from partial observations . We use RMSE between the ground truth and the estimate as above as the evaluation metric. Results using 10% missing entries are shown in Figure 2. On synthetic data (top row), BMP outperforms other baselines starting from 2 atoms. In the bottom row, even though BMP starts out similar to other methods, it achieves much better RMSE after 10 atoms. The trend is consistent in all settings where BMP outperforms other baselines, especially on the Samson dataset.
For a visual comparison, in Figure 3 we visualize the recovery results from channel 138 of the Samson1 dataset for different settings. In the first row, where 10% of the entries are missing, BMP can obtain a visually close estimate, whereas LFM only captures a rough picture of the data. In the second and third row, as more and more entries are missing, BMP can still robustly recover the selected channel, despite it not using real-valued factorization. On the other hand, LFM performs poorly, suggesting again the advantage of tensor latent feature model over its matrix counterparts.
In this paper, we generalize the problem of learning latent feature models from matrix to tensor data. We define a flexible tensor latent feature model and formulate a tractable latent feature learning problem. We design a fast optimization algorithm to search for the binary atoms iteratively using greedy matching pursuit and a MAXCUT solver. We theoretically analyze our algorithm and prove it can achieve an -suboptimal loss relative to the optimal solution in iterations. We experiment exhaustively on synthetic and real-world datasets and observe superior performance for both tensor denoising and recovery tasks.