A Unified Joint Matrix Factorization Framework for Data Integration

A Unified Joint Matrix Factorization Framework for Data Integration


Nonnegative matrix factorization (NMF) is a powerful tool in data exploratory analysis by discovering the hidden features and part-based patterns from high-dimensional data. NMF and its variants have been successfully applied into diverse fields such as pattern recognition, signal processing, data mining, bioinformatics and so on. Recently, NMF has been extended to analyze multiple matrices simultaneously. However, a unified framework is still lacking. In this paper, we introduce a sparse multiple relationship data regularized joint matrix factorization (JMF) framework and two adapted prediction models for pattern recognition and data integration. Next, we present four update algorithms to solve this framework. The merits and demerits of these algorithms are systematically explored. Furthermore, extensive computational experiments using both synthetic data and real data demonstrate the effectiveness of JMF framework and related algorithms on pattern recognition and data mining.

non-negative matrix factorization, joint matrix factorization, data integration, network-regularized constraint, pattern recognition, bioinformatics.

1 Introduction

Nonnegative matrix factorization (NMF) is a powerful matrix factorization technique which typically decomposes a nonnegative data matrix into the product of two low-rank nonnegative matrices [1, 2]. NMF was first introduced by Paatero and Tapper (1994) and has become an active area with much progress both in theory and in practice since the work by Lee and Seung (1999). NMF and its variants have been recognized as valuable exploratory analysis tools. They have been successfully applied into many fields including signal processing, data mining, pattern recognition, bioinformatics and so on [3, 4].

NMF has been shown to be able to generate sparse and part-based representation of data [2]. In other words, the factorization allows us to easily identify meaningful sub-structures underlying the data. In the past decade, a number of variants have been proposed by incorporating various kinds of regularized terms including discriminative constraints [5], network-regularized or locality-preserving constraints [6, 7], sparsity constraints [8, 9], orthogonality constraints [10] and others [11].

However, the typical NMF and its variants in its present form can only be applied to one matrix containing just one type of variables. Large amounts of multi-view data describing the same set of objects can be available now. Thus, data integration methods are urgently needed. Recently, joint matrix factorization based data integration methods have been proposed for pattern recognition and data mining among pairwise or multi-view data matrices. For example, Greene and Cunninghan proposed an integration model based on matrix factorization (IMF) to learn the embedded underlying clustering structures across multiple views. IMF is a late integration strategy, which fuses the clustering solutions of each individual view for further analysis [12]. Zhang et al. (2012) proposed a joint nonnegative matrix factorization (jNMF) to decompose a number of data matrices which share the same row dimension into a common basis matrix and different coefficient matrices , such that by minimizing [13]. This simultaneous factorization can not only detect the underlying part-based patterns in each matrix, but also reveal the potential connections between patterns of different matrices. A further network-regularized version has also been proposed and applied in bioinformatics [14]. Liu et al. (2013) proposed a multi-view clustering method, which factorizes individual matrices simultaneous and requires the coefficient matrices learnt from various views to be approximately common [15]. Specifically, it is defined as follows,

where is a parameter to tune the relative weight among different views as well the two terms. Zitnik and Zupan (2015) proposed a data fusion approach with penalized matrix tri-factorization (DFMF) for simultaneously factorizing multiple relationship matrices in one framework [16]. They also considered to incorporate the must-link and cannot-link constraints within each data type into the DFMF model as follows,

where , , , is the number of data sources for the th object type. represents the relationship data matrix between the th and the th object type (between constraint). DFMF decomposes it into , and constrained by (within constraint), which provides relations between objects of the th object type. This method well exploits the abstracted relationship data, but ignores the sample-specific information of data. In image science, Jing et al. (2012) adopted a supervised joint matrix factorization model to learn latent basis by factorizing both the region-image matrix and the annotation-image matrix simultaneously and incorporating the label information (where indicates the label index of the th image) [17]. This supervised model for image classification and annotation (SNMFCA) is formulated as follows,

where with if and 0 otherwise. Obviously, the SNMFCA aims to determine the latent basis with known class information. However, this model does not consider the the must-link and cannot-link constraints within each data type and those between data types.

Recently, based on the jNMF [13], Stražar et al. (2016) proposed an integrative orthogonality-regularized nonnegative matrix factorization (iONMF) to predict protein-RNA interactions. iONMF was an extension of jNMF by integrating multiple types of data with orthogonality regularization on the basis matrix [18]. This model learns the coefficients matrices from the training dataset, and the basis matrix from the testing dataset, and then predicts the interaction matrix. However, both jNMF and iONMF were originally solved by a multiplicative update method, which might be limited by its slow convergence or even non-convergence issues.

In this paper, we first generalize and introduce a unified joint matrix factorization framework (JMF) based on the classical NMF and jNMF for pattern recognition and data mining by integrating multi-view data on the same objects and must-link and cannot-link constraints within and between any two data. In addition, sparsity constraints are also considered. We adopt four update algorithms including multiplicative update algorithm (MUR), projected gradient method (PG), Nesterov’s optimal gradient method (Ne), and a novel proximal alternating nonnegative least squares algorithm (PANLS) for solving JMF. Then, the JMF is extended to two types of prediction models with one based on the basis matrix and another based on the coefficients matrices (). Finally, we demonstrate the effectiveness of this framework both in revealing object-specific multiple-view hidden patterns and prediction performance through extensive computational experiments.

Compared with existing NMF techniques for pattern recognition and data integration, JMF has the following characteristics:

  1. JMF can model multi-view data as well as must-links/cannot-links simultaneously for recognizing object-specific and multi-view associated patterns.

  2. Must-links and cannot links within and between some views can be completely missing, and each within-view or between-view type can be associated with multiple constraint matrices.

  3. JMF can be solved with diverse update algorithms, among which PANLS is a representative one for solving JMF with competitive performance in terms of computational accuracy and efficiency.

The rest of the paper is organized as follows. In section 2, we describe the formulation of JMF. In section 3, we present four update methods to solve JMF. In section 4, we propose two prediction models based on JMF. In section 5, we illustrated the experimental results on both synthetic and real datasets. At last, we summarize this study in section 6.

2 Problem formulation

Given two nonnegative matrices and with size of and , the networked relationship represented by two adjacency matrices and with size of and and the between networked relationship represented by a bipartite adjacency matrix of size . In our application, our assumption is that the two matrices and are two different kinds of descriptions of the same set of objects, the networked relationship , and are described as prior knowledge about the features. The goal of this study is to find a reduced representation by incorporating all the data we have now.

To achieve the ultimate goal in one framework, we incorporate three components into the objective function. The first one considers the parts-based data representation of two matrices and . The second and third ones consider the networked relationship and of each type of features, and the between networked relationship by imposing network regularized constraints, respectively. Finally, we consider to incorporate sparsity constraints to get a sparse solution.

2.1 NMF and its variants

Non-negative matrix factorization (NMF) problem is a matrix factorization model which uses two low-rank non-negative matrices, i.e., one basis matrix and one coefficient matrix, to reconstruct the original data matrix [1, 2]. Its objective function is

where and are the basis matrix and coefficient matrix with size of and respectively, and is the Frobenius norm of a matrix. The non-negativity has been stated that parts are generally combined additively to form a whole; hence, it can be useful for learning part-based representations. Thus, the so-called NMF can be a useful technique to decipher distinct sub-structures for revealing subtle data structure in the underlying data. Several approaches for solving NMF have been discussed in [3], and more variants and applications of NMF can refer a recent review paper [4].

Here our goal is to find the linked patterns among two matrices. We assume that there is one common basis matrix between matrices and . So a joint non-negative matrix representation can be derived by the following optimization problem,

min (1)

Ideally, the low-dimensional representation (the coefficient matrices) and for the original matrices and derived based on the best approximation can lead to the linked patterns. However, it is unnecessarily accurate due to the incompleteness and noises of the data and other possible factors. In order to improve the accuracy of the patterns, we incorporate the prior networked knowledge on each data object, and bipartite networked knowledge between the data objects and .

Networks Regularized Constraints

Let , denote the low-rank representation of the original data matrices. To decipher the inherent modular structure in a network or say the closeness information of the objects, we assume that adjacent nodes should have similar membership profiles. Therefore, we enforce the must-link constraints by maximizing the following optimization function for (or similarly for ):


where . Similarly, the between relationship information between the two types of objects can also be adopted in the following objective function:


The motivation behind the proposed network regularized constraints are actually quite straightforward. Note that the solution of the problem defined in Eq. 1 is often not unique. We expect to obtain a solution for Eq. 1, which also satisfies the network-regularized constraints well. The limitations of the previous model and the noisy of the real data lead us to consider an integrative framework for jointly handing feature data and networked data simultaneously.

Networks-Regularized jNMF

Here, we incorporate all the data (represented in these five matrices) to discover linked patterns based on , and . Specifically, we combine all the above objective functions together to integrate all the five matrices in the following optimization problem:


where the parameters and weigh the link constraints in , and respectively. And the first term is to describe the linked patterns between two data matrices by a shared basis or component matrix , the second term defines the summation of the within-variable constraints that decipher the modular structure in network , , and the third term defines the summation of the between-variable constraints which decipher the modular structure in the bipartite network. Here, we can consider the integration of these known networks as graph regularization of the first objective [6] or as a semi-supervised learning problem which aims to enforce the must-link constraints into the framework of pattern recognition, where variables with the ‘must-link’ constraint shall be forced into the same pattern. This can facilitate pattern search by significantly narrowing down the large search space and improve the reliability of the identified patterns.

2.2 A Unified Joint NMF Model (JMF)

One of the important characteristics of the NMF is that it usually generates sparse representation that allows us to discover parts-based patterns [2]. However, several studies have showed that the generation of a parts-based representation by NMF depends on the data and the algorithm [8]. Several approaches have been proposed to explicitly control the degree of sparseness in the and/or factors of the NMF [8, 9]. The idea of imposing -norm based constraints for achieving sparse solution has been successfully and comprehensively utilized in various problems [19]. We adopt the strategy suggested by [9], to make the coefficient matrices and sparse. Thus, the sparse network-regularized jNMF can be formulated as follows:

where and are the th and th column of and respectively. The first term favors modules with the data profiles, and the second term as well as the third term summarize all the must-link constraints in the first and second profiles, and between the two profiles. The term is used to control the scale of matrix , and encourages the sparsity. The parameter suppress the growth of and controls the desired sparsity.

Naturally, with the emergence of various kinds of muti-view, within-view and between-view type data, a unified framework is urgently needed. Therefore, we present a generalized form of JMF framework (Figure 1) as follows,


where is the th column of , is the th constraint matrix on the th object, and is the relationship matrix between the th and th objects.

Fig. 1: An illustration of JMF. There are type data matrices , and many constraints matrices. For example, data matrix is the first constraint data matrix on data type , and relates object types and .

3 Algorithms for JMF

Similar to the classical NMF problem, the proposed objective function in Eq.5 is not convex for all variables , (). Therefore, it is unrealistic to expect an algorithm to find the global minimum of the proposed optimization problem. For the classical NMF problem, it is convex for one matrix factor when another is fixed. Therefore, we adopt an alternative update strategy for solving JMF. Specifically, fix (), we can obtain by solving:


Similarly, fix , we can update by solving:


We can further update one by one. For any , given and , the objective function for optimizing is


Various types of methods have been proposed to solve each subproblem of classical NMF [20, 21, 22, 23, 24, 25]. The most widely used approach is the multiplicative update (MUR) algorithm [20]. This algorithm is easy to implement but converges slowly. And it cannot guarantee the convergence of a local minimum solution. As the resulted matrix factors are nonnegative, Lin treated each subproblem as a bounded constraint optimization problem and used a projected gradient (PG) method to solve it [21]. However, PG is inefficient because the Armijo rule is used for searching step size, which is very time-consuming. As the low-rank matrices of the classical NMF are desirable to be sparse, the active set strategy may be a promising method. Kim and Park adopted an active set (AS) method to solve such types of subproblems, which divides variables into an active set and a passive set. In each iteration, AS exchanges only one variable between these two sets [22]. They further used the block pivoting strategy to accelerate the AS method (BP) [23]. Both AS and BP methods assume that each subproblem is strictly convex, which might bring about numerical instability. As each subprobelm is a convex function and its gradient is Lipchitz continuous, Guan et al. solved each subproblem by Nesterov’s optimal gradient (Ne) method (NeNMF) [24]. NeNMF converges faster than previous methods as it has neither time-consuming line search step, nor numerical instability problem. Moreover, NeNMF can be extended to sparse and network regularization even it is not convex. Recently, Zhang et al. proposed a new proximal alternating nonnegative least squares (PANLS) to solve each subproblem, which switches between the constrained PG step and unconstrained active set step [25]. Luckily, MUR, PG, Ne and PANLS are all suitable for solving JMF, while both AS and BP are not directly applicable to the network-regularized NMF. As noted that the current code of BP needs to be modified and it may not be efficient if the BP update method is used [25]. In the following subsections, we develop four update methods (MUR, PG, Ne and PANLS) for optimizing JMF in spirit of the above exploration, and present their corresponding algorithms in Appendix Algorithms 1-4, respectively.

3.1 Multiplicative update algorithm

Firstly, we solve JMF with the MUR algorithm which searches along a rescaled gradient direction with a fixed form of learning rate to guarantee the nonnegativity of the low-rank matrices. The details of MUR are shown as follows. The Lagrange function is , and . The partial derivative of with respect to and are respectively as follows:


Based on the KKT conditions , we get the following equations for , respectively,

Then we can get the following update rules:


Note that the usual stopping criterion for MUR is


where is a predefined tolerance. While the usual stopping criterion used in the other three update methods is


If is denoted by , then Eq. 12 can be represented by . Note that may vary slowly when the variables close to a stationary point. Thus, we terminate PG, Ne and PANLS, when Eq. 12 or the following Eq. 13 is satisfied,


Generally, MUR is simple and easy to implement, and it quickly decreases the objective value at the beginning. But it does not guarantee the convergence to any local minimum because its solution is unnecessarily a stationary point. Even though it has a stationary point, it converges slowly. If some rows or columns of are close to zero, the result may have numerical problems.

3.2 Projected gradient algorithm

We adopt PG to solve each subproblem, which uses the Armijo rule to search the step size along the projection arc. We take the subproblem Eq. 6 as an example. The step size satisfies:


where (=0.01 is used), and is the second moment matrix of . The gradient function of is

The Hessian matrix for is

where is Kronecker product. The PG is very easy to implement but it is time-consuming and may suffer from the zigzag phenomenon when approaching the local minimizer if the condition number is bad.

3.3 Nesterov’s optimal gradient algorithm

NetNMF updates two sequences recursively to optimize each low-rank matrix. One sequence stores the approximate solutions which are obtained by PG method on the search points with step size determined by the Lipchitz constant. Another sequence stores the search points which are the combination of the latest two approximation solution. In this way, the objective function is convex for the variable and the gradient of the objective function is Lipchitz continuous that are the two prerequisites when applying Nesterov’s method [26, 27, 28]. NeNMF can be conveniently extended for optimizing -norm, -norm and network-regularized NMF and can also been extended for JMF. Given (), the objective function for optimizing in Eq. 6 is a convex function, and the gradient function for satisfies Lipschitz continuity as follows,


where is the gradient of . Though the objective function in Eq. 8 for optimizing is nonconvex, the gradient function for satisfies Lipschitz continuity as follows,


where is the gradient of . Ne indeed decrease the objective function but cannot guarantee the convergence to any stationary point as the objection function is nonconvex.

3.4 Proximal alternating nonegative least squares algorithm

Inspired by the PANLS for solving the typical NMF problem [25], we adopt the kernel ideal for solving JMF. The subproblems can be transformed as follows,


The Hessian matrices of and are:


Thus, and are strictly convex function of variables and with proper and . And each subproblem has a unique minimizer according to Frank Wolfe theorem. Therefore, PANLS has a nice convergence property.

4 Prediction models based on JMF

NMF and its variants can be used for prediction tasks [18]. JMF can also be extended to the prediction form. Both the basis matrix and coefficients matrices can be used for prediction. The prediction based on the basis matrix is denoted by JMF/L, while the prediction based on the coefficient matrices is denoted by JMF/R. Let () be the training datasets and () be the testing datasets. We can obtain low-rank matrices and by JMF on the training datasets.

For JMF/L, fix the learned coefficients matrices (), the predicted factor can be obtained by solving Eq. 6 on the testing datasets. There were two prediction scenarios based on the learned basis matrix on the training data. In the scenario I for class prediction of new samples, the basis matrix is used as the prediction factor, which can be obtained based on the testing data and learned coefficient matrices . The prediction class of each sample can be obtained based on the maximum value in each row of . In the scenario II for one view data prediction (e.g., ) from other view data (, ), the new basis matrix is computed with the learned coefficient matrices and the testing data , …, . Then the multiplication of and the learned coefficient matrix is used to predict . In this paper, we illustrated the second scenario of JMF/L with one real application.

Similarly, for JMF/R, fix the learned basis matrix , the predicted factors can be obtained by solving Eq. 8, which can be used as prediction factors as the scenario I in JMF/L.

Fig. 2: The influence of each constraint term on the performance of JMF.

5 Experiments

To demonstrate the performance of JMF, we applied it to four synthetic and three real datasets. Firstly, we evaluated how the parameters influence the performance of JMF in terms of the Area Under the Curve (AUC) on four synthetic datasets. Then we compared the average objective values with respect to iteration numbers or running time of the four update methods on the synthetic datasets. Finally, we applied JMF to three real data from diverse fields. We run the experiments of synthetic datasets on a machine with Intel Core i7-4770 CPU @ 3.40GHz ¡Á4 with 16 GB RAM and used MATLAB (R2016a) 64-bit for the general implementation. The real datasets were run on a windows server with Intel (R) Xeon (R) E5-2643 v3 CPU @ 3.40GHz ¡Á2 with 768 GB RAM and implemented on MATLAB (R2013a) 64-bit. For the purpose of reproducibility, the data and code are available at: http://page.amss.ac.cn/shihua.zhang/software.html

5.1 Synthetic dataset 1

We adopt a similar simulation strategy as used in Experiment in [29] to demonstrate the effectiveness of the algorithms for JMF. The true low-rank and the ground truth basis matrix represented by was constructed with coph = 0 as follows,


Meanwhile, three coefficient matrices () were constructed with coph =0,


where .

We set the data matrices by (), where was Gaussian noise and . The within constraint on each data matrix was simulated as follows,


We obtained the within constraint matrix on the th source by averaging the value of and , where . The between constraint on the th and th data matrices was simulated as follows,


The between constraint matrix .

5.2 Synthetic dataset 2

We simulated a relative large-scale dataset with the true low rank . Different from dataset 1, the entries of the true basis matrix were deemed as independent and identical Bernoulli variables with probability equals to . And we constructed the ground truth coefficient matrices in the following manner with coph=, , , respectively.

We set data matrices by , (), where was Gaussian noise and . Similarly, and were generated as mentioned above.


TABLE I: Performance comparison of update algorithms on synthetic dataset 1.
Time (seconds) #iteration Reconstruction error AUC
MUR 2.96 5.68 10.03 341 658 1156 31042.90 31026.51 31022.75 0.76 0.77 0.77
PG 20.78 25.18 30.53 76 93 114 31014.42 31013.91 31013.84 0.78 0.78 0.78
Ne 9.72 12.15 15.01 68 85 106 31014.43 31013.91 31013.84 0.78 0.78 0.78
PANLS 3.28 4.04 5.03 78 97 125 31015.16 31013.95 31013.85 0.78 0.78 0.78
MUR 2.31 5.08 6.04 339 729 873 31416.28 31398.34 31394.33 0.78 0.79 0.79
PG (2) 9.01 (2) 13.27 (2) 21.37 34 51 82 31380.58 31359.02 31348.51 0.79 0.80 0.81
Ne 7.85 11.59 16.33 56 82 118 31331.53 31318.13 31311.99 0.79 0.79 0.80
PANLS (3) 3.68 (3) 4.60 (3) 5.56 54 68 84 31509.66 31507.87 31508.37 0.81 0.81 0.81
PG 39.22 39.22 39.22 146 146 146 31013.84 31013.84 31013.84 0.78 0.78 0.78
Ne 19.36 19.36 19.36 137 137 137 31013.84 31013.84 31013.84 0.78 0.78 0.78
PANLS 0.63 2.05 4.19 13 51 105 31262.01 31027.84 31014.02 0.71 0.78 0.78
PG (1) 32.86 (1) 32.86 (1) 32.86 125 125 125 31346.22 31346.22 31346.22 0.81 0.81 0.81
Ne 21.73 21.73 21.73 153 153 153 31307.93 31307.93 31307.93 0.80 0.80 0.80
PANLS (1) 8.02 (1) 8.02 (1) 8.02 120 120 120 31500.84 31500.84 31500.84 0.81 0.81 0.81
  • The number in the bracket represents the frequency of the algorithm’s iteration exceeds under the initial values. The Stop1 and Stop2 represent the algorithms terminate with the first and second stop criteria respectively and the regularization being empty, while Stop1+ and Stop2+ represent the algorithms terminate with the first and second stop criteria respectively and the regularization being nonempty.


TABLE II: Performance comparison of update algorithms on synthetic dataset 2.
Time (seconds) #iteration Reconstruction error AUC
MUR (2) 9.68 (2) 26.89 (2) 84.20 199 558 1724 1425015.76 1421570.53 1420184.96 0.89 0.93 0.94
PG 50.75 80.81 97.36 48 74 89 1420060.68 1419679.13 1419662.75 0.94 0.95 0.95
Ne 37.32 58.09 70.04 47 73 88 1420070.63 1419679.79 1419662.69 0.94 0.95 0.95
PANLS 13.90 21.74 26.48 61 100 128 1420389.53 1419711.43 1419664.65 0.94 0.95 0.95
MUR (5) 8.65 (5) 26.5 (5) 62.73 197 609 1443 1425298.99 1421669.47 1420916.36 0.90 0.95 0.95
PG 47.78 66.23 73.03 60 82 92 1424383.52 1424099.74 1424098.05 0.96 0.97 0.97
Ne (3) 38.19 (3) 49.81 (3) 79.48 45 59 101 1420908.49 1420554.45 1420491.00 0.95 0.96 0.96
PANLS 11.48 21.35 25.93 46 81 101 1421211.47 1419969.28 1419935.96 0.93 0.95 0.95
PG 9.01 119.10 119.10 11 108 108 1429843.58 1419660.98 1419660.98 0.81 0.95 0.95
Ne 7.54 86.41 86.41 11 107 107 1427159.80 1419660.97 1419660.97 0.85 0.95 0.95
PANLS 2.73 2.78 10.48 11 11 48 1429249.16 1428993.74 1421766.35 0.82 0.82 0.92
PG 84.14 84.14 84.14 103 103 103 1424098.23 1424098.23 1424098.23 0.97 0.97 0.97
Ne 7.46 80.94 80.94 11 110 110 1427301.13 1420477.00 1420477.00 0.86 0.96 0.96
PANLS 3.01 3.06 33.48 11 11 120 1429044.28 1428805.69 1419755.46 0.83 0.83 0.95

5.3 Synthetic dataset 3

We simulated a dataset with overlap information on coefficient matrices as well as large noise in prior networks. We set the true low-rank and constructed the ground truth basis matrix with coph =,

The entries of the true coefficient matrices () were regarded as independent and identical Bernoulli variables with probability equal to . Then we set the data matrices by , (), where was Gaussian noise and . and were generated as mentioned in section 5.1.


TABLE III: Performance comparison of update algorithms on synthetic dataset 3.
Time (seconds) #iteration Reconstruction error AUC
MUR 6.01 19.67 71.31 82 268 934 1761985.02 1748184.66 1743369.37 0.60 0.63 0.64
PG 42.87 129.66 428.45 25 68 213 1744257.88 1741166.87 1740083.05 0.63 0.65 0.66
Ne 25.95 84.73 279.05 23 67 213 1744342.91 1741156.57 1740122.46 0.63 0.65 0.66
PANLS 7.12 7.12 7.12 16 16 16 1750415.28 1750415.28 1750415.28 0.62 0.62 0.62
MUR 5.50 18.31 67.64 79 260 936 1765436.16 1752533.34 1747422.15 0.61 0.63 0.65
PG 45.42 152.41 500.40 26 77 247 1748471.26 1744875.66 1743766.27 0.64 0.65 0.67
Ne 27.04 99.86 290.43 24 75 206 1748100.58 1744994.86 1744082.25 0.64 0.66 0.67
PANLS 7.26 7.26 7.26 16 16 16 1750520.76 1750520.76 1750520.76 0.62 0.62 0.62