A Unified Joint Matrix Factorization Framework for Data Integration
Abstract
Nonnegative matrix factorization (NMF) is a powerful tool in data exploratory analysis by discovering the hidden features and partbased patterns from highdimensional data. NMF and its variants have been successfully applied into diverse fields such as pattern recognition, signal processing, data mining, bioinformatics and so on. Recently, NMF has been extended to analyze multiple matrices simultaneously. However, a unified framework is still lacking. In this paper, we introduce a sparse multiple relationship data regularized joint matrix factorization (JMF) framework and two adapted prediction models for pattern recognition and data integration. Next, we present four update algorithms to solve this framework. The merits and demerits of these algorithms are systematically explored. Furthermore, extensive computational experiments using both synthetic data and real data demonstrate the effectiveness of JMF framework and related algorithms on pattern recognition and data mining.
1 Introduction
Nonnegative matrix factorization (NMF) is a powerful matrix factorization technique which typically decomposes a nonnegative data matrix into the product of two lowrank nonnegative matrices [1, 2]. NMF was first introduced by Paatero and Tapper (1994) and has become an active area with much progress both in theory and in practice since the work by Lee and Seung (1999). NMF and its variants have been recognized as valuable exploratory analysis tools. They have been successfully applied into many fields including signal processing, data mining, pattern recognition, bioinformatics and so on [3, 4].
NMF has been shown to be able to generate sparse and partbased representation of data [2]. In other words, the factorization allows us to easily identify meaningful substructures underlying the data. In the past decade, a number of variants have been proposed by incorporating various kinds of regularized terms including discriminative constraints [5], networkregularized or localitypreserving constraints [6, 7], sparsity constraints [8, 9], orthogonality constraints [10] and others [11].
However, the typical NMF and its variants in its present form can only be applied to one matrix containing just one type of variables. Large amounts of multiview data describing the same set of objects can be available now. Thus, data integration methods are urgently needed. Recently, joint matrix factorization based data integration methods have been proposed for pattern recognition and data mining among pairwise or multiview data matrices. For example, Greene and Cunninghan proposed an integration model based on matrix factorization (IMF) to learn the embedded underlying clustering structures across multiple views. IMF is a late integration strategy, which fuses the clustering solutions of each individual view for further analysis [12]. Zhang et al. (2012) proposed a joint nonnegative matrix factorization (jNMF) to decompose a number of data matrices which share the same row dimension into a common basis matrix and different coefficient matrices , such that by minimizing [13]. This simultaneous factorization can not only detect the underlying partbased patterns in each matrix, but also reveal the potential connections between patterns of different matrices. A further networkregularized version has also been proposed and applied in bioinformatics [14]. Liu et al. (2013) proposed a multiview clustering method, which factorizes individual matrices simultaneous and requires the coefficient matrices learnt from various views to be approximately common [15]. Specifically, it is defined as follows,
where is a parameter to tune the relative weight among different views as well the two terms. Zitnik and Zupan (2015) proposed a data fusion approach with penalized matrix trifactorization (DFMF) for simultaneously factorizing multiple relationship matrices in one framework [16]. They also considered to incorporate the mustlink and cannotlink constraints within each data type into the DFMF model as follows,
where , , , is the number of data sources for the th object type. represents the relationship data matrix between the th and the th object type (between constraint). DFMF decomposes it into , and constrained by (within constraint), which provides relations between objects of the th object type. This method well exploits the abstracted relationship data, but ignores the samplespecific information of data. In image science, Jing et al. (2012) adopted a supervised joint matrix factorization model to learn latent basis by factorizing both the regionimage matrix and the annotationimage matrix simultaneously and incorporating the label information (where indicates the label index of the th image) [17]. This supervised model for image classification and annotation (SNMFCA) is formulated as follows,
where with if and 0 otherwise. Obviously, the SNMFCA aims to determine the latent basis with known class information. However, this model does not consider the the mustlink and cannotlink constraints within each data type and those between data types.
Recently, based on the jNMF [13], Stražar et al. (2016) proposed an integrative orthogonalityregularized nonnegative matrix factorization (iONMF) to predict proteinRNA interactions. iONMF was an extension of jNMF by integrating multiple types of data with orthogonality regularization on the basis matrix [18]. This model learns the coefficients matrices from the training dataset, and the basis matrix from the testing dataset, and then predicts the interaction matrix. However, both jNMF and iONMF were originally solved by a multiplicative update method, which might be limited by its slow convergence or even nonconvergence issues.
In this paper, we first generalize and introduce a unified joint matrix factorization framework (JMF) based on the classical NMF and jNMF for pattern recognition and data mining by integrating multiview data on the same objects and mustlink and cannotlink constraints within and between any two data. In addition, sparsity constraints are also considered. We adopt four update algorithms including multiplicative update algorithm (MUR), projected gradient method (PG), Nesterov’s optimal gradient method (Ne), and a novel proximal alternating nonnegative least squares algorithm (PANLS) for solving JMF. Then, the JMF is extended to two types of prediction models with one based on the basis matrix and another based on the coefficients matrices (). Finally, we demonstrate the effectiveness of this framework both in revealing objectspecific multipleview hidden patterns and prediction performance through extensive computational experiments.
Compared with existing NMF techniques for pattern recognition and data integration, JMF has the following characteristics:

JMF can model multiview data as well as mustlinks/cannotlinks simultaneously for recognizing objectspecific and multiview associated patterns.

Mustlinks and cannot links within and between some views can be completely missing, and each withinview or betweenview type can be associated with multiple constraint matrices.

JMF can be solved with diverse update algorithms, among which PANLS is a representative one for solving JMF with competitive performance in terms of computational accuracy and efficiency.
The rest of the paper is organized as follows. In section 2, we describe the formulation of JMF. In section 3, we present four update methods to solve JMF. In section 4, we propose two prediction models based on JMF. In section 5, we illustrated the experimental results on both synthetic and real datasets. At last, we summarize this study in section 6.
2 Problem formulation
Given two nonnegative matrices and with size of and , the networked relationship represented by two adjacency matrices and with size of and and the between networked relationship represented by a bipartite adjacency matrix of size . In our application, our assumption is that the two matrices and are two different kinds of descriptions of the same set of objects, the networked relationship , and are described as prior knowledge about the features. The goal of this study is to find a reduced representation by incorporating all the data we have now.
To achieve the ultimate goal in one framework, we incorporate three components into the objective function. The first one considers the partsbased data representation of two matrices and . The second and third ones consider the networked relationship and of each type of features, and the between networked relationship by imposing network regularized constraints, respectively. Finally, we consider to incorporate sparsity constraints to get a sparse solution.
2.1 NMF and its variants
Nonnegative matrix factorization (NMF) problem is a matrix factorization model which uses two lowrank nonnegative matrices, i.e., one basis matrix and one coefficient matrix, to reconstruct the original data matrix [1, 2]. Its objective function is
where and are the basis matrix and coefficient matrix with size of and respectively, and is the Frobenius norm of a matrix. The nonnegativity has been stated that parts are generally combined additively to form a whole; hence, it can be useful for learning partbased representations. Thus, the socalled NMF can be a useful technique to decipher distinct substructures for revealing subtle data structure in the underlying data. Several approaches for solving NMF have been discussed in [3], and more variants and applications of NMF can refer a recent review paper [4].
Here our goal is to find the linked patterns among two matrices. We assume that there is one common basis matrix between matrices and . So a joint nonnegative matrix representation can be derived by the following optimization problem,
min  (1)  
s.t. 
Ideally, the lowdimensional representation (the coefficient matrices) and for the original matrices and derived based on the best approximation can lead to the linked patterns. However, it is unnecessarily accurate due to the incompleteness and noises of the data and other possible factors. In order to improve the accuracy of the patterns, we incorporate the prior networked knowledge on each data object, and bipartite networked knowledge between the data objects and .
Networks Regularized Constraints
Let , denote the lowrank representation of the original data matrices. To decipher the inherent modular structure in a network or say the closeness information of the objects, we assume that adjacent nodes should have similar membership profiles. Therefore, we enforce the mustlink constraints by maximizing the following optimization function for (or similarly for ):
(2) 
where . Similarly, the between relationship information between the two types of objects can also be adopted in the following objective function:
(3) 
The motivation behind the proposed network regularized constraints are actually quite straightforward. Note that the solution of the problem defined in Eq. 1 is often not unique. We expect to obtain a solution for Eq. 1, which also satisfies the networkregularized constraints well. The limitations of the previous model and the noisy of the real data lead us to consider an integrative framework for jointly handing feature data and networked data simultaneously.
NetworksRegularized jNMF
Here, we incorporate all the data (represented in these five matrices) to discover linked patterns based on , and . Specifically, we combine all the above objective functions together to integrate all the five matrices in the following optimization problem:
(4) 
where the parameters and weigh the link constraints in , and respectively. And the first term is to describe the linked patterns between two data matrices by a shared basis or component matrix , the second term defines the summation of the withinvariable constraints that decipher the modular structure in network , , and the third term defines the summation of the betweenvariable constraints which decipher the modular structure in the bipartite network. Here, we can consider the integration of these known networks as graph regularization of the first objective [6] or as a semisupervised learning problem which aims to enforce the mustlink constraints into the framework of pattern recognition, where variables with the ‘mustlink’ constraint shall be forced into the same pattern. This can facilitate pattern search by significantly narrowing down the large search space and improve the reliability of the identified patterns.
2.2 A Unified Joint NMF Model (JMF)
One of the important characteristics of the NMF is that it usually generates sparse representation that allows us to discover partsbased patterns [2]. However, several studies have showed that the generation of a partsbased representation by NMF depends on the data and the algorithm [8]. Several approaches have been proposed to explicitly control the degree of sparseness in the and/or factors of the NMF [8, 9]. The idea of imposing norm based constraints for achieving sparse solution has been successfully and comprehensively utilized in various problems [19]. We adopt the strategy suggested by [9], to make the coefficient matrices and sparse. Thus, the sparse networkregularized jNMF can be formulated as follows:
where and are the th and th column of and respectively. The first term favors modules with the data profiles, and the second term as well as the third term summarize all the mustlink constraints in the first and second profiles, and between the two profiles. The term is used to control the scale of matrix , and encourages the sparsity. The parameter suppress the growth of and controls the desired sparsity.
Naturally, with the emergence of various kinds of mutiview, withinview and betweenview type data, a unified framework is urgently needed. Therefore, we present a generalized form of JMF framework (Figure 1) as follows,
(5) 
where is the th column of , is the th constraint matrix on the th object, and is the relationship matrix between the th and th objects.
3 Algorithms for JMF
Similar to the classical NMF problem, the proposed objective function in Eq.5 is not convex for all variables , (). Therefore, it is unrealistic to expect an algorithm to find the global minimum of the proposed optimization problem. For the classical NMF problem, it is convex for one matrix factor when another is fixed. Therefore, we adopt an alternative update strategy for solving JMF. Specifically, fix (), we can obtain by solving:
(6) 
Similarly, fix , we can update by solving:
(7) 
We can further update one by one. For any , given and , the objective function for optimizing is
(8) 
Various types of methods have been proposed to solve each subproblem of classical NMF [20, 21, 22, 23, 24, 25]. The most widely used approach is the multiplicative update (MUR) algorithm [20]. This algorithm is easy to implement but converges slowly. And it cannot guarantee the convergence of a local minimum solution. As the resulted matrix factors are nonnegative, Lin treated each subproblem as a bounded constraint optimization problem and used a projected gradient (PG) method to solve it [21]. However, PG is inefficient because the Armijo rule is used for searching step size, which is very timeconsuming. As the lowrank matrices of the classical NMF are desirable to be sparse, the active set strategy may be a promising method. Kim and Park adopted an active set (AS) method to solve such types of subproblems, which divides variables into an active set and a passive set. In each iteration, AS exchanges only one variable between these two sets [22]. They further used the block pivoting strategy to accelerate the AS method (BP) [23]. Both AS and BP methods assume that each subproblem is strictly convex, which might bring about numerical instability. As each subprobelm is a convex function and its gradient is Lipchitz continuous, Guan et al. solved each subproblem by Nesterov’s optimal gradient (Ne) method (NeNMF) [24]. NeNMF converges faster than previous methods as it has neither timeconsuming line search step, nor numerical instability problem. Moreover, NeNMF can be extended to sparse and network regularization even it is not convex. Recently, Zhang et al. proposed a new proximal alternating nonnegative least squares (PANLS) to solve each subproblem, which switches between the constrained PG step and unconstrained active set step [25]. Luckily, MUR, PG, Ne and PANLS are all suitable for solving JMF, while both AS and BP are not directly applicable to the networkregularized NMF. As noted that the current code of BP needs to be modified and it may not be efficient if the BP update method is used [25]. In the following subsections, we develop four update methods (MUR, PG, Ne and PANLS) for optimizing JMF in spirit of the above exploration, and present their corresponding algorithms in Appendix Algorithms 14, respectively.
3.1 Multiplicative update algorithm
Firstly, we solve JMF with the MUR algorithm which searches along a rescaled gradient direction with a fixed form of learning rate to guarantee the nonnegativity of the lowrank matrices. The details of MUR are shown as follows. The Lagrange function is , and . The partial derivative of with respect to and are respectively as follows:
(9) 
Based on the KKT conditions , we get the following equations for , respectively,
Then we can get the following update rules:
(10) 
Note that the usual stopping criterion for MUR is
(11) 
where is a predefined tolerance. While the usual stopping criterion used in the other three update methods is
(12) 
If is denoted by , then Eq. 12 can be represented by . Note that may vary slowly when the variables close to a stationary point. Thus, we terminate PG, Ne and PANLS, when Eq. 12 or the following Eq. 13 is satisfied,
(13) 
Generally, MUR is simple and easy to implement, and it quickly decreases the objective value at the beginning. But it does not guarantee the convergence to any local minimum because its solution is unnecessarily a stationary point. Even though it has a stationary point, it converges slowly. If some rows or columns of are close to zero, the result may have numerical problems.
3.2 Projected gradient algorithm
We adopt PG to solve each subproblem, which uses the Armijo rule to search the step size along the projection arc. We take the subproblem Eq. 6 as an example. The step size satisfies:
(14) 
where (=0.01 is used), and is the second moment matrix of . The gradient function of is
The Hessian matrix for is
where is Kronecker product. The PG is very easy to implement but it is timeconsuming and may suffer from the zigzag phenomenon when approaching the local minimizer if the condition number is bad.
3.3 Nesterov’s optimal gradient algorithm
NetNMF updates two sequences recursively to optimize each lowrank matrix. One sequence stores the approximate solutions which are obtained by PG method on the search points with step size determined by the Lipchitz constant. Another sequence stores the search points which are the combination of the latest two approximation solution. In this way, the objective function is convex for the variable and the gradient of the objective function is Lipchitz continuous that are the two prerequisites when applying Nesterov’s method [26, 27, 28]. NeNMF can be conveniently extended for optimizing norm, norm and networkregularized NMF and can also been extended for JMF. Given (), the objective function for optimizing in Eq. 6 is a convex function, and the gradient function for satisfies Lipschitz continuity as follows,
(15) 
where is the gradient of . Though the objective function in Eq. 8 for optimizing is nonconvex, the gradient function for satisfies Lipschitz continuity as follows,
(16) 
where is the gradient of . Ne indeed decrease the objective function but cannot guarantee the convergence to any stationary point as the objection function is nonconvex.
3.4 Proximal alternating nonegative least squares algorithm
Inspired by the PANLS for solving the typical NMF problem [25], we adopt the kernel ideal for solving JMF. The subproblems can be transformed as follows,
(17)  
(18)  
The Hessian matrices of and are:
(19) 
(20) 
Thus, and are strictly convex function of variables and with proper and . And each subproblem has a unique minimizer according to Frank Wolfe theorem. Therefore, PANLS has a nice convergence property.
4 Prediction models based on JMF
NMF and its variants can be used for prediction tasks [18]. JMF can also be extended to the prediction form. Both the basis matrix and coefficients matrices can be used for prediction. The prediction based on the basis matrix is denoted by JMF/L, while the prediction based on the coefficient matrices is denoted by JMF/R. Let () be the training datasets and () be the testing datasets. We can obtain lowrank matrices and by JMF on the training datasets.
For JMF/L, fix the learned coefficients matrices (), the predicted factor can be obtained by solving Eq. 6 on the testing datasets. There were two prediction scenarios based on the learned basis matrix on the training data. In the scenario I for class prediction of new samples, the basis matrix is used as the prediction factor, which can be obtained based on the testing data and learned coefficient matrices . The prediction class of each sample can be obtained based on the maximum value in each row of . In the scenario II for one view data prediction (e.g., ) from other view data (, ), the new basis matrix is computed with the learned coefficient matrices and the testing data , …, . Then the multiplication of and the learned coefficient matrix is used to predict . In this paper, we illustrated the second scenario of JMF/L with one real application.
Similarly, for JMF/R, fix the learned basis matrix , the predicted factors can be obtained by solving Eq. 8, which can be used as prediction factors as the scenario I in JMF/L.
5 Experiments
To demonstrate the performance of JMF, we applied it to four synthetic and three real datasets. Firstly, we evaluated how the parameters influence the performance of JMF in terms of the Area Under the Curve (AUC) on four synthetic datasets. Then we compared the average objective values with respect to iteration numbers or running time of the four update methods on the synthetic datasets. Finally, we applied JMF to three real data from diverse fields. We run the experiments of synthetic datasets on a machine with Intel Core i74770 CPU @ 3.40GHz ¡Á4 with 16 GB RAM and used MATLAB (R2016a) 64bit for the general implementation. The real datasets were run on a windows server with Intel (R) Xeon (R) E52643 v3 CPU @ 3.40GHz ¡Á2 with 768 GB RAM and implemented on MATLAB (R2013a) 64bit. For the purpose of reproducibility, the data and code are available at: http://page.amss.ac.cn/shihua.zhang/software.html
5.1 Synthetic dataset 1
We adopt a similar simulation strategy as used in Experiment in [29] to demonstrate the effectiveness of the algorithms for JMF. The true lowrank and the ground truth basis matrix represented by was constructed with coph = 0 as follows,
(21) 
Meanwhile, three coefficient matrices () were constructed with coph =0,
(22) 
(23) 
(24) 
where .
We set the data matrices by (), where was Gaussian noise and . The within constraint on each data matrix was simulated as follows,
(25) 
We obtained the within constraint matrix on the th source by averaging the value of and , where . The between constraint on the th and th data matrices was simulated as follows,
(26) 
The between constraint matrix .
5.2 Synthetic dataset 2
We simulated a relative largescale dataset with the true low rank . Different from dataset 1, the entries of the true basis matrix were deemed as independent and identical Bernoulli variables with probability equals to . And we constructed the ground truth coefficient matrices in the following manner with coph=, , , respectively.
We set data matrices by , (), where was Gaussian noise and . Similarly, and were generated as mentioned above.
[b]
Time (seconds)  #iteration  Reconstruction error  AUC  
Stop1  
MUR  2.96  5.68  10.03  341  658  1156  31042.90  31026.51  31022.75  0.76  0.77  0.77 
PG  20.78  25.18  30.53  76  93  114  31014.42  31013.91  31013.84  0.78  0.78  0.78 
Ne  9.72  12.15  15.01  68  85  106  31014.43  31013.91  31013.84  0.78  0.78  0.78 
PANLS  3.28  4.04  5.03  78  97  125  31015.16  31013.95  31013.85  0.78  0.78  0.78 
Stop1+  
MUR  2.31  5.08  6.04  339  729  873  31416.28  31398.34  31394.33  0.78  0.79  0.79 
PG  (2) 9.01  (2) 13.27  (2) 21.37  34  51  82  31380.58  31359.02  31348.51  0.79  0.80  0.81 
Ne  7.85  11.59  16.33  56  82  118  31331.53  31318.13  31311.99  0.79  0.79  0.80 
PANLS  (3) 3.68  (3) 4.60  (3) 5.56  54  68  84  31509.66  31507.87  31508.37  0.81  0.81  0.81 
Stop2  
PG  39.22  39.22  39.22  146  146  146  31013.84  31013.84  31013.84  0.78  0.78  0.78 
Ne  19.36  19.36  19.36  137  137  137  31013.84  31013.84  31013.84  0.78  0.78  0.78 
PANLS  0.63  2.05  4.19  13  51  105  31262.01  31027.84  31014.02  0.71  0.78  0.78 
Stop2+  
PG  (1) 32.86  (1) 32.86  (1) 32.86  125  125  125  31346.22  31346.22  31346.22  0.81  0.81  0.81 
Ne  21.73  21.73  21.73  153  153  153  31307.93  31307.93  31307.93  0.80  0.80  0.80 
PANLS  (1) 8.02  (1) 8.02  (1) 8.02  120  120  120  31500.84  31500.84  31500.84  0.81  0.81  0.81 

The number in the bracket represents the frequency of the algorithm’s iteration exceeds under the initial values. The Stop1 and Stop2 represent the algorithms terminate with the first and second stop criteria respectively and the regularization being empty, while Stop1+ and Stop2+ represent the algorithms terminate with the first and second stop criteria respectively and the regularization being nonempty.
[b]
Time (seconds)  #iteration  Reconstruction error  AUC  
Stop1  
MUR  (2) 9.68  (2) 26.89  (2) 84.20  199  558  1724  1425015.76  1421570.53  1420184.96  0.89  0.93  0.94 
PG  50.75  80.81  97.36  48  74  89  1420060.68  1419679.13  1419662.75  0.94  0.95  0.95 
Ne  37.32  58.09  70.04  47  73  88  1420070.63  1419679.79  1419662.69  0.94  0.95  0.95 
PANLS  13.90  21.74  26.48  61  100  128  1420389.53  1419711.43  1419664.65  0.94  0.95  0.95 
Stop1+  
MUR  (5) 8.65  (5) 26.5  (5) 62.73  197  609  1443  1425298.99  1421669.47  1420916.36  0.90  0.95  0.95 
PG  47.78  66.23  73.03  60  82  92  1424383.52  1424099.74  1424098.05  0.96  0.97  0.97 
Ne  (3) 38.19  (3) 49.81  (3) 79.48  45  59  101  1420908.49  1420554.45  1420491.00  0.95  0.96  0.96 
PANLS  11.48  21.35  25.93  46  81  101  1421211.47  1419969.28  1419935.96  0.93  0.95  0.95 
Stop2  
PG  9.01  119.10  119.10  11  108  108  1429843.58  1419660.98  1419660.98  0.81  0.95  0.95 
Ne  7.54  86.41  86.41  11  107  107  1427159.80  1419660.97  1419660.97  0.85  0.95  0.95 
PANLS  2.73  2.78  10.48  11  11  48  1429249.16  1428993.74  1421766.35  0.82  0.82  0.92 
Stop2+  
PG  84.14  84.14  84.14  103  103  103  1424098.23  1424098.23  1424098.23  0.97  0.97  0.97 
Ne  7.46  80.94  80.94  11  110  110  1427301.13  1420477.00  1420477.00  0.86  0.96  0.96 
PANLS  3.01  3.06  33.48  11  11  120  1429044.28  1428805.69  1419755.46  0.83  0.83  0.95 
5.3 Synthetic dataset 3
We simulated a dataset with overlap information on coefficient matrices as well as large noise in prior networks. We set the true lowrank and constructed the ground truth basis matrix with coph =,
The entries of the true coefficient matrices () were regarded as independent and identical Bernoulli variables with probability equal to . Then we set the data matrices by , (), where was Gaussian noise and . and were generated as mentioned in section 5.1.
[b]
Time (seconds)  #iteration  Reconstruction error  AUC  
Stop1  
MUR  6.01  19.67  71.31  82  268  934  1761985.02  1748184.66  1743369.37  0.60  0.63  0.64 
PG  42.87  129.66  428.45  25  68  213  1744257.88  1741166.87  1740083.05  0.63  0.65  0.66 
Ne  25.95  84.73  279.05  23  67  213  1744342.91  1741156.57  1740122.46  0.63  0.65  0.66 
PANLS  7.12  7.12  7.12  16  16  16  1750415.28  1750415.28  1750415.28  0.62  0.62  0.62 
Stop1+  
MUR  5.50  18.31  67.64  79  260  936  1765436.16  1752533.34  1747422.15  0.61  0.63  0.65 
PG  45.42  152.41  500.40  26  77  247  1748471.26  1744875.66  1743766.27  0.64  0.65  0.67 
Ne  27.04  99.86  290.43  24  75  206  1748100.58  1744994.86  1744082.25  0.64  0.66  0.67 
PANLS  7.26  7.26  7.26  16  16  16  1750520.76  1750520.76  1750520.76  0.62  0.62  0.62 
Stop2 