A Convergent Algorithm for Biorthogonal Nonnegative Matrix TriFactorization^{†}^{†}thanks: Received by the editors on Date of Submission. Accepted for publication on Date of Acceptance. Handling Editor: Handling Editor.
Abstract
We extend our previous work on a convergent algorithm for uniorthogonal nonnegative matrix factorization (UNMF) to the case where the data matrix is decomposed into three factors with two of them are constrained orthogonally and the third one is used to absorb the approximation error. Due to the way the factorization is performed, we name it as biorthogonal nonnegative matrix trifactorization, i.e., BNMtF. This factorization was first introduced by Ding et al. [9] with intent to further improve clustering capability of their version of UNMF. However, as shown in this paper, not only their BNMtF algorithm does not have convergent property but also it does not minimize the objective function it intends to minimize. We tackle this problem by utilizing a technique presented in our previous work and prove that our algorithm converges to a stationary point inside the solution space. As a practical demonstration, the proposed algorithm is utilized for clustering a text corpus; however, contrary to the claim in the original work, both BNMtF algorithms (the original one by Ding et al. [9] and our proposed algorithm) perform poorly compared to the standard NMF algorithm by Lee & Seung [24] and our UNMF algorithm based on multiplicative update rules. This implies that the additional complexity introduced by BNMtF which was originally intended to improve UNMF clustering capability probably is not necessary.
Key words. nonnegative matrix factorization, orthogonality constraint, convergent algorithm, clustering methods.
AMS subject classifications. 65F30, 15A23.
1 Introduction
The nonnegative matrix factorization (NMF) was popularized by the work of Lee & Seung [23] in which they showed that this technique can be used to learn parts of faces and semantic features of text. In the basic form, the NMF seeks to decompose a nonnegative matrix into a pair of other nonnegative matrices with lower ranks:
where denotes the data matrix, denotes the basis matrix, denotes the coefficient matrix, denotes the number of factors which usually is chosen so that , and denotes by nonnegative real matrix. Frobenius norm is usually used as a distance measure to compute and in the following minimization problem:
(1.1) 
where the symbol denotes entrywise comparison.
The standard NMF objective function in equation LABEL:eq4 can be extended to accommodate other auxiliary constraints to direct the factors to have certain characteristics such as sparseness [14, 26, 11, 19], smoothness [33, 32, 16, 2], and orthogonality [9, 39, 40, 6, 25]. These extensions are sometimes useful to improve the performance of the NMF in some applications such as clustering [33, 38, 4, 11, 19, 8, 20, 5, 15, 10, 42, 36, 37, 41], image analysis [14, 26, 35, 31, 12], spectral analysis, and blind sourse separation [7, 43, 2, 3, 34].
In order to improve clustering capability of NMF, Ding et al. [9] introduced two types of orthogonal NMFs: uniorthogonal NMF (UNMF) and biorthogonal NMtF (BNMtF) where the former imposes orthogonality constraint on either columns of or rows of , and the latter imposes orthogonality constraints on both columns of and rows of simultaneously. And due to the tight constraints in the latter, they introduced the third factor to absorb the approximation error. They then proposed an algorithm for each orthogonal NMF and claimed that their algorithms are convergent. However, as their algorithms are based on the multiplicative update rules—which only have the nonincreasing property [24]—we were challenged to designed convergent algorithms for the orthogonal NMFs. A convergent algorithm for the former has been presented in our previous work [30], and now in this paper, we intend to develop a convergent algorithm for the latter. As the orthogonality constraints cannot be recast into alternating nonnegativityconstrained least square (ANLS) framework (see [19, 20] for discussion on ANLS) convergent algorithms for the standard NMF, e.g., [20, 29, 18, 17, 22, 28] cannot be extended to the problem.
2 Biorthogonal NMtF
BNMtF puts orthogonality constraints on both columns of and rows of and it is expected that this technique can be used to simultaneously cluster columns and rows of (biclustering). The following describes the original BNMtF objective function proposed by Ding et al. [9].
(2.2)  
where , , and is introduced to absorb the different scales of , , and due to the strict orthogonality constraints on and . We will set for the rest of this paper (for biclustering task, it is natural to have the same number of clusters for both columns and rows of the data matrix).
The KarushKuhnTucker (KKT) function of the objective function can be defined as:
where , , , , and are the KKT multipliers.
An equivalent objective function to eq. LABEL:eq37 was proposed by Ding et al. [9] to absorb the orthogonality constraints into the objective:
(2.3)  
It is easy to see that both objective functions eq. LABEL:eq37 and LABEL:eq39 have the same KKT function; however, they are not exactly the same as the orthogonality constraints are absorbed into the minimization problem.
The KKT conditions for objective in eq. LABEL:eq39 are:
where denotes entrywise multiplication operation, and
Then, by using the same strategy as discussed in [24], Ding et al. [9] derived BNMtF algorithm as follows:
(2.4)  
(2.5)  
(2.6) 
with
are derived exactly for the diagonal entries, and approximately for offdiagonal entries by relaxing the nonnegativity constraints.
The complete BNMtF algorithm proposed in [9] is shown in algorithm LABEL:algorithm3 where denotes some small positive number (note that the normalization step is not recommended as it will change the objective value). As there are approximations in deriving and , algorithm LABEL:algorithm3 may or may not be minimizing the objective eq. LABEL:eq39. Further, the auxiliary function used by the authors to prove the nonincreasing property is for the algorithm in eq. LABEL:eq44 – LABEL:eq46, not for algorithm LABEL:algorithm3. So there is no guarantee that algorithm LABEL:algorithm3 has the nonincreasing property. Figure LABEL:fig2 shows error per iteration of algorithm LABEL:algorithm3, with the error is calculated using the objective value in eq. LABEL:eq39. As shown, the assumptions used to obtain and are not acceptable as the algorithm LABEL:algorithm3 not only does not have the nonincreasing property but also fails to minimize the objective function.
3 A Convergent Algorithm for BNMtF
We define BNMtF problem with following equation:
(3.7)  
with and denote some positive constants to adjust the degree of orthogonality of and respectively. The KKT function of the objective can be written as:
And the KKT conditions are:
(3.8) 
where
As shown by Lee & Seung [24], a multiplicative update rules (MUR) based algorithm can be derived by utilizing the complementary slackness in the KKT conditions (the last line in equations LABEL:eq91). Therefore, a MUR based algorithm for our BNMtF problem can be written as:
The complete MUR algorithm is given in algorithm LABEL:algorithm7, and the additive update rules (AUR) version is given in algorithm LABEL:algorithm8 (please see e.g., [29, 30] for detailed discussion about AUR based NMF algorithms). As shown, the AUR algorithm can be initialized using nonnegative matrices as it does not inherit zero locking problems from its MUR algorithm counterpart.
(3.9)  
(3.10)  
(3.11) 
There are , , and in algorithm LABEL:algorithm8 which are modifications to avoid zero locking problems. The following gives their definitions.
with is a small positive number, , , and are matrices that contain , , and respectively. And there are also the variables , , and in algorithm LABEL:algorithm8. As shown in the appendix, these variables play a crucial role in guaranteeing the convergence of the algorithm.
Algorithm LABEL:algorithm9 shows modifications to algorithm LABEL:algorithm8 in order to guarantee the convergence as suggested by theorem LABEL:theorem30, LABEL:theorem31, and LABEL:theorem32 with step is a constant that determine how fast , , and grow in order to satisfies the nonincreasing property. Note that we set the same step value for all sequences, but different values can also be employed.
4 Experimental results
We will now analyze the convergence of the proposed algorithms LABEL:algorithm7 (MUB) and LABEL:algorithm9 (AUB) numerically. However, as it is generally difficult to reach stationary point in an acceptable computational time, only the nonincreasing property (or lack of it) will be shown. And because the BNMtF was originally designed for clustering purpose, we will also analyze this property. As dataset, we use Reuters21578 data corpus. The dataset is especially interesting because many NMFbased clustering methods are tested using it, e.g.: [33, 9, 38]. Detailed discussion about the dataset and preprocessing steps can be found in our previous work [30]. In summary, datasets were formed by combining top 2, 4, 6, 8, 10, and 12 classes from the corpus. Table LABEL:ch2:table3 summarizes the statistics of these test datasets, where #doc, #word, %nnz, max, and min refer to the number of documents, the number of words, percentage of nonzero entries, maximum cluster size, and minimum cluster size respectively. And as the corpus is bipartite, clustering can be done either for documents or words. We will evaluate both document clustering and word clustering. For comparison, we use the following algorithms:

standard NMF algorithm [24] LS,

original UNMF algorithm [9] DU,

original BNMtF algorithm [9] DB,

MUR based algorithm for UNMF, i.e.: algorithm 3 in [30] MUU, and

convergent algorithm for UNMF proposed in our previous work, i.e.: algorithm 4 in [30] AUU.
All experiments were developed in Octave under Linux platform using a notebook with 1.86 GHz Intel processor and 2 GB RAM.
Dataset  #doc  #word  %nnz  max  min 

Reuters2  6090  8547  0.363  3874  2216 
Reuters4  6797  9900  0.353  3874  333 
Reuters6  7354  10319  0.347  3874  269 
Reuters8  7644  10596  0.340  3874  144 
Reuters10  7887  10930  0.336  3874  114 
Reuters12  8052  11172  0.333  3874  75 
4.1 The nonincreasing property
Figures LABEL:fig5–LABEL:fig8 show graphically the nonincreasing property (or lack of it) of MUB and AUB. Because there are two adjustable parameters, and , we fix one parameter while studying the other. Figure LABEL:fig5 and LABEL:fig6 show the results for fixed , and figure LABEL:fig7 and LABEL:fig8 for fixed . As shown, while MUB fails to show the nonincreasing property for large and values, AUB successfully preserves this property regardless of and values. Note that we set , and for MUB and AUB in all experiments. A similar study for UNMF algorithms can be found in our previous work [30].
Not surprisingly, there are computational consequence for this property for large and/or . Table LABEL:table3 shows time comparisons between these algorithms for Reuters4 dataset. Note that, or letter is appended to the algorithm’s acronyms to indicate which parameter is being varied. For example AUB() means AUB with fixed and varied . Note that LS, DU, and DB do not have these parameters. As shown, the computational times of MU algorithms practically are independent from and values. And AU algorithms seem to become slower for some large or . This probably because for large or values the AU algorithms execute the inner iterations (shown as loops). Also, there are some anomalies in the AUB() and AUB() cases where for some or values, execution times are unexpectedly very fast. To investigate these, we display the number of iteration (#iter) and the number of inner iteration (#initer) for AU algorithms in table LABEL:table4 (note that MU algorithms reach maximum predefined number of iteration for all cases: 20 iterations). As displayed, in the cases where AU algorithms performed worse than their MU counterparts, they executed the inner iterations. And when an AU algorithm performed better, then its #iter is smaller than #iter of the corresponding MU algorithm and the inner iteration was not executed. These explain the differences in computational times in table LABEL:table3.
or  MUU  AUU  MUB()  AUB()  MUB()  AUB() 

0.01 
110  110  121  41.1  122  27.2 
0.05  110  110  121  40.9  121  40.7 
0.1  109  109  121  40.8  121  41.2 
0.3  110  109  121  40.4  121  41.1 
0.7  110  110  121  272  121  41.2 
1  110  110  121  40.8  121  273 
3  110  110  121  40.4  121  40.7 
7  110  110  121  40.4  121  273 
10  110  110  121  40.8  121  41.1 
30  109  110  121  272  121  442 
70  109  137  121  332  121  525 
100  110  232  121  382  121  605 
300  110  232  121  514  121  579 
700  110  461  121  607  121  606 
1000  110  411  121  606  121  365 
or  AUU  AUB()  AUB() 

#iter / #initer  #iter / #initer  #iter / #initer  
0.01  20 / 0  3 / 0  2 / 0 
0.05  20 / 0  3 / 0  3 / 0 
0.1  20 / 0  3 / 0  3 / 0 
0.3  20 / 0  3 / 0  3 / 0 
0.7  20 / 0  20 / 0  3 / 0 
1  20 / 0  3 / 0  20 / 0 
3  20 / 0  3 / 0  3 / 0 
7  20 / 0  3 / 0  20 / 0 
10  20 / 0  3 / 0  3 / 0 
30  20 / 0  20 / 0  20 / 44 
70  20 / 7  20 / 23  20 / 66 
100  20 / 32  20 / 22  20 / 88 
300  20 / 32  20 / 65  20 / 81 
700  20 / 92  20 / 75  20 / 88 
1000  20 / 79  20 / 90  20 / 24 
4.2 The maximum number of iteration
The maximum number of iteration is very crucial in MU and AU algorithms since these algorithms are known to be very slow [29, 14, 19, 21, 33, 1, 20, 18, 17, 22, 28]. As shown by Lin [28], LS is very fast to minimize the objective for some first iterations, but then tends to become slower. In table LABEL:table5, we display errors for some first iterations for LS, MUU, AUU, MUB, and AUB (DU and DB do not minimize the corresponding objective functions they intend to minimize, so it is irrelevant to investigate this property for their cases). Note that error0 refers to the initial error before the algorithms start running, and error is the error at th iteration. As shown, all algorithms are exceptionally very good at reducing errors in the first iterations. But then, the improvements are rather negligible with respect to the corresponding first improvements and the sizes of the datasets. Accordingly, we set maximum number of iteration to 20. Note that, in this case AUB converged at the third iteration.
error0  error1  error2  error3  error4  error5  

LS 
1373  0.476  0.474  0.472  0.469  0.466 
MUU  4652  1.681  1.603  1.596  1.591  1.583 
AUU  4657  1.681  1.605  1.595  1.586  1.573 
MUB  12474  2.164  2.104  2.103  2.102  2.102 
AUB  12680  2.137  2.104  2.103     
4.3 Determining and
In the proposed algorithms, there are two datasetdependent parameters, and , that have to be learned first. Because orthogonal NMFs are introduced to improve clustering capability of the standard NMF [9], these parameters will be learned based on clustering results on test dataset. We use Reuters4 for this purpose. These parameters do not exist in the original orthogonal NMFs nor in other orthogonal NMF algorithms [39, 40, 6]. However, we notice that our formulations resemble sparse NMF formulation [19, 21, 20], or in general case also known as constrained NMF [32]. As shown in [19, 20, 21], sparse NMF usually can give good results if and/or are rather small positive numbers. To determine and , we evaluate clustering qualities produced by our algorithms as or values grow measured by the standard clustering metrics: mutual information (MI), entropy (E), purity (P), and Fmeasure (F). The detailed discussions on these metrics can be found in [30]. Note that while larger MI, F, and P indicate better results, smaller E indicates better results. As shown in figure LABEL:fig9, for UNMF algorithms (MUU and AUU) seems to be a good choice. For MUB it seems that and are acceptable settings. And for AUB, and seem to be good settings. Based on this results, we decided to set and for all datasets and algorithms.
4.4 Times, #iterations, and errors
To evaluate computational performances of the algorithms, we measure average and maximum running times, average and maximum #iterations, and average and maximum errors produced at the last iterations for 10 trials. Table LABEL:table6LABEL:table8 show the results.
As shown in the table LABEL:table6, LS generally is the fastest; however when MUB or AUB converge before reaching the maximum iteration (20 iterations), then these algorithms outperform LS. Our uniorthogonal algorithms (MUU and AUU) seem to have comparable running times with LS. MUB seems to be slower for smaller datasets and then performs better than MUU and AUU for larger datasets: Reuters10 and Reuters12. Since AUB usually converges before reaching the maximum iteration (see table LABEL:table7), comparison can be done by using maximum running times where for Reuters4, Reuters6, Reuters10, and Reuters12 the data is available. As shown in table LABEL:table7, AUB is the slowest to perform calculation per iteration. There are also abrupt changes in the running times for Reuters10 and Reuters12 for all algorithms which are unfortunate since as shown in table LABEL:ch2:table3, the sizes of the datasets are only slightly larger. Figure LABEL:fig10 shows the average running times as the sizes of the datasets grow.
Average and maximum errors at the last iterations are shown in table LABEL:table8. Results for DU and DB are unsurprisingly really high as these algorithms do not minimize the objectives that are supposed to be minimized. Because only MUU & AUU and MUB & AUB pairs have the same objective each, we compare average errors for these pairs in figure LABEL:fig11 for each dataset. There is no significant difference between MUU & AUU in the average errors, but as shown in figure LABEL:fig10, MUU has better average running times especially for larger datasets. And for MUB & AUB, the differences in the average errors grow slightly as the size and classes of the datasets grow with significant differences occured at Reuters10 and Reuters12 where in these cases MUB significantly outperformed AUB.
Data  Time  LS  DU  DB  MUU  AUU  MUB  AUB 

Reuters2  Av.  77.266  83.655  104.98  78.068  77.825  66.318  38.367 
Max.  79.031  84.743  106.25  79.075  79.176  83.960  49.477  
Reuters4  Av.  108.84  119.42  152.77  109.04  109.12  119.46  86.745 
Max.  109.39  119.55  153.17  109.20  109.28  119.72  271.40  
Reuters6  Av.  134.02  149.32  194.43  133.91  134.19  149.63  75.432 
Max.  134.50  149.62  194.75  134.27  134.51  149.95  327.70  
Reuters8  Av.  158.37  173.43  228.59  153.53  155.03  173.00  56.464 
Max.  181.58  175.71  235.54  155.15  159.19  174.05  59.021  
Reuters10  Av.  834.69  892.91  911.34  874.18  914.93  859.31  601.57 
Max.  1004.5  1141.2  1127.3  1137.5  1162.0  1059.0  2794.1  
Reuters12  Av.  1249.2  1348.4  1440.1  1319.7  1335.6  1309.0  1602.4 
Max.  1389.0  1590.4  1746.1  1565.7  1529.4  1506.7  4172.2 
Data  #iter.  LS  DU  DB  MUU  AUU  MUB  AUB 

Reuters2  Av.  20  20  20  20  20  16.2  4.9 
Max.  20  20  20  20  20  20  6  
Reuters4  Av.  20  20  20  20  20  20  7.2 
Max.  20  20  20  20  20  20  20  
Reuters6  Av.  20  20  20  20  20  20  5.5 
Max.  20  20  20  20  20  20  20  
Reuters8  Av.  20  20  20  20  20  20  4 
Max.  20  20  20  20  20  20  4  
Reuters10  Av.  20  20  20  20  20  20  5.6 
Max.  20  20  20  20  20  20  20  
Reuters12  Av.  20  20  20  20  20  20  8.8 
Max.  20  20  20  20  20  20  20 
Data  #iter.  LS  DU  DB  MUU  AUU  MUB  AUB 
Reuters2  Av.  1.3763  3435.6  3626.5  1.4106  1.4138  1.7955  1.8021 
Max.  1.3854  3587.2  3867.4  1.4201  1.4230  1.8022  1.8025  
Reuters4  Av.  1.4791  9152.8  8689.0  1.5299  1.5310  2.0708  2.0962 
Max.  1.4855  9474.9  9297.9  1.5408  1.5402  2.0880  2.1028  
Reuters6  Av.  1.5229  17135  15823  1.5844  1.5878  2.2627  2.2921 
Max.  1.5301  17971  16955  1.5884  1.5952  2.2758  2.2998  
Reuters8  Av.  1.5434  25913  22893  1.6215  1.6171  2.3863  2.4421 
Max.  1.5473  27462  25553  1.6342  1.6262  2.3993  2.4422  
Reuters10  Av.  1.5696  34154  30518  1.6533  1.6533  1.8836  2.5673 
Max.  1.5801  35236  35152  1.6662  1.6618  1.9529  2.5718  
Reuters12  Av.  1.5727  42739  37038  1.6620  1.6621  1.8860  2.6551 
Max.  1.5815  44325  41940  1.6705  1.6713  1.9193  2.6697 
4.5 Document clustering
The results of document clustering are shown in table LABEL:table9–LABEL:table12. In average, MUU gives the best performances in all metrics especially for datasets with small #clusters. Then followed by LS, AUU, and DU with small margins. LS seems to be better for datasets with large #clusters. Generally, MUU, LS, AUU and DU can give consistent results for varied #clusters, but unfortunately this is not the case for DB, MUB and AUB which are all BNMtF algorithms. AUB especially seems to offer only slightly better clustering than random results.
Data  LS  DU  DB  MUU  AUU  MUB  AUB 
Reuters2  0.40392  0.42487  0.36560  0.42150  0.057799  0.00087646  
Reuters4  0.62879  0.61723  0.48007  0.63640  0.32142  0.072621  
Reuters6  0.79459  0.81831  0.52498  0.81811  0.37924  0.078201  
Reuters8  0.92285  0.90260  0.54534  0.92720  0.48435  0.013518  
Reuters10  1.0275  0.62125  1.0063  1.0138  0.50980  0.072014  
Reuters12  1.0865  0.58469  1.1195  1.0821  0.47697  0.16389  
Average  0.82071  0.81283  0.52032  0.81754  0.37160  0.066853 
Data  LS  DU  DB  MUU  AUU  MUB  AUB 
Reuters2  0.54193  0.52098  0.58025  0.52435  0.88805  0.94498  
Reuters4  0.40202  0.40780  0.47638  0.39822  0.55571  0.68011  
Reuters6  0.38391  0.37473  0.48821  0.37481  0.54459  0.66105  
Reuters8  0.35568  0.36242  0.48151  0.35423  0.50184  0.65879  
Reuters10  0.34023  0.46253  0.34661  0.34434  0.49608  0.62786  
Reuters12  0.33239  0.47236  0.32319  0.33362  0.50241  0.58974  
Average  0.38985  0.389760  0.49354  0.38787  0.58145  0.69375 
Data  LS  DU  DB  MUU  AUU  MUB  AUB 

Reuters2  0.82154  0.83599  0.80452  0.82507  0.66102  0.63612  
Reuters4  0.79417  0.78023  0.73778  0.79704  0.70119  0.59657  
Reuters6  0.74510  0.68844  0.74868  0.75069  0.66433  0.54569  
Reuters8  0.73982  0.66536  0.74869  0.73987  0.65033  0.50680  
Reuters10  0.73120  0.64845  0.72813  0.73330  0.63194  0.50639  
Reuters12  0.73877  0.72719  0.62223  0.72340  0.60118  0.52019  
Average  0.76331  0.76207  0.69446  0.76156  0.65166  0.55196 
Data  LS  DU  DB  MUU  AUU  MUB  AUB 
Reuters2  0.81904  0.83234  0.79163  0.82241  0.58237  0.50399  
Reuters4  0.56154  0.53754  0.44352  0.54267  0.36917  0.24585  
Reuters6  0.46225  0.47714  0.33910  0.47270  0.26372  0.17171  
Reuters8  0.40408  0.40554  0.25052  0.41822  0.23904  0.10869  
Reuters10  0.38001  0.23309  0.36923  0.35947  0.19552  0.094912  
Reuters12  0.35671  0.17387  0.35214  0.34435  0.16401  0.099949  
Average  0.49727  0.49851  0.37196  0.49526  0.30231  0.20418 
4.6 Word clustering
In some cases, the ability of clustering methods to simultaneously group similar documents with related words (coclustering) can become an added value. And because the original BNMtF is designed to have this ability, we will also investigate the quality of word clustering (in the context of coclustering) produced by all algorithms. Since word clustering has no reference class, we adopt idea from [9] where the authors proposed to create reference classes by using word frequencies: each word is assigned to a class where the word appears with the highest frequency. Table LABEL:table13–LABEL:table16 show the results. As shown, DU has the best overall results followed by LS, MUU and AUU by small margins. MUU is especially good for small #clusters and LS is good for large #clusters. However, all BNMtF algorithms, DB, MUB, and AUB, which are designed to accomodate coclustering task, seem to have poor results. These results are in accord with document clustering results where BNMtFs also perform poorly.
Data  LS  DU  DB  MUU  AUU  MUB  AUB 

Reuters2  0.15715  0.16609  0.12966  0.14978  0.013995  0.00029807  
Reuters4  0.42558  0.39193  0.21495  0.41663  0.11812  0.026943  
Reuters6  0.54112  0.26971  0.54239  0.54828  0.12460  0.035309  
Reuters8  0.63022  0.63368  0.29277  0.64699  0.15692  0.0037071  
Reuters10  0.70386  0.33046  0.66262  0.68367  0.025320  0.029618  
Reuters12  0.80111  0.77959  0.28412  0.76128  0.73517  0.013483  0.073478 
Average  0.54317  0.25361  0.53549  0.53188  0.075407  0.028226 
Data  LS  DU  DB  MUU  AUU  MUB  AUB 

Reuters2  0.76778  0.75884  0.79527  0.77515  0.91094  0.92463  
Reuters4  0.62965  0.64647  0.73496  0.63412  0.78338  0.82897  
Reuters6  0.56184  0.66683  0.56134  0.55906  0.72297  0.75751  
Reuters8  0.52006  0.51891  0.63255  0.51447  0.67783  0.72890  
Reuters10  0.50612  0.61852  0.51853  0.51220  0.71038  0.70909  
Reuters12  0.48811  0.62632  0.49322  0.50050  0.70181  0.68507  
Average  0.57792  0.67908  0.57806  0.58199  0.75122  0.77236 
Data  LS  DU  DB  MUU  AUU  MUB  AUB 

Reuters2  0.76987  0.77082  0.75378  0.76021  0.67006  0.65988  
Reuters4  0.64400  0.62881  0.60566  0.64184  0.55808  0.53116  
Reuters6  0.59830  0.55949  0.59763  0.59103  0.52966  0.49661  
Reuters8  0.58935  0.54296  0.59179  0.58770  0.50933  0.46499  
Reuters10  0.58123  0.51576  0.57045  0.58724  0.44765  0.45395  
Reuters12  0.59563  0.49555  0.58628  0.56846  0.43611  0.44882  
Average  0.63185  0.57887  0.62837  0.62274  0.52515  0.50923 
Data  LS  DU  DB  MUU  AUU  MUB  AUB 
Reuters2  0.59287  0.59471  0.58733  0.59427  0.52628  0.49976  
Reuters4  0.46891  0.43469  0.36397  0.46180  0.32520  0.27101  
Reuters6  0.37490  0.38365  0.27356  0.38026  0.21620  0.17572  
Reuters8  0.32488  0.32674  0.20820  0.33527  0.17127  0.12565  
Reuters10  0.29864  0.18626  0.28930  0.28573  0.10700  0.10545  
Reuters12  0.29072  0.14255  0.27525  0.27380  0.088517  0.095880  
Average  0.39189  0.38970  0.29365  0.38973  0.23908  0.21224 
5 Conclusions
We have presented BNMtF algorithms based on the additive update rules with rigorous convergence proof.
The only way to numerically evaluate whether the algorithm has converged to a stationary point is to check whether it has satisfied the KKT conditions on that point. While the nonnegativity conditions are easy to check, the complementary slackness conditions are hard since we must check for where denotes the first iteration number where stationarity has been reached. Not only there are some large matrix multiplications which can be inaccurate numerically, but also we must make sure that the stationary point is reachable in a reasonable amount of time. Accordingly, only the nonincreasing property was evaluated which as shown in section LABEL:nonincreasing, the convergent version of our algorithm kept these properties even for large or .
The maximum #iterations is an important issue in the multiplicative and additive update rules based NMF algorithms since these algorithms are known to be slow. As shown in table LABEL:table5, the multiplicative and additive update rules based algorithms were exceptionally very good at reducing the errors even in the first iterations, but then the errors were only slightly reduced for the remaining iterations. This led us to use 20 iterations as the maximum #iterations. Because it is a rather small number, it is very likely that the algorithms stop before reaching stationarity.
There were differences in the running times of the algorithms, but were not significant since all algorithms have the same computation complexity, i.e.: #iterations, where denotes the size of the data matrix, and denotes the number of decomposition factors.
The document clustering results favoured our MUU algorithm proposed in [30] in which it showed the best average performances for all metrics followed closely by LS, AUU, and DU. MUU was especially good for small #cluster and LS for large #clusters. It is possible that because we learned from Reuters4 dataset, then MUU performed best at the small datasets. All BNMtF algorithms, DB, MUB, and AUB, performed rather poorly in these datasets. These results are conflicting some previous works where it was reported that DB outperformed LS and DU [9, 27].
The results of word clustering are not as conclusive as the results of document clustering since there is no prior labelling for the words. Hence, we used strategy introduced in [9] to label the words. In this task, DU offered the best overall performances followed closely by LS, MUU and AUU. As in the document clustering, all BNMtF algorithms also performed poorly.
Acknowledgment. We are very grateful to the referees for their most interesting comments and suggestions.
References
 [1] M.W. Berry, M. Browne, A. Langville, V.P. Pauca, and R.J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. In Computational Statistics and Data Analysis, pages 155–173, 2006.
 [2] N. Bertin, R. Badeau, and E. Vincent. Enforcing harmonicity and smoothness in bayesian nonnegative matrix factorization applied to polyphonic music transcription. IEEE Transactions on Audio, Speech, and Language Processing, 18(3):538–549, 2010.
 [3] A. Bertrand and M. Moonen. Blind separation of nonnegative source signals using multiplicative updates and subspace projection. Signal Processing, 90(10):2877–2890, 2010.
 [4] J.P. Brunet, P. Tamayo, T.R. Golub, and J.P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. In Proc. Natl Acad. Sci. USA 101, pages 4164–4169, 2003.
 [5] P. CarmonaSaez, R.D. PascualMarqui, F. Tirado, J.M Carazo, and A. PascualMontano. Biclustering of gene expression data by nonsmooth nonnegative matrix factorization. BMC Bioinformatics, 7(1):78, 2006.
 [6] S. Choi. Algorithms for orthogonal nonnegative matrix factorization. In Proc. IEEE Int’l Joint Conf. on Neural Networks, pages 1828–1832, 2008.
 [7] A. Cichocki, S. Amari, R. Zdunek, R. Kompass, G. Hori, and Z. He. Extended smart algorithms for nonnegative matrix factorization. In Proc. of the 8th international conference on Artificial Intelligence and Soft Computing, pages 548–562, 2006.
 [8] K. Devarajan. Nonnegative matrix factorization: An analytical and interpretive tool in computational biology. PLoS Computational Biology, 4(7):e1000029, 2008.
 [9] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix trifactorizations for clustering. In Proc. 12th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pages 126–135, 2006.
 [10] P. Fogel, S.S. Young, D.M. Hawkins, and N. Ledirac. Inferential, robust nonnegative matrix factorization analysis of microarray data. Bioinformatics, 23(1):44–49, 2007.
 [11] Y. Gao and G. Church. Improving molecular cancer class discovery through sparse nonnegative matrix factorization. Bioinformatics, 21(21):3970–3975, 2005.
 [12] N. Gillis and F. Glineur. A multilevel approach for nonnegative matrix factorization. Journal of Computational and Applied Mathematics, 236(7):1708–1723, 2012.
 [13] L. Grippo and M. Sciandrone. On the convergence of the block nonlinear gaussseidel method under convex constraints. Operation Research Letters, 26(3):127–136, 2000.
 [14] P.O. Hoyer. Nonnegative matrix factorization with sparseness constraints. The Journal of Machine Learning Research, 5:1457–1469, 2004.
 [15] K. Inamura, T. Fujiwara, Y. Hoshida, T. Isagawa, M.H. Jones, C. Virtanen, M. Shimane, Y. Satoh, S. Okumura, K. Nakagawa, E. Tsuchiya, S. Ishikawa, H. Aburatani, H. Nomura, and Y. Ishikawa. Two subclasses of lung squamous cell carcinoma with different gene expression profiles and prognosis identified by hierarchical clustering and nonnegative matrix factorization. Oncogene, 24(47):7105–7113, 2005.
 [16] S. Jia and Y. Qian. Constrained nonnegative matrix factorization for hyperspectral unmixing. IEEE Transactions on Geoscience and Remote Sensing, 47(1):161–173, 2009.
 [17] D. Kim, S. Sra, and I.S. Dhillon. Fast newtontype methods for the least squares nonnegative matrix approximation problem. In in Data Mining, Proceedings of SIAM Conference on, pages 343–354, 2007.
 [18] D. Kim, S. Sra, and I.S. Dhillon. Fast projectionbased methods for the least squares nonnegative matrix approximation problem. Stat. Anal. Data Min., 1:38–51, February 2008.
 [19] H. Kim and H. Park. Sparse nonnegative matrix factorizations via alternating nonnegativity constrained least squares for microarray data analysis. Bioinformatics, 23(12):1495–1502, 2007.
 [20] H. Kim and H. Park. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl., 30(2):713–730, 2008.
 [21] J. Kim and H. Park. Sparse nonnegative matrix factorization for clustering. Technical report, CSE Technical Reports, Georgia Institute of Technology, 2008.
 [22] J. Kim and H. Park. Toward faster nonnegative matrix factorization: A new algorithm and comparisons. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pages 353–362, 2008.
 [23] D.D. Lee and H.S. Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401:788–791, 1999.
 [24] D.D. Lee and H.S. Seung. Algorithms for nonnegative matrix factorization. In Proc. Advances in Neural Processing Information Systems, pages 556–562, 2000.
 [25] H. Li, T. Adali, W. Wang, and D. Emge. Kernel kmeans: spectral clustering and normalized cuts. In Proc. IEEE Workshop on Machine Learning for Signal Processing, pages 253–258, 2005.
 [26] S.Z. Li, X.W. Hou, H.J. Zhang, and Q.S. Cheng. Learning spatially localized, partsbased representation. In Proc. IEEE Comp. Soc. Conf. on Computer Vision and Pattern Recognition, pages 207–212, 2001.
 [27] T. Li and C. Ding. The relationships among various nonnegative matrix factorization methods for clustering. In Proc. ACM 6th Int’l Conf. on Data Mining, pages 362–371, 2006.
 [28] C.J. Lin. Projected gradient methods for nonnegative matrix factorization. Technical report, Technical Report ISSTECH95013, Department of CS, National Taiwan University, 2005.
 [29] C.J. Lin. On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Transactions on Neural Networks, 18(6):1589–1596, 2007.
 [30] A. Mirzal. A convergent algorithm for orthogonal nonnegative matrix factorization. Journal of Computational and Applied Mathematics, 260:149–166, 2014.
 [31] A. PascualMontano, J.M. Carazo, K. Kochi, D. Lehman, and R.D. PascualMarqui. Nonsmooth nonnegative matrix factorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(3):403–415, 2006.
 [32] V.P. Pauca, J. Piper, and R.J. Plemmons. Nonnegative matrix factorization for spectral data analysis. Linear Algebra and Its Applications, 416(1):29–47, 2006.
 [33] F. Shahnaz, M.W. Berry, V. Pauca, and R.J. Plemmons. Document clustering using nonnegative matrix factorization. Information Processing & Management, 42(2):267–273, 2003.
 [34] T. Virtanen, A.T. Cemgil, and S. Godsill. Bayesian extensions to nonnegative matrix factorisation for audio signal modelling. In IEEE Int’l Conf. on Acoustics, Speech and Signal Processing, pages 1825–1828, 2008.
 [35] D. Wang and H. Lu. Online learning partsbased representation via incremental orthogonal projective nonnegative matrix factorization. Signal Processing, 93(6):1608–1623, 2013.
 [36] G. Wang, A.V. Kossenkov, and M.F. Ochs. Lsnmf: A modified nonnegative matrix factorization algorithm utilizing uncertainty estimates. BMC Bioinformatics, 7(1):175, 2006.
 [37] J.J.Y. Wang, X. Wang, and X. Gao. Nonnegative matrix factorization by maximizing correntropy for cancer clustering. BMC Bioinformatics, 14(1):107, 2013.
 [38] W. Xu, X. Liu, and Y. Gong. Document clustering based on nonnegative matrix factorization. In Proc. ACM SIGIR, pages 267–273, 2003.
 [39] J. Yoo and S. Choi. Orthogonal nonnegative matrix factorization: Multiplicative updates on stiefel manifolds. In Proceedings of the 9th International Conference on Intelligent Data Engineering and Automated Learning, pages 140–147. SpringerVerlag, 2008.
 [40] J. Yoo and S. Choi. Orthogonal nonnegative matrix trifactorization for coclustering: Multiplicative updates on stiefel manifolds. Inf. Process. Manage., 46(5):559–570, 2010.
 [41] N. Yuvaraj and P. Vivekanandan. An efficient svm based tumor classification with symmetry nonnegative matrix factorization using gene expression data. In Proc. Int’l Conf. on Information Communication and Embedded Systems, pages 761–768, 2013.
 [42] C.H. Zheng, D.S. Huang, D. Zhang, and X.Z. Kong. Tumor clustering using nonnegative matrix factorization with gene selection. IEEE Transactions on Information Technology in Biomedicine, 13(4):599–607, 2009.
 [43] G. Zhou, Z. Yang, and S. Xie. Online blind source separation using incremental nonnegative matrix factorization with volume constraint. IEEE Transactions on Neural Networks, 22(4):550–560, 2011.
A Convergence analysis
From a result in convergence analysis of block coordinate descent method [29, 28, 13], algorithm LABEL:algorithm8 has a convergence guarantee if the following conditions are satisfied:

sequence has nonincreasing property,

any limit point of sequence generated by algorithm LABEL:algorithm8 is a stationary point, and

sequence has at least one limit point.
Because algorithm LABEL:algorithm8 uses the alternating strategy, sequences , , and can be analyzed separately [24, 29]. And because update rule for in eq. LABEL:eq98 is similar to the update rule for in eq. LABEL:eq99, it suffices to prove nonincreasing property of one of them.
a.1 Nonincreasing property of
By using auxiliary function approach [24], nonincreasing property of can be proven if the following statement is true:
To define , let us rearrange into:
where is the th row of . And also let us define:
where is the th row of . Then we define:
where is a diagonal matrix with its diagonal entries defined as:
with
denotes the set of nonKKT indices in th row of , and the symbol is defined so that and .
Then, the auxiliary function can be defined as:
(A.12) 
Note that and are equivalent to and with is rearranged into , and other variables are reordered accordingly. And:
By definition, is positive definite for all not satisfy the KKT conditions, so is a strict convex function, and consequently has a unique minimum.
(A.13)  
which is exactly the update rule for .
By using the Taylor series expansion, can also be written as:
(A.14) 
where
and
with components are arranged along its diagonal area (there are components).
To show the nonincreasing property of , the following statements must be proven:

,

,

, and

,
The first and second statements are obvious from the definition of in eq. LABEL:eq109, the third and the fourth statements will be proven in theorem LABEL:theorem19 and LABEL:theorem20, and the boundedness of , , and will be proven in theorem LABEL:theorem32.
Theorem A.1.
Given sufficiently large and the boundedness of , , and , then it can be shown that . Moreover, if and only if satisfies the KKT conditions, then the equality holds.
Proof.
By substracting eq. LABEL:eq109 from eq. LABEL:eq111, we get:
(A.15) 
Let , then:
where <