A Convergent Algorithm for Bi-orthogonal Nonnegative Matrix Tri-FactorizationReceived by the editors on Date of Submission. Accepted for publication on Date of Acceptance. Handling Editor: Handling Editor.

A Convergent Algorithm for Bi-orthogonal Nonnegative Matrix Tri-Factorizationthanks: Received by the editors on Date of Submission. Accepted for publication on Date of Acceptance. Handling Editor: Handling Editor.

Andri Mirzal Department of Innovation and Technology Management, College of Graduate Studies, Arabian Gulf University, Manama, Bahrain (andrim@agu.edu.bh).
Abstract

We extend our previous work on a convergent algorithm for uni-orthogonal nonnegative matrix factorization (UNMF) to the case where the data matrix is decomposed into three factors with two of them are constrained orthogonally and the third one is used to absorb the approximation error. Due to the way the factorization is performed, we name it as bi-orthogonal nonnegative matrix tri-factorization, i.e., BNMtF. This factorization was first introduced by Ding et al. [9] with intent to further improve clustering capability of their version of UNMF. However, as shown in this paper, not only their BNMtF algorithm does not have convergent property but also it does not minimize the objective function it intends to minimize. We tackle this problem by utilizing a technique presented in our previous work and prove that our algorithm converges to a stationary point inside the solution space. As a practical demonstration, the proposed algorithm is utilized for clustering a text corpus; however, contrary to the claim in the original work, both BNMtF algorithms (the original one by Ding et al. [9] and our proposed algorithm) perform poorly compared to the standard NMF algorithm by Lee & Seung [24] and our UNMF algorithm based on multiplicative update rules. This implies that the additional complexity introduced by BNMtF which was originally intended to improve UNMF clustering capability probably is not necessary.

Key words. nonnegative matrix factorization, orthogonality constraint, convergent algorithm, clustering methods.

AMS subject classifications. 65F30, 15A23.

1 Introduction

The nonnegative matrix factorization (NMF) was popularized by the work of Lee & Seung [23] in which they showed that this technique can be used to learn parts of faces and semantic features of text. In the basic form, the NMF seeks to decompose a nonnegative matrix into a pair of other nonnegative matrices with lower ranks:

where denotes the data matrix, denotes the basis matrix, denotes the coefficient matrix, denotes the number of factors which usually is chosen so that , and denotes by nonnegative real matrix. Frobenius norm is usually used as a distance measure to compute and in the following minimization problem:

(1.1)

where the symbol denotes entrywise comparison.

The standard NMF objective function in equation LABEL:eq4 can be extended to accommodate other auxiliary constraints to direct the factors to have certain characteristics such as sparseness [14, 26, 11, 19], smoothness [33, 32, 16, 2], and orthogonality [9, 39, 40, 6, 25]. These extensions are sometimes useful to improve the performance of the NMF in some applications such as clustering [33, 38, 4, 11, 19, 8, 20, 5, 15, 10, 42, 36, 37, 41], image analysis [14, 26, 35, 31, 12], spectral analysis, and blind sourse separation [7, 43, 2, 3, 34].

In order to improve clustering capability of NMF, Ding et al. [9] introduced two types of orthogonal NMFs: uni-orthogonal NMF (UNMF) and bi-orthogonal NMtF (BNMtF) where the former imposes orthogonality constraint on either columns of or rows of , and the latter imposes orthogonality constraints on both columns of and rows of simultaneously. And due to the tight constraints in the latter, they introduced the third factor to absorb the approximation error. They then proposed an algorithm for each orthogonal NMF and claimed that their algorithms are convergent. However, as their algorithms are based on the multiplicative update rules—which only have the nonincreasing property [24]—we were challenged to designed convergent algorithms for the orthogonal NMFs. A convergent algorithm for the former has been presented in our previous work [30], and now in this paper, we intend to develop a convergent algorithm for the latter. As the orthogonality constraints cannot be recast into alternating nonnegativity-constrained least square (ANLS) framework (see [19, 20] for discussion on ANLS) convergent algorithms for the standard NMF, e.g., [20, 29, 18, 17, 22, 28] cannot be extended to the problem.

2 Bi-orthogonal NMtF

BNMtF puts orthogonality constraints on both columns of and rows of and it is expected that this technique can be used to simultaneously cluster columns and rows of (biclustering). The following describes the original BNMtF objective function proposed by Ding et al. [9].

(2.2)

where , , and is introduced to absorb the different scales of , , and due to the strict orthogonality constraints on and . We will set for the rest of this paper (for biclustering task, it is natural to have the same number of clusters for both columns and rows of the data matrix).

The Karush-Kuhn-Tucker (KKT) function of the objective function can be defined as:

where , , , , and are the KKT multipliers.

An equivalent objective function to eq. LABEL:eq37 was proposed by Ding et al. [9] to absorb the orthogonality constraints into the objective:

(2.3)

It is easy to see that both objective functions eq. LABEL:eq37 and LABEL:eq39 have the same KKT function; however, they are not exactly the same as the orthogonality constraints are absorbed into the minimization problem.

The KKT conditions for objective in eq. LABEL:eq39 are:

where denotes entrywise multiplication operation, and

Then, by using the same strategy as discussed in [24], Ding et al. [9] derived BNMtF algorithm as follows:

(2.4)
(2.5)
(2.6)

with

are derived exactly for the diagonal entries, and approximately for off-diagonal entries by relaxing the nonnegativity constraints.

The complete BNMtF algorithm proposed in [9] is shown in algorithm LABEL:algorithm3 where denotes some small positive number (note that the normalization step is not recommended as it will change the objective value). As there are approximations in deriving and , algorithm LABEL:algorithm3 may or may not be minimizing the objective eq. LABEL:eq39. Further, the auxiliary function used by the authors to prove the nonincreasing property is for the algorithm in eq. LABEL:eq44LABEL:eq46, not for algorithm LABEL:algorithm3. So there is no guarantee that algorithm LABEL:algorithm3 has the nonincreasing property. Figure LABEL:fig2 shows error per iteration of algorithm LABEL:algorithm3, with the error is calculated using the objective value in eq. LABEL:eq39. As shown, the assumptions used to obtain and are not acceptable as the algorithm LABEL:algorithm3 not only does not have the nonincreasing property but also fails to minimize the objective function.

  Initialize , , and with positive matrices to avoid zero locking.
  for  do
     
  end for
Algorithm 1 : Original BNMtF algorithm.
Figure 1: Error per iteration of algorithm LABEL:algorithm3 for Reuters4 dataset (see section for detailed info about the dataset).

3 A Convergent Algorithm for BNMtF

We define BNMtF problem with following equation:

(3.7)

with and denote some positive constants to adjust the degree of orthogonality of and respectively. The KKT function of the objective can be written as:

And the KKT conditions are:

(3.8)

where

As shown by Lee & Seung [24], a multiplicative update rules (MUR) based algorithm can be derived by utilizing the complementary slackness in the KKT conditions (the last line in equations LABEL:eq91). Therefore, a MUR based algorithm for our BNMtF problem can be written as:

The complete MUR algorithm is given in algorithm LABEL:algorithm7, and the additive update rules (AUR) version is given in algorithm LABEL:algorithm8 (please see e.g., [29, 30] for detailed discussion about AUR based NMF algorithms). As shown, the AUR algorithm can be initialized using nonnegative matrices as it does not inherit zero locking problems from its MUR algorithm counterpart.

  Initialize , , and with positive matrices to avoid zero locking.
  for  do
     
  end for
Algorithm 2 : The MUR based algorithm for BNMtF problem in eq. LABEL:eq90.
  Initialize , , and with nonnegative matrices.
  for  do
     
(3.9)
(3.10)
(3.11)
  end for
Algorithm 3 : The AUR based algorithm for BNMtF problem in eq. LABEL:eq90.

There are , , and in algorithm LABEL:algorithm8 which are modifications to avoid zero locking problems. The following gives their definitions.

with is a small positive number, , , and are matrices that contain , , and respectively. And there are also the variables , , and in algorithm LABEL:algorithm8. As shown in the appendix, these variables play a crucial role in guaranteeing the convergence of the algorithm.

Algorithm LABEL:algorithm9 shows modifications to algorithm LABEL:algorithm8 in order to guarantee the convergence as suggested by theorem LABEL:theorem30, LABEL:theorem31, and LABEL:theorem32 with step is a constant that determine how fast , , and grow in order to satisfies the nonincreasing property. Note that we set the same step value for all sequences, but different values can also be employed.

  Initialize , , and with nonnegative matrices, and choose a small positive number for and an integer number for step.
  for  do
     
     
     repeat
        
     until 
     
     
     repeat
        
     until 
     
     
     repeat
        
     until 
  end for
Algorithm 4 A Convergent algorithm for BNMtF problem.

4 Experimental results

We will now analyze the convergence of the proposed algorithms LABEL:algorithm7 (MU-B) and LABEL:algorithm9 (AU-B) numerically. However, as it is generally difficult to reach stationary point in an acceptable computational time, only the nonincreasing property (or lack of it) will be shown. And because the BNMtF was originally designed for clustering purpose, we will also analyze this property. As dataset, we use Reuters-21578 data corpus. The dataset is especially interesting because many NMF-based clustering methods are tested using it, e.g.: [33, 9, 38]. Detailed discussion about the dataset and preprocessing steps can be found in our previous work [30]. In summary, datasets were formed by combining top 2, 4, 6, 8, 10, and 12 classes from the corpus. Table LABEL:ch2:table3 summarizes the statistics of these test datasets, where #doc, #word, %nnz, max, and min refer to the number of documents, the number of words, percentage of nonzero entries, maximum cluster size, and minimum cluster size respectively. And as the corpus is bipartite, clustering can be done either for documents or words. We will evaluate both document clustering and word clustering. For comparison, we use the following algorithms:

  • standard NMF algorithm [24] LS,

  • original UNMF algorithm [9] D-U,

  • original BNMtF algorithm [9] D-B,

  • MUR based algorithm for UNMF, i.e.: algorithm 3 in [30] MU-U, and

  • convergent algorithm for UNMF proposed in our previous work, i.e.: algorithm 4 in [30] AU-U.

All experiments were developed in Octave under Linux platform using a notebook with 1.86 GHz Intel processor and 2 GB RAM.

Dataset #doc #word %nnz max min
Reuters2 6090 8547 0.363 3874 2216
Reuters4 6797 9900 0.353 3874 333
Reuters6 7354 10319 0.347 3874 269
Reuters8 7644 10596 0.340 3874 144
Reuters10 7887 10930 0.336 3874 114
Reuters12 8052 11172 0.333 3874 75
Table 1: Statistics of the test datasets.

4.1 The nonincreasing property

Figures LABEL:fig5LABEL:fig8 show graphically the nonincreasing property (or lack of it) of MU-B and AU-B. Because there are two adjustable parameters, and , we fix one parameter while studying the other. Figure LABEL:fig5 and LABEL:fig6 show the results for fixed , and figure LABEL:fig7 and LABEL:fig8 for fixed . As shown, while MU-B fails to show the nonincreasing property for large and values, AU-B successfully preserves this property regardless of and values. Note that we set , and for MU-B and AU-B in all experiments. A similar study for UNMF algorithms can be found in our previous work [30].

(a) Small
(b) Medium to large
(c) Some values of
Figure 2: MU-B() error per iteration for Reuters4 dataset ().
(a) Small
(b) Medium to large
(c) Some values of
Figure 3: AU-B() error per iteration for Reuters4 dataset ().
(a) Small
(b) Medium to large
(c) Some values of
Figure 4: MU-B() error per iteration for Reuters4 dataset ().
(a) Small
(b) Medium to large
(c) Some values of
Figure 5: AU-B() error per iteration for Reuters4 dataset ().

Not surprisingly, there are computational consequence for this property for large and/or . Table LABEL:table3 shows time comparisons between these algorithms for Reuters4 dataset. Note that, or letter is appended to the algorithm’s acronyms to indicate which parameter is being varied. For example AU-B() means AU-B with fixed and varied . Note that LS, D-U, and D-B do not have these parameters. As shown, the computational times of MU algorithms practically are independent from and values. And AU algorithms seem to become slower for some large or . This probably because for large or values the AU algorithms execute the inner iterations (shown as loops). Also, there are some anomalies in the AU-B() and AU-B() cases where for some or values, execution times are unexpectedly very fast. To investigate these, we display the number of iteration (#iter) and the number of inner iteration (#initer) for AU algorithms in table LABEL:table4 (note that MU algorithms reach maximum predefined number of iteration for all cases: 20 iterations). As displayed, in the cases where AU algorithms performed worse than their MU counterparts, they executed the inner iterations. And when an AU algorithm performed better, then its #iter is smaller than #iter of the corresponding MU algorithm and the inner iteration was not executed. These explain the differences in computational times in table LABEL:table3.

or MU-U AU-U MU-B() AU-B() MU-B() AU-B()

0.01
110 110 121 41.1 122 27.2
0.05 110 110 121 40.9 121 40.7
0.1 109 109 121 40.8 121 41.2
0.3 110 109 121 40.4 121 41.1
0.7 110 110 121 272 121 41.2
1 110 110 121 40.8 121 273
3 110 110 121 40.4 121 40.7
7 110 110 121 40.4 121 273
10 110 110 121 40.8 121 41.1
30 109 110 121 272 121 442
70 109 137 121 332 121 525
100 110 232 121 382 121 605
300 110 232 121 514 121 579
700 110 461 121 607 121 606
1000 110 411 121 606 121 365
Table 2: Time comparison (in second) for Reuters4 dataset.
or AU-U AU-B() AU-B()
#iter / #initer #iter / #initer #iter / #initer
0.01 20 / 0 3 / 0 2 / 0
0.05 20 / 0 3 / 0 3 / 0
0.1 20 / 0 3 / 0 3 / 0
0.3 20 / 0 3 / 0 3 / 0
0.7 20 / 0 20 / 0 3 / 0
1 20 / 0 3 / 0 20 / 0
3 20 / 0 3 / 0 3 / 0
7 20 / 0 3 / 0 20 / 0
10 20 / 0 3 / 0 3 / 0
30 20 / 0 20 / 0 20 / 44
70 20 / 7 20 / 23 20 / 66
100 20 / 32 20 / 22 20 / 88
300 20 / 32 20 / 65 20 / 81
700 20 / 92 20 / 75 20 / 88
1000 20 / 79 20 / 90 20 / 24
Table 3: #iter and #initer of AU algorithms (Reuters4).

4.2 The maximum number of iteration

The maximum number of iteration is very crucial in MU and AU algorithms since these algorithms are known to be very slow [29, 14, 19, 21, 33, 1, 20, 18, 17, 22, 28]. As shown by Lin [28], LS is very fast to minimize the objective for some first iterations, but then tends to become slower. In table LABEL:table5, we display errors for some first iterations for LS, MU-U, AU-U, MU-B, and AU-B (D-U and D-B do not minimize the corresponding objective functions they intend to minimize, so it is irrelevant to investigate this property for their cases). Note that error0 refers to the initial error before the algorithms start running, and error is the error at -th iteration. As shown, all algorithms are exceptionally very good at reducing errors in the first iterations. But then, the improvements are rather negligible with respect to the corresponding first improvements and the sizes of the datasets. Accordingly, we set maximum number of iteration to 20. Note that, in this case AU-B converged at the third iteration.

error0 error1 error2 error3 error4 error5

LS
1373 0.476 0.474 0.472 0.469 0.466
MU-U 4652 1.681 1.603 1.596 1.591 1.583
AU-U 4657 1.681 1.605 1.595 1.586 1.573
MU-B 12474 2.164 2.104 2.103 2.102 2.102
AU-B 12680 2.137 2.104 2.103 - -
Table 4: Errors for some first iterations (Reuters4).

4.3 Determining and

In the proposed algorithms, there are two dataset-dependent parameters, and , that have to be learned first. Because orthogonal NMFs are introduced to improve clustering capability of the standard NMF [9], these parameters will be learned based on clustering results on test dataset. We use Reuters4 for this purpose. These parameters do not exist in the original orthogonal NMFs nor in other orthogonal NMF algorithms [39, 40, 6]. However, we notice that our formulations resemble sparse NMF formulation [19, 21, 20], or in general case also known as constrained NMF [32]. As shown in [19, 20, 21], sparse NMF usually can give good results if and/or are rather small positive numbers. To determine and , we evaluate clustering qualities produced by our algorithms as or values grow measured by the standard clustering metrics: mutual information (MI), entropy (E), purity (P), and Fmeasure (F). The detailed discussions on these metrics can be found in [30]. Note that while larger MI, F, and P indicate better results, smaller E indicates better results. As shown in figure LABEL:fig9, for UNMF algorithms (MU-U and AU-U) seems to be a good choice. For MU-B it seems that and are acceptable settings. And for AU-B, and seem to be good settings. Based on this results, we decided to set and for all datasets and algorithms.

(a) MU-U
(b) AU-U
(c) MU-B(),
(d) AU-B(),
(e) MU-B(),
(f) AU-B(),
Figure 6: Clustering qualities as functions of or for Reuters4.

4.4 Times, #iterations, and errors

To evaluate computational performances of the algorithms, we measure average and maximum running times, average and maximum #iterations, and average and maximum errors produced at the last iterations for 10 trials. Table LABEL:table6-LABEL:table8 show the results.

As shown in the table LABEL:table6, LS generally is the fastest; however when MU-B or AU-B converge before reaching the maximum iteration (20 iterations), then these algorithms outperform LS. Our uni-orthogonal algorithms (MU-U and AU-U) seem to have comparable running times with LS. MU-B seems to be slower for smaller datasets and then performs better than MU-U and AU-U for larger datasets: Reuters10 and Reuters12. Since AU-B usually converges before reaching the maximum iteration (see table LABEL:table7), comparison can be done by using maximum running times where for Reuters4, Reuters6, Reuters10, and Reuters12 the data is available. As shown in table LABEL:table7, AU-B is the slowest to perform calculation per iteration. There are also abrupt changes in the running times for Reuters10 and Reuters12 for all algorithms which are unfortunate since as shown in table LABEL:ch2:table3, the sizes of the datasets are only slightly larger. Figure LABEL:fig10 shows the average running times as the sizes of the datasets grow.

Average and maximum errors at the last iterations are shown in table LABEL:table8. Results for D-U and D-B are unsurprisingly really high as these algorithms do not minimize the objectives that are supposed to be minimized. Because only MU-U & AU-U and MU-B & AU-B pairs have the same objective each, we compare average errors for these pairs in figure LABEL:fig11 for each dataset. There is no significant difference between MU-U & AU-U in the average errors, but as shown in figure LABEL:fig10, MU-U has better average running times especially for larger datasets. And for MU-B & AU-B, the differences in the average errors grow slightly as the size and classes of the datasets grow with significant differences occured at Reuters10 and Reuters12 where in these cases MU-B significantly outperformed AU-B.

Data Time LS D-U D-B MU-U AU-U MU-B AU-B
Reuters2 Av. 77.266 83.655 104.98 78.068 77.825 66.318 38.367
Max. 79.031 84.743 106.25 79.075 79.176 83.960 49.477
Reuters4 Av. 108.84 119.42 152.77 109.04 109.12 119.46 86.745
Max. 109.39 119.55 153.17 109.20 109.28 119.72 271.40
Reuters6 Av. 134.02 149.32 194.43 133.91 134.19 149.63 75.432
Max. 134.50 149.62 194.75 134.27 134.51 149.95 327.70
Reuters8 Av. 158.37 173.43 228.59 153.53 155.03 173.00 56.464
Max. 181.58 175.71 235.54 155.15 159.19 174.05 59.021
Reuters10 Av. 834.69 892.91 911.34 874.18 914.93 859.31 601.57
Max. 1004.5 1141.2 1127.3 1137.5 1162.0 1059.0 2794.1
Reuters12 Av. 1249.2 1348.4 1440.1 1319.7 1335.6 1309.0 1602.4
Max. 1389.0 1590.4 1746.1 1565.7 1529.4 1506.7 4172.2
Table 5: Average and maximum running time.
Data #iter. LS D-U D-B MU-U AU-U MU-B AU-B
Reuters2 Av. 20 20 20 20 20 16.2 4.9
Max. 20 20 20 20 20 20 6
Reuters4 Av. 20 20 20 20 20 20 7.2
Max. 20 20 20 20 20 20 20
Reuters6 Av. 20 20 20 20 20 20 5.5
Max. 20 20 20 20 20 20 20
Reuters8 Av. 20 20 20 20 20 20 4
Max. 20 20 20 20 20 20 4
Reuters10 Av. 20 20 20 20 20 20 5.6
Max. 20 20 20 20 20 20 20
Reuters12 Av. 20 20 20 20 20 20 8.8
Max. 20 20 20 20 20 20 20
Table 6: Average and maximum #iteration.
Data #iter. LS D-U D-B MU-U AU-U MU-B AU-B
Reuters2 Av. 1.3763 3435.6 3626.5 1.4106 1.4138 1.7955 1.8021
Max. 1.3854 3587.2 3867.4 1.4201 1.4230 1.8022 1.8025
Reuters4 Av. 1.4791 9152.8 8689.0 1.5299 1.5310 2.0708 2.0962
Max. 1.4855 9474.9 9297.9 1.5408 1.5402 2.0880 2.1028
Reuters6 Av. 1.5229 17135 15823 1.5844 1.5878 2.2627 2.2921
Max. 1.5301 17971 16955 1.5884 1.5952 2.2758 2.2998
Reuters8 Av. 1.5434 25913 22893 1.6215 1.6171 2.3863 2.4421
Max. 1.5473 27462 25553 1.6342 1.6262 2.3993 2.4422
Reuters10 Av. 1.5696 34154 30518 1.6533 1.6533 1.8836 2.5673
Max. 1.5801 35236 35152 1.6662 1.6618 1.9529 2.5718
Reuters12 Av. 1.5727 42739 37038 1.6620 1.6621 1.8860 2.6551
Max. 1.5815 44325 41940 1.6705 1.6713 1.9193 2.6697
Table 7: Average and maximum errors at the last iteration.
Figure 7: Average running time comparison as the datasets grow.
(a) MU-U and AU-U.
(b) MU-B and AU-B.
Figure 8: Average errors comparison as the datasets grow.

4.5 Document clustering

The results of document clustering are shown in table LABEL:table9LABEL:table12. In average, MU-U gives the best performances in all metrics especially for datasets with small #clusters. Then followed by LS, AU-U, and D-U with small margins. LS seems to be better for datasets with large #clusters. Generally, MU-U, LS, AU-U and D-U can give consistent results for varied #clusters, but unfortunately this is not the case for D-B, MU-B and AU-B which are all BNMtF algorithms. AU-B especially seems to offer only slightly better clustering than random results.

Data LS D-U D-B MU-U AU-U MU-B AU-B
Reuters2 0.40392 0.42487 0.36560 0.42150 0.057799 0.00087646
Reuters4 0.62879 0.61723 0.48007 0.63640 0.32142 0.072621
Reuters6 0.79459 0.81831 0.52498 0.81811 0.37924 0.078201
Reuters8 0.92285 0.90260 0.54534 0.92720 0.48435 0.013518
Reuters10 1.0275 0.62125 1.0063 1.0138 0.50980 0.072014
Reuters12 1.0865 0.58469 1.1195 1.0821 0.47697 0.16389
Average 0.82071 0.81283 0.52032 0.81754 0.37160 0.066853
Table 8: Average mutual information over 10 trials (document clustering).
Data LS D-U D-B MU-U AU-U MU-B AU-B
Reuters2 0.54193 0.52098 0.58025 0.52435 0.88805 0.94498
Reuters4 0.40202 0.40780 0.47638 0.39822 0.55571 0.68011
Reuters6 0.38391 0.37473 0.48821 0.37481 0.54459 0.66105
Reuters8 0.35568 0.36242 0.48151 0.35423 0.50184 0.65879
Reuters10 0.34023 0.46253 0.34661 0.34434 0.49608 0.62786
Reuters12 0.33239 0.47236 0.32319 0.33362 0.50241 0.58974
Average 0.38985 0.389760 0.49354 0.38787 0.58145 0.69375
Table 9: Average entropy over 10 trials (document clustering).
Data LS D-U D-B MU-U AU-U MU-B AU-B
Reuters2 0.82154 0.83599 0.80452 0.82507 0.66102 0.63612
Reuters4 0.79417 0.78023 0.73778 0.79704 0.70119 0.59657
Reuters6 0.74510 0.68844 0.74868 0.75069 0.66433 0.54569
Reuters8 0.73982 0.66536 0.74869 0.73987 0.65033 0.50680
Reuters10 0.73120 0.64845 0.72813 0.73330 0.63194 0.50639
Reuters12 0.73877 0.72719 0.62223 0.72340 0.60118 0.52019
Average 0.76331 0.76207 0.69446 0.76156 0.65166 0.55196
Table 10: Average purity over 10 trials (document clustering).
Data LS D-U D-B MU-U AU-U MU-B AU-B
Reuters2 0.81904 0.83234 0.79163 0.82241 0.58237 0.50399
Reuters4 0.56154 0.53754 0.44352 0.54267 0.36917 0.24585
Reuters6 0.46225 0.47714 0.33910 0.47270 0.26372 0.17171
Reuters8 0.40408 0.40554 0.25052 0.41822 0.23904 0.10869
Reuters10 0.38001 0.23309 0.36923 0.35947 0.19552 0.094912
Reuters12 0.35671 0.17387 0.35214 0.34435 0.16401 0.099949
Average 0.49727 0.49851 0.37196 0.49526 0.30231 0.20418
Table 11: Average Fmeasure over 10 trials (document clustering).

4.6 Word clustering

In some cases, the ability of clustering methods to simultaneously group similar documents with related words (co-clustering) can become an added value. And because the original BNMtF is designed to have this ability, we will also investigate the quality of word clustering (in the context of co-clustering) produced by all algorithms. Since word clustering has no reference class, we adopt idea from [9] where the authors proposed to create reference classes by using word frequencies: each word is assigned to a class where the word appears with the highest frequency. Table LABEL:table13LABEL:table16 show the results. As shown, D-U has the best overall results followed by LS, MU-U and AU-U by small margins. MU-U is especially good for small #clusters and LS is good for large #clusters. However, all BNMtF algorithms, D-B, MU-B, and AU-B, which are designed to accomodate co-clustering task, seem to have poor results. These results are in accord with document clustering results where BNMtFs also perform poorly.

Data LS D-U D-B MU-U AU-U MU-B AU-B
Reuters2 0.15715 0.16609 0.12966 0.14978 0.013995 0.00029807
Reuters4 0.42558 0.39193 0.21495 0.41663 0.11812 0.026943
Reuters6 0.54112 0.26971 0.54239 0.54828 0.12460 0.035309
Reuters8 0.63022 0.63368 0.29277 0.64699 0.15692 0.0037071
Reuters10 0.70386 0.33046 0.66262 0.68367 0.025320 0.029618
Reuters12 0.80111 0.77959 0.28412 0.76128 0.73517 0.013483 0.073478
Average 0.54317 0.25361 0.53549 0.53188 0.075407 0.028226
Table 12: Average mutual information over 10 trials (word clustering).
Data LS D-U D-B MU-U AU-U MU-B AU-B
Reuters2 0.76778 0.75884 0.79527 0.77515 0.91094 0.92463
Reuters4 0.62965 0.64647 0.73496 0.63412 0.78338 0.82897
Reuters6 0.56184 0.66683 0.56134 0.55906 0.72297 0.75751
Reuters8 0.52006 0.51891 0.63255 0.51447 0.67783 0.72890
Reuters10 0.50612 0.61852 0.51853 0.51220 0.71038 0.70909
Reuters12 0.48811 0.62632 0.49322 0.50050 0.70181 0.68507
Average 0.57792 0.67908 0.57806 0.58199 0.75122 0.77236
Table 13: Average entropy over 10 trials (word clustering).
Data LS D-U D-B MU-U AU-U MU-B AU-B
Reuters2 0.76987 0.77082 0.75378 0.76021 0.67006 0.65988
Reuters4 0.64400 0.62881 0.60566 0.64184 0.55808 0.53116
Reuters6 0.59830 0.55949 0.59763 0.59103 0.52966 0.49661
Reuters8 0.58935 0.54296 0.59179 0.58770 0.50933 0.46499
Reuters10 0.58123 0.51576 0.57045 0.58724 0.44765 0.45395
Reuters12 0.59563 0.49555 0.58628 0.56846 0.43611 0.44882
Average 0.63185 0.57887 0.62837 0.62274 0.52515 0.50923
Table 14: Average purity over 10 trials (word clustering).
Data LS D-U D-B MU-U AU-U MU-B AU-B
Reuters2 0.59287 0.59471 0.58733 0.59427 0.52628 0.49976
Reuters4 0.46891 0.43469 0.36397 0.46180 0.32520 0.27101
Reuters6 0.37490 0.38365 0.27356 0.38026 0.21620 0.17572
Reuters8 0.32488 0.32674 0.20820 0.33527 0.17127 0.12565
Reuters10 0.29864 0.18626 0.28930 0.28573 0.10700 0.10545
Reuters12 0.29072 0.14255 0.27525 0.27380 0.088517 0.095880
Average 0.39189 0.38970 0.29365 0.38973 0.23908 0.21224
Table 15: Average Fmeasure over 10 trials (word clustering).

5 Conclusions

We have presented BNMtF algorithms based on the additive update rules with rigorous convergence proof.

The only way to numerically evaluate whether the algorithm has converged to a stationary point is to check whether it has satisfied the KKT conditions on that point. While the nonnegativity conditions are easy to check, the complementary slackness conditions are hard since we must check for where denotes the first iteration number where stationarity has been reached. Not only there are some large matrix multiplications which can be inaccurate numerically, but also we must make sure that the stationary point is reachable in a reasonable amount of time. Accordingly, only the nonincreasing property was evaluated which as shown in section LABEL:nonincreasing, the convergent version of our algorithm kept these properties even for large or .

The maximum #iterations is an important issue in the multiplicative and additive update rules based NMF algorithms since these algorithms are known to be slow. As shown in table LABEL:table5, the multiplicative and additive update rules based algorithms were exceptionally very good at reducing the errors even in the first iterations, but then the errors were only slightly reduced for the remaining iterations. This led us to use 20 iterations as the maximum #iterations. Because it is a rather small number, it is very likely that the algorithms stop before reaching stationarity.

There were differences in the running times of the algorithms, but were not significant since all algorithms have the same computation complexity, i.e.: #iterations, where denotes the size of the data matrix, and denotes the number of decomposition factors.

The document clustering results favoured our MU-U algorithm proposed in [30] in which it showed the best average performances for all metrics followed closely by LS, AU-U, and D-U. MU-U was especially good for small #cluster and LS for large #clusters. It is possible that because we learned from Reuters4 dataset, then MU-U performed best at the small datasets. All BNMtF algorithms, D-B, MU-B, and AU-B, performed rather poorly in these datasets. These results are conflicting some previous works where it was reported that D-B outperformed LS and D-U [9, 27].

The results of word clustering are not as conclusive as the results of document clustering since there is no prior labelling for the words. Hence, we used strategy introduced in [9] to label the words. In this task, D-U offered the best overall performances followed closely by LS, MU-U and AU-U. As in the document clustering, all BNMtF algorithms also performed poorly.


Acknowledgment. We are very grateful to the referees for their most interesting comments and suggestions.

References

  • [1] M.W. Berry, M. Browne, A. Langville, V.P. Pauca, and R.J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. In Computational Statistics and Data Analysis, pages 155–173, 2006.
  • [2] N. Bertin, R. Badeau, and E. Vincent. Enforcing harmonicity and smoothness in bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Transactions on Audio, Speech, and Language Processing, 18(3):538–549, 2010.
  • [3] A. Bertrand and M. Moonen. Blind separation of non-negative source signals using multiplicative updates and subspace projection. Signal Processing, 90(10):2877–2890, 2010.
  • [4] J.P. Brunet, P. Tamayo, T.R. Golub, and J.P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. In Proc. Natl Acad. Sci. USA 101, pages 4164–4169, 2003.
  • [5] P. Carmona-Saez, R.D. Pascual-Marqui, F. Tirado, J.M Carazo, and A. Pascual-Montano. Biclustering of gene expression data by non-smooth non-negative matrix factorization. BMC Bioinformatics, 7(1):78, 2006.
  • [6] S. Choi. Algorithms for orthogonal nonnegative matrix factorization. In Proc. IEEE Int’l Joint Conf.  on Neural Networks, pages 1828–1832, 2008.
  • [7] A. Cichocki, S. Amari, R. Zdunek, R. Kompass, G. Hori, and Z. He. Extended smart algorithms for non-negative matrix factorization. In Proc. of the 8th international conference on Artificial Intelligence and Soft Computing, pages 548–562, 2006.
  • [8] K. Devarajan. Nonnegative matrix factorization: An analytical and interpretive tool in computational biology. PLoS Computational Biology, 4(7):e1000029, 2008.
  • [9] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In Proc. 12th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pages 126–135, 2006.
  • [10] P. Fogel, S.S. Young, D.M. Hawkins, and N. Ledirac. Inferential, robust non-negative matrix factorization analysis of microarray data. Bioinformatics, 23(1):44–49, 2007.
  • [11] Y. Gao and G. Church. Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics, 21(21):3970–3975, 2005.
  • [12] N. Gillis and F. Glineur. A multilevel approach for nonnegative matrix factorization. Journal of Computational and Applied Mathematics, 236(7):1708–1723, 2012.
  • [13] L. Grippo and M. Sciandrone. On the convergence of the block nonlinear gauss-seidel method under convex constraints. Operation Research Letters, 26(3):127–136, 2000.
  • [14] P.O. Hoyer. Non-negative matrix factorization with sparseness constraints. The Journal of Machine Learning Research, 5:1457–1469, 2004.
  • [15] K. Inamura, T. Fujiwara, Y. Hoshida, T. Isagawa, M.H. Jones, C. Virtanen, M. Shimane, Y. Satoh, S. Okumura, K. Nakagawa, E. Tsuchiya, S. Ishikawa, H. Aburatani, H. Nomura, and Y. Ishikawa. Two subclasses of lung squamous cell carcinoma with different gene expression profiles and prognosis identified by hierarchical clustering and non-negative matrix factorization. Oncogene, 24(47):7105–7113, 2005.
  • [16] S. Jia and Y. Qian. Constrained nonnegative matrix factorization for hyperspectral unmixing. IEEE Transactions on Geoscience and Remote Sensing, 47(1):161–173, 2009.
  • [17] D. Kim, S. Sra, and I.S. Dhillon. Fast newton-type methods for the least squares nonnegative matrix approximation problem. In in Data Mining, Proceedings of SIAM Conference on, pages 343–354, 2007.
  • [18] D. Kim, S. Sra, and I.S. Dhillon. Fast projection-based methods for the least squares nonnegative matrix approximation problem. Stat. Anal. Data Min., 1:38–51, February 2008.
  • [19] H. Kim and H. Park. Sparse non-negative matrix factorizations via alternating non-negativity constrained least squares for microarray data analysis. Bioinformatics, 23(12):1495–1502, 2007.
  • [20] H. Kim and H. Park. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl., 30(2):713–730, 2008.
  • [21] J. Kim and H. Park. Sparse nonnegative matrix factorization for clustering. Technical report, CSE Technical Reports, Georgia Institute of Technology, 2008.
  • [22] J. Kim and H. Park. Toward faster nonnegative matrix factorization: A new algorithm and comparisons. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pages 353–362, 2008.
  • [23] D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791, 1999.
  • [24] D.D. Lee and H.S. Seung. Algorithms for non-negative matrix factorization. In Proc. Advances in Neural Processing Information Systems, pages 556–562, 2000.
  • [25] H. Li, T. Adali, W. Wang, and D. Emge. Kernel k-means: spectral clustering and normalized cuts. In Proc. IEEE Workshop on Machine Learning for Signal Processing, pages 253–258, 2005.
  • [26] S.Z. Li, X.W. Hou, H.J. Zhang, and Q.S. Cheng. Learning spatially localized, parts-based representation. In Proc. IEEE Comp. Soc. Conf. on Computer Vision and Pattern Recognition, pages 207–212, 2001.
  • [27] T. Li and C. Ding. The relationships among various nonnegative matrix factorization methods for clustering. In Proc. ACM 6th Int’l Conf. on Data Mining, pages 362–371, 2006.
  • [28] C.J. Lin. Projected gradient methods for non-negative matrix factorization. Technical report, Technical Report ISSTECH-95-013, Department of CS, National Taiwan University, 2005.
  • [29] C.J. Lin. On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Transactions on Neural Networks, 18(6):1589–1596, 2007.
  • [30] A. Mirzal. A convergent algorithm for orthogonal nonnegative matrix factorization. Journal of Computational and Applied Mathematics, 260:149–166, 2014.
  • [31] A. Pascual-Montano, J.M. Carazo, K. Kochi, D. Lehman, and R.D. Pascual-Marqui. Nonsmooth nonnegative matrix factorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(3):403–415, 2006.
  • [32] V.P. Pauca, J. Piper, and R.J. Plemmons. Nonnegative matrix factorization for spectral data analysis. Linear Algebra and Its Applications, 416(1):29–47, 2006.
  • [33] F. Shahnaz, M.W. Berry, V. Pauca, and R.J. Plemmons. Document clustering using nonnegative matrix factorization. Information Processing & Management, 42(2):267–273, 2003.
  • [34] T. Virtanen, A.T. Cemgil, and S. Godsill. Bayesian extensions to non-negative matrix factorisation for audio signal modelling. In IEEE Int’l Conf. on Acoustics, Speech and Signal Processing, pages 1825–1828, 2008.
  • [35] D. Wang and H. Lu. On-line learning parts-based representation via incremental orthogonal projective non-negative matrix factorization. Signal Processing, 93(6):1608–1623, 2013.
  • [36] G. Wang, A.V. Kossenkov, and M.F. Ochs. Ls-nmf: A modified non-negative matrix factorization algorithm utilizing uncertainty estimates. BMC Bioinformatics, 7(1):175, 2006.
  • [37] J.J.Y. Wang, X. Wang, and X. Gao. Non-negative matrix factorization by maximizing correntropy for cancer clustering. BMC Bioinformatics, 14(1):107, 2013.
  • [38] W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In Proc. ACM SIGIR, pages 267–273, 2003.
  • [39] J. Yoo and S. Choi. Orthogonal nonnegative matrix factorization: Multiplicative updates on stiefel manifolds. In Proceedings of the 9th International Conference on Intelligent Data Engineering and Automated Learning, pages 140–147. Springer-Verlag, 2008.
  • [40] J. Yoo and S. Choi. Orthogonal nonnegative matrix tri-factorization for co-clustering: Multiplicative updates on stiefel manifolds. Inf. Process. Manage., 46(5):559–570, 2010.
  • [41] N. Yuvaraj and P. Vivekanandan. An efficient svm based tumor classification with symmetry non-negative matrix factorization using gene expression data. In Proc. Int’l Conf. on Information Communication and Embedded Systems, pages 761–768, 2013.
  • [42] C.H. Zheng, D.S. Huang, D. Zhang, and X.Z. Kong. Tumor clustering using nonnegative matrix factorization with gene selection. IEEE Transactions on Information Technology in Biomedicine, 13(4):599–607, 2009.
  • [43] G. Zhou, Z. Yang, and S. Xie. Online blind source separation using incremental nonnegative matrix factorization with volume constraint. IEEE Transactions on Neural Networks, 22(4):550–560, 2011.

A Convergence analysis

From a result in convergence analysis of block coordinate descent method [29, 28, 13], algorithm LABEL:algorithm8 has a convergence guarantee if the following conditions are satisfied:

  1. sequence has nonincreasing property,

  2. any limit point of sequence generated by algorithm LABEL:algorithm8 is a stationary point, and

  3. sequence has at least one limit point.

Because algorithm LABEL:algorithm8 uses the alternating strategy, sequences , , and can be analyzed separately [24, 29]. And because update rule for in eq. LABEL:eq98 is similar to the update rule for in eq. LABEL:eq99, it suffices to prove nonincreasing property of one of them.

a.1 Nonincreasing property of

By using auxiliary function approach [24], nonincreasing property of can be proven if the following statement is true:

To define , let us rearrange into:

where is the -th row of . And also let us define:

where is the -th row of . Then we define:

where is a diagonal matrix with its diagonal entries defined as:

with

denotes the set of non-KKT indices in -th row of , and the symbol is defined so that and .

Then, the auxiliary function can be defined as:

(A.12)

Note that and are equivalent to and with is rearranged into , and other variables are reordered accordingly. And:

By definition, is positive definite for all not satisfy the KKT conditions, so is a strict convex function, and consequently has a unique minimum.

(A.13)

which is exactly the update rule for .

By using the Taylor series expansion, can also be written as:

(A.14)

where

and

with components are arranged along its diagonal area (there are components).

To show the nonincreasing property of , the following statements must be proven:

  1. ,

  2. ,

  3. , and

  4. ,

The first and second statements are obvious from the definition of in eq. LABEL:eq109, the third and the fourth statements will be proven in theorem LABEL:theorem19 and LABEL:theorem20, and the boundedness of , , and will be proven in theorem LABEL:theorem32.

Theorem A.1.

Given sufficiently large and the boundedness of , , and , then it can be shown that . Moreover, if and only if satisfies the KKT conditions, then the equality holds.

Proof.

By substracting eq. LABEL:eq109 from eq. LABEL:eq111, we get:

(A.15)

Let , then:

where <