Multiple Partitions Aligned Clustering
Abstract
Multiview clustering is an important yet challenging task due to the difficulty of integrating the information from multiple representations. Most existing multiview clustering methods explore the heterogeneous information in the space where the data points lie. Such common practice may cause significant information loss because of unavoidable noise or inconsistency among views. Since different views admit the same cluster structure, the natural space should be all partitions. Orthogonal to existing techniques, in this paper, we propose to leverage the multiview information by fusing partitions. Specifically, we align each partition to form a consensus cluster indicator matrix through a distinct rotation matrix. Moreover, a weight is assigned for each view to account for the clustering capacity differences of views. Finally, the basic partitions, weights, and consensus clustering are jointly learned in a unified framework. We demonstrate the effectiveness of our approach on several real datasets, where significant improvement is found over other stateoftheart multiview clustering methods.
1 Introduction
As an important problem in machine learning and data mining, clustering has been extensively studied for many years [8]. Technology advances have produced large volumes of data with multiple views. Multiview features depict the same object from different perspectives, thereby providing complementary information. To leverage the multiview information, multiview clustering methods have drawn increasing interest in recent years [2]. Due to its unsupervised learning nature, multiview clustering is still a challenging task. The key question is how to reach a consensus of clustering among all views.
In the clustering field, two dominating methods are kmeans [8] and spectral clustering [17]. Numerous variants of them have been developed over the past decades [3, 16, 26, 9]. Among them, some can tackle multiview data, e.g., multiview kernel kmeans (MKKM) [23], robust multiview kernel kmeans (RMKKM) [1], Cotrained multiview spectral clustering (Cotrain) [14], Coregularized multiview spectral clustering (Coreg) [15]. Along with the development of nonnegative matrix factorization (NMF) technique, multiview NMF also gained a lot of attention. For example, a multimanifold regularized NMF (MNMF) is designed to preserve the local geometrical structure of the manifolds for multiview clustering [29].
Recently, subspace clustering method has shown impressive performance. Subspace clustering method first obtains a graph, which reveals the relationship between data points, then applies spectral clustering to achieve the embedding of original data, finally utilizes kmeans to obtain the final clustering result [4, 10]. Inspired by it, subspace clustering based multiview clustering methods [5, 28, 6] have become popular in recent years. For instance, Gao et al. proposed multiview subspace clustering (MVSC) method [5]. In this approach, multiple graphs are constructed and they are forced to share the same cluster pattern. Therefore, the final clustering is a negotiated result and it might not be optimal. [24] supposes that each graph should be close to each other. After obtaining graphs, their average is utilized to perform spectral clustering. The averaging strategy might be too simple to fully take advantage of heterogeneous information. Furthermore, it is a twostage algorithm. The constructed graph might not be optimal for the subsequent clustering [13].
By contrast, another class of graphbased multiview clustering method learns a common graph based on adaptive neighbors idea [18, 27]. In specific, is connected to with probability . should have a large value if the distance between and is small. Otherwise, should be small. Therefore, obtained is treated as the similarity between and . In [18], each view shares the same similarity graph. Moreover, a weight for each view is automatically assigned based on loss value. Though this approach has shown its competitiveness, one shortcoming of it is that it fails to consider the flexible local manifold structures of different views.
Although proved to be effective in many cases, existing graphbased multiview clustering methods are limited in several aspects. First, they integrate the multiview information in the feature space via some simple strategies. Due to the generally unavoidable noise in the data representation, the graphs might be severely damaged and cannot represent the true similarities among data points [11]. It would make more sense if we directly reach consensus clustering in partition space where a common cluster structure is shared by all views, while the graphs might be quite different for different views. Hence, partitions from various views might be less affected by noise and easier to reach an agreement. Second, most existing algorithms follow a multistage strategy, which might degrade the final performance. For example, the learned graph might not be suitable for the subsequent clustering task. A joint learning method is desired for this kind of problem.
Regarding the problems mentioned above, we propose a novel multiple Partitions Aligned Clustering (mPAC) method. Fig. 1 shows the idea of our approach. mPCA performs graph construction, spectral embedding, and partitions integration via joint learning. In particular, an iterative optimization strategy allows the consensus clustering to guide the graph construction, which later contributes to a new unified clustering. To sum up, we have our twofold contributions as follows:

[noitemsep]

Orthogonal to existing multiview clustering methods, we integrate multiview information in partition space. This change in paradigm accompanies several benefits.

An endtoend single stage model is developed to achieve from graph construction to final clustering. Especially, we assume that the unified clustering is reachable for each view through a distinct transformation. Moreover, the output of our algorithm is the discrete cluster indicator matrix, thus no more subsequent step is needed.
Notations In this paper, matrices and vectors are represented by capital and lowercase letters, respectively. For , and represents the th row and th column of , respectively. The norm of vector is defined as , where means transpose. denotes the trace of . denotes the Frobenius norm of . Vector indicates its elements are all ones. refers to the identity matrix with a proper size. represents the set of indicator matrices. We use the superscript or subscript to denote the th view of interchangeably when convenient.
2 Subspace Clustering Revisited
In general, for data with features and samples, the popular subspace clustering method can be formulated as:
(1) 
where is a balance parameter and is some regularization function, which varies in different algorithms [21]. For simplicity, we just apply the Frobenius norm in this paper. is the vector consists of diagonal elements of . is treated as the affinity graph. Therefore, once is obtained, we can implement spectral clustering algorithm to obtain the clustering results, i.e.,
(2) 
where is the Laplacian of graph and is the spectral embedding and is number of clusters. Graph Laplacian is defined by , where is a diagonal matrix with . Since is not discrete, kmeans is often used to recover the indicator matrix .
When data of multiple views are available, Eq. (2) can be extended to this scenario accordingly. denotes the data with views, where represents the th view data with features. Basically, most methods in the literature solve the following problem
(3) 
where represents some strategy to obtain a consensus graph . For example, [5] enforces each graph to share the same ; [24] penalizes the discrepancy between graphs, then their average is used as input to spectral clustering.
We observe that there are several drawbacks shared by these approaches. First and foremost, they still lack an effective way to integrate multiview knowledge while simultaneously considering the heterogeneity among views. Simply taking the average of graphs or assigning a unique spectral embedding is not enough to take full advantage of rich information. The graph representation itself might not be optimal to characterize the multiview information. Secondly, they adopt a multistage approach. Since there is no mechanism to ensure the quality of learned graphs, this approach might lead to suboptimal clustering results, which often occurs when noise exists. To address the abovementioned challenging issues, we propose a multiple Partitions Aligned Clustering (mPAC) method.
3 Proposed Approach
Unlike Eq.(3), which learns a unique graph based on multiple graphs , we propose to learn a partition for each graph. In specific, we adopt a joint learning strategy and formulate our objective function as
(4) 
Next, we propose a way to fuse the multiview information in the partition space. For multiview clustering, a shared cluster structure is assumed. It is reasonable to assume a cluster indicator matrix for all views. Unfortunately, ’s elements are continuous. The discrepancy also exists among ’s. Thus, it is challenging to integrate multiple s. To recover the underlying cluster , we assume that each partition is a perturbation of and it can be aligned with through a rotation [12, 19]. Mathematically, it can be formulated as
(5) 
where represents an orthogonal matrix. Eq. (5) treats each view equally. As shown by many researchers, it is necessary to distinguish their contributions. Therefore, we introduce a weight parameter for view . Deploying a unified framework, we eventually reach our objective for mPAC as
(6) 
We can observe that the proposed approach is distinct from other methods in several aspects:

[noitemsep]

Orthogonal to existing multiview clustering techniques, Eq. (6) integrates heterogeneous information in partition space. Considering that a common cluster structure is shared by all views, it would be natural to perform information fusion based on partitions.

Generally, learning with multistage strategy often leads to suboptimal performance. We adopt a joint learning framework. The learning of similarity graphs, spectral embeddings, view weights, and unified cluster indicator matrix is seamlessly integrated together.

is the final discrete cluster indicator matrix. Hence, discretization procedure is no longer needed. This eliminates the kmeans postprocessing step, which is sensitive to initialization. With input , (6) will output the final discrete . Thus, it is an endtoend singlestage learning problem.

Multiple graphs are learned in our approach. Hence, the local manifold structures of each view are well taken care of.

As a matter of fact, Eq. (6) is not a simple unification of the pipeline of steps and it attempts to learn graphs with optimal structure for clustering. According to the graph spectral theory, the ideal graph is connected if there are clusters [12]. In other words, the Laplacian matrix has zero eigenvalues s. Approximately, we can minimize , which is equivalent to . Hence, the third term in Eq. (6) ensures that each graph is optimal for clustering.
4 Optimization Methods
To handle the objective function in Eq. (6), we apply an alternating minimization scheme to solve it.
4.1 Update for Each View
By fixing other variables, we solve according to
(7) 
It can be seen that each is independent from other views. Therefore, we can solve each view separately. To simplify the notations, we ignore the view index tentatively. Note that is a function of and . Equivalently, we solve
(8) 
where with the th component defined by . By setting its firstorder derivative to zero, we obtain
(9) 
4.2 Update for Each View
Similarly, we drop all unrelated terms with respect to and ignore the view indexes. It yields,
(10) 
This subproblem can be efficiently solved based on the method developed in [25].
4.3 Update for Each View
With respect to , the objective function is additive. We can solve each individually. Specifically,
(11) 
Lemma 1.
For problem
(12) 
its closedform solution is , where , are the left and right unitary matrix of the SVD decomposition of , respectively [22].
4.4 Update
For , we get
(13) 
Let’s unfold above objective function, we have
Thus, we can equivalently solve
(14) 
It admits a closedform solution, that is, ,
(15) 
4.5 Update for Each View
Let’s denote as , then this subproblem can be expressed as
(16) 
Based on CauchySchwarz inequality, we have
(17) 
The minimum, which is a constant, is achieved when . Thus, the optimal is given by, ,
(18) 
.
For clarity, we summarize the algorithm^{1}^{1}1Our code is available: https://github.com/sckangz/mPAC to solve Eq. (6) in Algorithm 1.
5 Experiments
Data  Handwritten  Caltech7  Caltech20  BBCSport 

View #  6  6  6  4 
Points  2000  1474  2386  116 
Cluster #  10  7  20  5 
5.1 Experimental Setup
We conduct experiments on four benchmark data sets: BBCSport, Caltech7, Caltech20, Handwritten Numerals. Their statistics information is summarized in Table 1. We compare the proposed mPAC with several stateoftheart methods from different categories, including Cotrain [14], Coreg [15], MKKM [23], RMKM [1], MVSC [5], MNMF [29], parameterfree autoweighted multiple graph learning (AMGL) [18]. Furthermore, the classical kmeans (KM) method with concatenated features (i.e., all features, AllFea in short) is included as a baseline. That is to say, all views are of the same importance. Following [7], all values of each view are normalized into range . To achieve a comprehensive evaluation, we apply five widelyused metrics to examine the effectiveness of our method: Fscore, precision, Recall, Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI). We initialize our algorithm by using the results from [20].
Method  Fscore  Precision  Recall  NMI  ARI 

KM(AllFea)  0.3834(0.0520)  0.2345(0.0463)  0.6616(0.2161)  0.1701(0.0763)  0.1561(0.0863) 
Cotrain  0.3094(0.0107)  0.2348(0.0034)  0.4556(0.0398)  0.1591(0.0160)  0.1144(0.0064) 
Coreg  0.3116(0.0305)  0.2337(0.0053)  0.4879(0.1173)  0.1599(0.0192)  0.1166(0.0090) 
MKKM  0.3779(0.0162)  0.2359(0.0156)  0.7679(0.1402)  0.1160(0.0392)  0.1248(0.0309) 
RMKM  0.3774(0.0167)  0.2476(0.0113)  0.8416(0.1563)  0.1754(0.0259)  0.1100(0.0200) 
MVSC  0.3540(0.0270)  0.2459(0.0406)  0.7017(0.0801)  0.1552(0.0812)  0.1292(0.0666) 
MNMF  0.3755(0.0307)  0.2685(0.0117)  0.8558(0.1261)  0.2576(0.0614)  0.1274(0.0515) 
AMGL  0.3963(0.0167)  0.2801(0.0226)  0.6976(0.0971)  0.2686(0.0419)  0.0785(0.0399) 
mPAC  0.6780  0.7500  0.6187  0.6146  0.5617 
Method  Fscore  Precision  Recall  NMI  ARI 
KM(AllFea)  0.4688(0.0327)  0.7868(0.0080)  0.3618(0.0371)  0.4278(0.0120)  0.3172(0.0297) 
Cotrain  0.4678(0.0172)  0.7192(0.0136)  0.3550(0.0168)  0.3235(0.0226)  0.3342(0.0157) 
Coreg  0.4981(0.0092)  0.7014(0.0076)  0.3622(0.0098)  0.3738(0.0061)  0.2894(0.0046) 
MKKM  0.4804(0.0059)  0.7659(0.0178)  0.3663(0.0040)  0.4530(0.0132)  0.3053(0.0096) 
RMKM  0.4514(0.0409)  0.7491(0.0277)  0.3236(0.0376)  0.4220(0.0197)  0.2865(0.0429) 
MVSC  0.3341(0.0102)  0.5387(0.0271)  0.2427(0.0130)  0.1938(0.0185)  0.1242(0.0140) 
MNMF  0.4414(0.0303)  0.7587(0.0330)  0.3115(0.0262)  0.4111(0.0175)  0.3456(0.0576) 
AMGL  0.6422(0.0139)  0.6638(0.0125)  0.6219(0.0164)  0.5711(0.0149)  0.4295(0.0208) 
mPAC  0.6763  0.6306  0.7292  0.5741  0.4963 
Method  Fscore  Precision  Recall  NMI  ARI 

KM(AllFea)  0.3697(0.0071)  0.6235(0.0212)  0.2583(0.0095)  0.5578(0.0133)  0.2850(0.0063) 
Cotrain  0.3750(0.0287)  0.6375(0.0253)  0.2749(0.0238)  0.4895(0.0117)  0.3085(0.0281) 
Coreg  0.3719(0.0087)  0.6245(0.0137)  0.2882(0.0070)  0.5615(0.0042)  0.2751(0.0084) 
MKKM  0.3583(0.0114)  0.6724(0.0158)  0.2865(0.0092)  0.5680(0.0142)  0.3039(0.0110) 
RMKM  0.3955(0.0113)  0.6307(0.0144)  0.2712(0.0096)  0.5899(0.0092)  0.2952(0.0112)) 
MVSC  0.5417(0.0239)  0.4100(0.0245)  0.7994(0.0110)  0.4875(0.0113)  0.3800(0.0246) 
MNMF  0.3643(0.0157)  0.6509(0.0119)  0.2530(0.0136)  0.5367(0.0132)  0.3128(0.0042) 
AMGL  0.4017(0.0248)  0.3503(0.0479)  0.4827(0.0450)  0.5656(0.0387)  0.2618(0.0453) 
mPAC  0.5645  0.4350  0.8035  0.5986  0.5083 
Method  Fscore  Precision  Recall  NMI  ARI 

KM(AllFea)  0.6671(0.0105)  0.6550(0.0154)  0.6889(0.0180)  0.7183(0.0106)  0.6443(0.0122) 
Cotrain  0.6859(0.0172)  0.6634(0.0281)  0.7109(0.0252)  0.7222(0.0149)  0.6498(0.0227) 
Coreg  0.6840(0.0269)  0.6360(0.0336)  0.6413(0.0198)  0.7583(0.0197)  0.6266(0.0314) 
MKKM  0.6756(0.0000)  0.6501(0.0000)  0.7050(0.0000)  0.7526(0.0000)  0.7009(0.0000) 
RMKM  0.6542(0.0258)  0.6218(0.0350)  0.6915(0.0158)  0.7431(0.0209)  0.6013(0.0300) 
MVSC  0.6753(0.0335)  0.6193(0.0537)  0.7537(0.0215)  0.7566(0.0186)  0.6079(0.0419) 
MNMF  0.7068(0.0272)  0.6957(0.0294)  0.7183(0.0250)  0.7431(0.0227)  0.6407(0.0056) 
AMGL  0.7404(0.1070)  0.6650(0.1372)  0.8457(0.0560)  0.8392(0.0543)  0.7066(0.1235) 
mPAC  0.7473  0.7348  0.7200  0.7370  0.7069 
5.2 Experimental Results
We repeat each method 10 times and report their mean and standard deviation (std) values. For our proposed method, we only need to implement once since no kmeans is involved. The clustering performance on those four data sets is summarized in Tables 55, respectively. We can observe that our mPAC method achieves the best performance in most cases, which validates the effectiveness of our approach. In general, our method works better than kmeans and NMF based techniques. Furthermore, it can be seen that the improvement is remarkable. With respect to graphbased clustering methods, our approach also demonstrates its superiority. In particular, both MVSC and AMGL assume that all graphs produce the same partition, while our method learns one partition for each view and finds the underlying cluster by aligning mechanism.
To visualize the effect of partitions alignment, we implement tSNE on the clustering results of Handwritten Numerals data. As shown in Fig. 2, some partitions have a good cluster structure, thus it might be easy to find a good . On the other hand, although the partition of view 5 is bad, we can still achieve a good solution . This indicates that our method is reliable to obtain a good clustering since it operates in the partition space. By contrast, previous methods may not consistently provide a good solution.
5.3 Sensitivity Analysis
Taking Caltech7 as an example, we demonstrate the influence of parameters to clustering performance. From Fig. 3, we can observe that our performance is quite stable under a wide range of parameter settings. In particular, it becomes more robust to and when increases, which indicates the importance of partition alignment.
6 Conclusion
In this paper, a novel multiview clustering method is developed. Different from existing approaches, it seeks to integrate multiview information in partition space. We assume that each partition can be aligned to the consensus clustering through a rotation matrix. Furthermore, graph learning and clustering are performed in a unified framework, so that they can be jointly optimized. The proposed method is validated on four benchmark data sets.
Acknowledgments
This paper was in part supported by Grants from the Natural Science Foundation of China (Nos. 61806045 and 61572111), two Fundamental Research Fund for the Central Universities of China (Nos. ZYGX2017KYQD177 and A03017023701012) and a 985 Project of UESTC (No. A1098531023601041) .
References
 [1] (2013) Multiview kmeans clustering on big data.. In IJCAI, pp. 2598–2604. Cited by: §1, §5.1.
 [2] (2017) A survey on multiview clustering. arXiv preprint arXiv:1712.06246. Cited by: §1.
 [3] (2013) TWkmeans: Automated Twolevel Variable Weighting Clustering Algorithm for Multiview Data. IEEE Transactions on Knowledge and Data Engineering 25 (4), pp. 932–944. External Links: ISSN 10414347 Cited by: §1.
 [4] (2013) Sparse subspace clustering: algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence 35 (11), pp. 2765–2781. Cited by: §1.
 [5] (2015) Multiview subspace clustering. In ICCV, pp. 4238–4246. Cited by: §1, §2, §5.1.
 [6] (2019) Autoweighted multiview clustering via kernelized graph learning. Pattern Recognition 88, pp. 174–184. Cited by: §1.
 [7] (2018) Selfweighted multiview clustering with soft capped norm. KnowledgeBased Systems 158, pp. 1–8. Cited by: §5.1.
 [8] (2010) Data clustering: 50 years beyond kmeans. Pattern recognition letters 31 (8), pp. 651–666. Cited by: §1, §1.
 [9] (2018) Selfweighted multiple kernel learning for graphbased clustering and semisupervised classification. In IJCAI, pp. 2312–2318. Cited by: §1.
 [10] (2019) Similarity learning via kernel preserving embedding. In AAAI, Cited by: §1.
 [11] (2019) Robust graph learning from noisy data. IEEE Transactions on Cybernetics, pp. 1–11. External Links: Document, ISSN 21682267 Cited by: §1.
 [12] (2018) Unified spectral clustering with optimal graph. In AAAI, Cited by: 5th item, §3.
 [13] (2017) Twin learning for similarity and clustering: a unified kernel approach. In AAAI, Cited by: §1.
 [14] (2011) A cotraining approach for multiview spectral clustering. In ICML, pp. 393–400. Cited by: §1, §5.1.
 [15] (2011) Coregularized multiview spectral clustering. In NIPS, pp. 1413–1421. Cited by: §1, §5.1.
 [16] (201810) Partition level constrained clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (10), pp. 2469–2483. External Links: Document, ISSN 01628828 Cited by: §1.
 [17] (2002) On spectral clustering: analysis and an algorithm. NIPS 2, pp. 849–856. Cited by: §1.
 [18] (2016) Parameterfree autoweighted multiple graph learning: a framework for multiview clustering and semisupervised classification.. In IJCAI, pp. 1881–1887. Cited by: §1, §5.1.
 [19] (2018) Multiview clustering via adaptively weighted procrustes. In SIGKDD, pp. 2022–2030. Cited by: §3.
 [20] (2016) The constrained laplacian rank algorithm for graphbased clustering. In AAAI, Cited by: §5.1.
 [21] (2018) Connections between nuclearnorm and frobeniusnormbased representations. IEEE transactions on neural networks and learning systems 29 (1), pp. 218–224. Cited by: §2.
 [22] (1966) A generalized solution of the orthogonal procrustes problem. Psychometrika 31 (1), pp. 1–10. Cited by: Lemma 1.
 [23] (2012) Kernelbased weighted multiview clustering. In ICDM, pp. 675–684. Cited by: §1, §5.1.
 [24] (2016) Iterative views agreement: an iterative lowrank based structured optimization method to multiview spectral clustering. In IJCAI, pp. 2153–2159. Cited by: §1, §2.
 [25] (2013) A feasible method for optimization with orthogonality constraints. Mathematical Programming 142 (12), pp. 397–434. Cited by: §4.2.
 [26] (2018) Fast spectral clustering learning with hierarchical bipartite graph for largescale data. Pattern Recognition Letters. Cited by: §1.
 [27] (2017) Graph learning for multiview clustering. IEEE transactions on cybernetics (99), pp. 1–9. Cited by: §1.
 [28] (2017) Latent multiview subspace clustering. In CVPR, pp. 4279–4287. Cited by: §1.
 [29] (2017) Multiview clustering via multimanifold regularized nonnegative matrix factorization. Neural Networks 88, pp. 74–89. Cited by: §1, §5.1.