Integrative Generalized Convex Clustering Optimization and Feature Selection for Mixed MultiView Data
Abstract
In mixed multiview data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single dataview. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that will inherit the strong statistical, mathematical and empirical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multiview data is often challenging when each data source is highdimensional. To perform feature selection in such scenarios, we develop an adaptive shifted grouplasso penalty that selects features by shrinking them towards their lossspecific centers. Our socalled iGecco+ approach selects features from each dataview that are best for determining the groups, often leading to improved integrative clustering. To fit our model, we develop a new type of generalized multiblock ADMM algorithm using subproblem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for highdimensional mixed multiview data.
tabular \makesavenoteenvtable
Keywords: Integrative clustering, convex clustering, feature selection, convex optimization, sparse clustering, GLM deviance, Bregman divergences
1 Introduction
As the volume and complexity of data grows, statistical data integration has gained increasing attention as it can lead to discoveries which are not evident in analyses of a single data set. We study a specific dataintegration problem where we seek to leverage common samples measured across multiple diverse sets of features that are of different types (e.g., continuous, countvalued, categorical, skewed continuous and etc.). This type of data is often called mixed, multiview data (hall1997introduction; acar2011all; lock2013joint; tang2018integrated; baker2019feature). While many techniques have been developed to analyze each individual data type separately, there are currently few methods that can directly analyze mixed multiview data jointly. Yet, such data is common in many areas such as electronic health records, integrative genomics, multimodal imaging, remote sensing, national security, online advertising, and environmental studies. For example in genomics, scientists often study gene regulation by exploring only gene expression data, but other data types, such as short RNA expression and DNA methylation, are all part of the same gene regulatory system. Joint analysis of such data can give scientists a more holistic view of the problem they study. But, this presents a major challenge as each individual data type is highdimensional (i.e., a larger number of features than samples) with many uninformative features. Further, each data view contains different data types: expression of genes or short RNAs measured via sequencing is typically countvalued or zeroinflated plus skewed continuous data whereas DNA methylation data is typically proportionvalued. In this paper, we seek to leverage multiple sources of mixed data to better cluster the common samples as well as select relevant features that distinguish the inherent group structure.
We propose a convex formulation which integrates mixed types of data with different dataspecific losses, clusters common samples with a joint fusion penalty and selects informative features that separate groups. Due to the convex formulation, our methods enjoy strong statistical, mathematical and empirical properties. We make several methodological contributions. First, we consider employing different types of losses for better handling nonGaussian data with Generalized Convex Clustering Optimization (Gecco), which replaces Euclidean distances in convex clustering with more general convex losses. We show that for different losses, Gecco’s fusion penalty forms different types of centroids which we call lossspecific centers. To integrate mixed multiview data and perform clustering, we incorporate different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups; this gives rise to Integrative Generalized Convex Clustering (iGecco). Further, when dealing with highdimensional data, practitioners seek interpretability by identifying important features which can separate the groups. To facilitate feature selection in Gecco and iGecco, we develop an adaptive shifted grouplasso penalty that selects features by shrinking them towards their lossspecific centers, leading to Gecco+ and iGecco+ which performs clustering and variable selection simultaneously. To solve our methods in a computationally efficient manner, we develop a new general multiblock ADMM algorithm using subproblem approximations, and make an optimization contribution by proving that this new class of algorithms converge to the global solution.
1.1 Related Literature
Our goal is to develop a unified, convex formulation of integrative clustering with feature selection based on increasingly popular convex clustering methods. pelckmans2005convex; lindsten2011just; hocking2011clusterpath proposed convex clustering which uses a fusion penalty to achieve agglomerative clustering like hierarchical clustering. This convex formulation guarantees a global optimal solution, enjoys strong statistical and mathematical theoretical properties, and often demonstrates superior empirical performance to competing approaches. Specifically, in literature, pelckmans2005convex; chi2017convex showed it yields stable solutions to small perturbations on the data or tuning parameters; radchenko2017convex studied statistical consistency; tan2015statistical established its link to hierarchical clustering as well as prediction consistency; and many others have studied other appealing theoretical properties (zhu2014convex; sui2018convex; chi2019recovering). Despite these advantages, convex clustering has not yet gained widespread popularity due to its intensive computation. Recently, some proposed fast and efficient algorithms to solve convex clustering and estimate its regularization paths (chi2015splitting; weylandt2019dynamic). Meanwhile, convex clustering has been extended to biclustering (chi2017convex) and many other applications (chi2018provable; choi2019regularized).
One potential drawback to convex clustering however, is that thus far, it has only been studied employing Euclidean distances between data points and their corresponding cluster centers. As is well known, the Euclidean metric suffers from poor performance with data that is highly nonGaussian such as binary, countvalued, skewed data, or with data that has outliers. While wang2016robust studied robust convex clustering and sui2018convex investigated convex clustering with metric learning, there has not been a general investigation of convex clustering for nonGaussian data and data integration on mixed data has not been studied. But, many others have proposed clustering methods for nonGaussian data in other contexts. One approach is to perform standard clustering procedures on transformed data (anders2010differential; bullard2010evaluation; marioni2008rna; robinson2010scaling). But, choosing an appropriate transformation that retains the original cluster signal is a challenging problem. Another popular approach is to use hierarchical clustering with specified distance metrics for nonGaussian data (choi2010survey; fowlkes1983method). Closely related to this, banerjee2005clustering studied different clustering algorithms utilizing a large class of loss functions via Bregman divergences. Yet, the proposed methods are all extensions of existing clustering approaches and hence inherit both good and bad properties of those approaches. There has also been work on modelbased clustering, which assumes that data are generated by a finite mixture model; for example banfield1993model; si2013model propose such a model for the Poisson and negative binomial distributions. Still these methods have a nonconvex formulation and local solutions like all modelbased clustering methods. We propose to adopt the method similar to banerjee2005clustering and study convex clustering using different loss functions; hence our method inherits the desirable properties of convex clustering and handles nonGaussian data as well. More importantly, there is currently no literature on data integration using convex clustering and we achieve this by integrating different types of general convex losses with a joint fusion penalty.
Integrative clustering, however, has been wellstudied in the literature. The most popular approach is to use latent variables to capture the inherent structure of multiple types of data. This achieves a joint dimension reduction and then clustering is performed on the joint latent variables (shen2009integrative; shen2012integrative; shen2013sparse; mo2013pattern; mo2017fully; meng2015mocluster). Similar in nature to the latent variables approach, matrix factorization methods assume that the data has an intrinsic lowdimensional representation, with the dimension often corresponding to the number of clusters (lock2013joint; hellton2016integrative; zhang2012discovery; chalise2017integrative; zhang2011novel; yang2015non). There are a few major drawbacks of latent variable or dimension reduction approaches, however. First it is often hard to directly interpret latent factors or lowdimensional projections. Second, many of these approaches are based on nonconvex formulations yielding local solutions. And third, choosing the rank of factors or projections is known to be very challenging in practice and will often impact resulting clustering solutions. Another approach to integrative clustering is clustering of clusters (COC) which performs cluster analysis on every single data set and then integrates the primary clustering results into final group assignments using consensus clustering (hoadley2014multiplatform; lock2013bayesian; kirk2012bayesian; savage2013identifying; wang2014similarity). This, however, has several potential limitations as each individual data set might not have enough signal to discern joint clusters or the individual cluster assignments are too disparate to reach a meaningful consensus. Finally, others have proposed to use distancebased clustering for mixed types of data by first defining an appropriate distance metric for mixed data (for example, the Gower distance by gower1971general) and then applying an existing distancebased clustering algorithm such as hierarchical clustering (ahmad2007k; ji2012fuzzy). Consequently, this method inherits both good and bad properties of distancebased clustering approaches. Notice that all of these approaches are either twostep approaches or are algorithmic or nonconvex problems that yield local solutions. In practice, such approaches often lead to unreliable and unstable results.
Clustering is known to perform poorly for highdimensional data as most techniques are highly sensitive to uninformative features. One common approach is to reduce the dimensionality of the data via PCA, NMF, or tSNE before clustering (ghosh2002mixture; bernardo2003bayesian; tamayo2007metagene). A major limitation of such approaches is that the resulting clusters are not directly interpretable in terms of feature importance. To address this, several have proposed sparse clustering for highdimensional data. This performs clustering and feature selection simultaneously by iteratively applying clustering techniques to subsets of features selected via regularization (witten2010framework; sun2012regularized; chang2014sparse). The approach, however, is nonconvex and is highly susceptible to poor local solutions. Others have proposed penalized modelbased clustering that selects features (raftery2006variable; wang2008variable; pan2007penalized). Still, these methods inherit several disadvantages of modelbased clustering approaches. Moreover, sparse integrative clustering is relatively understudied. shen2013sparse; mo2013pattern extended iCluster using a penalized latent variable approach to jointly model multiple omics data types. They induce sparsity on the latent variable coefficients via regularization. As feature selection is performed on the latent variables, however, this is less interpretable in terms of selecting features directly responsible for distinguishing clusters. Recently, and most closely related to our own work, wang2018sparse proposed sparse convex clustering which adds a grouplasso penalty term on the cluster centers to shrink them towards zero, thus selecting relevant features. This penalty, however, is only appropriate for Euclidean distances when the data is centered; otherwise, the penalty term shrinks towards the incorrect cluster centers. For feature selection using different distances and losses, we propose an adaptive shifted grouplasso penalty that will select features by shrinking them towards their appropriate centroid.
2 Integrative Generalized Convex Clustering with Feature Selection
In this section, we introduce our new methods, beginning with the Gecco and iGecco and then show how to achieve feature selection via regularization. We also discuss some practical considerations for applying our methods and develop an adaptive version of our approaches.
2.1 Generalized Convex Clustering Optimization (Gecco)
In many applications, we seek to cluster data that is nonGaussian. In the literature, most do this using different distance metrics other than Euclidean distances (choi2010survey; fowlkes1983method; de2004clustering). Some use losses based on exponential family or deviances closely related to Bregman divergences (banerjee2005clustering).
To account for different types of losses for nonGaussian data, we propose to replace the Euclidean distances in convex clustering with more general convex losses; this gives rise to Generalized Convex Clustering Optimization (Gecco).
Here, our data is an matrix consisting of observations and features; is an centroid matrix with the row, , the cluster centroid attached to point . The general loss refers to a general loss metric that measures dissimilarity between the data point and assigned centroids . is the norm of a vector and usually is considered (hocking2011clusterpath). Here we prefer using the norm in the fusion penalty as it encourages the entire rows of similar observations to be fused together simultaneously and is also rotationinvariant; but one could use or norms as well. is a positive tuning constant and is a nonnegative weight. When equals zero, each data point occupies a unique cluster. As increases, the fusion penalty encourages some of the rows of the cluster center to be exactly fused, forming clusters. When becomes sufficiently large, all centroids eventually coalesce to a single cluster centroid, which we define as the lossspecific center associated with . Hence regulates both the cluster assignment and number of clusters, providing a family of clustering solutions. The weight should be specified by the user in advance and is not a tuning parameter; we discuss choices of weights for various convex losses in Section 2.5.
Going beyond Euclidean distances, we propose to employ convex distance metrics as well as deviances associated with exponential family distributions and Bregman divergences, which are always convex. Interestingly, we show that each of these possible loss functions shrink the cluster centers, , to different lossspecific centers, instead of the meanbased centroid as in convex clustering with Euclidean distances. For example, one may want to use least absolute deviations (norm or Manhattan distances) for skewed data or for data with outliers; with this loss, we show that all observations fuse to the median when is sufficiently large. We emphasize lossspecific centers here as they will be important in feature selection in the next section. For completeness, we list common distances and deviancebased losses, as well as their lossspecific centers respectively in Table 1. (See Appendix F for all calculations associated with lossspecific centers, and we provide a formal proof when studying the properties of our approaches in Section 2.4.)
Data Type  Loss Type  Loss Function  Lossspecific Center 
Continuous  Euclidean ()  
Skewed Continuous  Manhattan ()  median()  
Minkowski ()  no closed form  
Mahalanobis (weighted )  no closed form  
Chebychev ()  no closed form  
Canberra (weighted )  no closed form  
Binary  Bernoulli loglikelihood  
Binomial Deviance  
Hinge Loss  mode  
KL divergence  no closed form  
Hamming ()  mode ()  
Count  Poisson loglikelihood  
Poisson Deviance  
Negative Binomial loglikelihood  
Negative Binomial Deviance  
Manhattan ()  median()  
Canberra (weighted )  no closed form  
Categorical 
Multinomial loglikelihood  
Multinomial Deviance  ,  

2.2 Integrative Generalized Convex Clustering (iGecco)
In data integration problems, we observe data from multiple sources and would like to get a holistic understanding of the problem by analyzing all the data simultaneously. In our framework, we integrate mixed multiview data and perform clustering by employing different convex losses for each of the different data views with a joint convex fusion penalty that leads to common groups. Hence we propose Integrative Generalized Convex Clustering (iGecco) which can be formulated as follows:
Here, we have data sources. The dataview is an matrix consisting of observations and features; is also an matrix and the row, , is the cluster center associated with the point . And, is the loss function associated with the dataview. Each loss function is weighted by , which is fixed by the user in advance. We have found that setting to be inversely proportional to the null deviance evaluated at the lossspecific center, i.e., , performs well in practice. Note that where each column of denotes the lossspecific center . We employ this loss function weighting scheme to ensure equal scaling across data sets of different types. Finally, notice that we employ a joint grouplasso penalty on all of the ’s; this incorporates information from each of the data sources and enforces the same group structure amongst the shared observations. We study this further and prove these properties in Section 2.4.
2.3 Feature Selection: Gecco+ and iGecco+
In high dimensions, it is important to perform feature selection both for clustering purity and interpretability. Recently, wang2018sparse proposed sparse convex clustering by imposing a grouplassotype penalty on the cluster centers which achieves feature selection by shrinking noise features towards zero. This penalty, however, is only appropriate for Euclidean distances when the data is centered; otherwise, the penalty term shrinks towards the incorrect cluster centers. For example, the median is the cluster center with the or Manhattan distances. Thus, to select features in this scenario, we need to shrink them towards the median, and we should enforce “sparsity” with respect to the median and not the origin. To address this, we propose adding a shifted grouplassotype penalty which forces cluster center to shrink toward the appropriate lossspecific center for each feature. Both the cluster fusion penalty and this new shiftedgrouplassotype feature selection penalty will shrink towards the same lossspecific center.
To facilitate feature selection with the adaptive shifted grouplasso penalty for one data type, our Generalized Convex Clustering Optimization with Feature Selection (Gecco+) is formulated as follows:
Again, is an matrix and is the lossspecific center for the feature introduced in Table 1. The tuning parameter controls the number of informative features and the feature weight is a user input which plays an important role to adaptively penalize the features. (We discuss choices of in Section 2.5.2 when we introduce the adaptive version of our method.) When is small, all features are selected and contribute to defining the cluster centers. When grows sufficiently large, all features coalesce at the same value, the lossspecific center , and hence no features are selected and contribute towards determining the clusters. Another way of interpreting this is that the fusion penalty exactly fuses some of the rows of the cluster center , hence determining groups of rows. On the other hand, the shifted grouplasso penalty shrinks whole columns of towards their lossspecific centers, thereby essentially removing the effect of uninformative features. Selected features are then columns of that were not shrunken to their lossspecific centers, . These selected features, then, exhibit differences across the clusters determined by the fusion penalty. Clearly, sparse convex clustering of wang2018sparse is a special case of Gecco+ using Euclidean distances with centered data. Our approach using both a row and column penalty is also reminiscent of convex biclustering (chi2017convex) which uses fusion penalties on both the rows and columns to achieve checkerboardlike biclusters.
Building upon integrative generalized convex clustering in Section 2.2 and our proposed feature selection penalty above, our Integrative Generalized Convex Clustering Optimization with Feature Selection (iGecco+) is formulated as follows:
(1) 
Again, is an matrix and is the lossspecific center for the feature for data type. By construction, iGecco+ directly clusters mixed multiview data and selects features from each data view simultaneously. Similarly, adaptive choice of gives rise to adaptive iGecco+ which will be discussed in Section 2.5.2. Detailed discussions on practical choices of tuning parameters and weights can be also found in Section 2.5.
2.4 Properties
In this section, we develop some properties of our methods, highlighting several advantages of our convex formulation. The corresponding proofs can be found in Section A of the Appendix.
Define the objective function in (1) as where . Then due to convexity, we have the following:
propositionprop1 (Global solution) If is convex for all , then any minimizer of , , is a global minimizer. If is strictly convex for all , then is unique.
propositionprop2 (Continuity with respect to data and input parameters) The global minimizer of iGecco+ exists and depends continuously on the data, , tuning parameters and , the weight matrix w, the loss weight , and the feature weight .
propositionprop3 (Lossspecific center) Define where each column of equals the lossspecific center . Suppose each observation corresponds to a node in a graph with an edge between nodes and whenever . If this graph is fully connected, then is minimized by the lossspecific center when is sufficiently large or is sufficiently large.
Remark: As Gecco, Gecco+ and iGecco are special cases of iGecco+, it is easy to show that all of our properties hold for these methods as well.
These properties illustrate some important advantages of our convex clustering approaches. Specifically, many other widely used clustering methods are known to suffer from poor local solutions, but any minimizer of our problem will achieve a global solution. Additionally, we show that iGecco+ is continuous with respect to the data, tuning parameters, and other input parameters. Together, these two properties are very important in practice and illustrate that the global solution of our method remains stable to small perturbations in the data and input parameters. Stability is a desirable property in practice as one would question the validity of a clustering result that can change dramatically with small changes to the data or parameters. Importantly, most popular clustering methods such as kmeans, hierarchical clustering, modelbased clustering, or lowrank based clustering, do not enjoy these same stability properties.
Finally in Proposition 2.4, we verify that when the tuning parameters are sufficiently large, full fusion of all observations to the lossspecific centers is achieved. Hence, our methods indeed behave as intended, achieving joint clustering of observations. We illustrate this property in Figure 1 where we apply Gecco+ to the authors data set (described fully in Section 4). Here, we illustrate how our solution, , changes as a function of and . This socalled “cluster solution path” begins with each observation as its own cluster center when is small and stops when all observations are fused to the lossspecific center when is sufficiently large. In between, we see that observations are fusing together as increases. Similarly, when is small, all features are selected and as increases, some of the features get fused to their lossspecific center.
2.5 Practical Considerations and Adaptive iGecco+
In this section, we discuss some practical considerations for applying our method to real data. We discuss choosing userspecific inputs such as weights as well as how to select tuning parameters. In doing so, we introduce an adaptive version of our method as well.
Choice of Weights and Tuning Parameters
In practice, a good choice of fusion weights () has been shown to enhance both computational efficiency and clustering quality of convex clustering (chi2015splitting). It has been empirically demonstrated that using weights inversely proportional to the distances yields superior clustering performance; this approach is widely adopted in practice. Further, setting many of the weights to zero helps reduce computation cost. Considering these two, the most common weights choice for convex clustering is to use nearestneighbors method with a Gaussian kernel. Specifically, the weight between the sample pair () is set as , where equals 1 if observation is among observation ’s nearest neighbors or vice versa, and 0 otherwise. However, this choice of weights based on Euclidean distances may not work well for nonGaussian data in Gecco(+) or for mixed data in iGecco(+). To account for different data types and better measure the similarity between observations, we still adopt nearestneighbors method with an exponential kernel, but further extend this by employing appropriate distance metrics for specific data types in the exponential kernel. In particular, for weights in Gecco and Gecco+, we suggest using the same distance functions or deviances in the loss function of Gecco and Gecco+. For weights in iGecco and iGecco+, the challenge is that we need to employ a distance metric which measures mixed types of data. In this case, the Gower distance, which is a distance metric used to measure the dissimilarity of two observations measured in different data types (gower1971general), can address our problem. To be specific, the Gower distance between observation and overall can be defined as where refers to the Gower distance between observation and for feature in data view and is the range of feature in data view . In the literature, Gower distance has been commonly used as distance metrics for clustering mixed types of data (wangchamhan2017efficient; hummel2017clustering; akay2018clustering) and shown to yield superior performance than other distance metrics (ali2013k; dos2015categorical).
Alternatively, we also propose and explore using stochastic neighbor embedding weights based on symmetrized conditional probabilities (maaten2008visualizing). These have been shown to yield superior performance in highdimensions and if there are potential outliers. Specifically, the symmetrized conditional probabilities are defined as , where . We propose to use the weights where still equals 1 if observation is among observation ’s nearest neighbors or vice versa, and 0 otherwise. Again, we suggest using distance metrics appropriate for specific data types or the Gower distance for mixed data. In empirical studies, we experimented with both weight choices and found that stochastic neighbor embedding weights tended to work better in highdimensional settings and if there are outliers. Hence, we recommend these and employed them in our empirical investigations in Section 4 and 5.
Estimating the number of clusters in a data set is always challenging. Current literature for tuning parameter selection mainly focuses on stability selection or consensus clustering (wang2010consistent; fang2012selection) and holdout validation (chi2017convex). In this paper, we adopt holdout validation approach for tuning parameter selection and we follow closely the approach described in chi2017convex; we have found that this performs well in empirical studies.
For the choice of the feature selection tuning parameter, , we find that clustering result is fairly robust to choices of . Hence, we suggest using only a few possibilities for and to choose the combination of and which jointly minimizes holdout error or withincluster deviance. In many cases, we know the number of clusters a priori (or have an idea of an appropriate range for the number of clusters) and we can directly choose which minimizes the holdout error or within cluster deviance for that number of clusters.
Adaptive Gecco+ and iGecco+ to Weight Features
Finally, we consider how to specify the feature weights, used in the shifted grouplasso penalty. While employing these weights are not strictly necessary, we have found, as did wang2018sparse, that like the fusion weights, wellspecified ’s can both improve performance and speed computation. But unlike the fusion weights where we can utilize the pairwise distances, we don’t have prior information on which features may be relevant in clustering. Thus, we propose to use an adaptive scheme that first fits the iGecco+ with no feature weights and uses this initial estimate to define feature importance for use in weights. This is similar to many adaptive approaches in the literature (zou2006adaptive; wang2018sparse).
Our adaptive iGecco+ approach is given in Algorithm 1; this applies to adaptive Gecco+ as a special case as well. We assume that the number of clusters (or a range of the number of clusters) is known a priori. We begin by fitting iGecco+ with and uniform feature weights . We then find the which gives the desired number of clusters, yielding the initial estimate, . (Alternatively, we can use holdout validation to select .) Next, we use this initial estimate to adaptively weight features by proposing the following weights: . These weights place a large penalty on noise features as is close to zero in this case. We also notice that noise features impact the distances used in the fusion weights as well. Hence, we suggest updating the distances adaptively by using the selected features to better measure the similarities between observations. To this end, we propose a new scheme to compute weighted Gower distances. First, we scale the features within each data view so that informative features in different data views contribute equally and on the same scale. Then, we employ the inverse of , i.e., the null deviance, to weight the distances from different data types, resulting in an aggregated and weighted Gower distance, as further detailed in Algorithm 1. Note that if the clustering signal from one particular data type is weak and there are few informative features for this data type, then our weighting scheme will downweight this entire data type in the weighted Gower distance. In practice, our adaptive iGecco+ scheme works well as evidenced in our empirical investigations in the next sections.
3 iGecco+ Algorithm
In this section, we introduce our algorithm to solve iGecco+, which can be easily extended to Gecco, Gecco+ and iGecco. We first propose a simple, but rather slow ADMM algorithm as a baseline approach. To save computation cost, we further develop a new multiblock ADMMtype procedure using inexact onestep approximation of the subproblems. Our algorithm is novel from optimization perspective as we extend the multiblock ADMM to higher number of blocks and combine it with the inexact subproblem solve ADMM literature, which often results in major computational savings.
3.1 Full ADMM to Solve iGecco+ (Naive Algorithm)
Given the shifted grouplasso and fusion penalties along with general losses, developing an optimization routine for iGecco+ method is less straightforward than convex clustering or sparse convex clustering. In this section, we propose a simple ADMM algorithm to solve iGecco+ as a baseline algorithm and point out its drawbacks.
The most common approach to solve problems with more than two nonsmooth functions is via multiblock ADMM (lin2015global; deng2017parallel), which decomposes the original problem into several smaller subproblems and solves them in parallel at each iteration. chen2016direct established a sufficient condition for the convergence of threeblock ADMM. We develop a multiblock ADMM approach to fit our problem for certain types of losses and prove its convergence.
We first recast iGecco+ problem (1) as the equivalent constrained optimization problem:
Recently, weylandt2019dynamic derived the ADMM for convex clustering in matrix form and we adopt similar approach. We index a centroid pair by with , define the set of edges over the nonzero weights , and introduce a new variable where to account for the difference between the two centroids. Hence is a matrix containing the pairwise differences between connected rows of and the constraint is equivalent to for all ; is the directed difference matrix corresponding to the nonzero fusion weights. It is clear the subproblem has closedform solution for each iteration. We give generalform multiblock ADMM (Algorithm 2) to solve iGecco+. Here is the proximal mapping of function .
Notice that there is no closedform solution for the subproblem for general losses. Typically we need to apply an inner optimization routine to solve the subproblem until full convergence. In the next section, we seek to speed up this algorithm by using subproblem approximations. But, first we propose two different approaches to fully solve the subproblem based on specific loss types and then use these to develop a onestep update to solve the subproblem approximately with guaranteed convergence.
3.2 iGecco+ Algorithm
We have introduced Algorithm 2, a simple baseline ADMM approach to solve iGecco+. In this section, we consider different ways to solve the subproblem in Algorithm 2. First, based on specific loss types (differentiable and nondifferentiable), we propose two different algorithms to solve the subproblem to full convergence. These approaches, however, are rather slow for general losses as there is no closedform solution which results in nested iterative updates. To address this and in connection with current literature on variants of ADMM with subproblem approximations, we propose iGecco+ algorithm, a multiblock ADMM which solves the subproblems approximately by taking a single onestep update. We prove convergence of this general class of algorithms, a novel result in the optimization literature.
Differentiable Case
When the loss is differentiable, we consider solving the subproblem with proximal gradient descent, which is often used when the objective function can be decomposed into a differentiable and a nondifferentiable function. While there are many other possible optimization routines to solve the subproblem, we choose proximal gradient descent as there is existing literature proving convergence of ADMM algorithms with approximately solved subproblems using proximal gradient descent (liu2013linearized; lu2016fast). We will discuss in detail how to approximately solve the subproblem by taking a onestep approximation in Section 3.2.3. Based upon this, we propose Algorithm 3, which solves the subproblem by running full iterative proximal gradient descent to convergence. Here .
In Algorithm 3 and typically in general (proximal gradient) descent algorithms, we need to choose an appropriate step size to ensure convergence. Usually we employ a fixed step size by computing the Lipschitz constant as in the squared error loss case; but in our method, it is hard to compute the Lipschitz constant for most of our general losses. Instead, we suggest using backtracking line search procedure proposed by beck2009gradient; parikh2014proximal, which is a common way to determine step size with guaranteed convergence in optimization. Further, we find decomposing the subproblem to separate subproblems brings several advantages such as (i) better convergence property (than updating ’s all together) due to adaptive step size for each subproblem and (ii) less computation cost by solving each in parallel. Hence, in this case, we propose to use proximal gradient for each separate subproblem. To achieve this, we assume that the loss is elementwise, which is satisfied by every deviancebased loss. Last, as mentioned, there are many other possible ways to solve the subproblem than proximal gradient, such as ADMM. We find that when the loss is squared Euclidean distances or the hessian of the loss can be upper bounded by a fixed matrix, this method saves more computation. We provide all implementation details discussed above in Section C of the Appendix.
Nondifferentiable Case
When the loss is nondifferentiable, we can no longer adopt the proximal gradient method to solve the subproblem as the objective is now a sum of more than one separable nonsmooth function. To address this, as mentioned, we can use multiblock ADMM; in this case, we introduce new blocks for the nonsmooth functions and hence develop a full threeblock ADMM approach to fit our problem.
To augment the nondifferentiable term, we assume that our loss function can be written as where is convex but nondifferentiable and is affine. This condition is satisfied by all distancebased losses with ; for example, for Manhattan distances, we have , and . The benefit of doing this is that now the subproblem has closedform solution. Particularly, we can rewrite the subproblem as:
where is an matrix with columns equal to scalar .
It is clear that we can use multiblock ADMM to solve the problem above and each primal variable has simple update with closedform solution. We propose Algorithm 4, a full, iterative multiblock ADMM, to solve the subproblem when the loss is a nondifferentiable distancebased function. Algorithm 4 applies to iGecco+ with various distances such as Manhattan, Minkowski and Chebychev distances and details are given in Section D of the Appendix.
iGecco+ Algorithm: Fast ADMM with Inexact Onestep Approximation to the Subproblem
Notice that for both Algorithm 3 and 4, we need to run them iteratively to full convergence in order to solve the subproblem for each iteration, which is dramatically slow in practice. To address this in literature, many have proposed variants of ADMM with guaranteed convergence that find an inexact, onestep, approximate solution to the subproblem (without fully solving it); these include the generalized ADMM (deng2016global), proximal ADMM (shefi2014rate; banert2016fixing) and proximal linearized ADMM (liu2013linearized; lu2016fast). Thus, we propose to solve the subproblem approximately by taking a single onestep update of the algorithm for both types of losses and prove convergence. For the differentiable loss case, we propose to apply the proximal linearized ADMM approach while for the nondifferentiable case, we show that taking a onestep update of Algorithm 4 is equivalent to applying a fourblock ADMM to the original problem and we provide a sufficient condition for the convergence of fourblock ADMM. Our algorithm, to the best of our knowledge, is the first to incorporate higherorder multiblock ADMM and inexact ADMM with a onestep update to solve subproblems for general loss functions.
When the loss is differentiable, as mentioned in Algorithm 3, one can use full iterative proximal gradient to solve the subproblem, which however, is computationally burdensome. To avoid this, many proposed variants of ADMM which find approximate solutions to the subproblems. Specifically, closely related to our problem here, liu2013linearized; lu2016fast proposed proximal linearized ADMM which solves the subproblems efficiently by linearizing the differentiable part and then applying proximal gradient due to the nondifferentiable part. We find their approach fits into our problem and hence develop a proximal linearized 2block ADMM to solve iGecco+ when the loss is differentiable and gradient is Lipschitz continuous. It can be shown that applying proximal linearized 2block ADMM to Algorithm 2 is equivalent to taking a onestep update of Algorithm 3 along with and update in Algorithm 2. In this way, we avoid running full iterative proximal gradient updates to convergence for the subproblem as in Algorithm 3 and hence save computation cost.
When the loss is nondifferentiable, we still seek to take an onestep update to solve the subproblem. We achieve this by noticing that taking a onestep update of Algorithm 4 along with and update in Algorithm 2 is equivalent to applying multiblock ADMM to the original iGecco+ problem recast as follows (for nondifferentiable distancebased loss):
Typically, general higherorder multiblock ADMM algorithms do not always converge, even for convex functions (chen2016direct). We prove convergence of our algorithm and establish a novel convergence result by casting the iGecco+ with nondifferentiable losses as a fourblock ADMM, proposing a sufficent condition for convergence of higherorder multiblock ADMMs, and finally showing that our problem satisfies this condition. (Details are given in the proof of Theorem 3.2.3 in Appendix B.) Therefore, taking a onestep update of Algorithm 4 converges for iGecco+ with nondifferentiable losses.
So far, we have proposed inexactsolve onestep update approach for both differentiable loss and nondifferentiable loss case. For mixed type of losses, we combine those two algorithms and this gives Algorithm 5, a multiblock ADMM algorithm with inexact onestep approximation to the subproblem to solve iGecco+. We also establish the following convergence result.
theoreminexactfull (iGecco+ convergence) If is convex for all , Algorithm 5 converges to a global solution. In addition, if each is strictly convex, it converges to the unique global solution.
Remark: Our corresponding Theorem 3.2.3 establishes a novel convergence result as it is the first to show the convergence of fourblock or higher ADMM using approximate subproblems for both differentiable and nondifferentiable losses.
It is easy to see that Algorithm 5 can be applied to solve other Geccorelated methods as special cases. When , Algorithm 5 gives the algorithm to solve Gecco+. When , Algorithm 5 gives the algorithm to solve iGecco+. When and , Algorithm 5 gives the algorithm to solve Gecco.
To conclude this section, we compare the convergence results of both full ADMM and inexact ADMM with onestep update in the subproblem to solve Gecco+ ( and ) in Figure 2. The left plots show the number of iterations needed to yield optimization convergence while the right plots show computation time. We see that Algorithm 5 (onestep update to solve the subproblem) saves much more computational time than Algorithm 2 (full updates of the subproblem). It should be pointed out that though Algorithm 5 takes more iterations to converge due to inexact approximation for each iteration, we still reduce computation time dramatically as the computation time per iteration is much less than the fullsolve approach.
4 Simulation Studies
In this section, we first evaluate performance of Gecco+ against existing methods on nonGaussian data. Next we compare iGecco+ with other methods on mixed multiview data.
4.1 NonGaussian Data
In this subsection, we evaluate the performance of Gecco and (adaptive) Gecco+ by comparing it with kmeans, hierarchical clustering and sparse convex clustering. For simplicity, we have the following naming convention for all methods: loss type name + Gecco(+). For example, Poisson Deviance Gecco+ refers to Generalized Convex Clustering with Feature Selection using Poisson deviance. Sparse CC refers to sparse convex clustering using Euclidean distances. We measure the accuracy of clustering results using adjusted Rand index (hubert1985comparing). The adjusted Rand index is the correctedforchance version of the Rand index, which is used to measure the agreement between the estimated clustering assignment and the true group label. A larger adjusted Rand index implies a better clustering result. For all methods we consider, we assume oracle number of clusters for fair comparisons. For our Gecco+, we choose the largest which minimizes within cluster variance or holdout error.
Each simulated data set is comprised of observations with 3 clusters. Each cluster has an equal number of observations. Only the first 10 features are informative while the rest are noise. We consider the following simulation scenarios.

S1: Spherical data with outliers
The first 10 informative features in each group are generated from a Gaussian distribution with different ’s for each class. Specifically, the first 10 features are generated from where , , . The outliers in each class are generated from a Gaussian distribution with the same mean centroid but with higher variance, i.e, . The remaining noise features are generated from .
In the first setting (S1A), the number of noise features ranges in up to 225 with the proportion of the number of outliers fixed ( = 5%). We also consider the setting when the variance of noise features increases with number of features fixed and number of outliers fixed (S1B) and highdimensional setting where ranges from to 1000 (S1C).

S2: Nonspherical data with three half moons
Here we consider the standard simulated data of three interlocking half moons as suggested by chi2015splitting and wang2018sparse. The first 10 features are informative in which each pair makes up twodimensional three interlocking half moons. We randomly select 5% of the observations in each group and make them outliers. The remaining noise features are generated from . The number of noise features ranges from up to 225. In both S1 and S2, we compare Manhattan Gecco+ with other existing methods.

S3: Countvalued data
The first 10 informative features in each group are generated from a Poisson distribution with different ’s for each class. Specifically, , , . The remaining noise features are generated from a Poisson distribution with the same ’s which are randomly generated integers from 1 to 10. The number of noise features ranges from up to 225.
We summarize simulation results in Figure 3. We find that for spherical data with outliers, adaptive Manhattan Gecco+ performs the best in high dimensions. Manhattan Gecco performs well in low dimensions but poorly as number of noisy features increases. Manhattan Gecco+ performs well as the dimension increases, but adaptive Manhattan Gecco+ outperforms the former as it adaptively penalizes the features, meaning that noisy features quickly get zeroed out in the clustering path and that only the informative features perform important roles in clustering. We see that, without adaptive methods, we do not achieve the full benefit of performing feature selection. As we perform adaptive Gecco+, we show vast improvement in clustering purity as the number of noise features grows where regular Gecco performs poorly. Sparse convex clustering performs the worst as it tends to pick outliers as singleton classes. Our simulation results also show that adaptive Manhattan Gecco+ works well for nonspherical data by selecting the correct features. For count data, all three adaptive Gecco+ methods perform better than kmeans, hierarchical clustering and sparse convex clustering. We should point out that there are several linkage options for hierarchical clustering. For visualization purposes, we only show the linkage with the best and worst performance instead of all the linkages. Also we use the appropriate dataspecific distance metrics in hierarchical clustering. Table 2 shows the variable selection accuracy of sparse convex clustering and adaptive Gecco+ in terms of F score. In all scenarios, we fix . We see that adaptive Gecco+ selects the correct features, whereas sparse convex clustering performs poorly.
Method  Scenario 1 (A)  Scenario 2  Scenario 3 

Sparse Convex Clustering  0.37 (3.1e2)  0.25 (2.4e2)  0.14 (7.2e3) 
Adaptive Gecco+  0.97 (1.9e2)  0.99 (1.0e2)  0.81 (8.0e2) 
4.2 MultiView Data
In this subsection, we evaluate the performance of iGecco and (adaptive) iGecco+ on mixed multiview data by comparing it with hierarchical clustering, iCluster+ (mo2013pattern) and Bayesian Consensus Clustering (lock2013bayesian). Again, we measure the accuracy of clustering results using the adjusted Rand index (hubert1985comparing).
As before, each simulated data set is comprised of observations with 3 clusters. Each cluster has an equal number of observations. Only the first 10 features are informative while the rest are noise. We have three data views consisting of continuous data, countvalued data and binary/proportionvalued data. We investigate different scenarios with different dimensions for each data view and consider the following simulation scenarios:

S1: Spherical data with

S2: Three halfmoon data with

S3: Spherical data with , ,

S4: Three halfmoon data with , ,

S5: Spherical data with , ,

S6: Three halfmoon data with , ,
We employ a similar simulation setup as in Section 4.1 to generate each data view. The difference is that here for informative features, we increase the withincluster variance for Gaussian data and decrease difference of cluster mean centroids ’s for binary and count data so that there is overlap between different clusters. Specifically, for spherical cases, Gaussian data is generated from ; count data is generated from Poisson with different ’s (, , , etc); binary data is generated from Bernoulli with different ’s (, , , etc). For halfmoon cases, continuous data is simulated with larger noise and the count and proportionvalued data is generated via a copula transform. In this manner, we have created a challenging simulation scenario where accurate clustering results cannot be achieved by considering only a single dataview.
Again, for fair comparisons across methods, we assume the oracle number of clusters. When applying iGecco methods, we employ Euclidean distances for continuous data, Manhattan distances for countvalued data and Bernoulli loglikelihood for binary or proportionvalued data. We use the latter two losses as they perform well compared with counterpart losses in Gecco+ and demonstrate faster computation speed. Again, we choose the largest that minimizes withincluster deviance.
Simulation results in Table 3 and Table 4 show that our methods perform better than existing methods. In low dimensions, iGecco performs comparably with iCluster and Bayesian Consensus Clustering for spherical data. For nonspherical data, iGecco performs much better. For high dimensions, iGecco+ performs better than iGecco while adaptive iGecco+ performs the best as it achieves the full benefit of feature selection.
Method  Scenario 1  Scenario 2 

Hclust:  0.35 (2.9e2)  0.54 (1.3e2) 
Hclust:  0.53 (4.6e2)  0.61 (4.0e2) 
Hclust:  0.52 (2.2e2)  0.70 (3.0e2) 
Hclust:  Euclidean  0.68 (4.7e2)  0.66 (4.4e2) 
Hclust:  Gower  0.86 (1.5e2)  0.84 (4.0e2) 
iCluster+ with  0.90 (1.6e2)  0.70 (8.0e3) 
Bayesian Consensus Clustering  0.95 (1.2e2)  0.63 (1.0e2) 
iGecco  0.93 (4.7e3)  1.00 (0.0e0) 
Method  Scenario 3  Scenario 4  Scenario 5  Scenario 6 

Hclust:  0.57 (1.8e2)  0.57 (1.4e2)  0.44 (2.4e2)  0.49 (1.7e2) 
Hclust:  0.22 (1.9e2)  0.20 (1.8e2)  0.51 (1.7e2)  0.51 (2.6e2) 
Hclust:  0.28 (1.1e2)  0.25 (2.6e2)  0.57 (2.7e2)  0.48 (3.3e2) 
Hclust:  Euclidean  0.72 (2.2e2)  0.43 (1.9e2)  0.53 (1.9e2)  0.56 (2.3e2) 
Hclust:  Gower  0.78 (1.0e2)  0.41 (3.0e2)  0.58 (4.2e2)  0.64 (2.6e2) 
iCluster+  0.61 (2.5e2)  0.74 (2.8e2)  0.62 (1.7e2)  0.61 (1.4e2) 
Bayesian Consensus Clustering  0.47 (1.1e1)  0.53 (1.0e2)  0.60 (1.0e2)  0.63 (1.1e2) 
iGecco  0.14 (8.7e2)  0.13 (8.2e2)  0.45 (2.9e2)  0.42 (4.4e2) 
iGecco+  0.37 (6.7e2)  0.37 (5.6e2)  0.48 (2.8e2)  0.48 (4.7e2) 
Adaptive iGecco+  0.91 (6.1e3)  0.92 (8.3e3)  0.96 (2.4e2)  0.94 (4.3e2) 
Also we show the variable selection results in Table 5 and compare our method to that of iClusterPlus. Our adaptive iGecco+ outperforms iClusterPlus for all scenarios.
Overall  Gaussian  Count  Binary  

iCluster+  A iGecco+  iCluster+  A iGecco+  iCluster+  A iGecco+  iCluster+  A iGecco+  
S3  0.88 (1.1e2)  0.96 (4.5e3)  0.96 (9.2e3)  1.00 (0.0e0)  0.81 (1.3e2)  0.89 (5.7e3)  0.87 (1.3e2)  0.98 (7.8e3) 
S4  0.93 (1.5e2)  0.99 (4.5e3)  0.97 (2.0e2)  0.99 (7.0e3)  0.93 (1.5e2)  1.00 (4.8e3)  0.89 (2.2e2)  0.99 (6.3e3) 
S5  0.95 (3.0e2)  1.00 (3.3e3)  0.95 (3.3e2)  1.00 (0.0e0)  0.93 (3.6e2)  0.99 (1.0e2)  0.96 (2.2e2)  1.00 (0.0e0) 
S6  0.93 (3.1e2)  1.00 (1.6e3)  0.95 (3.3e2)  1.00 (0.0e0)  0.88 (4.4e2)  1.00 (0.0e0)  0.95 (2.5e2)  1.00 (0.0e0) 
5 Real Data Examples
In this section, we apply our methods to various real data sets and evaluate our methods against existing ones. We first evaluate the performance of Gecco+ for several real data sets and investigate the features selected by various Gecco+ methods.
5.1 Authors Data
The authors data set consists of word counts from chapters written by four famous Englishlanguage authors (Austen, London, Shakespeare, and Milton). Each class contains an unbalanced number of observations with 69 features. The features are common “stop words” like “a”, “be” and “the” which are typically removed before text mining analysis. We use Gecco+ not only to cluster book chapters and compare the clustering assignment with true labels of authors, but also to identify which key words help distinguish the authors. We choose tuning parameters using holdout validation.
Method  Adjusted Rand Index 

Kmeans  0.73 
Hierarchical Clustering  0.73 
Sparse Convex Clustering  0.50 
Manhattan Gecco+  0.96 
Poisson LL Gecco+  0.96 
Poisson Deviance Gecco+  0.96 
In Table 6, we compare Gecco+ with existing methods in terms of clustering quality. For hierarchical clustering, we only show the linkage with the best performance (in this whole section). Our method outperforms kmeans and the best hierarchical clustering method. Secondly, we look at the word texts selected by Gecco+. As shown in Table 7, Jane Austen tended to use the word “her” more frequently than the other authors; this is expected as the subjects of her novels are typically females. The word “was” seems to separate Shakespeare and Jack London well. Shakespeare preferred to use present tense more while Jack London preferred to use past tense more. To summarize, our Gecco+ not only has superior clustering performance but also selects interpretable features.
Method  Features  

Manhattan Gecco+ 


Poisson LL Gecco+  “an” , “her” , “our”, “your”  
Poisson Deviance Gecco+ 

5.2 TCGA Breast Cancer Data
The TCGA data set consists of logtransformed Level III RPKM gene expression levels for 445 breastcancer patients with 353 features from The Cancer Genome Atlas Network (cancer2012comprehensive). Five PAM50 breast cancer subtypes are included, i.e, Basallike, Luminal A, Luminal B, HER2enriched, and Normallike. Only 353 genes out of 50,000 with somatic mutations from COSMIC (forbes2010cosmic) are retained. The data is Level III TCGA BRCA RNASequencing gene expression data that have already been preprocessed according to the following steps: i) reads normalized by RPKM, and ii) corrected for overdispersion by a logtransformation. We remove 7 patients, who belong to the normallike group and the number of subjects becomes 438. We also combine Luminal A with Luminal B as they are often considered one aggregate group (choi2014identification).
Method  Adjusted Rand Index 

Kmeans  0.44 
Hierarchical Clustering  0.26 
Sparse Convex Clustering  0.01 
Manhattan Gecco+  0.76 
Poisson LL Gecco+  0.72 
Poisson Deviance Gecco+  0.72 
From Table 8, our method outperforms kmeans and the best hierarchical clustering method. Next, we look at the genes selected by Gecco+ in Table 9. FOXA1 is known to be a key gene that characterizes luminal subtypes in DNA microarray analyses (badve2007foxa1). GATA binding protein 3 (GATA3) is a transcriptional activator highly expressed by the luminal epithelial cells in the breast (mehra2005identification). ERBB2 is known to be associated with HER2 subtype and has been well studied in breast cancer (harari2000molecular). Hence our Gecco+ not only outperforms existing methods but also selects genes which are relevant to biology and have been implicated in previous scientific studies.
Method  Features  

Manhattan Gecco+ 


Poisson LL Gecco+  “ERBB2” “FOXA1” “GATA3”  
Poisson Deviance Gecco+ 

Next we evaluate the performance of iGecco+ for mixed multiview data sets and investigate the features selected by iGecco+.
5.3 Multiomics Data
One promising application for integrative clustering for multiview data lies in integrative cancer genomics. Biologists seek to integrate data from multiple platforms of highthroughput genomic data to gain a more thorough understanding of disease mechanisms and detect cancer subtypes. In this case study, we seek to integrate four different types of genomic data to study how epigenetics and short RNAs influence the gene regulatory system in breast cancer.
We use the data set from cancer2012comprehensive. lock2013bayesian analyzed this data set using integrative methods and we followed the same data preprocessing procedure: i) filter out genes in expression data whose standard deviation is less than 1.5, ii) take square root of methylation data, and iii) take log of miRNA data. We end up with a data set of 348 tumor samples including:

RNAseq gene expression (GE) data for 645 genes,

DNA methylation (ME) data for 574 probes,

miRNA expression (miRNA) data for 423 miRNAs,

Reverse phase protein array (RPPA) data for 171 proteins.
The data set contains samples used on each platform with associated subtype calls from each technology platform as well as integrated cluster labels from biologists. We use the integrated labels from biologists as true label. Also we merged the subtypes 3 and 4 in the integrated labels as those two subtypes correspond to Luminal A and Luminal B respectively from the predicted label given by gene expression data (PAM50 mRNA).
Figure 6 in Appendix H gives the distribution of data from different platforms. For our iGecco+ methods, we use Euclidean distances for gene expression data and protein data as the distributions of those two data sets appear gaussian; binomial deviances for Methylation data as the value is between ; Manhattan distances for miRNA data as the data is highlyskewed. We compare our adaptive iGecco+ with other existing methods. From Table 10, we see that our method outperforms all the existing methods.
Method  Adjusted Rand Index 

Hclust: GE  0.51 
Hclust: Meth  0.39 
Hclust: miRNA  0.21 
Hclust: Protein  0.24 
Hclust:  Euclidean  0.51 
Hclust:  Gower  0.40 
iCluster+  0.36 
Bayesian Consensus Clustering  0.35 
Adaptive iGecco+  0.71 
We also investigate the features selected by adaptive iGecco+, shown in Table 11, and find that our method is validated as most are known in the breast cancer literature. For example, FOXA1 is known to segregate the luminal subtypes from the others (badve2007foxa1), and AGR3 is a known biomarker for breast cancer prognosis and early breast cancer detection from blood (garczyk2015agr3). Several wellknown miRNAs were selected including MIR135b, which is upregulated in breast cancer and promotes cell growth (hua2016mir) as well as MIR190 which suppresses breast cancer metastasis (yu2018mir). Several known proteins were also selected including ERalpha, which is overexpressed in early stages of breast cancer (hayashi2003expression) and GATA3 which plays an integral role in breast luminal cell differentiation and breast cancer progression (cimino2013gata3).
Data view  Features  

Gene Expression 


miRNA 


Methylation 


Protein  “ER.alpha”, “GATA3”, “AR”, “Cyclin_E1” 
We also visualize resulting clusters from adaptive iGecco+. In Figure 4, we see that there is a clear separation between groups and adaptive iGecco+ identifies meaningful subtypes.
6 Discussion
In this paper, we develop a convex formulation of integrative clustering for highdimensional mixed multiview data. We propose a unified, elegant methodological solution to two critical issues for clustering and data integration: (i) dealing with mixed types of data and (ii) selecting interpretable features in highdimensional settings. Specifically, we show that clustering for mixed, multivew data can be achieved using different data specific convex losses with a joint fusion penalty. We also introduce a shifted grouplasso penalty that shrinks noise features to their lossspecific centers, hence selecting features that play important roles in separating groups. In addition, we make an optimization contribution by proposing and proving the convergence of a new general multiblock ADMM algorithm with subproblem approximations that efficiently solves our problem. Empirical studies show that iGecco+ outperforms existing clustering methods and selects interpretable features in separating clusters.
This paper focuses on the methodological development for integrative clustering and feature selection, but there are many possible avenues for future research related to this work. For example, we expect in future work to be able to show that our methods inherit the strong theoretical properties of other convex clustering approaches such as clustering consistency and prediction consistency. An important problem in practice is choosing which loss function is appropriate for a given data set. While this is beyond the scope of this paper, an interesting direction for future research would be to learn the appropriate convex loss function in a datadriven manner. Additionally, many have shown block missing structure is common in mixed data (yu2019optimal; xiang2013multi). A potentially interesting direction for future work would be to develop an extension of iGecco+ that can appropriately handle blockmissing multiview data. Additionally, weylandt2019dynamic developed a fast algorithm to compute the entire convex clustering solution path and used this to visualize the results via a dendrogram and pathwise plot. In future work, we expect that algorithmic regularization path approaches can also be applied to our method to be able to represent our solution as a dendrogram and employ other dynamic visualizations. Finally, while we develop an efficient multiblock ADMM algorithm, there may be further room to speed up computation of iGecco+, potentially by using distributed optimization approaches.
In this paper, we demonstrate that our method can be applied to integrative genomics, yet it can be applied to other fields such as multimodal imaging, national security, online advertising, and environmental studies where practitioners aim to find meaningful clusters and features at the same time. In conclusion, we introduce a principled, unified approach to a challenging problem that demonstrates strong empirical performance and opens many directions for future research.
Acknowledgements
The authors would like to thank Michael Weylandt and Tianyi Yao for helpful discussions. GA and MW also acknowledge support from NSF DMS1554821, NSF NeuroNex1707400, and NSF DMS1264058.
Integrative Generalized Convex Clustering Optimization and Feature Selection for Mixed MultiView Data: Supplementary Materials
Minjie Wang, Genevera I. Allen
The supplementary materials are organized as follows. In Appendix A, we prove properties of our methods discussed in Section 2.4. In Appendix B, we provide detailed proof for Theorem 3.2.3. We provide implementation details for Gecco+ with differentiable losses in Appendix C. In Appendix D, we discuss implementation details for Gecco+ with nondifferentiable losses. We introduce multinomial Gecco(+) in Appendix E. In Appendix F, we show how to calculate the lossspecific center in Table 1. In Appendix G, we visualize the results of authors data discussed in Section 5.1. We show the distribution of data from different platforms in Section 5.3 in Appendix H.
Appendix A Proof of Propositions
Proposition 2.4 and 2.4 are direct extension from Proposition 2.1 in chi2015splitting. Notice they proved the solution path depends continuously on the tuning parameter and the weight matrix w. It follows that the argument can be also applied to tuning parameter , the loss weight , and feature weight . Also it is obvious that the loss is continuous with respect to the data, .
We show in detail how to prove Proposition 2.4 in the following. First we rewrite as:
By definition, lossspecific cluster center is . Since is convex, it is equivalent to such that . Hence,
We use the similar proof approach in chi2015splitting. A point furnishes a global minimum of the convex function if and only if all forward directional derivatives at are nonnegative. We calculate the directional derivatives:
Note:
The generalized CauchySchwartz inequality therefore implies
Hence