Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach
Abstract
In this work, we consider to improve the model estimation efficiency by aggregating the neighbors’ information as well as identify the subgroup membership for each node in the network. A treebased penalty is proposed to save the computation and communication cost. We design a decentralized generalized alternating direction method of multiplier algorithm for solving the objective function in parallel. The theoretical properties are derived to guarantee both the model consistency and the algorithm convergence. Thorough numerical experiments are also conducted to back up our theory, which also show that our approach outperforms in the aspects of the estimation accuracy, computation speed and communication cost.
Department of Statistics, Iowa State University Jia Liu jialiu@iastate.edu
Department of Computer Science, Iowa State University Zhengyuan Zhu zhuz@iastate.edu
Department of Statistics, Iowa State University
\thesection Introduction
In this paper, we consider a fundamental distributed linear model clustering problem over networks (also sometimes referred to as subgroup analysis in the statistics literature): Suppose there are nodes in the network, each of which holds a dataset that is denoted as where and () represent the th covariate vector and response in the th dataset, respectively; and denotes the size of the dataset. For ease of exposition, the size of each dataset is assumed to be balanced (i.e., all nodes have samples)^{1}^{1}1Our algorithms and results in this paper can easily be extended to cases with datasets of unbalanced sizes.. Hence, the total sample size in the network is . We assume that there exist underlying clusters of the nodes, and the data pair from the th cluster follows a common linear model:
(\theequation) 
where is a dimensional coefficient vector for the th cluster, the independent error has a zero mean and a known variance The linear model in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) varies across the underlying clusters, i.e., the datasets in the same cluster share the same coefficient and vice versa. Our goal is to identify the cluster membership of each node and their corresponding coefficient. However, due to communication limitation or privacy restrictions, one cannot merge these datasets to a single location. Thus, the main challenge of this problem is to perform clustering and estimate the coefficients of each cluster in the network in a distributed fashion.
The above problem naturally arises in many machine learning applications. For example, a wireless sensor network is deployed in a large spatial domain to collect and learn the relationship between the soil temperature and air temperature Lee et al. (2015). The domain can be divided into several subregions due to the landcover types, such as forest and grassland, and temperature relationships may vary geographically: sensors in the same subregion may share the same regression relationship, and the coefficients vary across different subregions. Similar scenarios could also emerge in other applications, such as metaanalysis on medical data Tang and Song (2016), federated learning on the speech analysis Konecny et al. (2015), to name just a few.
Unfortunately, distributively clustering nodes based on regression model over networks is challenging as it includes two nontrivial interdependent and conflicting subtasks: i) statistical estimator design and ii) distributed optimization under the proposed estimator. In the literature, there exists treebased centralized estimator designs that achieve strong statistical performance guarantee with computational complexity (e.g., Tang and Song (2016); Li and Sang (2018), see Section Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach for detailed discussions). However, the treebased penalty architectures make it difficult to design distributed optimization algorithms. On the other hand, there exist efficient distributed algorithms for solving related clustering problems over networks (e.g., Jiang et al. (2018); Wang et al. (2018); Hallac et al. (2015), see Section Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach for details). However, it is unclear whether they could provide statistical performance guarantees, such as the selection consistency and estimation normality. Moreover, they all suffer computational and communication costs. In light of the limitations of these existing work, in this paper, we ask the following fundamental question: Could we develop a new distributed approach to achieve both strong statistical performance guarantees and computation and communication costs? In other words, could we achieve the best of both worlds of the existing methods in the literature?
In this paper, we show that the answer to the above question is affirmative. The main contribution of this paper is that, for the first time, we develop a new minimum spanning tree (MST) based fusedlasso approach for solving the network clustering problem. Our approach enjoys oracle statistical performance and enables lowcomplexity distributed optimization algorithm design with linear convergence rate. The main results of this paper are summarized as follows:

LowComplexity Estimator Design: We propose a new MSTbased penalty function for the clustering problem with complexity. Specifically, by comparing the coefficient similarities between the nodes, we construct a minimum spanning tree from the original network graph and only the edge in the tree are considered in the penalty function. Under this approach, the terms in the penalty function is reduced to (hence as opposed to ).

Statistical Performance Guarantee: Based on the MST structure, we propose the use of adaptive lasso to penalize the linear model coefficient differences. We show that our proposed estimator enjoys elegant oracle properties (cf. Fan and Li (2001)), which means that our method can identify the nodes’ cluster memberships almost surely (i.e., with probability one) as the size of datasets increases and the estimators achieve asymptotic normality.

Distributed Optimization Algorithm Design: Due to the restrictions imposed by the treebased estimator design, traditional gradient or ADMMtype distributed methods cannot be applied to solve the objective function and find the nodes’ cluster memberships distributively. In this paper, we develop a novel decentralized generalized ADMM algorithm for solving the treebased fusedlasso problem. Moreover, we show that our algorithm has a simple nodebased structure that is easy to implement and also enjoys the linear convergence.
Collectively, our results in this paper contribute to the theories of lowcomplexity model inference/clustering over networks and distributed optimization. Due to space limitation, we relegate most of the proof details to supplementary material.
\thesection Related work
In the literature, many approaches have been developed to cluster the heterogeneous data, such as the mixture model methods Hastie and Tibshirani (1996); Shen and He (2015); Chaganty and Liang (2013), the spectral clustering methodsRohe et al. (2011), etc. However, most of the literature focuses on clustering the obeservation rather than the relationship between and covariate The authors of Ma and Huang (2017); Ma et al. (2018) are the first few to investigate the network clustering problem under the subgroup analysis framework. Specifically, they considered the pairwise fusion penalty term for clustering the intercepts and the regression coefficients, respectively. In Tang and Song (2016), the authors proposed a fusedlasso method termed FLARCC to identify heterogeneity patterns of coefficients and to merge the homogeneous parameter clusters across multiple datasets in regression analysis with computational complexity. However, FLARCC does not exploit any spatial network structure to further improve the performance. The authors of Li and Sang (2018) proposed a spatially clustered coefficient (SCC) regression method, which is based on a minimum spanning tree (MST) of the network graph to capture the spatial relationships among the nodes. By contrast, in our work, we adopt the penalty function based framework to recover clusters identities by adding a penalty term to the ordinary least square problem for (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach). Unlike all the above methods that are implemented on single centralized machine, a key distinguishing feature of our work is that we need to conduct clustering in a distributed fashion. As will be shown later, we improve the treebased fusion penalty approach proposed in Li and Sang (2018) to enhance the estimation efficiency as well as significantly reduce the computation and communication load for distributed algorithm design.
Our work also contributes to the theory of distributed optimization over networks, which have attracted a flurry of recent research (see, e.g., Nedic and Ozdaglar (2009); Yuan et al. (2016); Shi et al. (2014); Eisen et al. (2017)). In the general framework of distributed optimization, all nodes in a connected network distributively and collaboratively solve an optimization problem in the form of: , where each is the objective function observable only to the th node and is a global decision variable across all nodes. By introducing a local copy the above distributed optimization problem can be reformulated in the following penalized version of the socalled consensus form: , where is a weight parameter for penalizing the disagreement between the th and th nodes. Interestingly, in this work, opposite to traditional distributed algorithms that focus on the consensus problems stated above, we consider whether there exists disagreements among the true The nodes are to be classified to several clusters and the nodes in each cluster share the same . We note that the authors of Jiang et al. (2018) and Wang et al. (2018) also focused on discovering the clustering patterns among the nodes with decentralized algorithms. However, they adopted a pairwise penalty function to obtain consensus of the innercluster weights, which can be reformulated as the wellknown Laplacian penalty Ando and Zhang (2007). A main limitation of the Laplacian penalty is that it cannot shrink the pairwise differences of the parameter estimates to zero (which is also verified in our simulations).
The most related work to ours is Hallac et al. (2015), where the network lasso method was introduced. In the network lasso method in Hallac et al. (2015), the authors adopted an penalty for each edge in the network graph. They also proposed a distributed alternating direction method of multipliers (ADMM) to solve the network lasso problem. Our work differs from Hallac et al. (2015) in the following key aspects: 1) The number of the penalty terms in Hallac et al. (2015) depends on the number of edges in the network graph, which yields an computation complexity and is unscalable for the largesized networks. In this paper, we consider a treebased penalty function, which contains exactly penalty terms; 2) The penalty function in the network lasso method Hallac et al. (2015) adopted the norm for the difference, while we consider an adaptive norm for the vector difference, which enjoys elegant oracle prosperities (i.e., the selection consistency and the asymptotic normality); 3) The algorithm in Hallac et al. (2015) is based on the classical ADMM algorithm with two constraints on each edge, while we propose a new generalized ADMM method with only one constraint on each edge, which significantly reduces the algorithm’s implementation complexity; 4) We rigorously prove the statistical consistency and algorithmic convergence of our proposed approach, both of which were not studied in Hallac et al. (2015).
\thesection Model and problem statement
Given a network , where and represent the node and edge sets, respectively, our goal is to estimate the coefficients and determine the cluster membership for each node. This problem can be formulated as minimizing the following loss function:
(\theequation) 
where denotes the th node in the network; and represent the reponses and design matrix at the th node, respectively; and is a penalty function with tuning parameter . Note that the objective function in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) consists of two parts: the first part is an ordinary least square (OLS) problem for all the coefficients the second term is a penalty term designed to shrink the difference of any two coefficient vectors if the corresponding nodes are connected. Note that the second term in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) depends on the network topology. Thus, we make the following assumption that is necessary to guarantee that the problem is welldefined in terms of estimation accuracy:
Assumption 1
Given a connected network , for any node from a cluster with more than two members, there exists another node from the same cluster such that the edge .
Under Assumption 1, each node is connected with its members if the cluster size is larger than one. Hence, by removing intercluster edges, i.e., identifying edges with nonzero coefficient difference, the original network graph can be reduced into subgraphs, which are the subgroup clusters. For the objective function in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach), several important remarks are in order:
Remark 1
The penalty terms in the objective function (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) consist of all pairwise coefficient differences among all edges in the network graph. If the penalty function is chosen as , then Eq. (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) has the same form as in the network lasso method Hallac et al. (2015). The objective function (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) can also be viewed as a variant of the method proposed in Ma and Huang (2017), where the penalty terms are all pairwise differences of the nodes, and hence the total number of the penalty terms is exactly . Thanks to Assumption 1, we only need to consider the difference of end nodes of edges. Thus, the number of penalty terms can be reduced to exactly However, the value of still implies that the number of penalty terms in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) could scale as if the network is dense, which will in turn result in heavy computation and communication loads as the network size gets large. To address the problem, we will propose a simplified treebased penalty function in Section Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach.
\thesection Problem reformulation: a treebased approach
As mentioned earlier and has been long noted in statistics (see, e.g., Tang and Song (2016); Li and Sang (2018)) and optimization (see, e.g., Chow et al. (2016)) communities, directly including all edges in penalty terms will incur high computational and communication complexity. To reduce the redundant penalty terms, several strategies have been proposed, including the order method in Ke et al. (2015); Tang and Song (2016) and the minimum spanning tree (MST) approach in Li and Sang (2018). Specifically, in Ke et al. (2015); Tang and Song (2016), the authors first determined the OLS estimation of the coefficients and then ordered the coefficients. They then presumed that similar coefficients will be neighbors with high probability and only regularization terms associated with the adjacent coefficients are considered. By contrast, in Li and Sang (2018), the authors used the spatial distance to constructed an MST, and the penalty terms in the tree are preserved. In essence, these two strategies are treebased approaches, with the only difference being the definitions of distance measure for the tree: the first one uses model similarity, while the second one uses spatial distances. In this paper, we propose a new treebased approach, where the distance measure for the tree can be viewed as integrating the above two measures in some sense. Yet, we will show that this new distance measure achieves surprising performance gains.
Specifically, we construct an MST as follows: First, local OLS estimators are determined in each node individually: . Then, the weight for two nodes is defined based on their local model similarity and their connection relationship in the graph as follows:
(\theequation) 
The weight in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) contains two important pieces of information: one is the network topology, which is characterized by spatial distances (e.g., in a sensor network, the nodes can only be connected within a certain communication range); the other is the local model similarity, which implies the likelihood of two nodes being in the same cluster. Based on (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach), an MST can be constructed so that only penalty terms associated with the MST are considered in the objective function:
(\theequation) 
where the notation signifies that the MST is based on the model similarity. Note that the estimation efficiency and clustering accuracy significantly depend on the penalty function. The following lemma guarantees that the nodes in the same cluster are connected in the based on the weight defined in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) (see Section 2.1 in supplementary material for proof details).
Lemma 1
Under Assumption 1, given an based on the weights defined in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach), as the local sample size then with probability for any node in a cluster with more than two members, there exists a node from the same cluster such that the edge is in the .
With Lemma 1, the number of intercluster edges is . Thus, the is a connected graph with the smallest possible number of intercluster edges. Also, the can be separated into clusters by identifying these intercluster edges. We note that there exist distributed methods to find the MST (e.g., the GHS algorithm Gallager et al. (1983)) and their implementation details are beyond the scope of this paper.
\thesection Statistical model: an adaptive fusedlasso based approach
For convenience, we use to denote the th element of vector . Based on the MST constructed in Section Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach, we specialize the loss function in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) by adopting the following adaptive lasso penalty:
(\theequation) 
where represents the set of the neighboring nodes of node in the MST, is an adaptive weight vector defined as for some constant Therefore, our proposed estimator is
Remark 2
Here, our use of an adaptive lasso penalty is motivated by: 1) Adaptive lasso is known to be an oracle procedure for related variable selection problems in statistics Fan and Li (2001); 2) With an adaptive lasso penalty, the objective function in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) is strongly convex as long as the design matrix is of full row rank. This implies that the minimum of (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) is unique. In Ma and Huang (2017); Ma et al. (2018), similar clustering methods were proposed based on the minimax concave penalty (MCP) and the smoothly clipped absolute deviations (SCAD) penalty, both of which are concave penalties and have been shown to be statistical efficient. However, from optimization perspective, concave penalties will render the objective function nonconvex, which in turn lead to intractable algorithm design. In Li and Sang (2018), lasso penalty was also adopted, but there is no proof for the oracle prosperities of their estimator.
For more compact notation in the subsequent analysis, we rewrite the objective function (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) in the following matrix form:
(\theequation) 
where and are the response vector, the design matrix, and coefficient vector, respectively; and denotes the Kronecker product. In (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach), is the incident matrix of the , which is row full rank and each entry in defined as:
(\theequation) 
where and denote the starting and ending node indices of edge in the MST, respectively, with . In (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach), , where and is the vector form of the OLS estimations. Note that adding one more row to , we can form a square and full rank matrix: Li and Sang (2018), and the objective function (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) can be equivalent rewritten as:
(\theequation) 
where is a full rank square matrix. Define as the difference of the connected nodes’ weights. It then follows that the objective function in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) can be rewritten in terms of as:
(\theequation) 
Our estimator then becomes: Since there is a onetoone transformation between and (i.e., ), we can instead focus on the theoretical prosperities of . Denote the true coefficients as , and Note that if the two connected nodes are from the same cluster, the corresponding elements in are zero. We denote the set of nonzero elements in as Similarly, the set of nonzero elements in is denoted as . To prove the oracle properties of , we need the following assumptions for the linear model in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach):
Assumption 2
For the linear model in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach): i) the errors are i.i.d. random variables with zero mean and variance ; ii) for some positive definite matrix as .
We note that in the second condition in Assumption 2, since is a full rank square matrix, is positive definite if is full column rank. Now, we state the oracle properties of as follows:
Theorem 1
\thesection Optimization algorithm: an ADMM based distributed approach
In this section, we will design a distributed algorithm for minimizing (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach). Due to the penalty structure in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach), one natural idea is to use the popular ADMM method Boyd et al. (2011), which has been shown to be particularly suited for solving lasso related problems (e.g., Ma and Huang (2017); Ma et al. (2018); Wahlberg et al. (2012); Zhu (2017)). However, in what follows, we will first illustrate why it is challenging to use a regular ADMM approach to solve the MSTbased fusedlasso clustering problem over networks in a distributed fashion. As a result, it is highly nontrivial to design a new ADMMbased algorithm by exploiting special problem structure in the MST regularizer. To this end, we first note that the penalty term in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) can be written as:
(\theequation) 
where represents the th edge in MST. In (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach), and denote the starting and ending node indices of edge , respectively, with ; and is the corresponding adaptive weight vector for the th edge. With the same notation as in Section Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach, the weight difference at edge is and . Note that there are edges in the MST. Thus, the problem of minimizing the loss function in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) can be reformulated as:
Minimize  (\theequation)  
subject to 
Then, we can construct an augmented Lagrangian with penalty parameter for (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) as follows:
(\theequation) 
where is the vector of dual variables corresponding to the edges. In what follows, we derive the updating rules for . First, given the primal and dual pair for the th edge with end nodes and , to determine the weight difference , we need to solve the subproblem and hence for the th edge with end nodes and , it follows that (see Section 1 in supplementary material for derivation details):
(\theequation) 
where is the coordinatewise softthresholding operator with Next, we derive the updating rule for . With the classical ADMM, it can be shown that (see Section 1 in supplementary material for derivation details):
(\theequation) 
Unfortunately, the matrix inverse in (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) cannot be computed in a distributed fashion due to the coupled structure of the Laplacian matrix Here, we show that the generalized ADMM studied in Deng and Yin (2016) can be leveraged to derive an updating rule for , which can be implemented in a parallel fashion. To this end, instead of directly solving the subproblem , we add a quadratic term in the subproblem ( is positive semidefinite):
Now, the key step is to recognize that we can choose the matrix where with positive scalars for node and is the Laplacian matrix for the MST. It follows that . Plugging in , we have the following local weight update:
(\theequation) 
where means node is an end node of edge is the degree of the node (i.e., . Thus, the updating of only requires the local and connected neighbor’s information, which facilitates distributed implementation. Also, matrix plays an important role on the algorithm convergence. Recall that To guarantee based on the Gershgorin circle theorem, we can choose as . Lastly, the dual variables can be updated as , and hence for the th edge, the corresponding dual update is (see Section 1 in supplementary material for details):
(\theequation) 
Note, however, that the updating rules (\theequation) and (\theequation) are edgebased while (\theequation) is nodebased. To make the updateing rules consistent, we define several additional notations: At node , we let and ; At node , we let and . With simple derivations, it can be verified that if and are satisfied in iteration , then in iteration , and still hold based on the following nodebased updating rules: and
(\theequation) 
Thus, we can set and , which satisfy the above conditions. Note that the updating rule for has the same structure as because and . Our method is summarized in Algorithm 1. The outputs of the algorithm are the estimated coefficient and the coefficient difference . Whether two nodes are in the same cluster can be determined by checking if and are in the same cluster. The following theorem guarantees the convergence speed of Algorithm 1.
Theorem 2
Denote the KKT point for the objective function (Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach) as . With a proper selection of such that the iterates converge to in the sense of norm: where represents the seminorm , where is defined as:
Further, the convergence rate is linear, i.e., such that .
\thesection Numerical Results
(a) .  (b) . 
Due to space limitation, we only provide the numerical results of the impacts of the choices of regularization on accuracy and cost. More detailed numerical studies can be found in the supplementary materials. We compare our MSTbased regularization ( penalty terms) to the pairwise regularization ( penalty tems), which will be referred to as Graph regularization in this section. Both models are solved by our proposed generalized ADMM algorithm distributively. In the distributed algorithm, the nodes need to update the local and in each iteration. Clearly, the amount of data being transmitted grows as the graph becomes denser. We simulate a node network and each node contains samples. We adjust the network denseness by changing the connection radius . Two setting are compared: and (see Figure Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach (a) – (b)). We compare the accuracy and costs of the two models with simulations. The MSEs and the estimated cluster number are used for measuring accuracy. We set the baseline to be the average computation time and the average communication cost for the MST model under . The boxplots for the accuracy, the computation time ratios, and the communication cost ratios are shown in Figure Distributed Linear Model Clustering over Networks: A TreeBased FusedLasso ADMM Approach. We can see that our method outperforms in all aspects: Our method improves the MSE at least , while reducing at least computation time and communication cost.
\thesection Conclusion
In this work, we considered the problem of distributively learning the regression coefficient heterogeneity over networks. We developed a new minimum spanning tree based adaptive fusedlasso model and a lowcomplexity distributed generalized ADMM algorithm to solve the problem. We investigated the theoretical properties of both the model consistency and the algorithm convergence. We showed that our model enjoys the oracle properties (i.e., selection consistency and asymptotic normality) and our distributed optimization algorithm has a linear convergence rate. An interesting future topic is to generalize our framework to a more general class of regression problems including generalized linear model and semiparameteric linear model.
 Ando and Zhang (2007) Rie K Ando and Tong Zhang. Learning on graph with laplacian regularization. In Advances in neural information processing systems, pages 25–32, 2007.
 Boyd et al. (2011) S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1 – 122, 2011.
 Chaganty and Liang (2013) Arun Tejasvi Chaganty and Percy Liang. Spectral experts for estimating mixtures of linear regressions. In International Conference on Machine Learning, pages 1040–1048, 2013.
 Chow et al. (2016) YatTin Chow, Wei Shi, Tianyu Wu, and Wotao Yin. Expander graph and communicationefficient decentralized optimization. In 2016 50th Asilomar Conference on Signals, Systems and Computers, pages 1715–1720. IEEE, 2016.
 Deng and Yin (2016) Wei Deng and Wotao Yin. On the global and linear convergence of the generalized alternating direction method of multipliers. Journal of Scientific Computing, 66(3):889–916, 2016.
 Eisen et al. (2017) Mark Eisen, Aryan Mokhtari, and Alejandro Ribeiro. Decentralized quasinewton methods. IEEE Transactions on Signal Processing, 65(10):2613–2628, 2017.
 Fan and Li (2001) Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
 Gallager et al. (1983) Robert G. Gallager, Pierre A. Humblet, and Philip M. Spira. A distributed algorithm for minimumweight spanning trees. ACM Transactions on Programming Languages and systems (TOPLAS), 5(1):66–77, 1983.
 Hallac et al. (2015) David Hallac, Jure Leskovec, and Stephen Boyd. Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 387–396. ACM, 2015.
 Hastie and Tibshirani (1996) Trevor Hastie and Robert Tibshirani. Discriminant analysis by gaussian mixtures. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):155–176, 1996.
 Jiang et al. (2018) Zhanhong Jiang, Kushal Mukherjee, and Soumik Sarkar. On consensusdisagreement tradeoff in distributed optimization. In 2018 Annual American Control Conference (ACC), pages 571–576. IEEE, 2018.
 Ke et al. (2015) Zheng Tracy Ke, Jianqing Fan, and Yichao Wu. Homogeneity pursuit. Journal of the American Statistical Association, 110(509):175–194, 2015.
 Konecny et al. (2015) Jakub Konecny, Brendan McMahan, and Daniel Ramage. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575, 2015.
 Lee et al. (2015) DJ Lee, Z Zhu, and P Toscas. Spatiotemporal functional data analysis for wireless sensor networks data. Environmetrics, 26(5):354–362, 2015.
 Li and Sang (2018) Furong Li and Huiyan Sang. Spatial homogeneity pursuit of regression coefficients for large datasets. Journal of the American Statistical Association, (justaccepted):1–37, 2018.
 Ma and Huang (2017) Shujie Ma and Jian Huang. A concave pairwise fusion approach to subgroup analysis. Journal of the American Statistical Association, 112(517):410–423, 2017.
 Ma et al. (2018) Shujie Ma, Jian Huang, and Zhiwei Zhang. Exploration of heterogeneous treatment effects via concave fusion. arXiv preprint arXiv:1607.03717, 2018.
 Nedic and Ozdaglar (2009) Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multiagent optimization. IEEE Transactions on Automatic Control, 54(1):48, 2009.
 Rohe et al. (2011) Karl Rohe, Sourav Chatterjee, Bin Yu, et al. Spectral clustering and the highdimensional stochastic blockmodel. The Annals of Statistics, 39(4):1878–1915, 2011.
 Shen and He (2015) Juan Shen and Xuming He. Inference for subgroup analysis with a structured logisticnormal mixture model. Journal of the American Statistical Association, 110(509):303–312, 2015.
 Shi et al. (2014) Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the linear convergence of the admm in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750–1761, 2014.
 Tang and Song (2016) Lu Tang and Peter XK Song. Fused lasso approach in regression coefficients clustering: learning parameter heterogeneity in data integration. The Journal of Machine Learning Research, 17(1):3915–3937, 2016.
 Wahlberg et al. (2012) Bo Wahlberg, Stephen Boyd, Mariette Annergren, and Yang Wang. An admm algorithm for a class of total variation regularized estimation problems. IFAC Proceedings Volumes, 45(16):83–88, 2012.
 Wang et al. (2018) Weiran Wang, Jialei Wang, Mladen Kolar, and Nathan Srebro. Distributed stochastic multitask learning with graph regularization. arXiv preprint arXiv:1802.03830, 2018.
 Yuan et al. (2016) Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016.
 Zhu (2017) Yunzhang Zhu. An augmented admm algorithm with application to the generalized lasso problem. Journal of Computational and Graphical Statistics, 26(1):195–204, 2017.