An Efficient Linearly Convergent Regularized Proximal Point Algorithm for Fused Multiple Graphical Lasso Problems
Abstract
Nowadays, analysing data from different classes or over a temporal grid has attracted a great deal of interest. As a result, various multiple graphical models for learning a collection of graphical models simultaneously have been derived by introducing sparsity in graphs and similarity across multiple graphs. This paper focuses on the fused multiple graphical Lasso model which encourages not only shared pattern of sparsity, but also shared values of edges across different graphs. For solving this model, we develop an efficient regularized proximal point algorithm, where the subproblem in each iteration of the algorithm is solved by a superlinearly convergent semismooth Newton method. To implement the semismooth Newton method, we derive an explicit expression for the generalized Jacobian of the proximal mapping of the fused multiple graphical Lasso regularizer. Unlike those widely used first order methods, our approach has heavily exploited the underlying second order information through the semismooth Newton method. This can not only accelerate the convergence of the algorithm, but also improve its robustness. The efficiency and robustness of our proposed algorithm are demonstrated by comparing with some stateoftheart methods on both synthetic and real data sets. Supplementary materials for this article are available online.
Keywords: Fast linear convergence Network estimation Semismooth Newton method Sparse Jacobian
1 Introduction
Undirected graphical models have been especially popular for learning conditional independence structures among a large number of variables where the observations are drawn independently and identically from the same distribution. The Gaussian graphical model is one of the most widely used undirected graphical models. In the highdimensional and lowsamplesize settings, it is always assumed that the conditional independence structure or the precision matrix is sparse in a certain sense. In other words, its corresponding undirected graph is expected to be sparse. To promote sparsity, there has been a great deal of interest in using the norm penalty in statistical applications (Banerjee et al. 2008; Friedman et al. 2008; Rothman et al. 2008). In many conventional applications, a single Gaussian graphical model is typically enough to capture the conditional independence structure of the random variables. However, due to the heterogeneity or similarity of the data involved, it is increasingly appealing to fit a collection of such models jointly, such as inferring the timevarying networks and finding the changepoints (Ahmed and Xing 2009; Monti et al. 2014; Gibberd and Nelson 2017; Hallac et al. 2017; Yang and Peng 2018) and estimating multiple precision matrices simultaneously for variables from distinct but related classes (Guo et al. 2011; Danaher et al. 2014; Yang et al. 2015).
Multiple graphical models refer to the models that can estimate a collection of precision matrices jointly. Specifically, let be random vectors (from different classes or over a temporal grid) drawn independently from different distributions . Assume that the multivariate random variable has observations , for each . Then the sample means are and the sample covariance matrices are , . The multiple graphical model for estimating the precision matrices jointly is the model with the variable :
(1) 
where is a penalty function, which usually promotes sparsity in each and similarities among different ’s. Various penalties have been considered in the literature (Ahmed and Xing 2009; Guo et al. 2011; Danaher et al. 2014; Monti et al. 2014; Yang et al. 2015; Gibberd and Nelson 2017).
In this paper, we focus on the following fused graphical Lasso (FGL) regularizer which was used by Ahmed and Xing (2009) and Yang et al. (2015):
(2) 
We refer to problem (1) with the FGL regularizer in (2) as the FGL problem. The FGL regularizer is in some sense a generalized fused Lasso regularizer (Tibshirani et al. 2005). It applies the penalty to all the offdiagonal elements of the precision matrices and the consecutive differences of the elements of successive precision matrices. Many elements with the same indices in the estimated matrices will be close or even identical when the parameter is large enough. Therefore, the FGL regularizer encourages not only shared pattern of sparsity, but also shared values across different graphs.
Existing algorithms for solving the FGL problem are quite limited in the literature. One of the most extensively used algorithms for solving this class of problems is the alternating direction method of multipliers (ADMM) (Danaher et al. 2014; Hallac et al. 2017; Gibberd and Nelson 2017). Besides, a proximal Newtontype method (Hsieh et al. 2011; Lee et al. 2014) was implemented by Yang et al. (2015) for solving the FGL problem. As we know, ADMM could be a practical first order method for finding approximate solutions of low or moderate accuracy. However, ADMM hardly utilizes any second order information, which generally must be used in order to obtain highly accurate solutions. Although the proximal Newtontype method does incorporate some forms of second order information, a complicated quadratic approximation problem has to be solved in each iteration, and this computation is usually timeconsuming. It is worth mentioning that the regularizers are often introduced to promote certain structures in the estimated precision matrices, and the tradeoff between biases and variances in the resulting estimators is controlled by the regularization parameters (Fan and Lv 2010). But in practice, it is extremely hard to find the optimal regularization parameters. Therefore, a sequence of regularization parameters is applied in practice, and consequently, a sequence of corresponding optimization problems must be solved (Fan and Tang 2013). Under such a circumstance, a highly efficient and robust algorithm for solving the FGL model becomes particularly important.
In this paper, we will design a semismooth Newton (SSN) based regularized proximal point algorithm (rPPA) for solving the FGL problem, which is inspired by Li et al. (2018b), where they have convincingly demonstrated the superior numerical performance of the SSN based augmented Lagrangian method (ALM), known as Ssnal, for solving the fused Lasso problem (Tibshirani et al. 2005). Thanks to the fact that the FGL problem has close connections to the fused Lasso problem, many of the virtues and theoretical insights of the Ssnal for solving the fused Lasso problem can be observed in our approach. However, we should emphasize that solving the FGL problem is much more challenging than solving the fused Lasso problem. Specifically, the difficulties are mainly due to the logdeterminant function and the matrix variables, as described below.

Unlike the simple quadratic functions in the fused Lasso problem, the function is defined on the space of positive definite matrices. Therefore, the FGL model requires the positive definiteness of their solutions. This greatly increases the difficulty and complexity of theoretical analysis and numerical implementation.

Li et al. (2018b) constructed an efficiently computable element in the generalized Jacobian of the proximal mapping of the fused Lasso regularizer, which is an essential step for solving the fused Lasso problem. Based on the constructions, we could obtain an efficiently computable generalized Jacobian of the proximal mapping of the FGL regularizer. However, this process needs more complicated manipulations of coordinates for a collection of matrix variables, unlike the vector case of the fused Lasso problem.
The key issue in the implementation of rPPA for solving the FGL model is the computation of the solution of the subproblem in each rPPA iteration. For this purpose, we will design an SSN method to solve those subproblems. We note that the numerical performance of the SSN method relies critically on the efficient calculation of the generalized Jacobian of the proximal mapping of the FGL regularizer and that of the logdeterminant function. Fortunately, the generalized Jacobian of the proximal mapping of the FGL regularizer can be constructed efficiently based on that of the proximal mapping of the fused Lasso regularizer given by Li et al. (2018b). As a result, the generalized Jacobian of the proximal mapping of the FGL regularizer would inherit the structured sparsity (referred to as second order sparsity) from that of the fused Lasso regularizer. Due to the structured sparsity, the computation of a matrixvector product in the SSN method is reasonably cheap and thus the SSN method is quite efficient for solving each subproblem. To summarize, it can be proven that our rPPA for solving the FGL problem has a linear convergent guarantee, and the convergence rate can be arbitrarily fast by choosing a sufficiently large proximal penalty parameter. Moreover, the SSN method for solving each of rPPA subproblems can be shown to be superlinearly convergent. Thus, based on these excellent convergent properties and the novel exploitation of the second order sparsity, we can expect the SSN based rPPA for solving the FGL problem to be highly efficient. Indeed, our numerical experiments have confirmed the high efficiency and robustness of the proposed algorithm for solving the FGL problems accurately.
The remaining parts of this paper are as follows. Section 2 presents some definitions and preliminary results. In section 3, we present a semismooth Newton based regularized proximal point algorithm for solving the FGL problem and its convergence properties. The numerical performance of our proposed algorithm on timevarying stock prices data sets and categorical text data sets are evaluated in section 4. Section 5 gives the conclusion.
Notations. () denotes the cone of positive semidefinite (definite) matrices in the space of real symmetric matrices . For any , we denote if (). In particular, indicates (. We let and to be the Cartesian product of positive semidefinite cones and that of spaces of symmetric matrices , respectively. denotes the dimensional Euclidean space, and denotes the set of all real matrices. For any , , and . We use the Matlab notation to denote the matrix obtained by appending below the last row of , when the number of columns of and is identical. For any matrix , denotes the th element of . For any , denotes the column vector obtained by taking out the th elements across all matrices . denotes the block diagonal matrix whose th diagonal block are the matrix , . denotes the identity matrix, and denotes an identity matrix or map when the dimension is clear from the context. The function composition is denoted by , that is, for any functions and , . The Hadamard product is denoted by .
2 Preliminaries
Let be a finitedimensional real Hilbert space, and be a proper and closed convex function. The MoreauYosida regularization (Moreau 1965; Yosida 1964) of is defined by
(3) 
The proximal mapping associated with is the unique minimizer of (3) defined by
(4) 
Moreover, is a continuously differentiable convex function (Lemaréchal and Sagastizábal 1997; Rockafellar and Wets 2009), and its gradient is given by
(5) 
For notational convenience, define by
Let be given. Define two scalar functions as follows:
In addition, the matrix counterparts of these two scalar functions can be defined by
for any with its eigenvalue decomposition , where . It is easy to show that and are welldefined. Moreover, and are positive definite for any .
Proposition 2.1.
(Wang et al. 2010, Lemma 2.1 (b)) The function is continuously differentiable, and its directional derivative at for any is given by
where admits the eigenvalue decomposition , and is defined by
Proposition 2.2.
(Yang et al. 2013, Proposition 2.3) For any , it holds that
2.1 Surrogate Generalized Jacobian of
In this section, we analyse the proximal mapping of the regularizer defined by (2). For any , one might observe that the penalty term merely penalizes the offdiagonal elements, and it is the same fused Lasso regularizer that acts on each vector . It holds that The function is the fused Lasso regularizer, and the matrix is defined by , . The formula for the generalized Jacobian of has been derived by Li et al. (2018b) and will be used in our subsequent algorithmic design. Define the surrogate generalized Jacobian of at as follows:
(6) 
where is the surrogate generalized Jacobian of (Li et al. 2018b, Equation 22) and will be given in the supplementary materials. From Theorem 1 by Li et al. (2018b), one can obtain the following theorem, which justifies why in (6) can be used as the surrogate generalized Jacobian of at .
Theorem 2.1.
The surrogate generalized Jacobian defined in (6) is a nonempty compact valued, upper semicontinuous multifunction. Given any , any element in the set is selfadjoint and positive semidefinite. Moreover, there exists a neighborhood of such that for all ,
2.2 Lipschitz Continuity of the Solution Mapping
By introducing an auxiliary variable , we can rewrite problem (1) equivalently as
(7) 
The Lagrangian function of the above problem is given by
The dual problem of (7) takes the following form (Borwein and Lewis 2010, Theorem 3.3.5):
(8) 
The KarushKuhnTucker (KKT) optimality conditions (Han et al. 2018) for (7) are given as follows:
(9) 
We make the following assumption on the existence of solutions to the KKT system.
Assumption 2.1.
The solution set to the KKT system (9) is nonempty.
Define an operator by Since the function is strictly convex, under Assumption 2.1, we can see that the KKT system (9) has a unique KKT point, denoted by , and .
Proposition 2.3.
There exists a nonnegative scalar such that for some it holds that
Proof.
Note that the regularizer defined by (2) is a positive homogeneous function. Therefore, it follows from Example 11.4(a) by Rockafellar and Wets (2009) that the conjugate function is an indicator function of a nonempty convex polyhedral set. This, together with Theorem 2.7 by Li et al. (2018a) and Proposition 6 by Cui et al. (2018), proves the required result. ∎
3 Regularized Proximal Point Algorithm
In this section, we present a regularized proximal point algorithm (rPPA) for solving the problem (7) with the FGL regularizer defined by (2). Given a sequence of positive scalars , the th iteration of PPA for solving (7) is given by
(10) 
where and is the objective function of the problem (7).
There are many ways to solve (10). Inspired by recent progresses in solving large scale convex optimization problems (Yang et al. 2013; Li et al. 2018a, b; Zhang et al. 2019), we shall adopt the approach of solving (10) via employing a sparse SSN method to its dual. The dual of (10) takes the following form:
By the definition of the MoreauYosida regularization (3), we can write explicitly as follows:
Therefore, by Proposition 2.2 and the definition of the proximal mapping (4), the th iteration of PPA (10) can be written as
where approximately solves the following problem: Since is not strongly concave in general, we consider the following rPPA.
Choose . Iterate the following steps for :

Compute
(11) 
Compute and for ,

Update .
Since in practice the inner subproblem (11) can only be solved inexactly, we will use the following standard stopping criteria studied by Rockafellar (1976):
The reason for using the above stopping criteria is due to the fact that Algorithm 1 is equivalent to the primaldual PPA in the sense of Rockafellar (1976). Moreover, we have the following convergence results.
Theorem 3.1.
Let be an infinite sequence generated by Algorithm 1 under stopping criterion (A). Then the sequence converges to the unique solution of (7), and the sequence converges to the unique solution of (8). Furthermore, if the criterion is also executed in Algorithm 1, there exists such that for all ,
where the convergence rate
and the parameter is from Proposition 2.3.
Proof.
3.1 Semismooth Newton Method for Solving Subproblem (11)
From (5), Proposition 2.2, and Theorem 31.5 by Rockafellar (2015), we know that is a continuously differentiable, strongly concave function and
where and , . Therefore, one can obtain the unique solution to problem (11) by solving the nonsmooth system
(12) 
Recall that is differentiable and its derivative is given by Proposition 2.1. Thus, the surrogate generalized Jacobian of at is defined as follows:
(13) 
With the generalized Jacobian of , we are ready to solve equation (12) by the SSN method, where the Newton systems are solved inexactly by the conjugate gradient (CG) method.
Given , , , and . Choose . Iterate the following steps for :

(Newton direction) Choose one specific map . Apply the CG method to find an approximate solution to
such that

(Line search) Set , where is the smallest nonnegative integer for which

Set .
Next, we derive the convergence result of the SSN method (Algorithm 2).
Theorem 3.2.
Proof.
Since the proximal mapping is piecewise linear and Lipschitz continuous, we know from Theorem 7.5.17 by Facchinei and Pang (2007) that is directionally differentiable. This, together with Theorem 2.1, implies that is strongly semismooth with respect to the multifunction (for its definition, see e.g., Definition 1 by Li et al. (2018b)). Therefore, the conclusion follows from the strong convexity of , Proposition 2.1, and Proposition 7 & Theorem 3 by Li et al. (2018b). ∎
4 Numerical Experiments
In this section, we compare the performance of our algorithm rPPA with the alternating direction method of multipliers (ADMM) and the proximal Newtontype method implemented by Yang et al. (2015) (referred to as MGL here) for which the solver is available at http://senyang.info/.
The following paragraph describes the measurement of the accuracy of an approximate optimal solution and the stopping criteria of the three methods. Since both rPPA and ADMM can generate primal and dual approximate solutions, we can assess the accuracy of their solutions by the relative KKT residuals. Unlike the primaldual method, MGL merely gives the primal solution and the KKT residual of a solution generated by MGL is not available. Instead, we measure the relative error of the objective value obtained by MGL with respect to that computed by rPPA. Based on the KKT optimality condition (9), the accuracy of an approximate optimal solution generated by rPPA (Algorithm 1) is measured by defining the following relative residuals:
Likewise, the accuracy of an approximate optimal solution generated by ADMM is measured by the relative KKT residual (defined in the supplementary material) that is analogous to .
In our numerical experiments, we terminate rPPA if it satisfies the condition for a given accuracy tolerance ; similarly for ADMM with the stopping condition Note that the terminating condition for MGL is different. Let “” and “” be the primal objective function values computed by rPPA and MGL, respectively. MGL will be terminated when the relative difference of its objective value with respect to that obtained by rPPA is smaller than the given tolerance , i.e.,
(14) 
We adopt a warmstart strategy to initialize rPPA. That is, we first run ADMM (with identity matrices as the starting point) for a fixed number of iterations to generate a good initial point to warmstart rPPA. We also stop ADMM as soon as the relative KKT residual of the computed iterate is less than . Note that such a warmstarting strategy is sound since in the initial phase of rPPA where the iterates are not close to the optimal solution (as measured by the associated relative KKT residual), it is computationally wasteful to use the more expensive rPPA iteration when the fast local linear convergence behavior of the algorithm has yet to kick in. Under such a scenario, naturally one would use cheaper iterations such as those of ADMM to generate the approximate solution points until the relative KKT residual has been sufficiently reduced.
For the tuning parameters and , for each test instance we select three pairs that lead to reasonable sparsity. In the following tables, “P” stands for rPPA; “A” stands for ADMM; “M” stands for MGL; “nnz” denotes the number of nonzero entries in the solution obtained by rPPA using the estimation: where is the vector obtained via sorting all elements in by magnitude in a descending order; “density” denotes the quantity nnz. The time is displayed in the format of “hours:minutes:seconds”, and the fastest method in terms of running time is highlighted in red. The errors presented in the tables are the relative KKT residuals for rPPA and for ADMM; while the error for MGL is in (14).
4.1 Nearestneighbour Networks
In this section, we assess the effectiveness of the FGL model on a simulated network: nearestneighbour network. The nearestneighbour network is generated by modifying the data generation mechanism described by Li and Gui (2006). We set and . For each , we generate 10,000 independently and identically distributed observations from a multivariate Gaussian distribution , where is the precision matrix of the th class. The details of the generation of are as follows. First of all, points are randomly generated on a unit square, their pairwise distances are calculated, and nearest neighbours of each point in terms of distance are found. The nearestneighbour network is obtained by linking any two points that are nearest neighbours of each other. The integer controls the degree of sparsity of the network, and we set in our simulation. Subsequently, we add heterogeneity to the common structure by further creating individual links as follows: for each , a pair of symmetric zero elements is randomly selected and replaced with a value uniformly drawn from the interval . This procedure is repeated ceil (the nearest integer greater than or equal to ) times, where is the number of edges in the nearestneighbour graph. In our simulation, the true number of edges in the three networks is .
There is a pair of tuning parameters and which must be specified. In the FGL model, drives sparsity and drives similarity, and we say that and are the sparsity and similarity control parameters respectively. In order to show the diversity of sparsity in our experiments, we choose a series of for the FGL model with fixed. Figure 1 shows the relative ability of the FGL model to recover the network structures and to detect the changepoints.
Figure 0(a) displays the number of true positive edges selected (i.e., TP edges) against the number of false edges selected (i.e., FP edges) for the FGL model. We say that an edge in the th network is selected in the estimate if , and we say that the edge is true in the precision matrix if and false if . We can see from the figure that the FGL model with can recover almost all of the true positive edges without false positive edges. Figure 0(a) also shows that for the FGL model the similarity control parameter is much better than in terms of the ability of true edges detection. When , the FGL model can merely detect about 3000 true positive edges while the the number of false positive edges is increased to over 600. One possible reason is that is too large compared with the underlying optimal one in this case.
Figure 0(b) illustrates the sum of squared errors between estimated edge values and true edge values, i.e., . When the number of the total edges selected is increasing (i.e., the sparsity control parameter is decreasing), the error is decreasing and finally reaches a fairly low value.
Figure 0(c) plots the number of true positive differential edges against false positive differential edges. A differential edge is an edge that differs between classes and thus corresponds to a changepoint. We say that the edge is estimated to be differential between the th and the th networks if , and we say that it is truly differential if . The number of differential edges is computed for all successive pairs of networks. The best point in Figure 0(c) is the red one which has approximately 2700 true positive differential edges and almost no false one. We can also see from Figure 0(c) that all the blue points have no false positive differential edge and small numbers of true positive differential edges. This might be caused by the larger similarity control parameter which forces an excessive number of edges across networks to be similar.
4.2 Standard & Poor’s 500 Stocks
In this section, we compare rPPA, ADMM, and MGL on the Standard & Poor’s 500 stock price data sets. The stock price data sets contain daily returns of 500 stocks over a long period, and can be downloaded from the link www.yahoo.com. The dependency structures of different stocks vary over time. But it appears that the dependency networks change smoothly over time. Therefore, the FGL model might be able to find the interactions among these stocks and how they evolve over time.
We first consider a relatively short threeyear time period from January 2004 to December 2006. During this period, there are totally 755 daily returns of 370 stocks. We call this data set SPX3a. For each year, it contains approximately 250 daily returns of each stock. Considering the limited number of observations in each year and the interpretation of the results, we choose to analyse random smaller subsets of all involved stocks, whose sizes are chosen to be and , over periods.
In addition to the above data set over three years, a relatively long period from January 2004 to December 2014 is also considered in the experiments, which is referred to as SPX11b. Since the time period is longer than the previous one, the number of stocks becomes smaller as some stocks might disappear. During the 11year time period, there are 2769 daily closing prices of 272 stocks. We can set a relatively large parameter according to years from January 2004 to December 2014. Again, we choose to analyse two random subsets of all existing stocks, of which the sizes are selected to be and .
Problem  Density  Iteration  Time  Error  

P  A  M  P  A  M  P  A  M  
(1e04,1e05)  0.039  25  3701  6  06  33  33  6.0e07  9.5e07  1.1e06  
SPX3a  (5e05,5e06)  0.144  24  3701  9  08  33  43  9.9e07  9.5e07  4.0e06 
(100,3)  (2e05,2e06)  0.241  26  5359  19  11  49  09  7.0e07  1.0e06  1.2e05 
(1e04,1e05)  0.025  24  3301  9  15  01:14  01:26  8.1e07  8.2e07  2.3e06  
SPX3a  (5e05,5e06)  0.086  24  3301  17  22  01:15  03:22  6.8e07  8.3e07  4.7e06 
(200,3)  (2e05,2e06)  0.150  26  5920  44  30  02:17  03:59  6.4e07  1.0e06  8.0e06 
(5e04,5e05)  0.028  24  3701  8  18  01:23  02:40  9.6e07  1.0e06  3.2e06  
SPX11b  (1e04,1e05)  0.126  24  3701  111  27  01:55  04:59  9.3e07  1.0e06  4.9e06 
(100,11)  (5e05,5e06)  0.206  24  3710  388  30  01:59  13:16  9.4e07  1.0e06  5.6e06 
(5e04,5e05)  0.017  22  3501  28  54  03:58  46:47  6.2e07  9.9e07  1.7e06  
SPX11b  (1e04,1e05)  0.081  24  3601  477  01:19  05:13  01:34:40  9.5e07  8.8e07  3.5e06 
(200,11)  (5e05,5e06)  0.134  24  3573  1076  01:38  05:21  03:00:00  9.8e07  1.0e06  4.0e05 
Table 1 shows the comparison of rPPA, ADMM, and MGL on the stock price data sets SPX3a and SPX11b with and selected stocks. One outstanding observation from the table is that rPPA outperforms ADMM and MGL by an obvious margin and rPPA is faster than the other two methods except for one instance. For the exceptional instance, rPPA is still faster than ADMM and merely two seconds slower than MGL. In addition, we find that both rPPA and ADMM succeeded in solving all instances; while MGL failed to solve two of them within one hour. This might imply that MGL is not robust for solving the FGL model when applied to the stock price data sets. The numerical results show convincingly that our algorithm rPPA can solve the FGL problems efficiently and robustly. The superior performance of rPPA can mainly be attributed to our ability to extract and exploit the sparsity structure (in the surrogate generalized Jacobian of ) within the SSN method to solve each rPPA subproblem very efficiently.
Figure 2 displays the sparse patterns of estimated precision matrices from year 2004 to year 2014 on the data set SPX11b with (1e4,1e4) in application of the FGL model. It should be noted that this period covers the 2008 financial crisis. We manually split the time points into three stages (one stage corresponds to one row in Figure 2) to aid the interpretation of the results. Each red pattern in the left panel presents the common structure across the estimated precision matrices in its stage. And each blue pattern visualizes the individual edges specific to its own precision matrix. Generally, one can hardly expect a meaningful common structure across all the time points, and thus we provide here the common structure across parts of nearby precision matrices. One can clearly see that more edges are detected in the middle stage, and the number of the common edges across year 2008, 2009, 2010, 2011 is correspondingly larger than that in the earlier and later stages. The increased amount of interactions among the stocks over this period is likely due to the 2008 global financial crisis and its sustained effects. Another observation is that the number of edges had a drastic increase in 2007, kept at a high level during the 2008 global financial crisis and a certain period after that, and then went down to a level still higher than that of the precrisis period (year 2004, 2005, 2006). The sudden increase in 2007 might be seen as a prediction of the oncoming financial crisis in 2008. The increased amount of interactions among stocks after the financial crisis compared to that in the precrisis period may indicate some essential changes of the financial landscape. To a certain degree, the observations agree well with those observed by Yang and Peng (2018).
4.3 University Webpages
Here we evaluate the numerical performances of rPPA, ADMM, and MGL on the data set university webpages, which is provided by CardosoCachopo (2007) and available at http://ana.cachopo.org/datasetsforsinglelabeltextcategorization. The original pages were collected from computer science departments of various universities in 1997. We selected four largest and meaningful classes in our experiment: Student, Faculty, Staff, and Department. For each class, the collection contains pages from four universities: Cornell, Texas, Washington, Wisconsin, and other miscellaneous pages from other universities. Furthermore, the original text data have been preprocessed by stemming techniques, that is, reducing words to their morphological roots. The preprocessed data sets downloaded from the link above contain two files: two thirds of the pages were randomly chosen as training set (Webtrain) and the remaining third as testing set (Webtest). Table 2 presents the distribution of documents per class.
Class  Student  Faculty  Course  Project  Total 

#train docs  1097  750  620  336  2803 
#test docs  544  374  310  168  1396 
Next, we apply the FGL model to the data set for the purpose of interpreting the data. We choose tuning parameters that enforce high sparsity and similarity. In our experiment, we set and . The resulting common structure is displayed in Figure 3. The thickness of an edge is proportional to the magnitude of the associated average partial correlation. Figure 3 shows that some standard phrases in computer science, such as programlanguag, opersystem, distributsystem, softwarengin, possess high partial correlations among their constituent words in all four classes. It successfully demonstrates the effectiveness of the FGL model for exploring the similarity across related classes. Table 3 shows the comparison of the three methods rPPA, ADMM, and MGL on the webpages data sets with data dimension , , and . As can be seen, rPPA outperforms ADMM and MGL by a large margin for most of the tested webpages data sets.
Problem  Density  Iteration  Time  Error  

P  A  M  P  A  M  P  A  M  
(1e02,1e03)  0.015  16  2401  4  06  25  05  5.8e07  9.9e07  1.2e07  
Webtest  (5e03,5e04)  0.047  16  2401  6  07  27  10  5.6e07  9.9e07  3.1e07 
(100,4)  (1e03,1e04)  0.219  15  701  38  05  09  49  5.7e07  6.1e07  9.0e07 
(1e02,1e03)  0.008  18  2101  7  11  01:03  59  7.3e07  8.5e07  8.6e07  
Webtest  (5e03,5e04)  0.025  18  2101  8  12  01:04  01:05  6.8e07  8.5e07  4.7e07 
(200,4)  (1e03,1e04)  0.156  18  2101  72  12  01:12  07:31  5.5e07  7.1e07  9.3e07 
(5e03,5e04)  0.016  18  2101  9  25  02:12  03:44  5.9e07  8.3e07  3.7e07  
Webtest  (1e03,1e04)  0.119  17  2101  258  49  02:16  39:58  7.9e07  8.6e07  1.0e06 
(300,4)  (5e04,5e05)  0.244  19  2901  1393  01:18  03:07  02:22:23  5.6e07  8.0e07  1.0e06 
(1e02,1e03)  0.011  24  20000  3  12  03:11  04  8.4e07  7.0e06  3.4e06  
Webtrain  (5e03,5e04)  0.030  24  20000  5  13  03:35  08  8.1e07  7.0e06  8.9e07 
(100,4)  (1e03,1e04)  0.162  24  20000  22  14  03:52  01:06  7.3e07  7.0e06  3.5e06 
(5e03,5e04)  0.015  24  20000  5  46  10:53  17  8.5e07  6.7e06  1.2e06  
Webtrain  (1e03,1e04)  0.105  24  15227  33  44  08:23  03:02  7.2e07  1.0e06  2.6e06 
(200,4)  (5e04,5e05)  0.210  24  20000  95  52  10:41  06:34  6.9e07  6.6e06  2.7e06 
(5e03,5e04)  0.010  24  20000  7  01:31  21:31  02:07  8.1e07  5.9e06  1.2e08  
Webtrain  (1e03,1e04)  0.077  24  20000  52  01:33  22:06  20:47  6.3e07  5.9e06  1.8e06 
(300,4)  (5e04,5e05)  0.168  24  20000  155  01:51  21:56  18:09  6.1e07  5.9e06  1.8e06 
5 Conclusion
We have designed an efficient and globally convergent regularized proximal point algorithm for solving the primal formulation of the fused graphical Lasso problem. From a theoretical perspective, we established the Lipschitiz continuity of the solution mapping and consequently obtained that the primal and dual sequences are locally linearly convergent. This lays the foundation for the efficiency of the proposed algorithm. Moreover, the second order information was also fully exploited, which further leads to the high efficiency of the proposed algorithm. Numerically, we demonstrated the superior efficiency and robust performance of the proposed method by comparing it with the extensively used alternating direction method of multipliers and the proximal Newtontype method (Yang et al. 2015) on both synthetic and real data sets. In summary, the proposed semismooth Newton based regularized proximal point algorithm is a highly efficient method for solving the fused graphical Lasso problems.
6 Supplementary Materials
 Supplementary material:

It contains technical details (generalized Jacobian of the proximal mapping of the fused Lasso regularizer and implementation of ADMM) and numerical results (on data University Webpages and 20 Newsgroups). (pdf file)
References
 Ahmed and Xing (2009) Ahmed, A. and E. P. Xing (2009). Recovering timevarying networks of dependencies in social and biological studies. Proceedings of the National Academy of Sciences 106(29), 11878–11883.
 Banerjee et al. (2008) Banerjee, O., L. E. Ghaoui, and A. d’Aspremont (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research 9, 485–516.
 Borwein and Lewis (2010) Borwein, J. and A. S. Lewis (2010). Convex analysis and nonlinear optimization: theory and examples. Springer Science & Business Media.
 CardosoCachopo (2007) CardosoCachopo, A. (2007). Improving methods for singlelabel text categorization. PhD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa.
 Cui et al. (2018) Cui, Y., D. F. Sun, and K.C. Toh (2018). On the Rsuperlinear convergence of the KKT residuals generated by the augmented Lagrangian method for convex composite conic programming. Mathematical Programming, DOI: 10.1007/s1010701813006.
 Danaher et al. (2014) Danaher, P., P. Wang, and D. M. Witten (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(2), 373–397.
 Facchinei and Pang (2007) Facchinei, F. and J.S. Pang (2007). Finitedimensional Variational Inequalities and Complementarity Problems. Springer Science & Business Media.
 Fan and Lv (2010) Fan, J. and J. Lv (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica 20(1), 101.
 Fan and Tang (2013) Fan, Y. and C. Y. Tang (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B 75(3), 531–552.
 Friedman et al. (2008) Friedman, J., T. Hastie, and R. Tibshirani (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441.
 Gibberd and Nelson (2017) Gibberd, A. J. and J. D. Nelson (2017). Regularized estimation of piecewise constant gaussian graphical models: The groupfused graphical lasso. Journal of Computational and Graphical Statistics 26(3), 623–634.
 Guo et al. (2011) Guo, J., E. Levina, G. Michailidis, and J. Zhu (2011). Joint estimation of multiple graphical models. Biometrika 98(1), 1–15.
 Hallac et al. (2017) Hallac, D., Y. Park, S. Boyd, and J. Leskovec (2017). Network inference via the timevarying graphical lasso. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 205–213. ACM.
 Han et al. (2018) Han, D., D. F. Sun, and L. Zhang (2018). Linear rate convergence of the alternating direction method of multipliers for convex composite programming. Mathematics of Operations Research 43(2), 622–637.
 Hsieh et al. (2011) Hsieh, C.J., I. S. Dhillon, P. K. Ravikumar, and M. A. Sustik (2011). Sparse inverse covariance matrix estimation using quadratic approximation. In Advances in Neural Information Processing Systems, pp. 2330–2338. Curran Associates, Inc.
 Lee et al. (2014) Lee, J. D., Y. Sun, and M. A. Saunders (2014). Proximal Newtontype methods for minimizing composite functions. SIAM Journal on Optimization 24(3), 1420–1443.
 Lemaréchal and Sagastizábal (1997) Lemaréchal, C. and C. Sagastizábal (1997). Practical aspects of the Moreau–Yosida regularization: Theoretical preliminaries. SIAM Journal on Optimization 7(2), 367–385.
 Li and Gui (2006) Li, H. and J. Gui (2006). Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics 7(2), 302–317.
 Li et al. (2018a) Li, X., D. F. Sun, and K.C. Toh (2018a). A highly efficient semismooth Newton augmented Lagrangian method for solving lasso problems. SIAM Journal on Optimization 28(1), 433–458.
 Li et al. (2018b) Li, X., D. F. Sun, and K.C. Toh (2018b). On efficiently solving the subproblems of a levelset method for fused lasso problems. SIAM Journal on Optimization 28(2), 1842–1862.
 Monti et al. (2014) Monti, R. P., P. Hellyer, D. Sharp, R. Leech, C. Anagnostopoulos, and G. Montana (2014). Estimating timevarying brain connectivity networks from functional MRI time series. NeuroImage 103, 427–443.
 Moreau (1965) Moreau, J.J. (1965). Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France 93(2), 273–299.
 Rockafellar (1976) Rockafellar, R. T. (1976). Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization 14(5), 877–898.
 Rockafellar (2015) Rockafellar, R. T. (2015). Convex Analysis. Princeton University Press.
 Rockafellar and Wets (2009) Rockafellar, R. T. and R. J.B. Wets (2009). Variational Analysis, Volume 317. Springer Science & Business Media.
 Rothman et al. (2008) Rothman, A. J., P. J. Bickel, E. Levina, and J. Zhu (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics 2, 494–515.
 Tibshirani et al. (2005) Tibshirani, R., M. Saunders, S. Rosset, J. Zhu, and K. Knight (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B 67(1), 91–108.
 Wang et al. (2010) Wang, C. J., D. F. Sun, and K.C. Toh (2010). Solving logdeterminant optimization problems by a NewtonCG primal proximal point algorithm. SIAM Journal on Optimization 20(6), 2994–3013.
 Yang and Peng (2018) Yang, J. and J. Peng (2018). Estimating timevarying graphical models. arXiv preprint arXiv:1804.03811.
 Yang et al. (2013) Yang, J. F., D. F. Sun, and K.C. Toh (2013). A proximal point algorithm for logdeterminant optimization with group lasso regularization. SIAM Journal on Optimization 23(2), 857–893.
 Yang et al. (2015) Yang, S., Z. Lu, X. Shen, P. Wonka, and J. Ye (2015). Fused multiple graphical lasso. SIAM Journal on Optimization 25(2), 916–943.
 Yosida (1964) Yosida, K. (1964). Functional analysis. Springer Berlin.
 Zhang et al. (2019) Zhang, Y., N. Zhang, D. F. Sun, and K.C. Toh (2019). An efficient Hessian based algorithm for solving largescale sparse group Lasso problems. Mathematical Programming, DOI: 10.1007/s1010701813296.
Supplementary material to “An Efficient Linearly Convergent Regularized Proximal Point Algorithm for Fused Multiple Graphical Lasso Problems”
Ning Zhang Yangjing Zhang Defeng Sun KimChuan Toh
February 19, 2019
Appendix 1 Generalized Jacobian of the Proximal Mapping of the Fused Lasso Regularizer
In this section, we recall the characterization of the generalized Jacobian of the fused Lasso regularizer (Tibshirani et al. 2005), which will be used to derive the explicit expression of the generalized Jacobian of the fused graphical Lasso (FGL) regularizer.
The fused Lasso regularizer is defined by where the matrix is defined by . Denote the proximal mapping of by
Lemma 1.1.
(Friedman et al. 2007, Proposition 1) Given , it holds that
Lemma 1.2.
(Li et al. 2018, Lemma 1) Given , it holds that