An Efficient Linearly Convergent Regularized Proximal Point Algorithm for Fused Multiple Graphical Lasso Problems

# An Efficient Linearly Convergent Regularized Proximal Point Algorithm for Fused Multiple Graphical Lasso Problems

Ning Zhang  Yangjing Zhang  Defeng Sun  Kim-Chuan Toh Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong (ningzhang_2008@yeah.net).(Corresponding author) Department of Mathematics, National University of Singapore, 10 Lower Kent Ridge Road, Singapore 119076 (zhangyangjing@u.nus.edu).Department of Applied Mathematics, The Hong Kong Polytechnic University, Hung Hom, Hong Kong (defeng.sun@polyu.edu.hk). This author is supported by Hong Kong Research Grant Council grant PolyU153014/18p.Department of Mathematics, and Institute of Operations Research and Analytics, National University of Singapore, 10 Lower Kent Ridge Road, Singapore 119076 (mattohkc@nus.edu.sg). The research of this author is supported in part by the Ministry of Education, Singapore, Academic Research Fund (Grant number: R-146-000-257-112).
February 19, 2019
###### Abstract

Nowadays, analysing data from different classes or over a temporal grid has attracted a great deal of interest. As a result, various multiple graphical models for learning a collection of graphical models simultaneously have been derived by introducing sparsity in graphs and similarity across multiple graphs. This paper focuses on the fused multiple graphical Lasso model which encourages not only shared pattern of sparsity, but also shared values of edges across different graphs. For solving this model, we develop an efficient regularized proximal point algorithm, where the subproblem in each iteration of the algorithm is solved by a superlinearly convergent semismooth Newton method. To implement the semismooth Newton method, we derive an explicit expression for the generalized Jacobian of the proximal mapping of the fused multiple graphical Lasso regularizer. Unlike those widely used first order methods, our approach has heavily exploited the underlying second order information through the semismooth Newton method. This can not only accelerate the convergence of the algorithm, but also improve its robustness. The efficiency and robustness of our proposed algorithm are demonstrated by comparing with some state-of-the-art methods on both synthetic and real data sets. Supplementary materials for this article are available online.

Keywords: Fast linear convergence Network estimation Semismooth Newton method Sparse Jacobian

## 1 Introduction

Undirected graphical models have been especially popular for learning conditional independence structures among a large number of variables where the observations are drawn independently and identically from the same distribution. The Gaussian graphical model is one of the most widely used undirected graphical models. In the high-dimensional and low-sample-size settings, it is always assumed that the conditional independence structure or the precision matrix is sparse in a certain sense. In other words, its corresponding undirected graph is expected to be sparse. To promote sparsity, there has been a great deal of interest in using the norm penalty in statistical applications (Banerjee et al. 2008; Friedman et al. 2008; Rothman et al. 2008). In many conventional applications, a single Gaussian graphical model is typically enough to capture the conditional independence structure of the random variables. However, due to the heterogeneity or similarity of the data involved, it is increasingly appealing to fit a collection of such models jointly, such as inferring the time-varying networks and finding the change-points (Ahmed and Xing 2009; Monti et al. 2014; Gibberd and Nelson 2017; Hallac et al. 2017; Yang and Peng 2018) and estimating multiple precision matrices simultaneously for variables from distinct but related classes (Guo et al. 2011; Danaher et al. 2014; Yang et al. 2015).

Multiple graphical models refer to the models that can estimate a collection of precision matrices jointly. Specifically, let be random vectors (from different classes or over a temporal grid) drawn independently from different distributions . Assume that the multivariate random variable has observations , for each . Then the sample means are and the sample covariance matrices are , . The multiple graphical model for estimating the precision matrices jointly is the model with the variable :

 minΘL∑l=1(−logdetΘ(l)+⟨S(l),Θ(l)⟩)+P(Θ), (1)

where is a penalty function, which usually promotes sparsity in each and similarities among different ’s. Various penalties have been considered in the literature (Ahmed and Xing 2009; Guo et al. 2011; Danaher et al. 2014; Monti et al. 2014; Yang et al. 2015; Gibberd and Nelson 2017).

In this paper, we focus on the following fused graphical Lasso (FGL) regularizer which was used by Ahmed and Xing (2009) and Yang et al. (2015):

 P(Θ)=λ1L∑l=1∑i≠j|Θ(l)ij|+λ2L∑l=2∑i≠j|Θ(l)ij−Θ(l−1)ij|. (2)

We refer to problem (1) with the FGL regularizer in (2) as the FGL problem. The FGL regularizer is in some sense a generalized fused Lasso regularizer (Tibshirani et al. 2005). It applies the penalty to all the off-diagonal elements of the precision matrices and the consecutive differences of the elements of successive precision matrices. Many elements with the same indices in the estimated matrices will be close or even identical when the parameter is large enough. Therefore, the FGL regularizer encourages not only shared pattern of sparsity, but also shared values across different graphs.

Existing algorithms for solving the FGL problem are quite limited in the literature. One of the most extensively used algorithms for solving this class of problems is the alternating direction method of multipliers (ADMM) (Danaher et al. 2014; Hallac et al. 2017; Gibberd and Nelson 2017). Besides, a proximal Newton-type method (Hsieh et al. 2011; Lee et al. 2014) was implemented by Yang et al. (2015) for solving the FGL problem. As we know, ADMM could be a practical first order method for finding approximate solutions of low or moderate accuracy. However, ADMM hardly utilizes any second order information, which generally must be used in order to obtain highly accurate solutions. Although the proximal Newton-type method does incorporate some forms of second order information, a complicated quadratic approximation problem has to be solved in each iteration, and this computation is usually time-consuming. It is worth mentioning that the regularizers are often introduced to promote certain structures in the estimated precision matrices, and the trade-off between biases and variances in the resulting estimators is controlled by the regularization parameters (Fan and Lv 2010). But in practice, it is extremely hard to find the optimal regularization parameters. Therefore, a sequence of regularization parameters is applied in practice, and consequently, a sequence of corresponding optimization problems must be solved (Fan and Tang 2013). Under such a circumstance, a highly efficient and robust algorithm for solving the FGL model becomes particularly important.

In this paper, we will design a semismooth Newton (SSN) based regularized proximal point algorithm (rPPA) for solving the FGL problem, which is inspired by Li et al. (2018b), where they have convincingly demonstrated the superior numerical performance of the SSN based augmented Lagrangian method (ALM), known as Ssnal, for solving the fused Lasso problem (Tibshirani et al. 2005). Thanks to the fact that the FGL problem has close connections to the fused Lasso problem, many of the virtues and theoretical insights of the Ssnal  for solving the fused Lasso problem can be observed in our approach. However, we should emphasize that solving the FGL problem is much more challenging than solving the fused Lasso problem. Specifically, the difficulties are mainly due to the log-determinant function and the matrix variables, as described below.

• Unlike the simple quadratic functions in the fused Lasso problem, the function is defined on the space of positive definite matrices. Therefore, the FGL model requires the positive definiteness of their solutions. This greatly increases the difficulty and complexity of theoretical analysis and numerical implementation.

• Li et al. (2018b) constructed an efficiently computable element in the generalized Jacobian of the proximal mapping of the fused Lasso regularizer, which is an essential step for solving the fused Lasso problem. Based on the constructions, we could obtain an efficiently computable generalized Jacobian of the proximal mapping of the FGL regularizer. However, this process needs more complicated manipulations of coordinates for a collection of matrix variables, unlike the vector case of the fused Lasso problem.

The key issue in the implementation of rPPA for solving the FGL model is the computation of the solution of the subproblem in each rPPA iteration. For this purpose, we will design an SSN method to solve those subproblems. We note that the numerical performance of the SSN method relies critically on the efficient calculation of the generalized Jacobian of the proximal mapping of the FGL regularizer and that of the log-determinant function. Fortunately, the generalized Jacobian of the proximal mapping of the FGL regularizer can be constructed efficiently based on that of the proximal mapping of the fused Lasso regularizer given by Li et al. (2018b). As a result, the generalized Jacobian of the proximal mapping of the FGL regularizer would inherit the structured sparsity (referred to as second order sparsity) from that of the fused Lasso regularizer. Due to the structured sparsity, the computation of a matrix-vector product in the SSN method is reasonably cheap and thus the SSN method is quite efficient for solving each subproblem. To summarize, it can be proven that our rPPA for solving the FGL problem has a linear convergent guarantee, and the convergence rate can be arbitrarily fast by choosing a sufficiently large proximal penalty parameter. Moreover, the SSN method for solving each of rPPA subproblems can be shown to be superlinearly convergent. Thus, based on these excellent convergent properties and the novel exploitation of the second order sparsity, we can expect the SSN based rPPA for solving the FGL problem to be highly efficient. Indeed, our numerical experiments have confirmed the high efficiency and robustness of the proposed algorithm for solving the FGL problems accurately.

The remaining parts of this paper are as follows. Section 2 presents some definitions and preliminary results. In section 3, we present a semismooth Newton based regularized proximal point algorithm for solving the FGL problem and its convergence properties. The numerical performance of our proposed algorithm on time-varying stock prices data sets and categorical text data sets are evaluated in section 4. Section 5 gives the conclusion.

Notations. () denotes the cone of positive semidefinite (definite) matrices in the space of real symmetric matrices . For any , we denote if (). In particular, indicates (. We let and to be the Cartesian product of positive semidefinite cones and that of spaces of symmetric matrices , respectively. denotes the -dimensional Euclidean space, and denotes the set of all real matrices. For any , , and . We use the Matlab notation to denote the matrix obtained by appending below the last row of , when the number of columns of and is identical. For any matrix , denotes the -th element of . For any , denotes the column vector obtained by taking out the -th elements across all matrices . denotes the block diagonal matrix whose -th diagonal block are the matrix , . denotes the identity matrix, and denotes an identity matrix or map when the dimension is clear from the context. The function composition is denoted by , that is, for any functions and , . The Hadamard product is denoted by .

## 2 Preliminaries

Let be a finite-dimensional real Hilbert space, and be a proper and closed convex function. The Moreau-Yosida regularization (Moreau 1965; Yosida 1964) of is defined by

 ΨΞ(u):=minu′{Ξ(u′)+12∥u′−u∥2},∀u∈E. (3)

The proximal mapping associated with is the unique minimizer of (3) defined by

 ProxΞ(u):=argminu′{Ξ(u′)+12∥u′−u∥2},∀u∈E. (4)

Moreover, is a continuously differentiable convex function (Lemaréchal and Sagastizábal 1997; Rockafellar and Wets 2009), and its gradient is given by

 ∇ΨΞ(u)=u−ProxΞ(u),∀u∈E. (5)

For notational convenience, define by

 ϑ(A)={−logdet(A),if A∈Sp++;+∞,otherwise.

Let be given. Define two scalar functions as follows:

 ϕ+β(x):=(√x2+4β+x)/2,ϕ−β(x):=(√x2+4β−x)/2,∀x∈R.

In addition, the matrix counterparts of these two scalar functions can be defined by

 ϕ+β(A):=QDiag(ϕ+β(d1),…,ϕ+β(dp))QT,ϕ−β(A):=QDiag(ϕ−β(d1),…,ϕ−β(dp))QT

for any with its eigenvalue decomposition , where . It is easy to show that and are well-defined. Moreover, and are positive definite for any .

###### Proposition 2.1.

(Wang et al. 2010, Lemma 2.1 (b)) The function is continuously differentiable, and its directional derivative at for any is given by

 (ϕ+β)′(A)[B]=Q[Γ⊙(QTBQ)]QT,

where admits the eigenvalue decomposition , and is defined by

 Γij=(ϕ+β(di)+ϕ+β(dj))/(√d2i+4β+√d2j+4β),i,j=1,2,…,p.
###### Proposition 2.2.

(Yang et al. 2013, Proposition 2.3) For any , it holds that

### 2.1 Surrogate Generalized Jacobian of ProxP

In this section, we analyse the proximal mapping of the regularizer defined by (2). For any , one might observe that the penalty term merely penalizes the off-diagonal elements, and it is the same fused Lasso regularizer that acts on each vector . It holds that The function is the fused Lasso regularizer, and the matrix is defined by , . The formula for the generalized Jacobian of has been derived by Li et al. (2018b) and will be used in our subsequent algorithmic design. Define the surrogate generalized Jacobian of at as follows:

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩W∈ˆ∂ProxP(X) if and only if there exist M(ij)∈ˆ∂Proxφ(X[ij]), i

where is the surrogate generalized Jacobian of (Li et al. 2018b, Equation 22) and will be given in the supplementary materials. From Theorem 1 by Li et al. (2018b), one can obtain the following theorem, which justifies why in (6) can be used as the surrogate generalized Jacobian of at .

###### Theorem 2.1.

The surrogate generalized Jacobian defined in (6) is a nonempty compact valued, upper semicontinuous multifunction. Given any , any element in the set is self-adjoint and positive semidefinite. Moreover, there exists a neighborhood of such that for all ,

 ProxP(Y)−ProxP(X)−W[Y−X]=0,∀W∈ˆ∂ProxP(Y).

### 2.2 Lipschitz Continuity of the Solution Mapping

By introducing an auxiliary variable , we can rewrite problem (1) equivalently as

 minΘ,Ω{f(Θ,Ω):=L∑l=1(ϑ(Ω(l))+⟨S(l),Θ(l)⟩)+P(Θ)∣∣Θ−Ω=0}. (7)

The Lagrangian function of the above problem is given by

 L(Θ,Ω,X)=f(Θ,Ω)−⟨Θ−Ω,X⟩,∀(Θ,Ω,X)∈X×X×Y.

The dual problem of (7) takes the following form (Borwein and Lewis 2010, Theorem 3.3.5):

 maxXL∑l=1(−ϑ(X(l))+p)−P∗(X−S). (8)

The Karush-Kuhn-Tucker (KKT) optimality conditions (Han et al. 2018) for (7) are given as follows:

 Θ−ProxP(Θ+X−S)=0,Ω(l)−Proxϑ(Ω(l)−X(l))=0,l=1,…,L,Θ−Ω=0. (9)

We make the following assumption on the existence of solutions to the KKT system.

###### Assumption 2.1.

The solution set to the KKT system (9) is nonempty.

Define an operator by Since the function is strictly convex, under Assumption 2.1, we can see that the KKT system (9) has a unique KKT point, denoted by , and .

###### Proposition 2.3.

There exists a nonnegative scalar such that for some it holds that

###### Proof.

Note that the regularizer defined by (2) is a positive homogeneous function. Therefore, it follows from Example 11.4(a) by Rockafellar and Wets (2009) that the conjugate function is an indicator function of a nonempty convex polyhedral set. This, together with Theorem 2.7 by Li et al. (2018a) and Proposition 6 by Cui et al. (2018), proves the required result. ∎

## 3 Regularized Proximal Point Algorithm

In this section, we present a regularized proximal point algorithm (rPPA) for solving the problem (7) with the FGL regularizer defined by (2). Given a sequence of positive scalars , the -th iteration of PPA for solving (7) is given by

 (Θk+1,Ωk+1)≈argminΘ,Ω{f(Θ,Ω)+12σk(∥Θ−Θk∥2+∥Ω−Ωk∥2)|Θ−Ω=0}, (10)

where and is the objective function of the problem (7).

There are many ways to solve (10). Inspired by recent progresses in solving large scale convex optimization problems (Yang et al. 2013; Li et al. 2018a, b; Zhang et al. 2019), we shall adopt the approach of solving (10) via employing a sparse SSN method to its dual. The dual of (10) takes the following form:

 supX{Φk(X):=infΘ,Ω{L(Θ,Ω,X)+12σk(∥Θ−Θk∥2+∥Ω−Ωk∥2)}}.

By the definition of the Moreau-Yosida regularization (3), we can write explicitly as follows:

 Φk(X)=infΘ{P(Θ)+⟨Θ,S−X⟩+12σk∥Θ−Θk∥2}  +∑Ll=1infΩ(l){ϑ(Ω(l))+⟨Ω(l),X(l)⟩+12σk∥Ω(l)−(Ω(l))k∥2}=1σkΨσkP(Θk+σk(X−S))+∑Ll=11σkΨσkϑ((Ωk)(l)−σkX(l))  −12σk∥Θk+σk(X−S)∥2+12σk∥Θk∥2−∑Ll=1(12σk∥(Ωk)(l)−σkX(l)∥2−12σk∥(Ω(l))k∥2).

Therefore, by Proposition 2.2 and the definition of the proximal mapping (4), the -th iteration of PPA (10) can be written as

 {Θk+1=ProxσkP(Θk+σk(Xk+1−S)),(Ω(l))k+1=Proxσkϑ((Ω(l))k−σk(X(l))k+1),l=1,2,…,L,

where approximately solves the following problem: Since is not strongly concave in general, we consider the following rPPA.

Since in practice the inner subproblem (11) can only be solved inexactly, we will use the following standard stopping criteria studied by Rockafellar (1976):

 (A)∥∇ˆΦk(Xk+1)∥≤εk/σk,εk≥0,∑∞k=0εk<∞;(B)∥∇ˆΦk(Xk+1)∥≤(δk/σk)∥(Θk+1,Ωk+1)−(Θk,Ωk)∥,δk≥0,∑∞k=0δk<∞.

The reason for using the above stopping criteria is due to the fact that Algorithm 1 is equivalent to the primal-dual PPA in the sense of Rockafellar (1976). Moreover, we have the following convergence results.

###### Theorem 3.1.

Let be an infinite sequence generated by Algorithm 1 under stopping criterion (A). Then the sequence converges to the unique solution of (7), and the sequence converges to the unique solution of (8). Furthermore, if the criterion is also executed in Algorithm 1, there exists such that for all ,

 ∥(Θk+1,Ωk+1,Xk+1)−(¯¯¯¯Θ,¯¯¯¯Ω,¯¯¯¯¯X)∥≤μk∥(Θk,Ωk,Xk)−(¯¯¯¯Θ,¯¯¯¯Ω,¯¯¯¯¯X)∥,

where the convergence rate

 1>μk:=[κ(κ2+σ2k)−1/2+δk]/(1−δk)→μ∞=κ(κ2+σ2∞)−1/2(μ∞=0ifσ∞=∞)

and the parameter is from Proposition 2.3.

###### Proof.

The global convergence of Algorithm 1 can be obtained from Theorem 1 by Rockafellar (1976) and the uniqueness of the KKT point. The linear rate of convergence can be derived from Proposition 2.3 and Theorem 2 by Rockafellar (1976). ∎

### 3.1 Semismooth Newton Method for Solving Subproblem (11)

From (5), Proposition 2.2, and Theorem 31.5 by Rockafellar (2015), we know that is a continuously differentiable, strongly concave function and

 ∇Φk(X)=−ProxσkP(Uk(X))+(ϕ+σk(W(1)k(X)),…,ϕ+σk(W(L)k(X))),

where and ,  . Therefore, one can obtain the unique solution to problem (11) by solving the nonsmooth system

 ∇ˆΦk(X)=∇Φk(X)−(X−Xk)/σk=0. (12)

Recall that is differentiable and its derivative is given by Proposition 2.1. Thus, the surrogate generalized Jacobian of at is defined as follows:

 ⎧⎪ ⎪⎨⎪ ⎪⎩V∈ˆ∂(∇Φk)(X) if and only if there exists G∈ˆ∂ProxP(Uk(X)/σk) such thatV[D]=−σkG[D]          −σk((ϕ+σk)′(W(1)k(X))[D(1)],…,(ϕ+σk)′(W(L)k(X))[D(L)]),∀D∈Y. (13)

With the generalized Jacobian of , we are ready to solve equation (12) by the SSN method, where the Newton systems are solved inexactly by the conjugate gradient (CG) method.

Next, we derive the convergence result of the SSN method (Algorithm 2).

###### Theorem 3.2.

Let be the infinite sequence generated by Algorithm 2. Then converges to the unique optimal solution of (12) and .

###### Proof.

Since the proximal mapping is piecewise linear and Lipschitz continuous, we know from Theorem 7.5.17 by Facchinei and Pang (2007) that is directionally differentiable. This, together with Theorem 2.1, implies that is strongly semismooth with respect to the multifunction (for its definition, see e.g., Definition 1 by Li et al. (2018b)). Therefore, the conclusion follows from the strong convexity of , Proposition 2.1, and Proposition 7 & Theorem 3 by Li et al. (2018b). ∎

## 4 Numerical Experiments

In this section, we compare the performance of our algorithm rPPA with the alternating direction method of multipliers (ADMM) and the proximal Newton-type method implemented by Yang et al. (2015) (referred to as MGL here) for which the solver is available at http://senyang.info/.

The following paragraph describes the measurement of the accuracy of an approximate optimal solution and the stopping criteria of the three methods. Since both rPPA and ADMM can generate primal and dual approximate solutions, we can assess the accuracy of their solutions by the relative KKT residuals. Unlike the primal-dual method, MGL merely gives the primal solution and the KKT residual of a solution generated by MGL is not available. Instead, we measure the relative error of the objective value obtained by MGL with respect to that computed by rPPA. Based on the KKT optimality condition (9), the accuracy of an approximate optimal solution generated by rPPA (Algorithm 1) is measured by defining the following relative residuals:

 ηP:=max{∥Θ−ProxP(Θ+X−S)∥1+∥Θ∥,∥Θ−Ω∥1+∥Θ∥,max1≤l≤L{∥Ω(l)X(l)−I∥1+√p}}.

Likewise, the accuracy of an approximate optimal solution generated by ADMM is measured by the relative KKT residual (defined in the supplementary material) that is analogous to .

In our numerical experiments, we terminate rPPA if it satisfies the condition for a given accuracy tolerance ; similarly for ADMM with the stopping condition Note that the terminating condition for MGL is different. Let “” and “” be the primal objective function values computed by rPPA and MGL, respectively. MGL will be terminated when the relative difference of its objective value with respect to that obtained by rPPA is smaller than the given tolerance , i.e.,

 ΔM:=(pobjM−pobjP)/(1+|pobjM|+|pobjP|)<ε. (14)

We adopt a warm-start strategy to initialize rPPA. That is, we first run ADMM (with identity matrices as the starting point) for a fixed number of iterations to generate a good initial point to warm-start rPPA. We also stop ADMM as soon as the relative KKT residual of the computed iterate is less than . Note that such a warm-starting strategy is sound since in the initial phase of rPPA where the iterates are not close to the optimal solution (as measured by the associated relative KKT residual), it is computationally wasteful to use the more expensive rPPA iteration when the fast local linear convergence behavior of the algorithm has yet to kick in. Under such a scenario, naturally one would use cheaper iterations such as those of ADMM to generate the approximate solution points until the relative KKT residual has been sufficiently reduced.

For the tuning parameters and , for each test instance we select three pairs that lead to reasonable sparsity. In the following tables, “P” stands for rPPA; “A” stands for ADMM; “M” stands for MGL; “nnz” denotes the number of nonzero entries in the solution obtained by rPPA using the estimation: where is the vector obtained via sorting all elements in by magnitude in a descending order; “density” denotes the quantity nnz. The time is displayed in the format of “hours:minutes:seconds”, and the fastest method in terms of running time is highlighted in red. The errors presented in the tables are the relative KKT residuals for rPPA and for ADMM; while the error for MGL is in (14).

### 4.1 Nearest-neighbour Networks

In this section, we assess the effectiveness of the FGL model on a simulated network: nearest-neighbour network. The nearest-neighbour network is generated by modifying the data generation mechanism described by Li and Gui (2006). We set and . For each , we generate 10,000 independently and identically distributed observations from a multivariate Gaussian distribution , where is the precision matrix of the -th class. The details of the generation of are as follows. First of all, points are randomly generated on a unit square, their pairwise distances are calculated, and -nearest neighbours of each point in terms of distance are found. The nearest-neighbour network is obtained by linking any two points that are -nearest neighbours of each other. The integer controls the degree of sparsity of the network, and we set in our simulation. Subsequently, we add heterogeneity to the common structure by further creating individual links as follows: for each , a pair of symmetric zero elements is randomly selected and replaced with a value uniformly drawn from the interval . This procedure is repeated ceil (the nearest integer greater than or equal to ) times, where is the number of edges in the nearest-neighbour graph. In our simulation, the true number of edges in the three networks is .

There is a pair of tuning parameters and which must be specified. In the FGL model, drives sparsity and drives similarity, and we say that and are the sparsity and similarity control parameters respectively. In order to show the diversity of sparsity in our experiments, we choose a series of for the FGL model with fixed. Figure 1 shows the relative ability of the FGL model to recover the network structures and to detect the change-points.

Figure 0(a) displays the number of true positive edges selected (i.e., TP edges) against the number of false edges selected (i.e., FP edges) for the FGL model. We say that an edge in the -th network is selected in the estimate if , and we say that the edge is true in the precision matrix if and false if . We can see from the figure that the FGL model with can recover almost all of the true positive edges without false positive edges. Figure 0(a) also shows that for the FGL model the similarity control parameter is much better than in terms of the ability of true edges detection. When , the FGL model can merely detect about 3000 true positive edges while the the number of false positive edges is increased to over 600. One possible reason is that is too large compared with the underlying optimal one in this case.

Figure 0(b) illustrates the sum of squared errors between estimated edge values and true edge values, i.e., . When the number of the total edges selected is increasing (i.e., the sparsity control parameter is decreasing), the error is decreasing and finally reaches a fairly low value.

Figure 0(c) plots the number of true positive differential edges against false positive differential edges. A differential edge is an edge that differs between classes and thus corresponds to a change-point. We say that the edge is estimated to be differential between the -th and the -th networks if , and we say that it is truly differential if . The number of differential edges is computed for all successive pairs of networks. The best point in Figure 0(c) is the red one which has approximately 2700 true positive differential edges and almost no false one. We can also see from Figure 0(c) that all the blue points have no false positive differential edge and small numbers of true positive differential edges. This might be caused by the larger similarity control parameter which forces an excessive number of edges across networks to be similar.

### 4.2 Standard & Poor’s 500 Stocks

In this section, we compare rPPA, ADMM, and MGL on the Standard & Poor’s 500 stock price data sets. The stock price data sets contain daily returns of 500 stocks over a long period, and can be downloaded from the link www.yahoo.com. The dependency structures of different stocks vary over time. But it appears that the dependency networks change smoothly over time. Therefore, the FGL model might be able to find the interactions among these stocks and how they evolve over time.

We first consider a relatively short three-year time period from January 2004 to December 2006. During this period, there are totally 755 daily returns of 370 stocks. We call this data set SPX3a. For each year, it contains approximately 250 daily returns of each stock. Considering the limited number of observations in each year and the interpretation of the results, we choose to analyse random smaller subsets of all involved stocks, whose sizes are chosen to be and , over periods.

In addition to the above data set over three years, a relatively long period from January 2004 to December 2014 is also considered in the experiments, which is referred to as SPX11b. Since the time period is longer than the previous one, the number of stocks becomes smaller as some stocks might disappear. During the 11-year time period, there are 2769 daily closing prices of 272 stocks. We can set a relatively large parameter according to years from January 2004 to December 2014. Again, we choose to analyse two random subsets of all existing stocks, of which the sizes are selected to be and .

Table 1 shows the comparison of rPPA, ADMM, and MGL on the stock price data sets SPX3a and SPX11b with and selected stocks. One outstanding observation from the table is that rPPA outperforms ADMM and MGL by an obvious margin and rPPA is faster than the other two methods except for one instance. For the exceptional instance, rPPA is still faster than ADMM and merely two seconds slower than MGL. In addition, we find that both rPPA and ADMM succeeded in solving all instances; while MGL failed to solve two of them within one hour. This might imply that MGL is not robust for solving the FGL model when applied to the stock price data sets. The numerical results show convincingly that our algorithm rPPA can solve the FGL problems efficiently and robustly. The superior performance of rPPA can mainly be attributed to our ability to extract and exploit the sparsity structure (in the surrogate generalized Jacobian of ) within the SSN method to solve each rPPA subproblem very efficiently.

Figure 2 displays the sparse patterns of estimated precision matrices from year 2004 to year 2014 on the data set SPX11b with (1e-4,1e-4) in application of the FGL model. It should be noted that this period covers the 2008 financial crisis. We manually split the time points into three stages (one stage corresponds to one row in Figure 2) to aid the interpretation of the results. Each red pattern in the left panel presents the common structure across the estimated precision matrices in its stage. And each blue pattern visualizes the individual edges specific to its own precision matrix. Generally, one can hardly expect a meaningful common structure across all the time points, and thus we provide here the common structure across parts of nearby precision matrices. One can clearly see that more edges are detected in the middle stage, and the number of the common edges across year 2008, 2009, 2010, 2011 is correspondingly larger than that in the earlier and later stages. The increased amount of interactions among the stocks over this period is likely due to the 2008 global financial crisis and its sustained effects. Another observation is that the number of edges had a drastic increase in 2007, kept at a high level during the 2008 global financial crisis and a certain period after that, and then went down to a level still higher than that of the pre-crisis period (year 2004, 2005, 2006). The sudden increase in 2007 might be seen as a prediction of the oncoming financial crisis in 2008. The increased amount of interactions among stocks after the financial crisis compared to that in the pre-crisis period may indicate some essential changes of the financial landscape. To a certain degree, the observations agree well with those observed by Yang and Peng (2018).

### 4.3 University Webpages

Here we evaluate the numerical performances of rPPA, ADMM, and MGL on the data set university webpages, which is provided by Cardoso-Cachopo (2007) and available at http://ana.cachopo.org/datasets-for-single-label-text-categorization. The original pages were collected from computer science departments of various universities in 1997. We selected four largest and meaningful classes in our experiment: Student, Faculty, Staff, and Department. For each class, the collection contains pages from four universities: Cornell, Texas, Washington, Wisconsin, and other miscellaneous pages from other universities. Furthermore, the original text data have been preprocessed by stemming techniques, that is, reducing words to their morphological roots. The preprocessed data sets downloaded from the link above contain two files: two thirds of the pages were randomly chosen as training set (Webtrain) and the remaining third as testing set (Webtest). Table 2 presents the distribution of documents per class.

Next, we apply the FGL model to the data set for the purpose of interpreting the data. We choose tuning parameters that enforce high sparsity and similarity. In our experiment, we set and . The resulting common structure is displayed in Figure 3. The thickness of an edge is proportional to the magnitude of the associated average partial correlation. Figure 3 shows that some standard phrases in computer science, such as program-languag, oper-system, distribut-system, softwar-engin, possess high partial correlations among their constituent words in all four classes. It successfully demonstrates the effectiveness of the FGL model for exploring the similarity across related classes. Table 3 shows the comparison of the three methods rPPA, ADMM, and MGL on the webpages data sets with data dimension , , and . As can be seen, rPPA outperforms ADMM and MGL by a large margin for most of the tested webpages data sets.

## 5 Conclusion

We have designed an efficient and globally convergent regularized proximal point algorithm for solving the primal formulation of the fused graphical Lasso problem. From a theoretical perspective, we established the Lipschitiz continuity of the solution mapping and consequently obtained that the primal and dual sequences are locally linearly convergent. This lays the foundation for the efficiency of the proposed algorithm. Moreover, the second order information was also fully exploited, which further leads to the high efficiency of the proposed algorithm. Numerically, we demonstrated the superior efficiency and robust performance of the proposed method by comparing it with the extensively used alternating direction method of multipliers and the proximal Newton-type method (Yang et al. 2015) on both synthetic and real data sets. In summary, the proposed semismooth Newton based regularized proximal point algorithm is a highly efficient method for solving the fused graphical Lasso problems.

## 6 Supplementary Materials

Supplementary material:

It contains technical details (generalized Jacobian of the proximal mapping of the fused Lasso regularizer and implementation of ADMM) and numerical results (on data University Webpages and 20 Newsgroups). (pdf file)

## References

• Ahmed and Xing (2009) Ahmed, A. and E. P. Xing (2009). Recovering time-varying networks of dependencies in social and biological studies. Proceedings of the National Academy of Sciences 106(29), 11878–11883.
• Banerjee et al. (2008) Banerjee, O., L. E. Ghaoui, and A. d’Aspremont (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research 9, 485–516.
• Borwein and Lewis (2010) Borwein, J. and A. S. Lewis (2010). Convex analysis and nonlinear optimization: theory and examples. Springer Science & Business Media.
• Cardoso-Cachopo (2007) Cardoso-Cachopo, A. (2007). Improving methods for single-label text categorization. PhD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa.
• Cui et al. (2018) Cui, Y., D. F. Sun, and K.-C. Toh (2018). On the R-superlinear convergence of the KKT residuals generated by the augmented Lagrangian method for convex composite conic programming. Mathematical Programming, DOI: 10.1007/s10107-018-1300-6.
• Danaher et al. (2014) Danaher, P., P. Wang, and D. M. Witten (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76(2), 373–397.
• Facchinei and Pang (2007) Facchinei, F. and J.-S. Pang (2007). Finite-dimensional Variational Inequalities and Complementarity Problems. Springer Science & Business Media.
• Fan and Lv (2010) Fan, J. and J. Lv (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica 20(1), 101.
• Fan and Tang (2013) Fan, Y. and C. Y. Tang (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: Series B 75(3), 531–552.
• Friedman et al. (2008) Friedman, J., T. Hastie, and R. Tibshirani (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441.
• Gibberd and Nelson (2017) Gibberd, A. J. and J. D. Nelson (2017). Regularized estimation of piecewise constant gaussian graphical models: The group-fused graphical lasso. Journal of Computational and Graphical Statistics 26(3), 623–634.
• Guo et al. (2011) Guo, J., E. Levina, G. Michailidis, and J. Zhu (2011). Joint estimation of multiple graphical models. Biometrika 98(1), 1–15.
• Hallac et al. (2017) Hallac, D., Y. Park, S. Boyd, and J. Leskovec (2017). Network inference via the time-varying graphical lasso. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 205–213. ACM.
• Han et al. (2018) Han, D., D. F. Sun, and L. Zhang (2018). Linear rate convergence of the alternating direction method of multipliers for convex composite programming. Mathematics of Operations Research 43(2), 622–637.
• Hsieh et al. (2011) Hsieh, C.-J., I. S. Dhillon, P. K. Ravikumar, and M. A. Sustik (2011). Sparse inverse covariance matrix estimation using quadratic approximation. In Advances in Neural Information Processing Systems, pp. 2330–2338. Curran Associates, Inc.
• Lee et al. (2014) Lee, J. D., Y. Sun, and M. A. Saunders (2014). Proximal Newton-type methods for minimizing composite functions. SIAM Journal on Optimization 24(3), 1420–1443.
• Lemaréchal and Sagastizábal (1997) Lemaréchal, C. and C. Sagastizábal (1997). Practical aspects of the Moreau–Yosida regularization: Theoretical preliminaries. SIAM Journal on Optimization 7(2), 367–385.
• Li and Gui (2006) Li, H. and J. Gui (2006). Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics 7(2), 302–317.
• Li et al. (2018a) Li, X., D. F. Sun, and K.-C. Toh (2018a). A highly efficient semismooth Newton augmented Lagrangian method for solving lasso problems. SIAM Journal on Optimization 28(1), 433–458.
• Li et al. (2018b) Li, X., D. F. Sun, and K.-C. Toh (2018b). On efficiently solving the subproblems of a level-set method for fused lasso problems. SIAM Journal on Optimization 28(2), 1842–1862.
• Monti et al. (2014) Monti, R. P., P. Hellyer, D. Sharp, R. Leech, C. Anagnostopoulos, and G. Montana (2014). Estimating time-varying brain connectivity networks from functional MRI time series. NeuroImage 103, 427–443.
• Moreau (1965) Moreau, J.-J. (1965). Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France 93(2), 273–299.
• Rockafellar (1976) Rockafellar, R. T. (1976). Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization 14(5), 877–898.
• Rockafellar (2015) Rockafellar, R. T. (2015). Convex Analysis. Princeton University Press.
• Rockafellar and Wets (2009) Rockafellar, R. T. and R. J.-B. Wets (2009). Variational Analysis, Volume 317. Springer Science & Business Media.
• Rothman et al. (2008) Rothman, A. J., P. J. Bickel, E. Levina, and J. Zhu (2008). Sparse permutation invariant covariance estimation. Electronic Journal of Statistics 2, 494–515.
• Tibshirani et al. (2005) Tibshirani, R., M. Saunders, S. Rosset, J. Zhu, and K. Knight (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B 67(1), 91–108.
• Wang et al. (2010) Wang, C. J., D. F. Sun, and K.-C. Toh (2010). Solving log-determinant optimization problems by a Newton-CG primal proximal point algorithm. SIAM Journal on Optimization 20(6), 2994–3013.
• Yang and Peng (2018) Yang, J. and J. Peng (2018). Estimating time-varying graphical models. arXiv preprint arXiv:1804.03811.
• Yang et al. (2013) Yang, J. F., D. F. Sun, and K.-C. Toh (2013). A proximal point algorithm for log-determinant optimization with group lasso regularization. SIAM Journal on Optimization 23(2), 857–893.
• Yang et al. (2015) Yang, S., Z. Lu, X. Shen, P. Wonka, and J. Ye (2015). Fused multiple graphical lasso. SIAM Journal on Optimization 25(2), 916–943.
• Yosida (1964) Yosida, K. (1964). Functional analysis. Springer Berlin.
• Zhang et al. (2019) Zhang, Y., N. Zhang, D. F. Sun, and K.-C. Toh (2019). An efficient Hessian based algorithm for solving large-scale sparse group Lasso problems. Mathematical Programming, DOI: 10.1007/s10107-018-1329-6.

Supplementary material to “An Efficient Linearly Convergent Regularized Proximal Point Algorithm for Fused Multiple Graphical Lasso Problems”

Ning Zhang     Yangjing Zhang     Defeng Sun     Kim-Chuan Toh

February 19, 2019

## Appendix 1 Generalized Jacobian of the Proximal Mapping of the Fused Lasso Regularizer

In this section, we recall the characterization of the generalized Jacobian of the fused Lasso regularizer (Tibshirani et al. 2005), which will be used to derive the explicit expression of the generalized Jacobian of the fused graphical Lasso (FGL) regularizer.

The fused Lasso regularizer is defined by where the matrix is defined by . Denote the proximal mapping of by

###### Lemma 1.1.

(Friedman et al. 2007, Proposition 1) Given , it holds that

###### Lemma 1.2.

(Li et al. 2018, Lemma 1) Given , it holds that