Decomposing Linearly Constrained Nonconvex Problems by a Proximal Primal Dual Approach: Algorithms, Convergence, and Applications

# Decomposing Linearly Constrained Nonconvex Problems by a Proximal Primal Dual Approach: Algorithms, Convergence, and Applications

Mingyi Hong M. Hong is with the Department of Industrial and Manufacturing Systems Engineering (IMSE), Iowa State University, Ames, IA 50011, USA. Email: mingyi@iastate.edu. M. Hong is supported in part by NSF under Grant CCF-1526078 and by AFOSR under grant 15RT0767.
April 4th, 2016
###### Abstract

In this paper, we propose a new decomposition approach named the proximal primal dual algorithm (Prox-PDA) for smooth nonconvex linearly constrained optimization problems. The proposed approach is primal-dual based, where the primal step minimizes certain approximation of the augmented Lagrangian of the problem, and the dual step performs an approximate dual ascent. The approximation used in the primal step is able to decompose the variable blocks, making it possible to obtain simple subproblems by leveraging the problem structures. Theoretically, we show that whenever the penalty parameter in the augmented Lagrangian is larger than a given threshold, the Prox-PDA converges to the set of stationary solutions, globally and in a sublinear manner (i.e., certain measure of stationarity decreases in the rate of , where is the iteration counter). Interestingly, when applying a variant of the Prox-PDA to the problem of distributed nonconvex optimization (over a connected undirected graph), the resulting algorithm coincides with the popular EXTRA algorithm [Shi et al 2014], which is only known to work in convex cases. Our analysis implies that EXTRA and its variants converge globally sublinearly to stationary solutions of certain nonconvex distributed optimization problem. There are many possible extensions of the Prox-PDA, and we present one particular extension to certain nonconvex distributed matrix factorization problem.

## 1 Introduction

Consider the following optimization problem

 minx∈RNf(x),s.t.Ax=b (1)

where is a closed and smooth function (possibly nonconvex); is a rank deficient matrix; is a known vector. In this paper we propose a first-order primal-dual method for solving such nonconvex problem. Below we list a few applications of the above model.

### 1.1 Motivating Examples

Distributed Optimization Over Networks. Consider a network consists of agents who collectively optimize the following problem

 miny∈Rf(y):=N∑i=1fi(y), (2)

where is a function local to agent (here is assumed to be scalar for ease of presentation). Suppose the agents are connected by a network defined by an undirected graph , with vertices and edges. Each agent can only communicate with its immediate neighbors, and it is responsible for optimizing one component function . This problem has found applications in various domains such as distributed consensus [1, 2], distributed communication networking [3, 4], distributed and parallel machine learning [5, 6, 7] and distributed signal processing [8, 3]; for more applications we refer the readers to a recent survey [9].

Define the node-edge incidence matrix as following: if and it connects vertex and with , then if , if and otherwise. Using this definition, the signed graph Laplacian matrix is given by

 L−:=ATA.

Introduce local variables , and suppose the graph is connected. Then it is clear that the following formulation is equivalent to the global consensus problem, which is precisely problem (1)

 minx∈RNf(x):=N∑i=1fi(xi),s.t.Ax=0. (3)

Multi-Block Linearly Constrained Problem. Consider the following multi-block linearly constrained problem

 min{yi∈RN}f(y)s.t.K∑i=1Aiyi=b (4)

where ; .

Define a new variable , and a new matrix , then the above problem can also be cast as a special case of problem (1)

 minx∈RNKf(x)s.t.Cx=b.

Such problem, convex or nonconvex, has wide applications in practice, such as in distributed optimization and coordination (more specifically the sharing problem) [10, 11], robust Principal Component Analysis [12] and rate maximization problem in downlink broadcast communication channels [13].

### 1.2 Literature Review.

The Augmented Lagrangian (AL) methods, or the methods of multipliers, pioneered by Hestenes [14] and Powell [15], is a classical algorithm for solving nonlinear nonconvex constrained optimization problems [16, 17]. Many existing packages such as LANCELOT [18, 19] are implemented based on this method. Recently, due to the need to solve very large scale nonlinear optimization problems, the AL and its variants regain their popularity, see recent developments in [20, 21, 22] and the references therein. Also reference [23] have developed an AL based algorithm for nonconvex nonsmooth optimization, where subgradients of the augmented Lagrangian are used in the primal update. When the problem is convex and the constraints are linear, Lan and Monterio [24] have analyzed the iteration complexity for the AL method. More specifically, they have characterized the total number of Nesterov’s optimal iterations [25] that are required to reach high quality primal-dual solutions. However, despite the generality of these methods, it appears that the AL methods does not decompose well over the optimization variables. Further, the AL method, at least in its classical forms, is difficult to be implemented in a distributed manner.

In this paper, we answer the following research question: Is it possible to develop augmented Lagrangian-like decomposition schemes for the linearly constrained nonconvex problem (1), with global convergence rate guarantee. Ideally, the resulting algorithm should be able to decompose the updates of different variable blocks so that each of its steps can be easily implemented in distributed manner and/or in parallel. Further, it is desirable that the resulting algorithm would have global convergence and rate of convergence guarantee. To this end, we study a primal-dual algorithm, where the primal step minimizes certain approximation of the augmented Lagrangian of problem (1), and the dual step performs an approximate dual ascent. The approximation used in the primal step is able to decompose the variables, making it possible to obtain simple subproblems by leveraging the problem structures. Theoretically, we show that whenever the penalty parameter in the augmented Lagrangian is larger than a given threshold, the Prox-PDA converges to the set of stationary solutions, globally and in a sublinear manner (i.e., certain measure of stationarity decreases in the rate of , where is the iteration counter). We also analyze various different extensions of the algorithm, and discuss their applications to the distributed nonconvex optimization problem (3).

## 2 The Proposed Algorithm

The proposed algorithm builds upon the classical augmented Lagrangian method (also known as the method of multipliers) [16, 15]. Let us introduce the augmented Lagrangian for problem (1) as

 Lβ(x,μ)=f(x)+⟨μ,Ax−b⟩+β2∥Ax−b∥2 (5)

where is the Lagrangian dual variable; is a penalty parameter. Let be some arbitrary matrix. Then the steps of the proposed proximal primal-dual algorithm is given in the following table:

Algorithm 1. The Proximal Primal Dual Algorithm (Prox-PDA) At iteration , initialize and . At each iteration , update variables by: (6a) (6b)

In Prox-PDA, the primal iteration (6a) minimizes the augmented Lagrangian plus a proximal term . It is important to note that the proximal term is critical in both the algorithm implementation and the analysis. It is used to ensure the following key properties:

1. The primal problem is strongly convex, hence easily solvable;

2. The primal problem is decomposable over different variable blocks.

To see why the first point above is possible, suppose is chosen such that , and that has Lipschitz gradient. Then by a result in [45, Theorem 2.1], we know that for any , the objective function of the -problem (6a) is strongly convex.

We illustrate the second point through an example. Consider the distributed optimization problem (3). Define the signless incidence matrix , where the absolute value is taken for each component of , and is the singed incidence matrix defined in Section 1. Using this choice of , we have , which is the signless graph Laplacian whose th diagonal entry is the degree of node , and its th entry is if , and otherwise. Then -update step (6a) becomes

 xr+1 =argminxN∑i=1fi(xi)+⟨μr,Ax−b⟩+β2xTL−x+β2(x−xr)TL+(x−xr) =argminxN∑i=1fi(xi)+⟨μr,Ax−b⟩+β2xT(L−+L+)x−βxTL+xr =argminxN∑i=1fi(xi)+⟨μr,Ax−b⟩+βxTDx−βxTL+xr

where is the degree matrix, with denoting the degree of node . Clearly this problem is separable over the nodes, therefore it can be solved completely distributedly.

We remark that one can always add an additional proximal term ( is some positive semidefinite matrix) to the -subproblem (6a). Our analysis (to be presented shortly) remains valid. However in order to reduce the notational burden, we will solely focus on analyzing Algorithm 1 as presented in the table. Also, Algorithm 1 can have many useful extensions. For example, one can solve the -subproblem inexactly, by performing a proximal gradient step (in which is used in place of in (6a)). We will discuss these extensions in later sections.

## 3 The Convergence Analysis

In this section we provide convergence analysis for Algorithm 1. The key is the construction of a new potential function that decreases at every iteration of the algorithm. The constructed potential function is a conic combination the augmented Lagrangian, certain proximal term as well as the size of the violation of the equality constraint, thus it measures the progress of both the primal and dual updates.

### 3.1 The Assumptions

We first state our main assumptions.

• The function is differentiable and has Lipschitz continuous gradient, i.e.,

 ∥∇f(x)−∇f(y)∥≤L∥x−y∥,∀ x,y∈RN.

Further assume that

• There exists a constant such that

 ∃f––>−∞,s.t.f(x)+δ2∥Ax−b∥2≥f––,∀ x∈RN.

Without loss of generality and for the simplicity of notations, in the following we will assume that 111We note that this is without loss of generality because we have assumed that is finite, therefore we can consider an equivalent problem with the objective function , which is always lower bounded by ..

• The constraint is feasible over .

Below we provide a few nonconvex smooth functions that satisfy Assumption [A1] – [A3]. Note that the first three nonconvex functions are of particular interest in learning neural networks, as they are commonly used as activation functions.

• The sigmoid function. The sigmoid function is given by

 sigmoid(x)=11+e−x∈[−1,1].

Clearly it satisfies [A2]. We have , and such boundedness of the first order derivative implies that [A1] is true (by first-order mean value theorem).

• The function. Note that , so it clearly satisfies [A2]. so it is bounded, which implies that [A1] is true. Finally note that

 arctan′′(x)=−2x(1+x2)2

so it is a nonconvex function.

• The function. Note that we have

 tanh(x)∈[−1,1],tanh′(x)=1−tanh(x)2∈[0,1].

Therefore the function satisfies [A1]–[A2].

• The logit function. Since the logistic function is related to the function as follows

 2logit(x)=2exex+1=1+tanh(x/2),

then Assumptions [A1]-[A2] are again satisfied.

• The function. This function has applications in structured matrix factorization [46]. The function itself is obviously nonconvex and lower bounded. Its first order derivative is and it is also bounded.

• The quadratic function . Suppose that is a symmetric matrix but not necessarily positive semidefinite, and suppose that is strongly convex in the null space of . Then it can be shown that there exists a large enough such that [A2] is true; see e.g., [47, 16].

Other relevant functions include , , and so on.

### 3.2 The Analysis Steps

Below we provide the analysis of Prox-PDA. Our analysis consists of a series of lemmas leading to a theorem characterizing the convergence and iteration complexity for prox-PDA.

In the first step we provide a bound of the size of the constraint violation using a quantity related to the primal iterates. Let denote the smallest non-zero eigenvalue for . We have the following result.

###### Lemma 3.1

Suppose Assumptions [A1] and [A3] are satisfied. Then the following is true for Prox-PDA.

 1β∥μr+1−μr∥2 ≤2L2βσmin(ATA)∥∥xr−xr+1∥∥2 +2βσmin(ATA)∥∥BTB((xr+1−xr)−(xr−xr−1))∥∥2. (7)

Proof. From the optimality condition of the problem (6a) we have

 ∇f(xr+1)+ATμr+βAT(Axr+1−b)+βBTB(xr+1−xr)=0.

Applying (6b), we have

 ATμr+1=−∇f(xr+1)−βBTB(xr+1−xr). (8)

Note that by Assumption [A3], we have that lies in the column space of . Therefore we must have

 μr+1−μr=β(Axr+1−b)∈col(A).

That is, the difference of the dual variable lies in the column space of . Also from the fact that , we have that the dual variable itself also lies in the column space of

 μr=βr∑t=1(Axt−b)∈col(A),∀ r=1,⋯. (9)

Applying these facts to (8), and let denote the smallest non-zero eigenvalue of , we have

 σ1/2min(ATA)∥μr+1−μr∥≤∥A(μr+1−μr)∥.

This inequality combined with (8) implies that

 ∥μr+1−μr∥ ≤1σ1/2min(ATA)∥−∇f(xr+1)−βBTB(xr+1−xr)−(−∇f(xr)−βBTB(xr−xr−1))∥ =1σ1/2min(ATA)∥∥∇f(xr)−∇f(xr+1)−βBTB((xr+1−xr)−(xr−xr−1))∥∥.

Squaring both sides and dividing by , we obtain the desired result. Q.E.D.

Our second step bounds the descent of the augmented Lagrangian.

###### Lemma 3.2

Suppose Assumptions [A1] and [A3] are satisfied. Then the following is true for Algorithm 1

 (10)

Proof. Since has Lipschitz continuous gradient, and that by Assumption [A1], it is known that if , then the -subproblem (6a) is strongly convex with modulus ; cf. [45, Theorem 2.1]. That is, we have

 Lβ(x,μr)+β2∥x−xr∥2BTB−(Lβ(z,μr)+β2∥z−xr∥2BTB) ≥⟨∇xLβ(z,μr)+β(BTB(z−xr)),x−z⟩+γ2∥x−z∥2,∀ x,z∈RN,∀ μr. (11)

Using this property, we have

 Lβ(xr+1,μr+1)−Lβ(xr,μr) =Lβ(xr+1,μr+1)−Lβ(xr+1,μr)+Lβ(xr+1,μr)−Lβ(xr,μr) ≤Lβ(xr+1,μr+1)−Lβ(xr+1,μr)+Lβ(xr+1,μr)+β2∥xr+1−xr∥2BTB−Lβ(xr,μr) \lx@stackrel(i)≤∥μr+1−μr∥2β+⟨∇xLβ(xr+1,μr)+β(BTB(xr+1−xr)),xr+1−xr⟩−γ2∥xr+1−xr∥2 \lx@stackrel(ii)≤−γ2∥xr+1−xr∥2+∥μr+1−μr∥2β ≤−γ2∥xr+1−xr∥2+1σmin(ATA)(2L2β∥∥xr−xr+1∥∥2+2β∥∥BTB((xr+1−xr)−(xr−xr−1))∥∥2) =−(β−L2−2L2βσmin(ATA))∥xr+1−xr∥2+2βσmin(ATA)∥∥BTB((xr+1−xr)−(xr−xr−1))∥∥2 (12)

where in we have used (3.2) with the identification and ; in we have used the optimality condition for the -subproblem (6a). The claim is proved. Q.E.D.

A key observation from Lemma 3.2 is that no matter how large is, the rhs of (10) cannot be made negative, as the second term is increasing in . This observation suggests that the augmented Lagrangian alone cannot serve as the potential function for Prox-PDA.

In search for an appropriate potential function, we need a new object that is decreasing in the order of . The following lemma shows that the descent of the sum of the constraint violation and the proximal term has the desired term.

###### Lemma 3.3

Suppose Assumption [A1] is satisfied. Then the following is true

 β2(∥Axr+1−b∥2+∥xr+1−xr∥2BTB) ≤L∥xr+1−xr∥2+β2(∥xr−xr−1∥2BTB+∥Axr−b∥2) −β2(∥(xr−xr−1)−(xr+1−xr)∥2BTB+∥A(xr+1−xr)∥2). (13)

Proof. From the optimality condition of the -subproblem (6a) we have

 ⟨∇f(xr+1)+ATμr+βAT(Axr+1−b)+βBTB(xr+1−xr),xr+1−x⟩≤0,∀ x∈RN ⟨∇f(xr)+ATμr−1+βAT(Axr−b)+βBTB(xr−xr−1),xr−x⟩≤0,∀ x∈RN.

Plugging into the first inequality and into the second, adding the resulting inequalities and utilizing the -update step (6b) we obtain

 ⟨∇f(xr+1)−∇f(xr)+AT(μr+1−μr)+βBTB((xr+1−xr)−(xr−xr−1)),xr+1−xr⟩≤0.

Rearranging, we have

 ⟨AT(μr+1−μr),xr+1−xr⟩ ≤−⟨∇f(xr+1)−∇f(xr)+βBTB((xr+1−xr)−(xr−xr−1)),xr+1−xr⟩. (14)

Let us bound the lhs and the rhs of (3.2) separately.

First the lhs of (3.2) can be expressed as

 ⟨AT(μr+1−μr),xr+1−xr⟩ =⟨βAT(Axr+1−b),xr+1−xr⟩ =⟨β(Axr+1−b),Axr+1−b−(Axr−b)⟩ =β∥Axr+1−b∥2−β⟨Axr+1−b,Axr−b⟩ =β2(∥Axr+1−b∥2−∥Axr−b∥2+∥A(xr+1−xr)∥2). (15)

Second we have the following bound for the rhs of (3.2)

 −⟨∇f(xr+1)−∇f(xr)+βBTB((xr+1−xr)−(xr−xr−1)),xr+1−xr⟩ ≤L∥xr+1−xr∥2−β⟨BTB((xr+1−xr)−(xr−xr−1)),xr+1−xr⟩ =L∥xr+1−xr∥2+β2(∥xr−xr−1∥2BTB−∥xr+1−xr∥2BTB−∥(xr−xr−1)−(xr+1−xr)∥2BTB). (16)

Combining the above two bounds, we have

 β2(∥Axr+1−b∥2+∥xr+1−xr∥2BTB) ≤L∥xr+1−xr∥2+β2(∥xr−xr−1∥2BTB+∥Axr−b∥2) −β2(∥(xr−xr−1)−(xr+1−xr)∥2BTB+∥A(xr+1−xr)∥2).

The desired claim is proved. Q.E.D.

It is interesting to observe that the new object, , increases in and decreases in , while the augmented Lagrangian behaves in an opposite manner (cf. Lemma 3.2). More importantly, in our new object, the constant in front of is independent of . Although neither of these two objects decreases by itself, quite surprisingly, a proper conic combination of these two objects decreases at every iteration of Prox-PDA. To precisely state the claim, let us define the potential function for Algorithm 1 as

 Pc,β(xr+1,xr,μr+1)=Lβ(xr+1,μr+1)+cβ2(∥Axr+1−b∥2+∥xr+1−xr∥2BTB) (17)

where is some constant to be determined later. We have the following result.

###### Lemma 3.4

Suppose the assumptions made in Lemmas 3.13.3 are satisfied. Then we have the following estimate

 Pc,β(xr+1,xr,μr+1) ≤Pc,β(xr,xr−1,μr)−(β−L2−2L2βσmin(ATA)−cL)∥xr+1−xr∥2 −(cβ2−2β∥BTB∥σmin(ATA))∥∥(xr+1−xr)−(xr−xr−1)∥∥2BTB. (18)

Proof. Multiplying both sides of (3.3) by the constant and then add them to (10), we obtain

 Lβ(xr+1,μr+1)+cβ2(∥Axr+1−b∥2+∥xr+1−xr∥2BTB) ≤Lβ(xr,μr)+cL∥xr+1−xr∥2+cβ2(∥xr−xr−1∥2BTB+∥Axr−b∥2) −(β−L2−2L2βσmin(ATA))∥xr+1−xr∥2+2βσmin(ATA)∥∥BTB((xr+1−xr)−(xr−xr−1))∥∥2 −cβ2(∥(xr−xr−1)−(xr+1−xr)∥2BTB+∥A(xr+1−xr)∥2) ≤Lβ(xr,μr)+cβ2(∥xr−xr−1∥2BTB+∥Axr−b∥2) −(β−L2−2L2βσmin(ATA)−cL)∥xr+1−xr∥2 −(cβ2−2β∥BTB∥σmin(ATA))∥∥(xr+1−xr)−(xr−xr−1)∥∥2BTB.

The desired result is proved. Q.E.D.

From the above analysis, it is easy to see that as long as and are chosen large enough, the potential function decreases at each iteration of Prox-PDA. Below we derive the precise bounds for and .

First, it is clear that a sufficient condition for is

 c≥max{δL,4∥BTB∥σmin(ATA)}. (19)

Note that the term“” (Defined in Assumption [A2]) in the operator is needed for later use. Importantly, such bound on is independent of .

Second, for any given , we need to satisfy

 β−L2−2L2βσmin(ATA)−cL>0,

which implies the following

 β>L2(2c+1+√(2c+1)2+16L2σmin(ATA)). (20)

Clearly combining the bounds for and we see that . We conclude that if both (19) and (20) are satisfied, then the potential function decreases at every iteration.

Our next step shows that by using the particular choices of and in (19) and (20), the constructed potential function is lower bounded.

###### Lemma 3.5

Suppose [A1] - [A3] are satisfied, and are chosen according to (19) and (20). Then the following statement holds true

 ∃ P––s.t.Pc,β(xr+1,xr,μr+1)≥P––>−∞,∀ r>0. (21)

Proof. To prove this we need to utilize the boundedness assumption in [A2].

First, we can express the augmented Lagrangian function as following

 Lβ(xr+1,μr+1) =f(xr+1)+⟨μr+1,Axr+1−b⟩+β2∥Axr+1−b∥2 =f(xr+1)+1β⟨μr+1,μr+1−μr⟩+β2∥Axr+1−b∥2 =f(xr+1)+12β(∥μr+1∥2−∥μr∥2+∥μr+1−μr∥2)+β2∥Axr+1−b∥2.

Therefore, summing over , we obtain

 T∑r=1Lβ(xr+1,μr+1) =T∑r=1(f(xr+1)+β2∥Axr+1−b∥2+12β∥μr+1−μr∥2) +12β(∥μT+1∥2−∥μ1∥2).

Suppose Assumption [A2] is satisfied and is chosen according to (20) and (19), then clearly the above sum is lower bounded since

 f(x)+β2∥Ax−b∥2≥f(x)+δ2∥Ax−b∥2≥0,∀ x∈RN.

This fact implies that the sum of the potential function is also lower bounded (note, the remaining terms in the potential function are all nonnegative), that is

 T∑r=1Pc,β(xr+1,xr,μr+1)>−∞,∀ T>0.

Note that if and are chosen according to (19) and (20), then is nonincreasing. Combined with the lower boundedness of the sum of the potential function, we can conclude that the following is true

 Pc,β(xr+1,xr,μr+1)>−∞,∀ r>0. (22)

This completes the proof. Q.E.D.

Now we are ready to present the main result of this section on the convergence and the rate of convergence of Prox-PDA. To this end, define as the optimality gap of problem (1), given by

 Q(xr+1,μr):=∥∇xLβ(xr+1,μr)∥2+∥Axr+1−b∥2. (23)

It is easy to see that implies that the limit point is a KKT point of (1) that satisfies the following conditions

 0=∇f(x∗)+ATμ∗,Ax∗=b. (24)

To see this, we can first observe that , implying , therefore the second condition in (24) hold. Second, using the fact that the first term in the optimality gap also goes to zero, we have

 limr→∞∇f(xr+1)+ATμr+βAT(Axr+1−b)=∇f(x∗)+ATμ∗=0.

Therefore the first condition in (24) is true.

In the following result we show that the optimality gap not only decreases to zero, but does so in a sublinear manner. This is the main result of this section.

###### Theorem 3.1

Suppose Assumption A is satisfied. Further suppose that the conditions on and in (20) and (19) are satisfied. Then we have the following claims for the sequence generated by Prox-PDA.

• (Eventual Feasibility). The constraint is satisfied in the limit, i.e.,

 limr→∞μr+1−μr→0,limr→∞Axr→b,%andlimr→∞xr+1−xr=0.
• (Boundedness of Sequence). Further, if is bounded for all , then the sequence is also bounded; If is coercive, then the sequence is bounded.

• (Convergence to Stationary Points). Every limit point of the iterates generated by Algorithm 1 converges to a stationary point of problem (1). Further, .

• (Sublinear Convergence Rate). For any given , let us define to be the first time that the optimality gap reaches below , i.e.,

 T:=argminrQ(xr+1,μr)≤φ.

Then there exists a constant such that the following is true

 φ≤νT−1.

That is, the optimality gap converges sublinearly.

Proof. First we prove part (1). Combining Lemmas 3.4 and 3.5, we conclude that . Then according to (3.1), in the limit we have , or equivalently . That is, the constraint violation will be satisfied in the limit.

Then we prove part (2). From the optimality condition of -update step (6a) we have

 ∇f(xr+1)+ATμr+βAT(Axr+1−b)+βBTB(xr+1−xr)=0.

Then we argue that is a bounded sequence if is bounded. Indeed the fact that and imply that both and are bounded. Then the boundedness of follows from the assumption that is bounded for any , and that lies in the column space of (cf. (9)).

Then we argue that is bounded if