A second order primal-dual method for
nonsmooth convex composite optimization
We develop a second order primal-dual method for optimization problems in which the objective function is given by the sum of a strongly convex twice differentiable term and a possibly nondifferentiable convex regularizer. After introducing an auxiliary variable, we utilize the proximal operator of the nonsmooth regularizer to transform the associated augmented Lagrangian into a function that is once, but not twice, continuously differentiable. The saddle point of this function corresponds to the solution of the original optimization problem. We employ a generalization of the Hessian to define second order updates on this function and prove global exponential stability of the corresponding differential inclusion. Furthermore, we develop a globally convergent customized algorithm that utilizes the primal-dual augmented Lagrangian as a merit function. We show that the search direction can be computed efficiently and prove quadratic/superlinear asymptotic convergence. We use the -regularized least squares problem and the problem of designing a distributed controller for a spatially-invariant system to demonstrate the merits and the effectiveness of our method.
We study a class of composite optimization problems in which the objective function is given by the sum of a differentiable strongly convex component and a nondifferentiable convex component. Problems of this form are encountered in diverse fields including compressive sensing, machine learning, statistics, image processing, and control [1, 2, 3, 4, 5]. They often arise in structured feedback synthesis where it is desired to balance controller performance (e.g., the closed-loop or norm) with structural complexity [5, 6].
The lack of differentiability in the regularization term precludes the use of standard descent methods for smooth objective functions. Proximal gradient methods [7, 8, 9] and their accelerated variants  generalize gradient descent, but typically require the nonsmooth term to be separable.
An alternative approach introduces an auxiliary variable to split the smooth and nonsmooth components of the objective function. The reformulated problem facilitates the use of splitting methods such as the alternating direction method of multipliers (ADMM) . This augmented-Lagrangian-based method divides the optimization problem into simpler subproblems, allows for a broader class of regularizers than proximal gradient, and it is convenient for distributed implementation. In , we exploited the structure of proximal operators associated with nonsmooth regularizers and introduced the proximal augmented Lagrangian. This continuously differentiable function enables the use of the standard method of multipliers (MM) for nonsmooth optimization and it is convenient for distributed implementation via the Arrow-Hurwicz-Uzawa gradient flow dynamics. Recent work has extended MM to incorporate second order updates of the primal and dual variables [13, 14, 15] for nonconvex problems with twice continuously differentiable objective functions.
Since first order approaches tend to converge slowly to high-accuracy solutions, much work has focused on developing second order methods for nonsmooth composite optimization. A generalization of Newton’s method was developed in [16, 17, 18, 19] and it requires solving a regularized quadratic subproblem to determine a search direction. Related ideas have been utilized for sparse inverse covariance estimation in graphical models [20, 21] and topology design in consensus networks .
Generalized Newton updates for identifying stationary points of (strongly) semismooth gradient mappings were first considered in [23, 24, 25]. In [26, 27, 28], the authors introduce the once-continuously differentiable Forward-Backward Envelope (FBE) and solve composite problems by minimizing the FBE using line search, quasi-Newton methods, or second order updates based on an approximation of the generalized Hessian.
We develop a second order primal-dual algorithm for nonsmooth composite optimization by leveraging these advances. Second order updates for the once continuously differentiable proximal augmented Lagrangian are formed using a generalized Hessian. We employ the merit function introduced in  to assess progress towards the optimal solution and develop a globally convergent customized algorithm with fast asymptotic convergence rate. When the proximal operator associated with a nonsmooth regularizer is (strongly) semismooth, our algorithm exhibits local (quadratic) superlinear convergence.
Our presentation is organized as follows. In Section II, we formulate the problem and provide necessary background material. In Section III, we define second order updates to find the saddle points of the proximal augmented Lagrangian. In Section IV, we prove global exponential stability of a continuous-time differential inclusion. In Section V, we develop a customized algorithm that converges globally to the saddle point and exhibits superlinear or quadratic asymptotic convergence rates. In Section VI, we provide examples to illustrate the utility of our method. We discuss interpretations of our approach and connections to the alternative algorithms in Section VII and conclude the paper in Section VIII.
Ii Problem formulation and background
We consider the problem of minimizing the sum of two convex functions over an optimization variable ,
where . Problem (1) was originally formulated in the context of compressive sensing to incorporate structural considerations into traditional sensing or regression [1, 2, 3]. As specified in Assumption 1, the differentiable part of the objective function , which quantifies loss or performance, is strictly convex with a Lipschitz continuous gradient. In contrast, the regularization function may be nondifferentiable and is used to incorporate structural requirements on the optimization variable. In structured feedback synthesis [5, 6], typically quantifies the closed-loop performance, e.g., the norm, and imposes structural requirements on the controller, e.g., by penalizing the amount of network traffic [29, 30, 31, 32]. Although our strongest results require strong convexity of , our theory and techniques are applicable as long as the Hessian of is positive definite.
The function is twice continuously differentiable, has an Lipschitz continuous gradient , and is strictly convex with ; the function is proper, lower semicontinuous, and convex; and the matrix has full row rank.
The matrix is important when the desired structure has a simple representation in the co-domain of , but it makes the problem more challenging. One approach is to reformulate (1) by introducing an auxiliary optimization variable ,
Problem (2) is convenient for constrained optimization algorithms based on the augmented Lagrangian,
where is the Lagrange multiplier and is a positive parameter. Relative to the standard Lagrangian, contains an additional quadratic penalty on the linear constraint in (2).
In the remainder of this section, we provide background on proximal operators and describe generalizations of the gradient for nondifferentiable functions. We also briefly overview existing approaches for solving (1).
Ii-A1 Proximal operators
The proximal operator of the function is the minimizer of the sum of and a proximal term,
|where is a positive parameter and is a given vector. When is convex, its proximal operator is Lipschitz continuous with parameter , differentiable almost everywhere, and firmly non-expansive . The value function associated with (3a) specifies the Moreau envelope of ,|
|The Moreau envelope is continuously differentiable, even when is not, and its gradient|
is Lipschitz continuous with parameter .
For example, the proximal operator associated with the norm, , is determined by soft-thresholding,
the associated Moreau envelope is the Huber function,
and its gradient is the saturation function,
Other regularizers with efficiently computable proximal operators include indicator functions of simple convex sets as well as the nuclear and Frobenius norms of matricial variables. Such regularizers can enforce bounds on , promote low rank solutions, and enhance group sparsity, respectively.
Ii-A2 Generalization of the gradient and Jacobian
Although is typically not differentiable, it is Lipschitz continuous and therefore differentiable almost everywhere . One generalization of the gradient for such functions is given by the -subdifferential set , which applies to locally Lipschitz continuous functions : . Let be a set at which is differentiable. Each element in the set is the limit point of a sequence of gradients evaluated at a sequence of points whose limit is ,
If is continuously differentiable in the neighborhood of a point , the -subdifferential set becomes single valued and it is given by the gradient, . In general, is not a convex set; e.g., if , .
The Clarke subdifferential set of : at is the convex hull of the -subdifferential set ,
When is a convex function, the Clarke subdifferential set is equal to the subdifferential set which defines the supporting hyperplanes of at [36, Chapter VI]. For a function : , the -generalization of the Jacobian at a point is given by
where each is a member of the -subdifferential set of the th component of evaluated at . The Clarke generalization of the Jacobian at a point , , has the same structure where each is a member of the Clarke subdifferential of .
The mapping : is semismooth at if for any sequence , the sequence of Clarke generalized Jacobians provides a first order approximation of ,
where denotes that as tends to infinity . The function is strongly semismooth if this approximation satisfies the stronger condition,
where signifies that for some positive constant and positive .
(Strong) semismoothness of the proximal operator leads to fast asymptotic convergence of the differential inclusion (see Section IV) and the efficient algorithm (see Section V). Proximal operators associated with many typical regularization functions (e.g., the and nuclear norms , piecewise quadratic functions , and indicator functions of affine convex sets ) are strongly semismooth. In general, semismoothness of follows from semismoothness of the projection onto the epigraph of . However, there are convex sets onto which projection is not directionally differentiable . The indicator functions associated with such sets or functions whose epigraph is described by such sets may induce proximal operators which are not semismooth.
Ii-B Existing methods
Problem (1) is encountered in a host of applications and it has been the subject of extensive study. Herein, we provide a brief overview of existing approaches to solving it.
Ii-B1 First order methods
When is identity or a diagonal matrix, the proximal gradient method, which generalizes gradient descent to certain classes of nonsmooth composite optimization problems [10, 9], can be used to solve (1),
where is the current iterate and is the step size. When , we recover gradient descent, when is the indicator function of the convex set , it simplifies to projected gradient descent, and when is the norm, it corresponds to the Iterative Soft-Thresholding Algorithm (ISTA). Nesterov-style techniques can also be employed for acceleration .
When the matrix is not diagonal, the alternating direction method of multipliers (ADMM) provides an appealing option for solving (1) via (2) by alternating between minimization of over (a continuously differentiable problem), minimization of over (amounts to evaluating ), and a gradient ascent step in ,
Although each step in ADMM is conveniently computable, its convergence rate is strongly influenced by the parameter .
Ii-B2 Second order methods
The slow convergence of first order methods to high-accuracy solutions motivates the development of second order methods for solving (1). A generalization of Newton’s method for nonsmooth problems (1) with was developed in [17, 16, 19, 18]. A sequential quadratic approximation of the smooth part of the objective function is utilized and a search direction is obtained as the solution of a regularized quadratic subproblem,
where is the current iterate and is the Hessian of . This method generalizes the projected Newton method  to a broader class of regularizers. For example, when is the norm, this amounts to solving a LASSO problem , which can be a challenging task. Coordinate descent is often used to solve this subproblem  and it has been observed to perform well in practice [20, 21, 22].
The Forward-Backward Envelope (FBE) was introduced in [26, 27, 28]. FBE is once-continuously differentiable nonconvex function of and its minimum corresponds to the solution of (1) with . As demonstrated in Section VII, FBE can be obtained from the proximal augmented Lagrangian (that we introduce in Section III). Since the generalized Hessian of FBE involves third-order derivatives of (which may be expensive to compute), references [26, 27, 28] employ either truncated- or quasi-Newton methods to obtain a second order update to .
Iii The proximal augmented Lagrangian and second order updates
In this section, we transform into a form that is once but not twice continuously differentiable. For the resulting function, which we call the proximal augmented Lagrangian, we define second order updates to find its saddle points, show that they are always well defined, and prove that they are locally (quadratically) superlinearly convergent when is (strongly) semismooth.
Iii-a Proximal augmented Lagrangian
The continuously differentiable proximal augmented Lagrangian was recently introduced in . This was done by rewriting via completion of squares,
and restricting it to the manifold that corresponds to explicit minimization over the auxiliary variable ,
where the minimizer of over is determined by the proximal operator of the function ,
and it defines the aforementioned manifold.
Theorem 1 (Theorem 1 in )
The proximal augmented Lagrangian contains the Moreau envelope of and its introduction allows the use of the method of multipliers (MM) to solve problem (2). MM requires joint minimization of over and which is, in general, challenging because the -minimization subproblem is nondifferentiable. However, Theorem 1 enables an equivalent implementation of MM
which improves performance relative to ADMM and has guaranteed convergence to a local minimum even when is nonconvex .
Continuous differentiability of also enables a joint update of and via the primal-dual Arrow-Hurwicz-Uzawa gradient flow dynamics,
where and are given by (9). When and are sparse mappings, this method is convenient for distributed implementation and it is guaranteed to converge at an exponential rate for strongly convex and sufficiently large .
In what follows, we extend the primal-dual algorithm to incorporate second order information of and thereby achieve fast convergence to high-accuracy solutions.
Iii-B Second order updates
Even though Newton’s method is primarily used for solving minimization problems in modern optimization, it was originally formulated as a root-finding technique and it has long been employed for finding stationary points . In , a generalized Jacobian was used to extend Newton’s method to semismooth problems. We employ this generalization of Newton’s method to in order to compute the saddle point of the proximal augmented Lagrangian. The unique saddle point of is given by the optimal primal-dual pair () and it thus provides the solution to (1).
Iii-B1 Generalized Newton updates
Let . We use the -generalized Jacobian of the proximal operator , , to define the set of -generalized Hessians of the proximal augmented Lagrangian,
|and the Clarke generalized Jacobian to define the set of Clarke generalized Hessians of the proximal augmented Lagrangian,|
Note that because .
In the rest of the paper, we introduce the composite variable, use interchangeably with , and suppress the dependance of and on to reduce notational clutter. For simplicity of exposition, we assume that is semismooth and state the results for the Clarke generalized Hessian (11b), i.e., . As described in Remark 4 in Section IV, analogous convergence results for non-semismooth can be obtained for the -generalized Hessian (11a), i.e., .
We use the Clarke generalized Hessian (11b) to obtain a second order update by linearizing the stationarity condition around the current iterate ,
Since is firmly nonexpansive, . In Lemma 2 we use this fact to prove that the second order update is well-defined for any generalized Hessian (11) of the proximal augmented Lagrangian as long as is strictly convex with for all .
Let be symmetric positive definite, , let be symmetric positive semidefinite with eigenvalues less than one, , let be full row rank, and let . Then, the matrix
is invertible and it has positive and negative eigenvalues.
Matrix (13a) is positive definite because and both and are positive semidefinite. Matrix (13b) is negative definite because the kernels of and have no nontrivial intersection and has full row rank. \qed
Iii-B2 Fast local convergence
The use of generalized Newton updates for solving the nonlinear equation for nondifferentiable was studied in . We apply this framework to the stationarity condition when is (strongly) semismooth and show that second order updates (12) converge (quadratically) superlinearly within a neighborhood of the optimal primal-dual pair.
Let be (strongly) semismooth, and let be defined by (12). Then, there is a neighborhood of the optimal solution in which the second order iterates converge (quadratically) superlinearly to .
Lemma 2 establishes that is nonsingular for any . Since the gradient of the proximal augmented Lagrangian is Lipschitz continuous by Theorem 1, nonsingularity of and (strong) semismoothness of the proximal operator guarantee (quadratic) superlinear convergence of the iterates by [23, Theorem 3.2]. \qed
Iv A globally convergent differential inclusion
Since we apply a generalization of Newton’s method to a saddle point problem and the second order updates are set valued, convergence to the optimal point is not immediate. Although we showed local convergence rates in Proposition 3 by leveraging the results of , proof of the global convergence is more subtle and it is established next.
To justify the development of a discrete-time algorithm based on the search direction resulting from (12), we first examine the corresponding differential inclusion,
where is the Clarke generalized Hessian (11b) of . We assume existence of a solution and prove asymptotic stability of (14) under Assumption 1 and global exponential stability under an additional assumption that is strongly convex.
Differential inclusion (14) has a solution.
Iv-a Asymptotic stability
We first establish asymptotic stability of differential inclusion (14).
Lyapunov function candidate (15) is a positive function of everywhere apart from the optimal primal-dual pair where it is zero. It remains to show that is decreasing along the solutions of (14), i.e., that is strictly negative for all ,
For Lyapunov function candidates which are differentiable with respect to , . Although (15) is not differentiable with respect to , we show that is differentiable along the solutions of (14). Instead of employing the chain rule, we use the limit that defines the derivative,
which gives in the limit . We then rewrite as the limit point of a sequence of functions so that
and use the Moore-Osgood theorem [45, Theorem 7.11] to exchange the order of the limits and establish that .
Let denote a subset of over which is differentiable (and therefore is differentiable with respect to ) and let be a sequence of points in that converges to . We define the sequence of functions ,
where , as we establish below, converges to . To employ the Moore-Osgood theorem, it remains to show that converges pointwise (for any ) as and that converges uniformly on some interval as .
Since , is differentiable for every and . It thus follows that
pointwise (for any ).
for sufficiently large , where . Lemma 2 and [23, Proposition 3.1] imply that is bounded within some neighborhood of and thus that can be written as where . This implies convergence of to and, combined with local Lipschitz continuity of with respect to , uniform convergence of to on where .
Iv-B Global exponential stability
To establish global asymptotic stability, we show that the Lyapunov function (15) is radially unbounded, and to prove exponential stability we bound it with quadratic functions. We first provide two lemmas that characterize the mappings and in terms of the spectral properties of matrices that describe the corresponding input-output relations at given points.
Lemma 5 (Lemma 2 in )
Let be a proper, lower semicontinuous, convex function and let : be the corresponding proximal operator. Then, for any , there exists a symmetric matrix satisfying such that
Let be strongly convex with parameter and let its gradient be Lipschitz continuous with parameter . Then, for any there exists a symmetric matrix satisfying such that
Let , , , and
Clearly, by construction, . It is also readily verified that when . It thus remains to show that (i) when ; and (ii) .
(i) Since is strongly convex and is Lipschitz continuous, is convex and is Lipschitz continuous. Furthermore, we have , and [46, Proposition 5] implies
This shows that only if and, thus, when . Therefore, there always exist a symmetric matrix such that .
Although matrices and in Lemmas 5 and 6 depend on the operating point, their spectral properties, and , hold for all and . These lemmas can be interpreted as a combination between a generalization of the mean value theorem [45, Theorem 5.9] to vector-valued functions and spectral bounds on the operators : and : arising from firm nonexpansiveness of , strong convexity of , and Lipschitz continuity of .
We now combine Lemmas 5 and 6 to establish quadratic upper and lower bounds for Lyapunov function (15) and thereby prove global exponential stability of differential inclusion (14) for strongly convex .
Given the assumptions, Theorem 4 establishes asymptotic stability of (14) with the dissipation rate It remains to show the existence of positive constants and such that Lyapunov function (15) satisfies
As noted in Remark 2, although and depend on the operating point, their spectral properties hold for all .
Since , we can write
and express Lyapunov function (15) as
for some ,
The set is closed and bounded and the minimum eigenvalue of is a continuous function of and . Thus, the extreme value theorem [45, Theorem 4.14] implies that its infimum over ,
The rate of exponential convergence established by Theorem 7 is independent of , , and . This is a consequence of insensitivity of Newton-like methods to poor conditioning. In contrast, the first order primal-dual method considered in  requires a sufficiently large for exponential convergence. In our second order primal-dual method, problem conditioning and parameter selection affect the multiplicative constant but not the rate of convergence.
When differential inclusion (14) is defined with the -generalized Hessian (11a), Theorems 4 and 7 hold even for proximal operators which are not semismooth. This follows from defining and choosing such that in the proof of Theorem 4. Such a choice of is possible by the definition of the -subdifferential (4); since the Clarke subgradient is the convex hull of the -subdifferential, it contains points outside of and thus such a sequence is not guaranteed to exist for any . When defined in this manner, is constant with respect to and thus uniform convergence of to is immediate.
V A second order primal-dual algorithm
An algorithm based on the second order updates (12) requires step size selection to ensure global convergence. This is challenging for saddle point problems because standard notions, such as sufficient descent, cannot be applied to assess the progress of the iterates. Instead, it is necessary to identify a merit function whose minimum lies at the stationary point and whose sufficient descent can be used to evaluate progress towards the saddle point.
An approach based on discretization of differential inclusion (14) and Lyapunov function (15) as a merit function leads to Algorithm 2 in Appendix -A. However, such a merit function is nonconvex and nondifferentiable in general which makes the utility of backtracking (e.g., the Armijo rule) unclear. Moreover, Algorithm 2 employs a fixed penalty parameter . A priori selection of this parameter is difficult and it has a large effect on the convergence speed.
Instead, we employ the primal-dual augmented Lagrangian introduced in  as a merit function and incorporate an adaptive update. This merit function is convex in both and and it facilitates an implementation with outstanding practical performance. Drawing upon recent advancements for constrained optimization of twice differentiable functions [14, 15], we show that our algorithm converges to the solution of (2). Finally, our algorithm exhibits local (quadratic) superlinear convergence for (strongly) semismooth .
V-a Merit function
The primal-dual augmented Lagrangian,
was introduced in , where is an estimate of the optimal Lagrange multiplier . Following [13, Theorem 3.1], it can be shown that the optimal primal-dual pair of optimization problem (2) is a stationary point of . Furthermore, for any fixed , is a convex function of , , and and it has a unique global minimizer.