An Efficient Inexact ABCD Method for Least Squares Semidefinite Programming
We consider least squares semidefinite programming (LSSDP) where the primal matrix variable must satisfy given linear equality and inequality constraints, and must also lie in the intersection of the cone of symmetric positive semidefinite matrices and a simple polyhedral set. We propose an inexact accelerated block coordinate descent (ABCD) method for solving LSSDP via its dual, which can be reformulated as a convex composite minimization problem whose objective is the sum of a coupled quadratic function involving four blocks of variables and two separable non-smooth functions involving only the first and second block, respectively. Our inexact ABCD method has the attractive iteration complexity if the subproblems are solved progressively more accurately. The design of our ABCD method relies on recent advances in the symmetric Gauss-Seidel technique for solving a convex minimization problem whose objective is the sum of a multi-block quadratic function and a non-smooth function involving only the first block. Extensive numerical experiments on various classes of over 600 large scale LSSDP problems demonstrate that our proposed ABCD method not only can solve the problems to high accuracy, but it is also far more efficient than (a) the well known BCD (block coordinate descent) method, (b) the eARBCG (an enhanced version of the accelerated randomized block coordinate gradient) method, and (c) the APG (accelerated proximal gradient) method.
Keywords: Least squares SDP, accelerated block coordinate descent, symmetric Gauss-Seidel.
AMS subject classifications. 90C06, 90C22, 90C25, 65F10.
Let be the space of real symmetric matrices endowed with the standard trace inner product and its induced norm . We denote by the cone of positive semidefinite matrices in . For any matrix , we use to indicate that is a symmetric positive definite (positive semidefinite) matrix.
Consider the following semidefinite programming (SDP) problem:
where and are given data, and are two given linear maps, and are two nonempty simple closed convex sets, e.g., with being given matrices and with being given vectors. When applying a proximal point algorithm (PPA) [25, 26] to solve (1), we need to solve the following subproblem in each iteration for a given point and a parameter :
This motivated us to study the following least squares semidefinite programming (LSSDP) which includes (2) as a particular case:
where , are given data. In order for the PPA to be efficient for solving (1), it is of great importance for us to design an efficient algorithm to solve the above problem (P). Thus, the objective of this paper is to achieve this goal via solving the dual of (P).
The dual of (P) is given by
where for any given set , is the indicator function over such that if and otherwise, and is the conjugate function of defined by
Problem (D) belongs to a general class of multi-block convex optimization problems of the form:
where , and each is a finite dimensional real Euclidean space equipped with an inner product and its induced norm . Here , and are proper, lower semi-continuous convex functions. We assume that is continuously differentiable on an open neighborhood containing and its gradient is Lipschitz continuous. Note that one can write (D) in the form of (4) in a number of different ways. One natural choice is of course to express it in the form of (4) for with . Another possibility is to express it in the form of (4) for with
For the problem in (4), a well known technique for solving it is the block coordinate descent (BCD) method, for examples, see [10, 28, 32, 33] and references therein. Specifically, at iteration , one may update the blocks successively in the Guass-Seidel fashion (other rules can also be applied, see ):
When the subproblems in (1) are not easily solvable, a popular approach is to use a single step of the proximal gradient (PG) method, thus yielding the block coordinate gradient descent (BCGD) method [35, 2].
Problem (4) can also be solved by the accelerated proximal gradient (APG) method with iteration complexity of such as in [17, 18, 19, 1, 34]. In the best case, BCD-type methods have an iteration complexity of (see [27, 2]), and can hardly be accelerated to as in the case for the APG method. Nevertheless, some researchers have tried to tackle this difficulty from different aspects. Beck and Tetruashvili  proposed an accelerated BCGD method for solving (4) by assuming that , i.e., without the nonsmooth terms. Very recently, Chambolle and Pock  presented an accelerated BCD method for solving (4) by assuming that has the special form . In theory, this method can be applied to the problem (D) by choosing and . But the serious practical disadvantage is that the method in  does not cater for inexact solutions of the associated subproblems and hence it is not suitable for large scale problems since for (D) the subproblems must be solved numerically.
Besides BCD-type methods based on deterministic updating order, there has been a wide interest in randomized BCD-type methods. Nesterov  presented a randomized BCD method with unconstrained and constrained versions in which the selection of the blocks is not done by a deterministic rule (such as the cyclic rule (1)), but rather via a predescribed distribution. Furthermore, an accelerated variant was studied for the unconstrained version. To extend the accelerated version for the more generic problem (4), Fercoq and Richtárik  proposed an accelerated randomized block coordinate gradient (ARBCG) method. For strongly convex functions, Lin, Lu and Xiao  showed that a variant of this method can achieve a linear convergence rate. However, from our numerical experience, the ARBCG method usually can only solve (D) to an accuracy of – since only the maximal eigenvalue of is used when updating (similarly for updating ). Even a numerically much enhanced version of the ARBCG (denoted as eARBCG) method with a weighted norm (for which the theoretical convergence needs to be studied) is also typically – times slower than the accelerated block coordinate descent (ABCD) method with a special deterministic rule which we will propose later.
In this paper we aim to design an efficient inexact accelerated BCD-type method whose worst-case iteration complexity is for (D). We achieve this goal by first proposing an inexact accelerated block coordinate gradient descent (ABCGD) method for the general convex programming problem (4) with , and then apply it to (D) to obtain an inexact ABCD method. Note that when is a convex quadratic function, the ABCGD method and the ABCD method are identical. Our ABCD method is designed based on three components. First, we apply a Danskin-type theorem to eliminate the variable in (4). Then we adapt the inexact APG framework of Jiang, Sun and Toh proposed in  to solve the resulting reduced problem. By choosing an appropriate proximal term and adapting the recently developed inexact symmetric Gauss-Seidel decomposition technique  for solving a multi-block convex quadratic minimization problem (possibly with a single nonsmooth block), we show that each subproblem in the inexact APG method can be solved efficiently in a fashion almost like the symmetric Gauss-Seidel update.
As already mentioned, one can also apply the APG method directly to solve (D), or more generally (4). In this paper, we also adapt the APG method to directly solve (D) for the sake of numerical comparison. In addition, since the ARBCG method does not perform well for solving (D), again for the sake of numerical comparison, we propose an enhanced version (called eARBCG) of an accelerated randomized block coordinate gradient method designed in  for solving (D). As one can see later from the extensive numerical experiments conducted to evaluate the performance of various methods, though the BCD, APG and eARBCG methods are natural choices for solving (D), they are substantially less efficient than the ABCD method that we have designed. In particular, for solving (D), the ABCD method is at least ten times faster than the BCD method for vast majority of the tested problems. It is quite surprising that a simple novel acceleration step with a special BCD cycle, as in the case of the ABCD method, can improve the performance of the standard Gauss-Seidel BCD method by such a dramatic margin.
The paper is organized as follows. In the next section, we introduce the key ingredients needed to design our proposed algorithm for solving (4), namely, a Danskin-type theorem for parametric optimization, the inexact symmetric Gauss-Seidel decomposition technique, and the inexact APG method. In Section 3, we describe the integration of the three ingredients to design our inexact ABCGD method for solving (4). Section 4 presents some specializations of the introduced inexact ABCGD method to solve the dual LSSDP problem (D), as well as discussions on the numerical computations involved in solving the subproblems. In Section 5, we describe the direct application of the APG method for solving (D). In addition, we also propose an enhancement of the accelerated randomized block coordinate gradient method in  for solving (D). Extensive numerical experiments to evaluate the performance of the ABCD, APG, eARBCG and BCD methods are presented in Section 6. Finally, we conclude the paper in the last section.
Notation. For any given self-adjoint positive semidefinite operator that maps a real Euclidean space into itself, we use to denote the unique self-adjoint positive semidefinite operator such that and define
2.1 A Danskin-type theorem
Here we shall present a Danskin-type theorem for parametric optimization problems.
Let and be two finite dimensional real Euclidean spaces each equipped with an inner product and its induced norm and be a lower semi-continuous function and be a nonempty open set in . Denote the effective domain of by , which is assumed to be nonempty. Let be a continuous function. Define the function by
For any given , let denote the solution set, possibly empty, to the optimization problem (6). The following theorem, largely due to Danskin , can be proven by essentially following the proof given in [8, Theorem 10.2.1].
Suppose that is a lower semi-continuous function and is a continuous function. Assume that for every , is differentiable on and is continuous on . Let be given. Suppose that there exists an open neighborhood of such that for each , is nonempty and that the set is bounded. Then the following results hold:
The function is directionally differentiable at and for any given ,
If reduces to a singleton, say , then is Gteaux differentiable at with
Danskin’s Theorem 2.1 will lead to the following results when convexities on and are imposed.
Suppose that is a proper closed convex function, is an open convex set of and is a closed convex function. Assume that for every , is differentiable on and is continuous on and that for every , is a singleton, denoted by . Then the following results hold:
Let be given. If there exists an open neighborhood of such that is bounded on any nonempty compact subset of , then the convex function is differentiable on with
Suppose that there exists a convex open neighborhood such that is bounded on any nonempty compact subset of . Assume that for any , is Lipschitz continuous on and that there exists a self-joint positive semidefinite linear operator such that for all and ,
where for any given , is the generalized Hessian of at . Then
Moreover, if , then is Lipschitz continuous on with the Lipschitz constant (the spectral norm of ) and for any ,
where denotes the generalized Hessian of at .
Assume that for every , is continuously differentiable on an open neighborhood containing . Suppose that there exist two constants and such that for all ,
Then is Lipschitz continuous on such that for all ,
Part (ii). For any , we obtain from part (i) and the generalized mean value theorem (MVT) [12, Theorem 2.3] that
which, together with the convexity of , implies that (7) holds.
Assume that . By using (7) and [18, Theorem 2.1.5], we can assert that is globally Lipschitz continuous with the Lipschitz constant . So by Rademacher’s Theorem, the Hessian of exists almost everywhere in . From (7), we can observe that for any such that the Hessian of at exists, it holds that
Thus, (8) follows from the definition of the generalized Hessian of .
Part (iii). The conclusions of part (iii) follow from part (i), the maximal monotonicity of , the assumptions and the fact that for every ,
We omit the details here. ∎
2.2 An inexact block symmetric Gauss-Seidel iteration
Let be a given integer and , where ’s are real finite dimensional Euclidean spaces. For any , we write with . Let be a given self-adjoint positive semidefinite linear operator. Consider the following block decomposition
where are self-adjoint positive semidefinite linear operators, are linear maps. Note that where .
Let be given. Define the convex quadratic function by
Let be a given lower semi-continuous proper convex function.
Here, we further assume that are positive definite. Define
with the convention that
Suppose that , are given error vectors, with . Denote and . Define the following operator and vector:
Let be given. Define
The following proposition describing an equivalent BCD-type procedure for computing , is the key ingredient for our subsequent algorithmic developments. The proposition is introduced by Li, Sun and Toh  for the sake of making their Schur complement based alternating direction method of multipliers  more explicit.
Assume that the self-adjoint linear operators , are positive definite. Let be given. For define by
Then the optimal solution defined by (11) can be obtained exactly via
Furthermore, is positive definite.
For later purpose, we also state the following proposition.
Suppose that is positive definite. Let . Then
The proof is straightforward by using the factorization of . ∎
In the computation in (12) and (13), we should interpret the solutions , as approximate solutions to the minimization problems without the terms involving and . Once these approximate solutions have been computed, they would generate the error vectors and . With these known error vectors, we know that the computed approximate solutions are actually the exact solutions to the minimization problems in (12) and (13).
2.3 An inexact APG method
For more generality, we consider the following minimization problem
where is a finite-dimensional real Euclidean space. The functions , are proper, lower semi-continuous convex functions (possibly nonsmooth). We assume that is continuously differentiable on and its gradient is Lipschitz continuous with modulus on , i.e.,
Given any positive definite linear operator , define by
Let be a given convergent sequence of nonnegative numbers such that
Suppose that for each , we have an approximate minimizer:
that satisfies the following admissible conditions
where (Note that for to be an approximate minimizer, we must have .) Then the inexact APG algorithm described in Algorithm 1 has the following iteration complexity result.
See [13, Theorem 2.1]. ∎
3 An inexact accelerated block coordinate gradient descent method
We consider the problem
where , and and are three closed proper convex functions and are finite dimensional real Euclidean spaces. We assume that problem (21) is solvable with an optimal solution . Define the function by
For any given , let denote the solution set to the optimization problem in (22). Suppose that the assumptions in Part (ii)111For subsequent discussions, we do not need the assumptions in Part (iii) of Proposition 2.2 though the results there have their own merits. of Proposition 2.2 imposed on and hold for . Then, by Part (ii) of Proposition 2.2, we know that is continuously differentiable and its gradient is globally Lipschitz continuous, and for all ,
where is a self-adjoint positive semidefinite linear operator such that for any ,
where is the generalized Hessian of at . One natural choice, though not necessarily the best, for is , where is a self-adjoint positive semidefinite linear operator that satisfies
Here for any given , is the generalized Hessian of at .
Now we consider an equivalent problem of (21):
Given a self-adjoint positive semidefinite linear operator such that
where is the unique solution to . From (23), we have
The iteration complexity result for the inexact ABCGD algorithm described in Algorithm 2 follows from Theorem 2.5 without much difficulty.
Denote and . Let be the sequence generated by Algorithm 2. Then
where , , Consequently,
(a) Note that we can use the symmetric Gauss-Seidel iteration described in Proposition 2.3 to compute in (31). Therefore, Step 1 is actually one special block coordinate gradient descent (BCGD) cycle with the order .
(b) Assuming that and have been computed. One may try to estimate by using , respectively. In this case the corresponding residual vector would be given by:
In practice, we may accept such an approximate solution for , provided a condition of the form (for some constant say ) is satisfied for . When such an approximation is admissible, then the linear systems involving need only be solved once instead of twice for . We should emphasize that such a saving can be significant when the linear systems are solved by a Krylov iterative method such as the preconditioned conjugate gradient method. Of course, if the linear systems are solved by computing the Cholesky factorization (which is done only once at the start of the algorithm) of , then the saving would not be as substantial.
4 Two variants of the inexact ABCD method for (D)
Now we can apply Algorithm 2 directly to (D) to obtain two variants of the inexact accelerated block coordinate descent method. In the first variant, we apply Algorithm 2 to (D) by expressing it in the form of (4) with and . In the second variant, we express (D) in the form of (4) with and For the remaining parts of this paper, we assume that is onto and the solution set of (D) is nonempty. The convergence of both variants follows from Theorem 3.1 for Algorithm 2 and will not be repeated here.
The detailed steps of the first variant of the inexact ABCD method are given as follows.
Algorithm ABCD-1: An inexact ABCD method for (D). Select an initial point with . Let be a summable sequence of nonnegative numbers, and . Set . Iterate the following steps. Step 1. Suppose that and are error vectors such that