Parameter Insensitivity in ADMM-Preconditioned Solution of Saddle-Point Problems
We consider the solution of linear saddle-point problems, using the alternating direction method-of-multipliers (ADMM) as a preconditioner for the generalized minimum residual method (GMRES). We show, using theoretical bounds and empirical results, that ADMM is made remarkably insensitive to the parameter choice with Krylov subspace acceleration. We prove that ADMM-GMRES can consistently converge, irrespective of the exact parameter choice, to an -accurate solution of a -conditioned problem in iterations. The accelerated method is applied to randomly generated problems, as well as the Newton direction computation for the interior-point solution of semidefinite programs in the SDPLIB test suite. The empirical results confirm this parameter insensitivity, and suggest a slightly improved iteration bound of .
We consider iteratively solving very large scale instances of the saddle-point problem
with data matrices with invertible, with invertible, symmetric positive definite , and data vectors , , . Note that the special case of reduces (1) to the familiar block saddle-point structure
Additionally, we assume that efficient solutions (i.e. black-box oracles) to the two subproblems
are available for a fixed choice of .
Saddle-point problems with this structure arise in numerous settings, ranging from nonlinear optimization to the numerical solution of partial differential equations (PDEs); the subproblems (2) are often solved with great efficiency by exploiting application-specific features. For example, when the data matrices are large-and-sparse, the smaller saddle-point problems (2) can admit highly sparse factorizations, based on nested dissection or minimum degree orderings [31, 30]. Also, the Schur complements and are symmetric positive definite, and can often be interpreted as discretized Laplacian operators, for which many fast solvers are available [29, 26, 32, 25]. In some special cases, a triangular factorization or a diagonalization may be available analytically . The reader is referred to  for a more comprehensive review of possible applications.
The problem structure has an interpretation of establishing consensus between the two subproblems. To see this, note that (1) is the Karush–Kuhn–Tucker (KKT) optimality condition associated with the equality-constrained least-squares problem
and the solution to (1) is the unique optimal point. This problem is easy to solve if one of two variables were held fixed. For instance, holding fixed, the minimization of (3) over to -accuracy can be made with just a single call to the first subproblem in (2), taking the parameter to be . The difficulty of the overall problem, then, lies entirely in the need for consensus, i.e. for two independent minimizations to simultaneously satisfy a single equality constraint.
The alternating direction method-of-multipliers (ADMM) is a popular first-order method widely used in signal processing, machine learning, and related fields, to solve consensus problems like the one posed in (3); cf.  for an extensive review. Each ADMM iteration calls the subproblems in (2), with serving as the step-size parameter for the underlying gradient ascent. Under the assumptions on the data matrices stated at the start of the paper, ADMM converges at a linear rate (with error scaling at the -th iteration), starting from any initial point.
where is the condition number associated with the rescaled matrix . This bound is asymptotically optimal, in the sense that the square-root factor cannot be improved [18, Thm. 2.1.13].
Unfortunately, explicitly estimating the optimal parameter choice can be challenging. Picking any arbitrarily value, say , often results in convergence that is so slow as to be essentially stagnant, even on well-conditioned problems [14, 19]. A heuristic that works well in practice is to adjust the parameter after each iteration, using a rule-of-thumb based on keeping the primal and dual residuals within the same order of magnitude [17, 33, 7]. However, varying the value of between iterations can substantially increase the cost of solving the subproblems in (2).
1.1. Main results
When applied to a least-squares problem, ADMM reduces to a classic block Gauss-Seidel splitting on the corresponding KKT equations, i.e. the original saddle-point problem in (1). Viewing ADMM as the resulting linear fixed-point iterations, convergence can be optimally accelerated by using a Krylov subspace method like generalized minimum residual (GMRES) [24, 23]. Or equivalently, viewing ADMM as a preconditioner, it may be used to improve the conditioning of the KKT equations for a Krylov subspace method like GMRES. We refer to the GMRES-accelerated version of ADMM (or the ADMM-preconditioned GMRES) as ADMM-GMRES.
In this paper, we show, using theoretical bounds and empirical results, that ADMM-GMRES (nearly) achieves the optimal convergence rate in (4) for every parameter choice. Figure 1 makes this comparison for two representative problems. Our first main result (Theorem 7) conclusively establishes the optimal iteration bound when is very large or very small. Our second main result (Theorem 9) proves a slightly weaker statement: ADMM-GMRES converges within iterations for all remaining choices of , subject to a certain normality assumption. The two bounds gives us the confidence to select the parameter choice in order to maximize numerical stability.
To validate these results, we benchmark the performance of ADMM-GMRES with a randomly selected in Section 7 against regular ADMM with an optimally selected . Two problem classes are considered: (1) random problems generated by selecting random orthonormal bases and singular values; and (2) the Newton direction subproblems associated with the interior-point solution of large-scale semidefinite programs. Our numerical results suggest that ADMM-GMRES converges in iterations for all values of , which is a slightly stronger iteration bound that the one we have proved.
1.2. Related ideas
When the optimal parameter choice for (regular) ADMM is explicitly available, we showed in a previous paper  that ADMM-GMRES can consistently converge in just iterations, which is an entire order of magnitude better than the optimal bound (4). Problems that would otherwise require thousands of iterations to solve using ADMM are reduced to just tens of ADMM-GMRES iterations. However, the improved rate cannot be guaranteed over all problems, and there exist problem classes where ADMM-GMRES convergences in iterations for all choices of .
More generally, the idea of using a preconditioned Krylov subspace method to solve a saddle-point system has been explored in-depth by a number of previous authors [2, 28, 20, 1, 3, 4]. We make special mention of the Hemitian / Skew-Hermitian (HSS) splitting method, first proposed by Bai, Golub & Ng  and used as a preconditioner for saddle-point problems by Benzi & Golub , which also makes use of efficient solutions to the subproblems in (2). It is curious to note that HSS has a strikingly similar expression for its optimal parameter choice and the resulting convergence rate, suggesting that the two methods may be closely related.
Note that ADMM is an entirely distinct approach from the augmented Lagrangian / method of multipliers (MM) in optimization, or equivalently, the Uzawa method in saddle-point problems [10, 8]. In MM, convergence is guaranteed in a small, constant number of iterations, but each step requires the solution of an ill-conditioned symmetric positive definite system of equations, often via preconditioned conjugate gradients. In ADMM, convergence is slowly achieved over a large number of iterations, but each iteration is relatively inexpensive. We refer the reader to  for a more detailed comparison of the two methods.
Finally, it remains unknown whether these benefits extend to nonlinear saddle-point problems (or equivalently, nonlinear versions of the consensus problem), where the ADMM update equations are also nonlinear. There are a number of competiting approaches to generalize GMRES to nonlinear fixed-point iterations [22, 9, 11]. Their application to ADMM is the subject of future work.
1.3. Definitions & Notation
Given a matrix , we use to refer to its -th eigenvalue, and to denote its set of eigenvalues, including multiplicities. If the eigenvalues are purely-real, then refers to its most positive eigenvalue, and its most negative eigenvalue. Let denote the vector norm, as well as the associated induced norm, also known as the spectral norm. We use to refer to the -th largest singular value.
Define and as the strong convexity parameter and the gradient Lipschitz constant for the quadratic form associated with the matrix . The quantity is the corresponding condition number.
2. Application of ADMM to the Saddle-point Problem
Beginning with a choice of the quadratic-penalty / step-size parameter and initial points , the method generates iterates
Note that the local and global variable updates can each be implemented by calling one of the two subproblems in (2). Since the KKT optimality conditions are linear with respect to the decision variables, the update equations are also linear, and can be written
with iteration matrix
upon the vector of local, global, and multiplier variables, . We will refer to the residual norm in all discussions relating to convergence.
Definition 1 (Residual convergence).
Given the initial and final iterates and , we say residual convergence is achieved in iterations if , where and are the KKT matrix and vector in (1).
2.1. Basic Spectral Properties
Convergence analysis for linear fixed-point iterations is normally performed by examining the spectral properties of the corresponding iteration matrix. Using dual feasibility arguments, a block-Schur decomposition for (6) can be explicitly specified.
Lemma 2 ([34, Lem. 11]).
Define the QR decomposition with and , and define as its orthogonal complement. Then defining the orthogonal matrix and the scaling matrix ,
yields a block-Schur decomposition of
where the size inner iteration matrix is defined in terms of the matrix
We may immediately conclude that has zero eigenvalues, and nonzero eigenvalues the lie within a disk on the complex plane centered at , with radius of . It is straightforward to compute the radius of this disk exactly.
Let , and define and . Then the spectral norm of is given
Also, we see from (8) that each Jordan block associated with a zero eigenvalue of is at most size . After two iterations, the behavior of ADMM becomes entirely dependent upon the inner iteration matrix .
Lemma 4 ([34, Lem. 13]).
For any and any polynomial , we have
where is defined in terms of the matrices in Lemma 2, as in
One application of Lemma 4 is to bound the spectral norm of the -th power iteration, i.e. , thereby yielding the following iteration estimate.
2.2. Accelerating Convergence using GMRES
In the context of quadratic objectives, the convergence of ADMM can be accelerated by GMRES in a largely plug-and-play manner. Given a specific choice of parameter and an initial point , we may task GMRES with the fixed-point equation associated with the ADMM update equation (5)
which is indeed a linear system of equations when is held fixed. It is an elementary fact that the resulting iterates will always converge onto the fixed-point point faster than regular ADMM (under a suitably defined metric) .
Alternatively, the fixed-point equation (11) is equivalent to the left-preconditioned system of equations
where and are the KKT matrix and residual defined in (1), the ADMM preconditioner matrix is
Note that the ADMM iteration matrix satisfies by definition. In turn, GMRES-accelerated ADMM is equivalent to a preconditioned GMRES solution to the KKT system, , with preconditioner . Matrix-vector products with can always be implemented as the composition of an augmentation operation and a single iteration of ADMM, as seen in the factorization in (13).
GMRES can also be used to solve the right-preconditioned system
and the solution is recovered via . The resulting method performs essentially the same steps as the one above, but optimizes the iterates under a more preferable metric. Starting from the same initial point , the -th iterate of GMRES as applied to (14), written , is guaranteed to produce a KKT residual norm that is smaller than or equal to that of the -th iterate of regular ADMM, written , as in . This property is preferable as becomes progressively ill-conditioned and numerical precision becomes an issue; cf. [34, Sec. 7] for details.
Throughout this paper, we will refer to both methods as GMRES-accelerated ADMM, or ADMM-GMRES for short, and reserve the “left-preconditioned” or the “right-preconditioned” specifications only where the distinctions are important. The reason is that both methods share a common bound for the purposes of convergence analysis.
Given an arbitrary linear system, , GMRES generates iterates that satisfies the minimal residual property 
where is the -th residual vector. Furthermore, Lemma 4 yields for all ,
and the last equality is due to the existence of a bijective linear map between and . In the left-preconditioned system (12), the matrix is , and the factor arises by bounding the residuals . In the right-preconditioned system (12), the matrix is , and the factor arises via . ∎
In order to use Proposition 6 to derive useful convergence estimates, the polynomial norm-minimization problem can be reduced into a polynomial min-max approximation problem over a set of points on the complex plane. More specifically, consider the following sequence of inequalities
which makes the following normality assumption, that is standard within this context.
Assumption 1 ( is bounded).
Given fixed , the matrix , defined in (9), is diagonalizable with eigendecomposition, . Furthermore, the condition number for the matrix-of-eigenvectors, , is bounded from above by an absolute constant.
We refer to this last problem in (17) as the eigenvalue approximation problem. Only in very rare cases is an explicit closed-form solution known, but any heuristic choice of polynomial will provide a valid upper-bound.
3. Convergence Analysis for ADMM-GMRES
Our main result in this paper is that ADMM-GMRES converges to an -accurate solution in iterations for any value of , in the sense of the residual. We split the precise statement into two parts. First, for very large and very small values of , we can conclusively establish that ADMM-GMRES convergences in iterations. This is asymptotically the same as the optimal figure for regular ADMM.
Theorem 7 (Extremal ).
GMRES-accelerated ADMM achieves residual convergence in
for any choice of or .
For intermediate choices of , the same result almost holds. Subject to the normality assumption in Assumption 1, ADMM-GMRES converges in iterations. Accordingly, we conclude that ADMM-GMRES converge within this number of iterations for every fixed value of .
Theorem 9 (Intermediate ).
GMRES-accelerated ADMM achieves residual convergence in
for any choice of .
As we reviewed in Section 2, convergence analysis for GMRES can be reduced to a polynomial approximation problem over the eigenvalues of , written in (17). Our proofs for Theorems 7 & 9 are based on solving this polynomial approximation problem heuristically. The discrete eigenvalue distribution of is enclosed within simple regions on the complex plane in Section 4 for different values of . The polynomial approximation problems associated with these outer enclosures are solved in their most general form in Section 5. These results are pieced together in Section 6, yielding proofs to our main results.
4. Eigenvalue Distribution of the Iteration Matrix
The eigenvalues of the iteration matrix play a pivotal role in driving the convergence of both ADMM as well as ADMM-GMRES. Figure 2 plots these for a fixed, randomly generated problem, while sweeping the value of . Initially, we see two clusters of purely-real eigenvalues, tightly concentrated about , that enlargen and shift closer towards the origin and towards each other with increasing . As the two clusters coalesce, some of the purely-real eigenvalues become complex. The combined radius of the two clusters reaches its minimum at around , at which point most of the eigenvalues are complex. Although not shown, the process is reversed once moves past ; the two clusters shrink, become purely-real, break apart, and move away from the origin, ultimately reverting into two clusters concentrated about .
Three concrete findings can be summarized from these observations. First, despite the fact that is nonsymmetric, its eigenvalues are purely-real over a broad range of .
Let or . Then is diagonalizable and its eigenvalues are purely real. Furthermore, let be the condition number for the matrix-of-eigenvectors as defined in Assumption 1. Then this quantity is bound
Furthermore, the eigenvalues are partitioned into two distinct, purely-real clusters that only become complex once they coalesce.
Define the positive scalar , which satisfies by construction. If , then is enclosed within the union of a disk and an interval:
If , then is enclosed within a single interval:
Finally, if , then is enclosed within the union of two disjoint intervals:
Furthermore, if , then contains at least one eigenvalue within each interval.
In the limits and , the two clusters in (21) concentrate about , and the spectral radius of the ADMM iteration matrix converges towards 1.
Let or . Then for , there exists an eigenvalue of the ADMM iteration matrix whose modulus is lower-bounded
The spectral radius determines the asympotic convergence rate, so given Corollary 13, it is unsurprising that ADMM stagnates if is poorly chosen. But the situation is different with ADMM-GMRES, because it is able to exploit the clustering of eigenvalues. As we will see later, this is the mechanism that allows ADMM-GMRES to be insensitive to the parameter choice.
4.1. Properties of -symmetric matrices
Most of our characterizations for the eigenvalues of are based on a property known as “-symmetry”. In the following discussion, we will drop all arguments with respect to for clarity. Returning to its definition in (9), we note that has the block structure
with subblocks , , and ,
and the matrices and with orthonormal columns are defined as in Lemma 2. From the block structure in (22) we see that the matrix is self-adjoint with respect to the indefinite product (assuming ) defined by :
The -symmetric matrix in (22) has at most eigenvalues with nonzero imaginary parts, counting conjugates. These eigenvalues are contained within the disk of radius
where denotes the space of polynomials.
Benzi & Simoncini  provide a succinct proof for the first statement. The second statement is based on the fact that every eigenpair of satisfying must have an eigenvector that is “-neutral”, i.e. satisfying ; cf. [5, Thm. 2.1]. Hence, the following bound holds
for every with . Taking the Lagrangian dual of (25) yields the desired statement. ∎
Also, we can derive a simple sufficient condition for the eigenvalues of to be purely real, based on the ideas described in .
Suppose that there exists a real scalar to make the matrix positive definite. Then is diagonalizable with eigendecomposition, , its eigenvalues are purely-real, and the condition number of the matrix-of-eigenvectors satisfies
It is easy to verify that is also symmetric with respect to , as in . Since is positive definite, there exists a symmetric positive definite matrix satisfying , and the -symmetry implies
Hence we conclude that is similar to the real symmetric matrix , with purely-real eigenvalues and eigendecomposition , where is orthogonal. The corresponding eigendecomposition for is with . ∎
Finally, we may use the block-generalization of Gershgorin’s circle theorem to decide when the off-diagonal block is sufficiently “small” such that the eigenvalues of become similar to the block diagonal matrix .
Given -symmetric matrix in (22), define the two Gershgorin sets
Then . Moreover, if and are disjoint, i.e. , then contains exactly eigenvalues in and eigenvalues in .
This is a straightforward application of the block Gershgorin’s theorem for matrices with normal pivot blocks [12, Thm. 4]. ∎
4.2. Proof of Lemma 11
The cases of and are trivial. In the remaining cases, is -symmetric, and we will use Proposition 15 to prove the statement. Noting that is unitarily similar with , simple direct computation reveals that is positive definite for , and is positive definite for . Hence, for these choices of , the eigenvalues of are purely real. Some further computation reveals that and , so the condition number of is bound
Taking the square-root and substituting yields the desired estimate for in (18). ∎
4.3. Proof of Lemma 12
Again, the cases of and are trivial. In all remaining cases, is -symmetric. The single interval case (20) is a trivial consequence of our purely-real result in Lemma 11. The disk-and-interval case (19) arises from Proposition 14, in which we use the disk of radius
to enclose the eigenvalues with nonzero imaginary parts, and the spectral norm disk to enclose the purely-real eigenvalues. The two-interval case (21) is a consequence of the block Gershgorin theorem in Proposition 16. Some direct computation on the block matrices in (22) yields the following two statements
the latter of which also uses the fact that . The projections of the corresponding Gershgorin regions onto the real line satisfy
Once , the regions become separated. Taking the intersection between each disjoint Gershgorin region, the real line, and the spectral norm disk yields (21). ∎
5. Solving the Approximation Problems
With the distribution of eigenvalues characterized in Lemma 12, we now consider solving each of the three accompanying approximation problems in their most general form.
5.1. Chebyshev approximation over the real interval
The optimal approximation for an interval over the real line has a closed-form solution due to a classic result attributed to Chebyshev.
Let denote the interval on the real line. Then assuming that , the polynomial approximation problem has closed-form solution
where is the degree- Chebyshev polynomial of the first kind, and is the condition number for the interval. The minimum is attained by the Chebyshev polynomial .
See e.g. . ∎
Whereas approximating a general -conditioned region to -accuracy requires an order polynomial, approximating a real interval of the same conditioning and to the same accuracy only requires an order polynomial. This is the underlying mechanism that grants the conjugate gradients method a square-root factor speed-up over gradient descent; cf. [16, Ch. 3] for a more detailed discussion. For future reference, we also note the following identity.
Given any , define the corresponding condition number as . Then
5.2. Real intervals symmetric about the imaginary axis
Now, consider the polynomial approximation problem for two real, non-overlapping intervals with respect to the constraint point , illustrated in Fig. 3, which arises as the eigenvalue distribution (21) in Lemma 12.
Given and , define the two closed intervals
such that and . Then the following holds
where is the condition number for the segment .
Of course, the union of the two intervals, i.e. , lies within a single real interval with condition number , so Theorem 17 can also be used to obtain an estimate. However, the explicit treatment of clustering in Lemma 19 yields a considerably tighter bound, because it is entirely possible for each and to be individually well-conditioned while admitting an extremely ill-conditioned union. For a concrete example, consider setting and taking the limit ; the condition number for is fixed at , but the condition number for the union diverges . In this case, Lemma 19 predicts extremely rapid convergence for all values of , whereas Theorem 17 does not promise convergence at all.
To prove Lemma 19, we will begin by stating a technical lemma.
Define with domain . Then the quotient is monotonously increasing with infimum attained at the limit point .
By definition, we see that both and are nonzero for all . Taking the derivatives
reveals that is monotonously increasing for all , so we also have . Finally, we observe that for all . Combining these three observations with the quotient rule reveals that is monotonously increasing
Hence, the infimum for must be attain at its lower limit point . Using l’Hôpital’s rule yields . ∎