A Doubly Stochastic Gauss-Seidel Algorithm for Solving Linear Equations and Certain Convex Minimization Problems

A Doubly Stochastic Gauss-Seidel Algorithm for Solving Linear Equations and Certain Convex Minimization Problems

M. Razaviyayn equal contributions. M. Razaviyayn is with the Department of Industrial and Systems Engineering, the University of Southern California. 11email: razaviya@usc.edu; M. Hong and N. Reyhanian are with the Department of Electrical and Computer Engineering, University of Minnesota, USA. 11email: {mhong, navid}@umn.edu; Z.-Q. Luo is with The Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, China. 11email: luozq@cuhk.edu.cn This research is supported by the NSFC grants 61731018 and 61571384, the Peacock project of SRIBD.    M. Hong equal contributions. M. Razaviyayn is with the Department of Industrial and Systems Engineering, the University of Southern California. 11email: razaviya@usc.edu; M. Hong and N. Reyhanian are with the Department of Electrical and Computer Engineering, University of Minnesota, USA. 11email: {mhong, navid}@umn.edu; Z.-Q. Luo is with The Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, China. 11email: luozq@cuhk.edu.cn This research is supported by the NSFC grants 61731018 and 61571384, the Peacock project of SRIBD.    N. Reyhanian equal contributions. M. Razaviyayn is with the Department of Industrial and Systems Engineering, the University of Southern California. 11email: razaviya@usc.edu; M. Hong and N. Reyhanian are with the Department of Electrical and Computer Engineering, University of Minnesota, USA. 11email: {mhong, navid}@umn.edu; Z.-Q. Luo is with The Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, China. 11email: luozq@cuhk.edu.cn This research is supported by the NSFC grants 61731018 and 61571384, the Peacock project of SRIBD.    Z.-Q. Luo equal contributions. M. Razaviyayn is with the Department of Industrial and Systems Engineering, the University of Southern California. 11email: razaviya@usc.edu; M. Hong and N. Reyhanian are with the Department of Electrical and Computer Engineering, University of Minnesota, USA. 11email: {mhong, navid}@umn.edu; Z.-Q. Luo is with The Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, China. 11email: luozq@cuhk.edu.cn This research is supported by the NSFC grants 61731018 and 61571384, the Peacock project of SRIBD.
Submitted in April 2018
Abstract

Consider the classical problem of solving a general linear system of equations . It is well known that the (successively over relaxed) Gauss-Seidel scheme and many of its variants may not converge when is neither diagonally dominant nor symmetric positive definite. Can we have a linearly convergent G-S type algorithm that works for any ? In this paper we answer this question affirmatively by proposing a doubly stochastic G-S algorithm that is provably linearly convergent (in the mean square error sense) for any feasible linear system of equations. The key in the algorithm design is to introduce a nonuniform double stochastic scheme for picking the equation and the variable in each update step as well as a stepsize rule. These techniques also generalize to certain iterative alternating projection algorithms for solving the linear feasibility problem with an arbitrary , as well as certain high-dimensional convex minimization problems. Our results demonstrate that a carefully designed randomization scheme can make an otherwise divergent G-S algorithm converge.

1 Introduction: Solving Linear System of Equations

Consider the generic problem of solving a linear system of equations

(1)

where , . We use and to denote the set and , respectively. A classical approach to solve (1) is by the Gauss-Seidel (G-S) algorithm, whereby at each iteration only one variable is updated by using the information from only one equation. More precisely, let and denote the th row of and th element of , respectively. We define the G-S type algorithm as follows.

Definition 1

An iterative algorithm for solving (1) is of Gauss-Seidel type, if at each iteration the algorithm updates one variable , , by utilizing only for some .

A natural application of the G-S type algorithms is in the setting where players play a game in which is the variable of player . In this case we have , and the objective of player  is to satisfy the -th equation . The best response strategy for player  is given by

(2)

Using a step-size , this best response strategy leads to the following successive over-relaxation (SOR) update rule:

(3)

where is the th element of ; is the th element of . The central question is how to choose the step-size  and determine the order in which the players update their variables so that the G-S type algorithm (3) will eventually lead to an equilibrium (or equivalently, a solution to (1)).

1.1 Background on the Convergence of G-S Type Algorithm

To better understand the convergence behavior of G-S algorithm and its variants, let us consider the following example.

Example 1: Consider the following special case of (1):

where is some given constant. The best response strategy in (3) leads to the following update rule:

(4)
(5)

Assume .

Let us consider the following five different update rules which are all of the G-S type.

  1. Cyclic Successive Over Relaxation (SOR): At iteration , we perform

    (6)
  2. Symmetric SOR: At iteration , the variables are updated using a forward-sweep G-S step followed by a backward-sweep G-S step Saad03book ():

  3. Uniformly Randomized (UR) SOR: At iteration , randomly pick one variable from with equal probability. Update according to (4) or (5) based on which variables are selected, while fixing the other variable at its previous value.

  4. Non-Uniformly Randomized (NUR) SOR: At iteration , let and satisfy ; randomly pick according to . Update according to (4) or (5) based on which variables are selected, while fixing the remaining variable at its previous value.

  5. Random Permutation (RP) SOR: At iteration , randomly select a permutation of the index set ; The variables are updated according to

    (7)

    Note that this method is referred to as the shuffled SOR in Oswald15b ().

It is easily seen that for any update order listed above, the resulting algorithm have the following property:

On the other hand, the solution of the system of linear equation is ; hence none of these algorithms will find the solution.

Next we give a brief literature review on the convergence analysis of the G-S type, as well as other related algorithms for solving a linear system of equations or inequalities.

The convergence of G-S type algorithm. It is well-known that when is either diagonally dominant, or symmetric positive definite (PD), then the classical SOR method with cyclic update rule converges to the solution of (1) (bertsekas97, , Proposition 6.7, 6.8, 6.10). More specifically, if is symmetric and PD, the convergence of SOR [for any ] can be established by showing that each iteration of the SOR algorithm is equivalent to a step of the coordinate descent (CD) algorithm for minimizing the strictly convex cost function (bertsekas97, , Section 2.6.3). Note that the convergence rate in this case is linear, although the rate is not easily expressible in terms of the condition number of matrix if the classical cyclic rule is used Golub96 (). Without the symmetry or the positive definiteness of , the convergence of the SOR algorithm is only known when is diagonally dominant (bertsekas97, , Section 2.6.2). In particular, using the matrix splitting where , , are the lower-triangular, upper-triangular and the diagonal part of , respectively, we can write the SOR iteration as

(8)

Hence, the convergence of the SOR algorithm is guaranteed if the spectral norm of the iteration matrix is strictly less than one. Recently, the work Oswald15b () shows that when is symmetric and positive semidefinite (PSD), then the G-S algorithm with RP rule can yield better convergence rate compared with the cyclic G-S (in the asymptotic region where is large). From the above discussion it is clear that the classical G-S type algorithm does not work for any matrix . A natural question is: Can a G-S type algorithm converge for any matrix ?

The convergence of Kaczmarz type algorithm. Another popular method that bears similarity to the G-S type method for iteratively solving (1) is the Kaczmarz method KACZMARZ93 (), whose iteration is expressed as

(9)

where is the th row of . This method has been used in many applications, but its rate of convergence was only analyzed in 2008 by Strohmer and Vershynin Strohmer2008 () who proposed a randomized Kaczmarz (RK) method for over-determined linear systems. In the RK method, the -th equation is selected for update randomly with probability proportional to . This method can be seen as a particular case of stochastic gradient descent algorithm for minimizing the cost function

and the iterates converge linearly to a solution of (1). The RK method has a convergence rate dependent only on a certain scaled condition number of matrix .

In a related work Leventhal10 (), Leventhal and Lewis studied randomized variants of two classical algorithms, one is the CD for solving systems of linear equations (as has been discussed above), the other is the iterated projection Deutsch01 () for systems of linear inequalities (which contains the RK algorithm Strohmer2008 () as a special case). The authors show that for the first algorithm when is symmetric (of size ) and PSD, and for the second algorithm when the system has nonempty solution set, the global linear convergence can be established. Further the authors show that the linear rate can be bounded in terms of natural linear-algebraic condition numbers of the problems. Note that for the iterated projection method and the RK algorithm, the global linear convergence does not require to have full column rank.

Other recent works along this line include WHEATON2017 (); Gower15 (); DeLoera17 (); Needell10 (); Needell13 (); Oswald15 (). In Gower15 (), a stochastic dual ascent (SDA) algorithm, which contains RK as a special case, was introduced for finding the projection of a given vector onto the solution space of a linear system. The method is dual in nature, with the dual being a unconstrained concave quadratic maximization problem. In each iteration of SDA, a dual variable is updated by choosing a point in a subspace spanned by the columns of a random matrix drawn independently from a fixed distribution. In DeLoera17 (), the authors combined the relaxation method of Motzkin Motzkin54 () (also known as Kaczmarz method with the “most violated constraint control”) and the randomized Kaczmarz method Strohmer2008 () to obtain a family of algorithms called Sampling Kaczmarz-Motzkin (SKM) for solving the linear systems . In SKM, at each time a subset of inequalities are picked, and the variables are updated based on the projection to the subspace corresponding to the most violated linear equality/inequality. The reference WHEATON2017 () proposed a new algorithm in which each iteration consists of obtaining norm projection of current approximate solution (onto hyperplanes defined by individual equations), followed by proper combination of the projections to all equations to yield the next iterate. Different from the Kaczmarz method KACZMARZ93 (), this method requires information from all equations to update one variable. Needell Needell10 () extended the RK method to the case of inconsistent equations, and showed that global linear convergence can be achieved until some fixed convergence horizon is reached. Needell and Tropp Needell13 () analyzed a block version of the RK algorithm, in which at each iteration the iterate is projected onto the solution space of many equations simultaneously by selecting a block of rows rather than a single row. The convergence rate of the resulting algorithm is analyzed using the notion of row paving of a matrix. Recently Liu and Wright proposed schemes to accelerate the RK method. The resulting scheme converges faster than the RK algorithm on ill conditioned problems Liu13ARK ().

Recently, there are a few works analyzing the randomized CD method proposed in Leventhal10 () and the RK method Strohmer2008 (), see Hefny17 (); MA15 (). It is shown in MA15 () that the randomized CD method can be extended to yield the minimum norm solution. In Hefny17 (), variants of RK and randomized CD for solving Tikhonov regularized regression is proposed, and the corresponding convergence rates are derived. The rates derived indicate that RK based methods are preferable when , while the randomized CD based methods are preferable when .

It has been recognized that randomization can be effective in simplifying the analysis of RK method. In particular, Leventhal and Lewis Leventhal10 () have used randomization in the RK method and strengthened the convergence analysis of the resulting randomized algorithm. They concluded that “randomization here provides a framework for simplifying the analysis of algorithms, allowing easy bounds on the rates of linear convergence in terms of natural linear-algebraic condition measures…”. The present paper goes a step further in trying to understand the power of randomization.

rgb]0.9,0.9,0.9

(Q1)   Can randomization make the (otherwise divergent) G-S algorithm convergent for a general linear system (1)?

1.2 Contribution of This Work

In this paper, we answer the above question affirmatively. In particular, we propose a doubly stochastic G-S algorithm that is provably linearly convergent (in expectation) for any feasible linear system of equations. The key in the algorithm design is to introduce a nonuniform double randomization scheme for picking the equation and the variable in each update step of the G-S algorithm, along with an appropriate stepsize rule. Interestingly, these randomization techniques also generalize to certain iterative alternating projection algorithms for solving the linear feasibility problem with an arbitrary , as well as to certain high-dimensional convex minimization problems. Our results demonstrate that a carefully designed randomization scheme can make an otherwise divergent G-S algorithm converge linearly.

1.3 Notations

For any matrix , let denote the smallest constant such that for all . Let us define the relative condition number of as ; the scaled condition number is defined as . It is easy to verify that

(10)

We use and to denote the th column and th row of , respectively. For a symmetric matrix , we use , and to denote its maximum, the minimum and the minimum nonzero eigenvalues.

2 A Doubly Stochastic G-S Algorithm

2.1 Simultaneously selecting equations and variables?

Recall that in the G-S algorithm (3), we always use equation to update variable . In other words, variable is locked to equation . This locking is arbitrary since one can simply re-order the equations (or re-indexing the variables) without affecting the solution of the linear system, and yet different variable to equation coupling will give rise to a different G-S update scheme, some of which may be divergent while others may be convergent. Figure 1 gives an illustrative example in , whereby if we use equation 1 to update variable , and equation 2 to update variable , then the G-S algorithm diverges (left subfigure), but if we use equation 1 to update variable , and equation 2 to update variable , then the G-S algorithm converges linearly (right subfigure). This example suggests variable to equation association can greatly affect the convergence of the G-S algorithm.

Figure 1: Convergence behavior of the G-S algorithm under two different variable to equation associations.

Thus, a natural way to design a convergent G-S algorithm for a general linear system (1) is to carefully select a fixed matching that determines which variable is to be updated by which equation. Clearly, the choice of a good matching (one that can lead to a convergent G-S algorithm) will be dependent on the coefficient matrix . Unfortunately, this is a challenging task since the number of possible matchings is , which grows exponentially in . Moreover, for a non-square linear system (), it is not clear how to define such a matching between variables and equations.

In this paper, we propose to unlock the fixed pairing of each variable to a unique equation in the G-S algorithm. Moreover, since determining which variable is to be updated by which equation is hard, we propose to simply do so randomly! More specifically, at each G-S iteration, we can randomly select a pair where is an index for an equation while is an index for a variable. Using this strategy, after picking the pair , one can update variable  using equation as follows (according to the G-S update rule)

(11)

To answer , it is then natural to ask the following question:

rgb]0.9,0.9,0.9

(Q2)   Can unlocking the fixed variable-equation pairing and randomization ensure the convergence of a G-S type algorithm for an arbitrary linear system (1)?

To understand the impact of unlocking and randomization, let us consider the following example.

Example 2: Consider the same as given in Example 1, and let us consider the unlocked version of the G-S outlined above.

Specifically, after selecting the pair , we can use one of the following four update rules to update the variables:

  1. Case 1. ;

  2. Case 2. ;

  3. Case 3. ;

  4. Case 4. .

Consider a uniform randomized update rule where at each iteration, one of above update rules is selected (each with probability 1/4) and used to update the variable . Consider an initialization that and have the same sign. Without loss of generality let us assume . Define the random process . Using the uniform randomized update rule, one can show that

(12)

In order to show the divergence of the algorithm, it suffices to show that with probability one. To show this, we define the random process with and

(13)

Clearly, . Hence we only need to show with probability one. Notice that where is an i.i.d process with

(14)

It is not hard to see that . Therefore, due to the law of large numbers. Consequently, which implies , regardless of stepsize .

Furthermore, if one uses the cyclic update rule, then at each iteration of the G-S algorithm, each one of the above cases will be selected once according to some deterministic rules. Therefore, in order to study the convergence of the resulting algorithm, one need to look at the spectral radius of the resulting mapping. For example, if we select the update rules in the order of 1), 2), 3), and 4), one needs to study the spectral radius where

One can check that any permutation of the above matrices will result in a product matrix whose spectral radius is larger than 1. Consequently, the resulting algorithm will diverge for almost all initialization, and for any . Furthermore, one can numerically check that the randomly permuted rule will also diverge for almost all initialization and for any .

The above example suggests that by simply unlocking the variable-equation association, the G-S type methods still may not converge. Moreover, Figure 2 shows an example in whereby the G-S type algorithm will not converge if regardless of variable-equation association, but will converge if for any variable-equation association. Thus, stepsize control is also necessary (in addition to randomization) for the convergence of the G-S type algorithm.

Figure 2: Convergence behavior of the G-S algorithm under different stepsizes

Surprisingly, we will show in the subsequent sections that, by properly selecting a data-dependent updating probability as well as the stepsize , the randomized unlocked G-S algorithm (Algorithm 1 below) is globally linearly convergence in the mean squared error sense for any feasible linear system (1).

2.2 A Doubly Stochastic G-S Type Algorithm

We propose a doubly stochastic G-S type algorithm below. This randomized algorithm combines the unlocking idea with certain non-unifrom random selection of both equations and variables. The key that ensures the convergence of the resulting algorithm, compared with the divergent cases in Example 2, is a judicious selection of the probability that governs how the equations and the variables are picked, as well as a stepsize rule. Note that if a particular entry of the matrix , then its corresponding , so the -th index will never be picked. Therefore the update (16) is well defined.

Algorithm 1. A doubly stochastic G-S (DSGS) algorithm. Let . At iteration , randomly generate . At iteration , randomly pick the index pair with probability (15) Update by the following: (16)

2.3 Case 1: has full column rank

We first consider the case where has full column rank. Let be the feasible solution for (1). For simplicity of notation we use the superscript to denote the new iteration. Let us define

(17)

We have the following result.

Theorem 1

Consider a consistent system with being full column rank. Then there holds

Thus the DSGS algorithm converges globally linearly in the mean squared error sense for . Moreover, if we choose , then the DSGS algorithm achieves the following convergence rate:

(18)

Proof. Clearly we have the following relation

(19)

Also let us choose We have the following series of equalities

In the last equation, the notation denotes the th row of the matrix .

Summing the above equation over , we have

Clearly, to make the error converge geometrically, it suffices to have

Let us pick , then we have

This shows that the expected value of the optimality gap shrinks globally geometrically. Q.E.D.

Remark 1

Let us compare the rate obtained above with existing results in the literature. First, the rate of the randomized CD method obtained in (Leventhal10, , Theorem 3.4) (for solving (1) with being symmetric and PD) is given by

(20)

We can see that our rate obtained in (18) takes a similar form, except that our rate is proportional to . This is reasonable due to the lack of symmetricity and positive definiteness of .

Alternatively, the rate obtained in Strohmer2008 () when using RK method for solving (1) with being full column rank is given by (see, e.g., (Leventhal10, , Theorem 4.2), (DeLoera17, , Proposition 2))

(21)

Note that at each iteration of , variables are updated. In contrast, our rate in (18) is proportional to , but at each iteration only one variable is updated. When is large and is large, we have

which indicates that the two rates are asymptotically comparable.

2.4 Case II: has no full column rank

In this subsection, we assume that has no full column rank, and system (1) has a least one solution . In this case, the previous analysis does not work because (18) does not imply linear convergence. In this section, we use a different analysis technique.

Let us define

(22)

Below we will show that the quantity converges linearly to zero in expectation.

Theorem 2

Consider a consistent system with arbitrary . Let us pick

Then the double stochastic G-S algorithm achieves the following convergence rate

(23)

Proof. First from the update rule (16), it is clear that when the tuple is picked, we have

(24)

where is the th elementary vector. Left multiplying both sides by , we have

(25)

According to the definition (22), we further have

(26)

Since each tuple is picked using probability in (15), we have the following estimate

(27)

Summing over , we have

Note that by definition , and the system is consistent, so . It follows that we have

Therefore, the following sufficient conditions are needed

(28)

These implies that

(29)

Let us pick

(30)

then we have

The claim is proved. Q.E.D.

Remark 2

Let us compare the rate obtained above with the rate of a randomized CD method (Algorithm 3.5 in Leventhal10 ()), a method which updates only one variable at each iteration, while utilizing one column of matrix . It is shown that for a consistent linear systems of equations (1) with arbitrary non-zero matrix , the randomized CD method achieves the following rate

(31)

Clearly the above rate is closely related to the one given in (23).

Remark 3

Although we are mainly interested in addressing Question Q1, namely how to design a G-S scheme that converges for any matrix A, the proposed double stochastic algorithm does have important practical value. Consider the distributed computation setting where there is a central controller node connected to a number of distributed computing nodes, each having a subset of rows of data matrix and , respectively. The central controller stores the variable , the error term and is capable of broadcasting to every distributed nodes. At each iteration, the central node randomly pick a pair , and send the distributed node that has the scalar ; the corresponding node will update according to (16) using its local information. Then the new will be transmitted back to node , and node will recompute and continue the previous process. Therefore after iterations of the algorithm, the total number of messages transmitted between the local nodes to node is . In comparison, if one implements the RK method in the same distributed network, then each iteration messages have to be communicated from the local nodes to node .

3 A Doubly Stochastic Alternating Projection Algorithm

In this section, we extend our previous analysis to the problem of finding a point in the intersection of multiple polyhedral sets. The algorithm to be developed has the flavor of the classical alternating projection algorithm, except that we perform the alternating projection coordinate-wise. Specifically, we consider the following problem:

(32)

We will assume in the rest of this section that the system is feasible.

The proposed algorithm is closely related to Algorithm 1, except that we only update those inequalities that are violated.

Algorithm 2. The double stochastic alternating projection algorithm. At iteration , randomly generate . At iteration , randomly pick the index pair with probability
Update by the following (33) (34)

To facilitate our analysis, let us define the following function

(35)

We note that any feasible solution of (32) will imply . Further, each function is differentiable, and its gradient is if , and it is otherwise. For a given iteration , define the index set

(36)

and define as the diagonal matrix with if and otherwise. Then we have

(37)

where , and is defined similarly.

Note that by using the above definition, we have

(38)

3.1 Case 1: has full row rank

In this subsection we make the following assumption

(39)

Similarly as before, we will use (resp. ) to denote the new (resp. previous) iteration; we will use , and to denote at iteration , respectively.

Theorem 3

Suppose has full row rank, and is chosen as

(40)

Then we have

(41)

Proof. Suppose that the th pair gets selected, and that th coordinate gets updaded (this means that ), then we can rewrite as following

(42)

In this case, we can estimate the component function based on whether the th inequality is satisfied for . Suppose that (i.e. the th inequality is not satisfied), then we have