Stochastic Reformulations of Linear Systems: Algorithms and Convergence Theory

# Stochastic Reformulations of Linear Systems: Algorithms and Convergence Theory

Peter Richtárik KAUST and University of Edinburgh    Martin Takáč Lehigh University
###### Abstract

We develop a family of reformulations of an arbitrary consistent linear system into a stochastic problem. The reformulations are governed by two user-defined parameters: a positive definite matrix defining a norm, and an arbitrary discrete or continuous distribution over random matrices. Our reformulation has several equivalent interpretations, allowing for researchers from various communities to leverage their domain specific insights. In particular, our reformulation can be equivalently seen as a stochastic optimization problem, stochastic linear system, stochastic fixed point problem and a probabilistic intersection problem. We prove sufficient, and necessary and sufficient conditions for the reformulation to be exact.

Further, we propose and analyze three stochastic algorithms for solving the reformulated problem—basic, parallel and accelerated methods—with global linear convergence rates. The rates can be interpreted as condition numbers of a matrix which depends on the system matrix and on the reformulation parameters. This gives rise to a new phenomenon which we call stochastic preconditioning, and which refers to the problem of finding parameters (matrix and distribution) leading to a sufficiently small condition number. Our basic method can be equivalently interpreted as stochastic gradient descent, stochastic Newton method, stochastic proximal point method, stochastic fixed point method, and stochastic projection method, with fixed stepsize (relaxation parameter), applied to the reformulations.

Key words. linear systems, stochastic methods, iterative methods, randomized Kaczmarz, randomized Newton, randomized coordinate descent, random pursuit, randomized fixed point.

AMS subject classifications. 15A06, 15B52, 65F10, , 68W20, 65N75, 65Y20, 68Q25, 68W40, 90C20

## 1 Introduction

Linear systems form the backbone of most numerical codes used in academia and industry. With the advent of the age of big data, practitioners are looking for ways to solve linear systems of unprecedented sizes. The present work is motivated by the need to design such algorithms. As an algorithmic tool enabling faster computation, randomization is well developed, understood and appreciated in several fields, typically traditionally of a “discrete” nature, most notably theoretical computer science [38]. However, probabilistic ideas are also increasingly and successfully penetrating “continuous” fields, such as numerical linear algebra [69, 10, 11, 72, 37, 62, 1], optimization [30, 44, 46, 57, 71, 51], control theory [4, 5, 74], machine learning [64, 58, 26, 14, 55], and signal processing [7, 27].

In this work111All theoretical results in this paper were obtained by August 2016 and a first draft was circulated to a few selected colleagues in September 2016. The first author gave several talks on these results before this draft was made publicly available on arXiv: Linear Algebra and Parallel Computing at the Heart of Scientific Computing, Edinburgh, UK (Sept 21, 2016), Seminar on Combinatorics, Games and Optimisation, London School of Economics, London, UK, (Nov 16, 2016), Workshop on Distributed Machine Learning, Télécom ParisTech, Paris, France (Nov 25, 2016), Skoltech Seminar, Moscow, Russia (Dec 1, 2016), BASP Frontiers Workshop, Villars-sur-Ollon, Switzerland (Feb 1, 2017), and SIAM Conference on Optimization, Vancouver, Canada (May 22, 2017). In addition, the first author has included the results of this paper in the MSc/PhD course Modern optimization methods for big data problems, delivered in Spring 2017 at the University of Edinburgh, as an introduction into the role of randomized decomposition in linear algebra, optimization and machine learning. All main results of this paper were distributed to the students in the form of slides. we are concerned with the problem of solving a consistent linear system.

In particular, consider the problem

 solveAx=b, \hb@xt@.01(1.1)

where . We shall assume throughout the paper that the system is consistent, i.e., . Problem (LABEL:eq:linear_system) is arguably one of the most important problems in linear algebra. As such, a tremendous amount of research has been done to design efficient iterative algorithms [63]. However, surprisingly little is know about randomized iterative algorithms for solving linear systems. In this work we aim to contribute to closing this gap.

### 1.1 Stochastic reformulations of linear systems

We propose a fundamental and flexible way of reformulating each consistent linear system into a stochastic problem. To the best of our knowledge, this is the first systematic study of such reformulations. Stochasticity is introduced in a controlled way, into an otherwise deterministic problem, as a decomposition tool which can be leveraged to design efficient, granular and scalable randomized algorithms.

#### Parameters defining the reformulation

Stochasticity enters our reformulations through a user-defined distribution describing an ensemble of random matrices . We make use of one more parameter: a user-defined symmetric positive definite matrix . Our approach and underlying theory support virtually all thinkable distributions222We only require that the the expectation exists, where .. The choice of the distribution should ideally depend on the problem itself, as it will affect the conditioning of the reformulation. However, for now we leave such considerations aside.

#### One stochastic reformulation in four disguises

Our reformulation of (LABEL:eq:linear_system) as a stochastic problem has several seemingly different, yet equivalent interpretations, and hence we describe them here side by side.

a) Stochastic optimization problem. Consider the problem

 minimizef(x)def=ES∼\@fontswitchD[fS(x)], \hb@xt@.01(1.2)

where , , and denotes the Moore-Penrose pseudoinverse. When solving the problem, we do not have (or do not wish to exercise, as it may be prohibitively expensive) explicit access to , its gradient or Hessian. Rather, we can repeatedly sample and receive unbiased samples of these quantities at points of interest. That is, we may obtain local information about the stochastic function , such as the stochastic gradient , and use this to drive an iterative process for solving (LABEL:eq:problem:stoch_opt).

b) Stochastic linear system. Consider now a preconditioned version of the linear system (LABEL:eq:linear_system) given by

 solveB−1A⊤ES∼\@fontswitchD[H]Ax=B−1A⊤ES∼\@fontswitchD[H]b, \hb@xt@.01(1.3)

where is the preconditioner. The preconditioner is not assumed to be known explicitly. Instead, when solving the problem, we are able to repeatedly sample , obtaining an unbiased estimate of the preconditioner (not necessarily explicitly), , for which we coin the name stochastic preconditioner. This gives us access to an unbiased sample of the preconditioned system (LABEL:eq:problem:preconditioning): As we shall see—in an analogy with stochastic optimization—the information contained in such systems can be utilized by an iterative algorithm to solve (LABEL:eq:problem:preconditioning).

c) Stochastic fixed point problem. Let denote the projection of onto , in the norm . Consider the stochastic fixed point problem

 solvex=ES∼\@fontswitchD[ΠB\@fontswitchLS(x)]. \hb@xt@.01(1.4)

That is, we seek to find a fixed point of the mapping . When solving the problem, we do not have an explicit access to the average projection map. Instead, we are able to repeatedly sample , and use the stochastic projection map .

d) Probabilistic intersection problem. Note that for all . We would wish to design in such a way that a suitably chosen notion of an intersection of the sets is equal to . The correct notion is what we call probabilistic intersection, denoted , and defined as the set of points which belong to with probability one. This leads to the problem:

 findx∈∩S∼\@fontswitchD\@fontswitchLSdef={x:Prob(x∈\@fontswitchLS)=1}. \hb@xt@.01(1.5)

As before, we typically do not have an explicit access to the probabilistic intersection when designing an algorithm. Instead, we can repeatedly sample , and utilize the knowledge of to drive the iterative process. If is a discrete distribution, probabilistic intersection reduces to standard intersection.

All of the above formulations have a common feature: they all involve an expectation over , and we either do not assume this expectation is known explicitly, or even if it is, we prefer, due to efficiency or other considerations, to sample from unbiased estimates of the objects (e.g., stochastic gradient , stochastic preconditioner , stochastic projection map , random set ) appearing in the formulation.

#### Equivalence and exactness

We show that all these stochastic reformulations are equivalent (see Theorem LABEL:thm:X). In particular, the following sets are identical: the set of minimizers of the stochastic optimization problem (LABEL:eq:problem:stoch_opt), the solution set of the stochastic linear system (LABEL:eq:problem:preconditioning), the set of fixed points of the stochastic fixed point problem (LABEL:eq:problem:fixed_point), and the probabilistic intersection (LABEL:eq:problem:intersection). Further, we give necessary and sufficient conditions for this set to be equal to . If this is the case, we say the the reformulation is exact (see Section LABEL:sec:exactness). Distributions satisfying these conditions always exist, independently of any assumptions on the system beyond consistency. The simplest, but also the least useful choice of a distribution is to pick (the identity matrix), with probability one. In this case, all of our reformulations become trivial.

### 1.2 Stochastic algorithms

Besides proposing a family of stochastic reformulations of the linear system (LABEL:eq:linear_system), we also propose three stochastic algorithms for solving them: Algorithm LABEL:alg:alg1 (basic method), Algorithm LABEL:alg:alg2 (parallel/minibatch method), and Algorithm LABEL:alg:alg3 (accelerated method). Each method can be interpreted naturally from the viewpoint of each of the reformulations.

#### Basic method

Below we list some of the interpretations of Algorithm LABEL:alg:alg1 (basic method), which performs updates of the form

 xk+1=ϕω(xk,Sk)def=xk−ωB−1A⊤Sk(S⊤kAB−1A⊤Sk)†S⊤k(Axk−b), \hb@xt@.01(1.6)

where is sampled independently in each iteration. The method is formally presented and analyzed in Section LABEL:sec:Basic_Method.

a) Stochastic gradient descent. Algorithm LABEL:alg:alg1 can be seen as stochastic gradient descent [60], with fixed stepsize, applied to (LABEL:eq:problem:stoch_opt). At iteration of the method, we sample , and compute , which is an unbiased stochastic approximation of . We then perform the step

 xk+1=xk−ω∇fSk(xk), \hb@xt@.01(1.7)

where is a stepsize.

Let us note that in order to achieve linear convergence it is not necessary to use any explicit variance reduction strategy [64, 26, 28, 9], nor do we need to use decreasing stepsizes. This is because the stochastic gradients vanish at the optimum, which is a consequence of the consistency assumption. Surprisingly, we get linear convergence in spite of the fact that we deal with a non-finite-sum problem (LABEL:eq:problem:stoch_opt), and without the need to assume boundedness of the stochastic gradients, and without being strongly convex. To the best of our knowledge, this is the first linearly convergent accelerated method for stochastic optimization without requiring strong convexity. This beats the minimax bounds given by Srebro [70]. This is because (LABEL:eq:problem:stoch_opt) is not a black-box stochastic optimization objective; indeed, we have constructed it in a particular way from the underlying linear system (LABEL:eq:linear_system).

b) Stochastic Newton method. However, Algorithm LABEL:alg:alg1 can also be seen as a stochastic Newton method. At iteration we sample , and instead of applying the inverted Hessian of to the stochastic gradient (this is not possible as the Hessian is not necessarily invertible), we apply a pseudoinverse (which always exists). That is, we perform the step

 \hb@xt@.01(1.8)

where is a stepsize and the -pseudoinverse of a matrix is defined as . While any pseudoinverse will resolve non-invertibility issue, since we work in the geometry induced by the –inner product, the -pseudoinverse is the right choice. One may wonder why methods (LABEL:eq:alg:SGD) and (LABEL:eq:alg:SNM) are equivalent; after all, the (stochastic) gradient descent and (stochastic) Newton methods are not equivalent in general. However, in our setting it turns out that the stochastic gradient is always an eigenvector of , with eigenvalue 1 (see Lemma LABEL:lem:all_sort_of_stuff_is_equal).

Stochastic Newton-type methods were recently developed and analyzed in the optimization and machine learning literature [50, 49, 52, 39]. However, they are design to solve different problems, and operate in a different manner.

c) Stochastic proximal point method. If we restrict our attention to stepsizes satisfying , then Algorithm LABEL:alg:alg1 can be equivalently (see Theorem LABEL:thm:SPP in the Appendix) written down as

 xk+1=argminx∈Rn{fSk(x)+1−ω2ω∥x−xk∥2B}. \hb@xt@.01(1.9)

That is, (LABEL:eq:alg:SPP) is a stochastic variant of the proximal point method for solving (LABEL:eq:problem:stoch_opt), with a fixed regularization parameter [61]. The proximal point method is obtained from (LABEL:eq:alg:SPP) by replacing with . If we define the prox operator of a function with respect to the -norm as then iteration (LABEL:eq:alg:SPP) can be written compactly as

d) Stochastic fixed point method. From the perspective of the stochastic fixed point problem (LABEL:eq:problem:fixed_point), Algorithm LABEL:alg:alg1 can be interpreted as a stochastic fixed point method, with relaxation. We first reformulate the problem into an equivalent form using relaxation, which is done to improve the contraction properties of the map. We pick a parameter , and instead consider the equivalent fixed point problem . Now, at iteration , we sample , which enables us to obtain an unbiased estimate of the new fixed point mapping, and then simply perform one step of a fixed point method on this mapping:

 xk+1=ωΠB\@fontswitchLSk(xk)+(1−ω)xk. \hb@xt@.01(1.10)

e) Stochastic projection method. Algorithm LABEL:alg:alg1 can also be seen as a stochastic projection method applied to the probabilistic intersection problem (LABEL:eq:problem:intersection). By sampling , we are one of the sets defining the intersection, namely . We then project the last iterate onto this set, in the -norm, followed by a relaxation step with relaxation parameter . That is, we perform the update

 xk+1=xk+ω(ΠB\@fontswitchLSk(xk)−xk). \hb@xt@.01(1.11)

This is a randomized variant of an alternating projection method. Note that the representation of as a probabilistic intersection of sets is not given to us. Rather, we construct it with the hope to obtain faster convergence.

An optimization algorithm utilizing stochastic projection steps was developed in [41]. For a comprehensive survey of projection methods for convex feasibility problems, see [2].

#### Parallel method

A natural parallelization strategy is to perform one step of the basic method independently times, starting from the same point , and average the results:

 xk+1=1ττ∑i=1ϕω(xk,Sik), \hb@xt@.01(1.12)

where are independent samples from (recall that is defined in (LABEL:eq:alg1-informal)). This method is formalized as Algorithm LABEL:alg:alg2, and studied in Section LABEL:sec:minibatch. Betrayed by our choice of the name, this method is useful in scenarios where parallel workers are available, allowing for the basic steps to be computed in parallel, followed by an averaging operation.

From the stochastic optimization viewpoint, this is a minibatch method. Considering the SGD interpretation (LABEL:eq:alg:SGD), we can equivalently write (LABEL:eq:parallel_method_intro) in the form This is minibatch SGD. Iteration complexity of minibatch SGD was first understood in the context of training support vector machines with the hinge loss [73]. Complexity under a lock-free paradigm, in a different setting from ours, was first studied in [47]. Notice that in the limit , we obtain gradient descent. It is therefore interesting to study the complexity of the parallel method as a function . Of course, this method can also be interpreted as a minibatch stochastic Newton method, minibatch proximal point method and so on.

From the probabilistic intersection point of view, method (LABEL:eq:parallel_method_intro) can be interpreted as a stochastic variant of the parallel projection method. In particular, we obtain the iterative process

 xk+1=xk+ω[(1ττ∑i=1ΠB\@fontswitchLSik(xk))−xk].

That is, we move from the current iterate, , towards the average of the projection points, with undershooting (if ), precisely landing on (if ), or overshooting (if ) the average. Projection methods have a long history and are well studied [12, 3]. However, much less is known about stochastic projection methods.

#### Accelerated method

In order to obtain acceleration without parallelization—that is, acceleration in the sense of Nesterov [45]—we suggest to perform an update step in which depends on both and . In particular, we take two dependent steps of Algorithm LABEL:alg:alg1, one from and one from , and then take an affine combination of the results. That is, the process is started with , and for involves an iteration of the form

 xk+1=γϕω(xk,Sk)+(1−γ)ϕω(xk−1,Sk−1), \hb@xt@.01(1.13)

where the matrices are independent samples from , and is an acceleration parameter. Note that by choosing (no acceleration), we recover Algorithm LABEL:alg:alg1. This method is formalized as Algorithm LABEL:alg:alg3 and analyzed in Section LABEL:sec:acceleration. Our theory suggests that should be always between and . In particular, for well conditioned problems333The condition number, , is defined in (LABEL:eq:condition_number)., one should choose , and for ill conditioned problems, one should choose .

### 1.3 Complexity

The iteration complexity of our methods is completely described by the spectrum of the (symmetric positive semidefinite) matrix

 Wdef=B−1/2A⊤ES∼\@fontswitchD[H]AB−1/2.

Let be the eigenvalue decomposition of , where are the eigenvectors, are the eigenvalues, and . It can be shown that the largest eigenvalue, is bounded above by 1 (see Lemma LABEL:lem:spectrumxx). Let be the smallest nonzero eigenvalue.

With all of the above reformulations we associate the same condition number

 ζ=ζ(A,B,\@fontswitchD)def=∥W∥∥W†∥=λmaxλ+min, \hb@xt@.01(1.14)

where is the spectral norm, is the largest eigenvalue of and is the smallest nonzero eigenvalue of . Note that, for example, is the condition number of the Hessian of , and also the condition number of the stochastic linear system (LABEL:eq:problem:preconditioning). Natural interpretations from the viewpoint of the stochastic fixed point and probabilistic intersection problems are also possible. As one varies the parameters defining the reformulation ( and ), the condition number changes. For instance, choosing with probability one gives .

#### Exact formula for the evolution of expected iterates

We first show (Theorem LABEL:thm:alg1_complexity_first) that after the canonical linear transformation , the expected iterates of the basic method satisfy the identity

 E[U⊤B1/2(xk−x∗)]=(I−ωΛ)kU⊤B1/2(x0−x∗), \hb@xt@.01(1.15)

where is an arbitrary solution of the linear system (i.e., ). This identity seems to suggest that zero eigenvalues cause an issue, preventing convergence of the corresponding elements of the error to zero. Indeed, if , then (LABEL:eq:i98t8ghdiee) implies that , which does not change with . However, it turns out that under exactness we have whenever if we let to be the projection, in the -norm, of onto (Theorem LABEL:thm:alg1_complexity_first). This is then used to argue (Corollary LABEL:thm:corollary) that converges to zero if and only if . The choice of stepsize issue is discussed in detail in Section LABEL:subsec:_omega.

The main complexity results obtained in this paper are summarized in Table LABEL:tbl:summary_intro. The full statements including the dependence of the rate on these parameters, as well as other alternative results (such as lower bounds, ergodic convergence) can be found in the theorems referenced in the table.

#### L2 convergence

The rate of decay of the quantity for three different stepsize choices is summarized in the first three rows of Table LABEL:tbl:summary_intro. In particular, the default stepsize leads to the complexity , the long stepsize gives the improved complexity , and the optimal stepsize gives the best complexity . However, if we are interested in the convergence of the larger quantity (L2 convergence), it turns out that is the optimal choice, leading to the complexity .

#### Parallel and accelerated methods

The parallel method improves upon the basic method in that it is capable of faster L2 convergence. We give a complexity formula as a function of , recovering the complexity the rate of the basic method in the case, and achieving the improved asymptotic complexity as (recall that , whence the improvement). Because of this, is the quantity driving parallelizability. If is close to 1, then there is little or no reason to parallelize. If is very small, parallelization helps. The smaller is, the more gain is achieved by utilizing more processors.

With the correct stepsize () and acceleration () parameters, the accelerated method improves the complexity achieved by the basic method to . However, this is established for the quantity . We conjecture that the L2 convergence rate of the accelerated method (for a suitable choice of the parameters and ) is .

#### Novelty

All convergence results presented in this paper are new in one way or another. Indeed, we extend the methods from [19, 20] to include a stepsize/relaxation parameter, or analyze these methods under a weaker condition (basic method with under the “exactness” assumption), or develop new methods (parallel and accelerated variants). Let us focus on the case of the basic method with unit stepsize as this method was considered in [19] and [20]: see lines 1 and 4 of Table LABEL:tbl:summary_intro. Even in this case, our results hold under weaker assumptions; and the unit stepsize is not optimal for the convergence rate of the quantity in line 1 (this can be seen by looking at the improved rates in lines 2 and 3). While [20] weakens the (rather strong) assumption in [19] (full column rank of ), our assumption (which we call “exactness”) is even weaker. The focus of paper [20] was both to weaken the assumptions in [19], and also to develop a duality theory for what we now call the “basic method”. Here we do not touch on duality theory at all (the extension is rather straightforward).

### 1.4 Stochastic preconditioning

We coin the phrase stochastic preconditioning to refer to the general problem of designing matrix and distribution such that the appropriate condition number of is well behaved. For instance, one might be interested in minimizing (or reducing) the condition number if the basic method with unit stepsize is used, and the quantity we wish to converge to zero is either , , or (see Lines 1, 4 and 5 of Table LABEL:tbl:summary_intro). On the other hand, if we can estimate , then one may use the basic method with the larger stepsize , in which case we may wish to minimize (or reduce) the condition number (see Line 2 of Table LABEL:tbl:summary_intro).

One possible approach to stochastic preconditioning is to choose some and then focus on a reasonably simple parametric family of distributions , trying to find the parameters which minimize (or reduce) the condition number of interest. The distributions in this family should be “comparable” in the sense that the cost of one iteration of the method of interest should be comparable for all distributions; as otherwise comparing bounds on the number of iterations does not make sense.

To illustrate this through a simple example, consider the family of discrete uniform distributions over vectors in (that is, ), where the vectors themselves are the parameters defining the family. The cost of one iteration of the basic method will be proportional to the cost of performing a matrix-vector product of the form , which is comparable across all distributions in the family (assuming the vectors are dense, say). To illustrate this approach, consider this family, and further assume that is symmetric and positive definite. Choose . It can be shown that is maximized precisely when correspond to the eigenvectors of . In this case, , and hence our stochastic preconditioning strategy results in a condition number which is independent of the condition number of . If we now apply the basic method to the stochastic optimization reformulation, we can interpret it as a spectral variant of stochastic gradient descent (spectral SGD). Ignoring logarithmic terms, spectral SGD only needs to perform matrix vector multiplications to solve the problem. While this is not a practical preconditioning strategy—computing the eigenvectors is hard, and if we actually had access to them, we could construct the solution directly, without the need to resort to an iterative scheme—it sheds light on the opportunities and challenges associated with stochastic preconditioning.

All standard sketching matrices can be employed within our framework, including the count sketch [6] and the count-min sketch [8]. In the context to this paper (since we sketch with the transpose of ), is a count-sketch matrix (resp. count-min sketch) if it is assembled from random columns of (resp ), chosen uniformly with replacement, where is the identity matrix.

The notion of importance sampling developed in the last 5 years in the randomized optimization and machine learning literature [56, 76, 53, 51] can be seen a type of stochastic preconditioning, somewhat reverse to what we have outlined above. In these methods, the atoms forming the distribution are fixed, and one is seeking to associate them with appropriate probabilities. Thus, the probability simplex is the parameter space defining the class of distributions one is considering.

Stochastic preconditioning is fundamentally different from the idea of randomized preconditioning [62, 1], which is based on a two-stage procedure. In the first step, the input matrix is randomly projected and an good preconditioning matrix is extracted. In the second step, an iterative least squares solver is applied to solve the preconditioned system.

Much like standard preconditioning, different stochastic preconditioning strategies will need to be developed for different classes of problems, with structure of informing the choice of and . Due to its inherent difficulty, stochastic preconditioning is beyond the scope of this paper.

### 1.5 Notation

For convenience, a table of the most frequently used notation is included in Appendix LABEL:sec:notation_glossary. All matrices are written in bold capital letters. By and we mean the range space and null space of matrix , respectively. Given a symmetric positive definite matrix , we equip with the Euclidean inner product defined by . We also define the induced norm: . The short-hand notation means , where is the identity matrix. We shall often write for matrix being merely positive semidefinite; this constitutes a pseudonorm.

## 2 Further Connections to Existing Work

In this section we outline several connections of our work with existing developments. We do not aim to be comprehensive.

### 2.1 Randomized Kaczmarz method, with relaxation and acceleration

Let , and choose as follows: with probability . Since

 W=B−1/2E[Z]B−1/2=E[Z]=m∑i=1piA⊤i:Ai:∥Ai:∥22=1∥A∥2Fm∑i=1A⊤i:Ai:=A⊤A∥A∥2F.

the condition number is

 ζ=∥W∥∥W†∥=∥E[Z]∥∥E[Z]†∥=λmax(A⊤A)λ+min(A⊤A). \hb@xt@.01(2.1)

#### Basic method

In this setup, Algorithm LABEL:alg:alg1 simplifies to

 xk+1=xk−ω(Ai:xk−bi)∥Ai:∥22A⊤i:.

For , this reduces to the celebrated randomized Kaczmarz method (RK) of Strohmer and Vershyinin [72]. For , this is RK with overrelaxation – a new method not considered before. Based on Theorem LABEL:thm:alg1_bestomega, for the iteration complexity of Algorithm LABEL:alg:alg1 is This is an improvement on standard RK method (with ), whose complexity depends on instead of . Thus, the improvement can be as large as by a factor .

#### Accelerated method

In the same setup, Algorithm LABEL:alg:alg3 simplifies to

 xk+1=γ(xk−ω(Aik:xk−bik)∥Aik:∥22A⊤ik:)+(1−γ)(xk−1−ω(Aik−1:xk−1−bik−1)∥Aik−1:∥22A⊤ik−1:)

This is accelerated RK method with overrelaxation – a new method not considered before. Based on Theorem LABEL:thm:main-accelerated, for the parameter choice and , the iteration complexity of this method is If we instead choose and , the iteration complexity gets slightly worse: . To the best of our knowledge, this is the best known complexity for a variant of RK. Let us remark that an asynchronous accelerated RK method was developed in [31].

The randomized Kaczmarz method, its variants have received considerable attention recently [42, 77, 43, 54], and several connections to existing methods were made. Kaczmarz-type methods in a Hilbert setting were developed in [48].

### 2.2 Basic method with unit stepsize

The method was first proposed and analyzed (under a full rank assumption on ) in [19]. Note that in view of (LABEL:eq:alg:SFP), this is the basic method with unit stepsize. However, it was not interpreted as a method for solving any of the reformulations presented here, and as a result, all the interpretations we are giving here also remained undiscovered. Instead, it was developed and presented as a method for finding the unique solution of (LABEL:eq:linear_system).

### 2.3 Duality

As we have seen, all three methods developed in this paper converge to a specific solution of the linear system (LABEL:eq:linear_system), namely, to the projection of the starting point onto the solution space: . Therefore, our methods solve the best approximation problem

 minx∈Rn{∥x−x0∥2B:Ax=b}. \hb@xt@.01(2.2)

The “dual” of the basic method with unit stepsize was studied in this context in [20]. The Fenchel dual of the best approximation problem is an unconstrained concave quadratic maximization problem of the form where the dual objective depends on and . In [20] it was shown that the basic method with unit stepsize closely related to a dual method (stochastic dual subspace ascent) performing iterations of the form

 yk+1=yk+Skλk, \hb@xt@.01(2.3)

where , and is chosen in such a way that the dual objective is as large as possible. Notice that the dual method in each iteration performs exact “line search” in a random subspace of spanned by the columns of the random matrix , and passing through . In particular, the iterates of the basic method with unit stepsize arise as affine images of the iterates of the dual method: .

In a similar fashion, it is possible to interpret the methods developed in this paper as images of appropriately designed dual methods. In [20], the authors focus on establishing convergence of various quantities, such as dual objective, primal-dual gap, primal objective and so on. They obtain the complexity , which is identical to the rate we obtain here for the basic method with unit stepsize. However, their results require a stronger assumption on (their assumption implies exactness, but not vice versa). We perform a much deeper analysis from the novel viewpoint of stochastic reformulations of linear systems, include a stepsize, and propose and analyze parallel and accelerated variants.

In the special case when is chosen to be a random unit coordinate vector, (LABEL:eq:SDA) specializes to the randomized coordinate descent method, first analyzed by Leventhal and Lewis [30]. In the special case when is chosen as a random column submatrix of the identity matrix, (LABEL:eq:SDA) specializes to the randomized Newton method of Qu et al. [52]. Randomized coordinate descent methods are the state of the art methods for certain classes of convex optimization problems with a very large number of variables. The first complexity analysis beyond quadratics was performed in [66, 46, 57], a parallel method was developed in [58], duality was explored in [67] and acceleration in [14].

### 2.4 Randomized gossip algorithms

It was shown in [19, 33] that for a suitable matrix encoding the structure of a graph, and for , the application of the randomized Kaczmarz and randomized block Kaczmarz methods to (LABEL:eq:s98y0f9yhfsee) lead to classical and new randomized gossip algorithms developed and studied in the signal processing literature, with new insights and novel proofs. Likewise, when applied in the same context, our new methods lead to new parallel and accelerated gossip algorithms.

### 2.5 Empirical risk minimization

Regularized empirical risk minimization (ERM) problems are optimization problems of the form

 minx∈Rn1mm∑i=1fi(x)+g(x), \hb@xt@.01(2.4)

where is a loss function and a regularizer. Problems of this form are of key importance in machine learning and statistics [65]. Let if and otherwise, further let . In this setting, the ERM problem (LABEL:eq:ERMxx) becomes equivalent to (LABEL:eq:s98y0f9yhfsee). While quadratic regularizers similar to are common in machine learning, zero/infinity loss functions are not used. For this reason, this specific instance of ERM was not studied in the machine learning literature. Since all our methods solve (LABEL:eq:s98y0f9yhfsee), they can be seen as stochastic algorithms for solving the ERM problem (LABEL:eq:ERMxx).

Since there is no reason to expect that any of our methods will satisfy for any finite , the ERM objective value can remain to be equal to throughout the entire iterative process. From this perspective, the value of the ERM objective is unsuitable as a measure of progress.

### 2.6 Matrix inversion and quasi-Newton methods

Given an invertible matrix , its inverse is the unique solution of the matrix equation . In [22] the authors have extended the “sketch and project” method [19] to this equation. In each iteration of the method, one projects the current iterate matrix , with respect to a weighted Frobenius norm, onto the sketched equation . This is a similar iterative process to the basic method with unit stepsize. The authors of [22] prove that the iterates of method converge to the inverse matrix at a linear rate, and detail connections of their method to quasi-Newton updates and approximate inverse preconditioning. A limited memory variant of the stochastic block BFGS method has been used to develop new efficient stochastic quasi-Newton methods for empirical risk minimization problems appearing in machine learning [16].

It is possible to approach the problem in the same way we approach the system (LABEL:eq:linear_system) in our paper, writing down stochastic reformulations, and then developing new variants of the sketch and project method [22]: a basic method with a stepsize, and parallel and accelerated methods. This would lead to the development of new variants of stochastic quasi-Newton rules, notably parallel and accelerated block BFGS. We conjecture that these rules will have superior performance to classical BFGS in practice.

Similar extensions and improvements can be done in relation to the problem of computing the pseudoinverse of very large rectangular matrices [21].

## 3 Stochastic Reformulations of Linear Systems

In this section we formally derive the four stochastic formulations outlined in the introduction: stochastic optimization, stochastic linear system, stochastic fixed point problem and probabilistic intersection. Along the way we collect a number of results and observations which will be useful in the complexity analysis of our methods.

### 3.1 Projections

For a closed convex set , we let denote the projection operator onto , in the -norm. That is, The -pseudoinverse of a matrix is defined as

 M†Bdef=B−1M⊤(MB−1M⊤)†. \hb@xt@.01(3.1)

The projection onto is given by

 ΠB\@fontswitchL(x)=x−B−1A⊤(AB−1A⊤)†(Ax−b)x−A†B(Ax−b). \hb@xt@.01(3.2)

Note that for , we get , and hence the -pseudoinverse reduces to the standard Moore-Penrose pseudoinverse. The -pseudoinverse satisfies

### 3.2 Stochastic functions

Let be an arbitrary distribution over matrices, which we shall denote as . We shall write to say that is drawn from . We shall often refer to matrix expressions involving and . In order to keep the expressions brief throughout the paper, it will be useful to define

 Hdef=S(S⊤AB−1A⊤S)†S⊤, \hb@xt@.01(3.3)

and

 \hb@xt@.01(3.4)

Notice that That is, is the projection matrix corresponding to projection onto in the -norm. In particular, we have the relations

 (B−1Z)2=B−1Z% andZB−1Z=Z. \hb@xt@.01(3.5)

Given , we define the stochastic (random) function

 fS(x)def=12∥Ax−b∥2H=12(Ax−b)⊤H(Ax−b). \hb@xt@.01(3.6)

By combining (LABEL:eq:prodstoch) and (LABEL:eq:Z), this can be also written in the form

 fS(x)=12(x−x∗)⊤Z(x−x∗),x∈Rn,x∗∈\@fontswitchL. \hb@xt@.01(3.7)

For all and all we have the expansion , where

 ∇fS(x)=B−1A⊤H(Ax−b)and∇2fS=B−1Z \hb@xt@.01(3.8)

are the gradient and Hessian of with respect to the -inner product, respectively.444If , then is the standard Euclidean inner product, and we recover formulas for the standard gradient and Hessian. Note that is both self-adjoint and positive semidefinite with respect to the