Weighted SGD for \ell_{p} Regression with Randomized Preconditioning 1footnote 11footnote 1A conference version of this paper appears under the same title in Proceedings of ACM-SIAM Symposium on Discrete Algorithms, Arlington, VA, 2016 (Yang et al., 2016a).

Weighted SGD for ℓp Regression with Randomized Preconditioning 1

Abstract

In recent years, stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have been applied to many large-scale problems in machine learning and data analysis. SGD methods are easy to implement and applicable to a wide range of convex optimization problems. In contrast, RLA algorithms provide much stronger performance guarantees but are applicable to a narrower class of problems. We aim to bridge the gap between these two methods in solving constrained overdetermined linear regression problems—e.g., and regression problems.

• We propose a hybrid algorithm named pwSGD that uses RLA techniques for preconditioning and constructing an importance sampling distribution, and then performs an SGD-like iterative process with weighted sampling on the preconditioned system.

• By rewriting a deterministic regression problem as a stochastic optimization problem, we connect pwSGD to several existing solvers including RLA methods with algorithmic leveraging (RLA for short).

• We prove that pwSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity. Such SGD convergence rates are superior to other related SGD algorithm such as the weighted randomized Kaczmarz algorithm.

• Particularly, when solving regression with size by , pwSGD returns an approximate solution with relative error in the objective value in time. This complexity is uniformly better than that of RLA methods in terms of both and when the problem is unconstrained. In the presence of constraints, pwSGD only has to solve a sequence of much simpler and smaller optimization problem over the same constraints. In general this is more efficient than solving the constrained subproblem required in RLA.

• For regression, pwSGD returns an approximate solution with relative error in the objective value and the solution vector measured in prediction norm in time. We show that for unconstrained regression, this complexity is comparable to that of RLA and is asymptotically better over several state-of-the-art solvers in the regime where the desired accuracy , high dimension and low dimension satisfy and .

We also provide lower bounds on the coreset complexity for more general regression problems, indicating that still new ideas will be needed to extend similar RLA preconditioning ideas to weighted SGD algorithms for more general regression problems. Finally, the effectiveness of such algorithms is illustrated numerically on both synthetic and real datasets, and the results are consistent with our theoretical findings and demonstrate that pwSGD converges to a medium-precision solution, e.g., , more quickly.

1 Introduction

Many novel algorithms for large-scale data analysis and machine learning problems have emerged in recent years, among which stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have received much attention—both for their strong performance in practical applications and for their interesting theoretical properties (Bottou, 2010; Mahoney, 2011). Here, we consider the ubiquitous and regression problems, and we describe a novel RLA-SGD algorithm called pwSGD (preconditioned weighted SDG). Our new algorithm combines the advantages of both RLA and SGD methods for solving constrained overdetermined and regression problems.

Consider the overdetermined regression problem

 minx∈Zf(x)=∥Ax−b∥p, (1)

where , , and . When , i.e., the solution space is unconstrained, the cases are respectively known as the Least Absolute Deviations (LAD, or ) and Least-squares (LS, or ) regression problems. Classically, the unconstrained regression problem can be solved by eigenvector-based methods with worst-case running time  (Golub and Van Loan, 1996); or by iterative methods for which the running time depends on the condition number of  (Barrett et al., 1994; Kelley, 1995; Saad, 2003), while the unconstrained regression problem can be formulated as a linear program (Portnoy and Koenker, 1997; Chen et al., 2001) and solved by an interior-point method (Portnoy and Koenker, 1997; Portnoy, 1997).

For these and other regression problems, SGD algorithms are widely used in practice because of their scalability and efficiency. In contrast, RLA algorithms have better theoretical guarantees but (thus far) have been less flexible, e.g., in the presence of constraints. For example, they may use an interior point method for solving a constrained subproblem, and this may be less efficient than SGD. (Without constraints, RLA methods can be used to construct subproblems to be solved exactly, or they can be used to construct preconditioners for the original problem; see Yang et al. (2016b) for details and implementations of these RLA methods to compute low, medium, and high precision solutions on up to terabyte-sized input data.) In this paper, we combine these two algorithmic approaches to develop a method that takes advantage of the strengths of both of these approaches.

1.1 Overview of our main algorithm

Our main algorithm pwSGD is a hybrid method for solving constrained overdetermined and regression problems. It consists of two main steps. First, apply RLA techniques for preconditioning and construct an importance sampling distribution. Second, apply an SGD-like iterative phase with weighted sampling on the preconditioned system. Such an algorithm preserves the simplicity of SGD and the high quality theoretical guarantees of RLA. In particular, we prove that after preconditioning, the number of iterations required to converge to a target accuracy is fully predictable and only depends on the low dimension , i.e., it is independent of the high dimension . We show that, with a proper choice of preconditioner, pwSGD runs in time to return an approximate solution with relative error in the objective for constrained regression; and in time to return an approximate solution with relative error in the solution vector in prediction norm for constrained regression. Furthermore, for unconstrained regression, pwSGD runs in time to return an approximate solution with relative error in the objective.

To provide a quick overview of how pwSGD compares to existing algorithms, in Tables 1 and 2, we summarize the complexity required to compute a solution with relative error , of several solvers for unconstrained and regression. In Table 1, RLA with algorithmic leveraging (RLA for short) (Clarkson et al., 2013; Yang et al., 2014) is a popular method for obtaining a low-precision solution and randomized IPCPM is an iterative method for finding a higher-precision solution (Meng and Mahoney, 2013b) for unconstrained regression. Clearly, pwSGD has a uniformly better complexity than that of RLA methods in terms of both and , no matter which underlying preconditioning method is used. This makes pwSGD a more suitable candidate for getting a medium-precision, e.g., , solution.

In Table 2, all the methods require constructing a sketch first. Among them, “low-precision” solvers refer to “sketching + direct solver” type algorithms; see (Drineas et al., 2011; Clarkson and Woodruff, 2013) for projection-based examples and (Clarkson and Woodruff, 2013; Drineas et al., 2012) for sampling-based examples. “High-precision” solvers refer to “sketching + preconditioning + iterative solver” type algorithms; see (Avron et al., 2010; Meng et al., 2014) for examples. One can show that, when and , pwSGD is asymptotically better than all the solvers shown in Table 2. Moreover, although high-precision solvers are more efficient when a high-precision solution is desired, usually they are designed for unconstrained problems, whereas pwSGD also works for constrained problems.

We remark that, compared to general SGD algorithms, our RLA-SGD hybrid algorithm pwSGD works for problems in a narrower range, i.e., regression, but inherits the strong theoretical guarantees of RLA. When solving regression, for which traditional RLA methods are well designed, pwSGD has a comparable complexity. On the other hand, when solving regression, due to the efficiency of SGD update, pwSGD has a strong advantage over traditional RLA methods. See Sections 4.3 and 4.4 for more detailed discussions.

Finally, in Section 5, empirically we show that pwSGD performs favorably compared to other competing methods, as it converges to a medium-precision solution more quickly.

1.2 Connection to related algorithms

As a side point of potentially independent interest, a connection between regression and stochastic optimization will allow us to unify our main algorithm pwSGD and some existing regression solvers under the same framework. In Figure 1, we present the basic structure of this framework, which provides a view of pwSGD from another perspective. To be more specific, we (in Proposition 4 formally) reformulate the deterministic regression problem in (1) as a stochastic optimization problem, i.e.,

 miny∈Y∥Uy−b∥pp=miny∈YEξ∼P[|Uξy−bξ|p/pξ],

where is a basis for the range space of and is a random variable over with distribution . As suggested in Figure 1, to solve this stochastic optimization problem, typically one needs to answer the following three questions.

• (): How to sample: SAA (Sampling Average Approximation, i.e., draw samples in a batch mode and deal with the subproblem) or SA (Stochastic Approximation, i.e., draw a mini-batch of samples in an online fashion and update the weight after extracting useful information)?

• (): Which probability distribution (uniform distribution or not) and which basis (preconditioning or not) to use?

• (): Which solver to use (e.g., how to solve the subproblem in SAA or how to update the weight in SA)?

Some combinations of these choices may lead to existing solvers; see Figure 1 and Section 3 for more details. A natural question arises: is there a combination of these choices that leverages the algorithmic benefits of RLA preconditioning to improve the performance of SGD-type algorithms? Recall that RLA methods (in particular, those that exploit algorithmic averaging; see Appendix B and also (Drineas et al., 2012; Yang et al., 2016b)) inherit strong theoretical guarantees because the underlying sampling distribution captures most of the important information of the original system; moreover, such a carefully constructed leverage-based distribution is defined based on a well-conditioned basis , e.g., an orthogonal matrix for . One immediate idea is to develop an SGD-like algorithm that uses the same choice of and as in RLA methods. This simple idea leads to our main algorithm pwSGD, which is an online algorithm () that uses a non-uniform sampling distribution () and performs a gradient descent update () on a preconditioned system (), as Figure 1 suggests.

Indeed, for least-squares problems (unconstrained regression), pwSGD is highly related to the weighted randomized Kaczmarz (RK) algorithm (Strohmer and Vershynin, 2009; Needell et al., 2014) in the way that both algorithms are SGD algorithm with non-uniform but pwSGD runs on a well-conditioned basis while randomized RK doesn’t involve preconditioning. In Section 4.5 we show that this preconditioning step dramatically reduces the number of iterations required for pwSGD to converge to a (fixed) desired accuracy.

1.3 Main contributions

Now we are ready to state our main contributions.

• We reformulate the deterministic regression problem (1) into a stochastic optimization problem (3) and make connections to existing solvers including RLA methods with algorithmic leveraging and weighted randomized Kaczmarz algorithm (Sections 3 and 4.5).

• We develop a hybrid algorithm for solving constrained overdetermined and regression called pwSGD, which is an SGD algorithm with preconditioning and a non-uniform sampling distribution constructed using RLA techniques. We present several choices of the preconditioner and their tradeoffs. We show that with a suitable preconditioner, convergence rate of the SGD phase only depends on the low dimension , and is independent of the high dimension (Sections 4.1 and 4.2).

• We prove that pwSGD returns an approximate solution with relative error in the objective value in time for regression. This complexity is uniformly better than that of RLA methods in terms of both and when the problem is unconstrained. In the presence of constraints, pwSGD only has to solve a sequence of much simpler and smaller optimization problems over the same constraints, which in general can be more efficient than solving the constrained subproblem required in RLA (Sections 4.3 and 4.4).

• We prove that pwSGD returns an approximate solution with relative error in the objective value and the solution vector measured in prediction norm in time for regression. We show that for unconstrained regression, this complexity is asymptotically better than several state-of-the-art solvers in the regime where and (Sections 4.3 and 4.4).

• Empirically, we show that when solving and regression problems, pwSGD inherits faster convergence rates and performs favorably in the sense that it obtains a medium-precision much faster than other competing SGD-like solvers do. Also, theories regarding several choices of preconditioners are numerically verified (Section 5).

• We show connections between RLA algorithms and coreset methods of empirical optimization problems under the framework of Feldman and Langberg (2011). We show that they are equivalent for regression and provide lower bounds on the coreset complexity for some more general regression problems. We also discuss the difficulties in extending similarly RLA preconditioning ideas to general SGD algorithms (Section 6).

1.4 Other prior related work

Numerous RLA algorithms have been proposed to solve regression problems (Yang et al., 2016b). RLA theories show that to achieve a relative-error bound, the required sampling size only depends on , independent of , and the running time also depends on the time to implement the random projection at the first step. Regarding the performance of unconstrained regression problems, in (Dasgupta et al., 2009) the authors provide an algorithm that constructs a well-conditioned basis by ellipsoid rounding and a subspace-preserving sampling matrix for regression problems in time; a sampling algorithm based on Lewis weights for regression have been proposed by Cohen and Peng (2015); the algorithms in (Sohler and Woodruff, 2011) and (Clarkson et al., 2013) use the “slow” and “fast” Cauchy Transform to compute the low-distortion embedding matrix and solve the over-constrained regression problem in and time, respectively; the algorithms in (Drineas et al., 2012) estimate the leverage scores up to a small factor and solve the regression problem in time respectively; and the algorithms in (Clarkson and Woodruff, 2013; Meng and Mahoney, 2013a; Nelson and Nguyen, 2013), solve the problem via sparse random projections in nearly input-sparsity time, i.e., time, plus lower-order terms, and a tighter analysis is provided by Cohen (2016). As for iterative algorithms, the algorithms in (Avron et al., 2010; Meng et al., 2014) use randomized linear algebra to compute a preconditioner and call iterative solvers such as LSQR to solve the preconditioned problem.

In contrast, SGD algorithms update the solution vector in an iterative fashion and are simple to implement and scalable to large datasets (Bottou and Le Cun, 2004; Shalev-Shwartz and Srebro, 2008; Bottou and Bousquet, 2008). Moreover, these methods can be easily extended for problems with general convex loss functions and constraints, such as Pegasos (Shalev-Shwartz et al., 2007) for regularized SVM and stochastic coordinate descent (SCD) for regularization (Shalev-Shwartz and Tewari, 2009). Several techniques, such as SAGE (Hu et al., 2009), AdaGrad (Duchi et al., 2011), SVRG (Johnson and Zhang, 2013), have recently been proposed to accelerate the convergence rate of SGD, and Niu et al. (2011) also show that SGD is favorable for parallel/distributed computation. More recently, several works, e.g., (Zhao and Zhang, 2015; Needell et al., 2014) regarding SGD with weighted sampling are proposed, in which the authors show that the performance of SGD can be improved by using a nonuniform sampling distribution.

In addition, as we point out in Section 4.2, pwSGD has a close relationship to second-order methods. It can be viewed as an algorithm with approximate Hessians obtained by sketching and stochastic gradients. This is related to the iterative Hessian sketching algorithm for solving constrained least squares problems proposed by Pilanci and Wainwright (2014) which is essentially a Newton-type algorithm with iterative sketched Hessians and batch gradients. Moreover, the idea of using approximate Hessians and stochastic gradients have been discussed in several recent papers. For example, (Moritz et al., 2016; Byrd et al., 2016; Curtis, 2016) exploit the idea of approximating Hessian with L-BFGS type updates and (variance-reduced) stochastic updates.

2 Preliminaries

For any matrix , we use and to denote the -th row and -th column of , respectively. We assume has full rank, i.e., . Also denote by the usual condition number of , by the number of nonzero elements in , and by a low-degree polynomial in . We also use to denote the set of indices .

Throughout this subsection, the definitions are applied to general . We denote by the element-wise norm of a matrix: . In particular, when , is equivalent to the Frobenius norm.

The following two notions on well-conditioned bases and leverage scores are crucial to our methods. The first notion is originally introduced by Clarkson (2005) and stated more precisely in Dasgupta et al. (2009), and it is used to justify the well-posedness of a regression problem. These notions were introduced by Dasgupta et al. (2009).

Definition 1 ((α,β,p)-conditioning and well-conditioned basis).

An is -conditioned if and for all , , where . Define as the minimum value of such that is -conditioned. We say that a basis for is a well-conditioned basis if is a low-degree polynomial in , independent of .

The notion of leverage scores captures how important each row in the dataset is, and is used in the construction of the sampling probability.

Definition 2 (ℓp leverage scores).

Given , suppose is an well-conditioned basis for . Then the -th leverage score of is defined as for .

2.1 Preconditioning

Here, we briefly review the preconditioning methods that will be used in our main algorithms. A detailed summary of various preconditioning methods can be found in Yang et al. (2014, 2016b). The procedure for computing a preconditioner can be summarized in the following two steps.

• Given a matrix with full rank, we first construct a sketch for satisfying

 σS⋅∥Ax∥p≤∥SAx∥p≤κSσS⋅∥Ax∥p,∀x∈Rd, (1)

where is the distortion factor independent of .

• Next, we compute the QR factorization of whose size only depends on . Return .

The following lemma guarantees that the preconditioner satisfies that is well-conditioned since and depend on only, independent of .

Lemma 3.

Let be the matrix returned by the above preconditioning procedure, then we have

 ¯κp(AR−1)≤κSdmax{12,1p}s|1p−12|. (2)

Various ways of computing a sketching matrix satisfying (1) are proposed recently. It is worth mentioning that sketching algorithms that run in nearly input-sparsity time, i.e., in time proportional to to obtain such a sketch matrix for and are available via random projections composed of sparse matrices; see Clarkson and Woodruff (2013); Meng and Mahoney (2013a); Woodruff and Zhang (2013); Nelson and Nguyen (2013) for details. In Tables 5 and 6 in Appendix A we provide a short summary of these sketching methods and the resulting running time and condition number.

3 A connection to stochastic optimization

In this section, we describe our framework for viewing deterministic regression problems from the perspective of stochastic optimization. This framework will recover both RLA and SGD methods in a natural manner; and by combining these two approaches in a particular way we will obtain our main algorithm.

We reformulate overdetermined regression problems of the form (1) into a stochastic optimization problem of the form (33, which reformulates a deterministic regression problem into a stochastic optimization problem. Note that the result holds for general .

Proposition 4.

Let be a basis of the range space of in the form , where . The constrained overdetermined regression problem (1) is equivalent to

 miny∈Y∥Uy−b∥pp=miny∈YEξ∼P[H(y,ξ)], (3)

where is a random variable over with distribution , is the decision variable in , and . The constraint set of is .

With Proposition 4, as suggested in Figure 1, one can solve the overdetermined regression problem (1) by applying either SAA or SA, i.e., () on the stochastic optimization problem (3). In addition to the choice of SA versus SAA, one also has to choose and , i.e., (), and determine the underlying solver, i.e., ().

Assume that if SAA is used, then for () we solve the subproblem exactly, i.e., we compute a high-precision solution of the subproblem; this leads to a class of randomized linear algebra (RLA) algorithms for solving regression. Alternatively, if we assume that SA is used, then we extract the first-order information, i.e., sub-gradient of the sample, and update the weight in a gradient descent fashion; this leads to a family of stochastic gradient descent (SGD) algorithms for solving regression.

For (), we need to choose a basis that converts (1) into an equivalent problem represented by and choose a distribution for which the algorithm samples a row at every iteration accordingly. In general, different choices of and lead to different algorithms. In the following two subsections, we will discuss their effects on SAA and SA and make connections between existing solvers and our new solution methods. For simplicity, we assume there are no constraints, i.e., (although much of this framework generalizes to nontrivial constraints).

3.1 Using RLA (SAA) to solve ℓp regression

In this subsection, we briefly discuss the algorithms induced by applying SAA to (3) with different choices of basis and distribution in Proposition 4.

We first show that the choice of the basis has no effect on the resulting sampling algorithm. Let be the equivalent sampling matrix in the sampling algorithm. That is,

 Sij={1/pj  if the j-th row is sampled in the i% -th iteration0  otherwise.

Then the subproblem can be cast as which is equivalent to Therefore, with a given distribution , applying SAA to the stochastic optimization problem associated with any basis is equivalent to applying SAA to the original problem with matrix .

Next, we discuss the effect of the choice of , i.e., the sampling distribution in SAA, on the required sampling size.

Naive choice of P

One choice of is a uniform distribution, i.e., for . The resulting SAA algorithm becomes uniformly sampling rows from the original rows and solving the subproblem induced by the selected rows. If all the rows are equally “important”, such an algorithm can be expected to work. However, consider the following toy example for which uniform sampling gives undesirable answers with high probability. Suppose the first row of the matrix contains the only nonzero element in the first column of the design matrix . Since the only measurement of lies in the first row, in order to recover the optimal value, namely , the first row in matrix is crucial. However, when a uniform sampling scheme is used, the sampling size required in order to sample the first row is . This implies that RLA with uniform sampling will fail with high probability unless the sampling size .

Smarter choice of P

In the above example, it is not hard to show that the leverage score of the first row is , i.e., it is much larger than the average value of the leverage scores. This inspires us to put more weights on “important” rows, i.e., rows with higher leverage scores. An immediate solution is to define based on the leverage scores as follows:

 pi=λi∑nj=1λj,

where is the -th leverage score of (which depends on whether one is working with , , or more general regression). Applying SAA with this distribution and solving the subproblem exactly recovers the recently proposed RLA methods with algorithmic leveraging for solving overdetermined regression problems; see (Mahoney, 2011; Dasgupta et al., 2009; Clarkson et al., 2013; Yang et al., 2014; Clarkson and Woodruff, 2013; Meng and Mahoney, 2013a; Ma et al., 2014) for details. (In RLA, this is simply solving the subproblem of the original problem, but in statistical learning theory, this has the interpretation of Empirirical Risk Minimization.) This algorithm is formally stated in Algorithm 3 in Appendix B. We also include its approximation-of-quality results from (Dasgupta et al., 2009) in Appendix B, which state that the resulting approximate solution produces a -approximation to the objective if the sampling size is large enough. (Note, in particular, that “large enough” here means that when the desired accuracy and failure probability are fixed, the required sampling size only depends on the lower dimension , independent of .)

3.2 Using SGD (SA) to solve ℓp regression

Applying SA to (3) and updating the weight vector using first-order information results in a SGD algorithm. It is not hard to show that, given and , the update rule is as follows. Suppose the -th row is sampled; then the weight vector is updated by

 xt+1=xt−ηctH−1Aξt,

where , is the step size, and is a constant that depends on and .

Next, we discuss how different choices of and affect the convergence rates of the resulting SGD algorithms. For simplicity, we restrict our discussions to unconstrained regressions.

Naive choice of U and P

Consider the following choices of and that lead to undesirable convergence rates. Let . If we apply the SGD with some distribution , some simple arguments in the SGD convergence rate analysis lead to a relative approximation error of

 f(^x)−f(x∗)f(^x)=O(∥x∗∥2max1≤i≤n∥Ai∥1/pi∥Ax∗−b∥1), (4)

where and is the optimal solution. When is the uniform distribution, (4) becomes , where is the maximum row norm of . Alternatively, if one chooses to be proportional to the row norms of , i.e., , then (4) becomes . Consider the following scenario. Given and , we continue to append samples satisfying and to and , respectively. This process will keep , and unchanged. However, the value of and will increase. Thus, in this case, the expected time for convergence of SGD with these naive sampling distributions might blow up as the size of the matrix grows.

Smarter choice of U and P

To avoid this problem, we need to precondition the linear regression problem. If we work with a well-conditioned basis for the range space of and choose the sampling probabilities proportional to the row norms of , i.e., leverage scores of , then the resulting convergence rate on the relative error of the objective becomes , where is an optimal solution to the transformed problem. By Definition 1, if is a well-conditioned basis, then one obtains and . Since the condition number of a well-conditioned basis depends only on and since is a constant, it implies that the resulting SGD inherits a convergence rate in a relative scale that depends on and is independent of .

The idea of using a preconditioner and a sampling distribution according to the leverage scores leads to our main algorithm.

4 Our Main Algorithm

In this section, we will state our main algorithm pwSGD (Algorithm 1) for solving the constrained overdetermined and regression problems. We now summarize the main steps of our main algorithm as follows.

First, we compute a well-conditioned basis (Definition 1) for the range space of implicitly via a conditioning method; see Tables 5 and 6 in Appendix A for a summary of recently proposed randomized conditioning methods. We refer this as the “implicit” method, i.e., it focuses on computing such that . A typical way of obtaining is via the QR decomposition of where is a sketch of ; see Appendix A for more details.

Second, we either exactly compute or quickly approximate the leverage scores (Definition 2), i.e., the row norms of as . To compute exactly, we have to form the matrix explicitly, which takes time . Alternatively, we can estimate the row norms of without computing the product between and , in order to further reduce the running time; see Appendix A for more details. We assume that satisfy

 (1−γ)∥Ui∥pp≤λi≤(1+γ)∥Ui∥pp, (5)

where is the approximation factor of estimation. When the leverage scores are exact, the approximation factor . From that, we can define a distribution over based on as follows:

 pi=λi∑nj=1λj. (6)

Third, in each iteration a new sample corresponding to a row of is drawn according to distribution and we apply an SGD process to solve the following equivalent problem with a specific choice of :

 miny∈Yh(y)=∥AFy−b∥pp=Eξ∼P[|AξFy−bξ|p/pξ]. (7)

Here the matrix is called the preconditioner for the linear system being solved; see Section 4.2 for several choices of . Below, we show that with a suitable choice of , the convergence rate of the SGD phase can be improved significantly. Indeed, we can perform the update rule in the original domain (with solution vector instead of ), i.e., (10). Notice that if and , then the update rule can be simplified as

 xt+1=xt−ηctAξt. (8)

If and , then the update rule becomes

 xt+1=xt−ηctH−1Aξt, (9)

where . In the presence of constraints, (10) only needs to solve an optimization problem with a quadratic objective over the same constraints whose size is independent of .

Finally, the output is the averaged value over all iterates, i.e., , for regression, or the last iterate, i.e., , for regression.

4.1 Main results for ℓ1 and ℓ2 regression problems

The quality-of-approximation of Algorithm 1 is presented in Proposition 5 and Proposition 6 for and regression, respectively, in which we give the expected number of iterations that pwSGD needs for convergence within small tolerance. We show that pwSGD inherits a convergence rate of for regression and for regression and the constant term only depends on the lower dimension when . Worth mentioning is that for regression, our bound on the solution vector is measured in prediction norm, i.e., . For completeness, we present the non-asymptotic convergence analysis of pwSGD in Proposition 14 and Proposition 15 in Appendix A. All the proofs can be found in Appendix C. The analysis of these results is based on the convergence properties of SGD; see Appendix D for technical details.

In the following results, is the matrix computed in step 3 in Algorithm 1, , are the leverage scores computed in step 4, is the preconditioner chosen in step 6 in Algorithm 1 and . Denote by the condition number of the well-conditioned basis and the approximation factor of the leverage scores , , that satisfies (5). For any vector , denote by the ellipsoidal norm of induced by matrix . For any non-singular matrix , denote and . The exact form of the step-sizes used can be found in the proofs 4.

Proposition 5.

For and , define and suppose . Then there exists a step-size such that after

 T=d¯κ21(U)^κ2(RF)c21c2c23ϵ2

iterations, Algorithm 1 with returns a solution vector estimate that satisfies the expected relative error bound

 E[f(¯x)]−f(x∗)f(x∗)≤ϵ.

Here, the expectation is taken over all the samples and is the optimal solution to the problem . The constants in are given by , and .

Proposition 6.

For and , define and suppose . Then there exists a step-size such that after

 T=c1¯κ22(U)κ2(RF)⋅log(2c2κ2(U)κ2(RF)ϵ)⋅(1+κ2(U)κ2(RF)c3ϵ)

iterations, Algorithm 1 with returns a solution vector estimate that satisfies the expected relative error bound

 E[∥A(xT−x∗)∥22]∥Ax∗∥22≤ϵ.

Furthermore, when and , there exists a step-size such that after

 T=c1¯κ22(U)⋅log(c2κ2(U)ϵ)⋅(1+2κ2(U)ϵ)

iterations, Algorithm 1 with returns a solution vector estimate that satisfies the expected relative error bound

 E[f(xT)]−f(x∗)f(x∗)≤ϵ.

Here, the expectation is taken over all the samples , and is the optimal solution to the problem . The constants in are given by , , .

The above results indicate two important properties of pwSGD. First recall that the condition number 5 of the well-conditioned basis is a polynomial of that is independent of . Thus with a preconditioner and an appropriate step-size in pwSGD, the number of iterations required to achieve an arbitrarily low relative error only depends on the low dimension of the input matrix . Second, pwSGD is robust to leverage score approximations, i.e., the expected convergence rate will only be affected by a small distortion factor even when the approximation has low accuracy, such as .

Remark. For constrained regression, the bound is on the solution vector measured in prediction norm. By the triangular inequality, this directly implies .

Remark. Our approach can also be applied to other type of linear regression problems such as ridge regressions in which SGD can be invoked in a standard way. In this case, the “condition number” of SGD is lower than due to the regularization term. The randomized preconditioning methods discussed in Section 2.1 can be used but it is an “overkill’. More sophisticated preconditioning methods can be devised, e.g., based on ridge leverage scores (Cohen et al., 2015b).

4.2 The choice of the preconditioner F

As we can see, the preconditioner plays an important role in our algorithm. It converts the original regression problem in (1) to the stochastic optimization problem in (7). From Proposition 5 and Proposition 6, clearly, different choices of will lead to different convergence rates in the SGD phase (reflected in 6) and additional computational costs (reflected in in (10) ).

When , the effect of on vanishes. In this case, is also a good approximation to the Hessian . This is because usually is the -factor in the QR decomposition of , where is a “sketch” of satisfying (1) that shares similar properties with . Together we have . This implies (9) is close to the Newton-type update. However, as a tradeoff, since is a dense matrix, an additional cost per iteration is required to perform SGD update (10).

On the other hand, when , no matrix-vector multiplication is needed in updating . However, based on the discussion above, one should expect to be close to . Then the term can be large if is poorly conditioned, which might lead to undesirable performance in SGD phase.

Besides the obvious choices of such as and , one can also choose to be a diagonal preconditioner that scales to have unit column norms. According to van der Sluis (1969), the condition number after preconditioning is always upper bounded by the original condition number , while the additional cost per iteration to perform SGD updates with diagonal preconditioner is only . In Section 5 we will illustrate the tradeoffs among these three choices of preconditioners empirically.

4.3 Complexities

Here, we discuss the complexity of pwSGD with . The running time of Algorithm 1 consists of three parts. First, for computing a matrix such that is well-conditioned, Appendix A provides a brief overview of various recently proposed preconditioning methods for computing for both and norms; see also Table 5 and Table 6 for their running time and preconditioning quality . Particularly, there are several available sparse preconditioning methods that run in plus lower order terms in time (Clarkson and Woodruff, 2013; Meng and Mahoney, 2013a; Nelson and Nguyen, 2013; Yang et al., 2016b; Woodruff and Zhang, 2013). Second, to estimate the leverage scores, i.e., the row norms of , Drineas et al. (2012); Clarkson et al. (2013) proposed several algorithms for approximating the and leverage scores without forming matrix . For a target constant approximation quality, e.g., and , the running time of these algorithms is . Third, Proposition 5 and Proposition 6 provide upper bounds for the expected algorithmic complexity of our proposed SGD algorithm when a target accuracy is fixed. Combining these, we have the following results.

Proposition 7.

Suppose the preconditioner in step 3 of Algorithm 1, is chosen from Table 5 or Table 6, with constant probability, one of the following events holds for pwSGD with . To return a solution with relative error on the objective,

• It runs in for unconstrained regression.

• It runs in for constrained regression.

• It runs in for unconstrained regression.

• It runs in for constrained regression.

In the above, denotes the time for computing the matrix and denotes the time for solving the optimization problem in (10).

Notice that, since only depends on , an immediate conclusion is that by using sparse preconditioning methods, to find an -approximate solution, pwSGD runs in time for regression and in time for regression (in terms of solution vector in prediction norm for constrained problems or objective value for unconstrained problems).

Also, as can be seen in Proposition 7, for the complexity for regression, the tradeoffs in choosing preconditioners from Table 5 are reflected here. On the other hand, for regression, as all the preconditioning methods in Table 5 provide similar preconditioning quality, i.e., , becomes the key factor for choosing a preconditioning method. In Table 3, we summarize the complexity of pwSGD using various sketching methods for solving unconstrained and regression problems. The results are obtained by a direct combination of Tables 25 and 6. We remark that, with decaying step-sizes, it is possible to improve the dependence on from to  (Rakhlin et al., 2012).