1 Introduction
The classical problem of regression corresponds to finding a point such that
In this work, we are primarily concerned with developing iterative algorithms for the approximate minimization of regression. We use OPT to denote and our goal is to find an approximate minimizer to the regression function, i.e. a point such that
This regression problem has fundamental implications in many areas of statistics and optimization [She13, LS14, LS15a, SWWY18]. In many of these settings, it is also useful to design iterative method machinery for the following more general problem of boxconstrained regression:
for some pairs of scalar (possibly infinite). Note that the constrained optimization problem is strictly more general than the standard one as setting recovers the regression problem. For this work, the domain constraint will only be of the form where we use to denote for some constant (though our results apply to the more general case). For short, we will simply refer to this problem as the constrained regression problem.
Many natural optimization problems which arise from combinatorial settings can be written in the form of constrained regression, such as the maximum flow problem and other structured linear programs, and thus faster methods for solving regression can imply faster algorithms for common problems in theoretical computer science. Therefore, the central goal of this paper is to provide faster algorithms for computing approximate minimizers to constrained regression, that when specialized to the maximum flow problem, achieve faster running times.
1.1 Regression Results
In this paper we show how to apply ideas from the literature on accelerated coordinate descent methods (see Section 1.3) to provide faster algorithms for approximately solving constrained regression. We show that by assuming particular sampling and smoothness oracles or sparsity assumptions on we can obtain a randomized algorithm which improves upon the the classic gradient descent based methods across a broad range of parameters and attains an dependence in the runtime. Formally, we show the following:
Theorem 1.1 (Accelerated constrained regression)
There is an algorithm initialized at that approximately minimizes the boxconstrained regression problem (for any )
iterations provided each column of has at most nonzero entries. Furthermore, after nearly linear preprocessing each iteration can be implemented in time.
In particular, note that when no column of has more than nonzero entries, each iteration can be implemented in time. The only other known algorithm for achieving such a dependence in running time for regression, that improves upon the classic time without paying a running time penalty in terms of dimension or domain size, is the recent breakthrough result of [She17]. Our running times match those of [She17] across a broad range of parameters, and in certain cases improve upon it, due to our algorithm’s tighter dependence on the norm and therefore sparsity of the optimal solution, as well as a more finegrained dependence on the problem’s smoothness parameters. Because of these tighter dependences, in many parameter regimes including the maximum flow problem for even slightly dense graphs, our result improves upon [She17].
Interestingly, our work provides an alternative approach to [She17] for accelerating gradient descent for certain highly structured optimization problems, i.e. regression. Whereas Sherman’s work introduced an intriguing notion of area convexity and new regularizations of regression, our results are achieved by working with the classic smoothing of the norm and by providing a new variant of accelerated coordinate descent. We achieve our tighter bounds by exploiting local smoothness properties of the problem and dynamically sampling by these changing smoothnesses.
Our algorithm leverages recent advances in nonuniform sampling for accelerated coordinate descent [AQRY16, QR16, NS17] and is similar in spirit to work on accelerated algorithms for approximately solving packing and covering linear programs [AO15] which too works with nonstandard notions of smoothness. Our paper overturns some conventional wisdom that these techniques do not extend nicely to regression and the maximum flow problem. Interestingly, our algorithms gain an improved dependence on dimension and sparsity over [She17] in certain cases while losing the parallelism of [She17]. It is an open direction for future work as to see whether or not these approaches can be combined for a more general approach to minimizing smooth functions.
1.2 Maximum Flow Results
The classical problem of maximum flow roughly asks for a graph with (capacitated) edges and vertices, with a specified source node and sink node, how to send as many units of flow from the source to the sink while preserving conservation at all other vertices and without violating edge capacity constraints (i.e. the flow cannot put more units on an edge than the edge’s capacity).
The maximum flow problem is known to be easily reducible to the more general problem of minimum congestion flow. Instead of specify and this problem takes as input a vector such that , where is the allones vector. The goal of minimum congestion flow is to find a flow which routes meaning, mean that the imbalance of at vertex is given by , and subject to this constraint minimizes the congestion,
where is the flow on some edge, and is the capacity on that edge. We refer to the vector with entries as the congestion vector.
A recent line of work beginning in [She13, KLOS14] solves the maximum flow problem by further reducing to constrained regression. To give intuition for the reduction used in this work, broadly inspired by [She13, KLOS14], we note that maximum flow in uncapacitated graphs can be rephrased as asking for the smallest congestion of a feasible flow, namely to solve the problem
where the restriction for the edgevertex incidence matrix of a graph, and the demands, enforces the flow constraints. This can be solved up to logarithmic factors in the running time by fixing some value for and asking to optimally solve the problem
where we note that the constraint can be decomposed as the indicator of a box so that this objective matches the form of Equation 1. The exact reduction we use has a few modifications: the box constraint is more simply replaced by , and the regression objective is in a matrix , where is a combinatoriallyconstructed preconditioner whose goal is to improve the condition number (and convergence rate) of the problem, and the problem is scaled for capacitated graphs (for a more detailed description, see Section 4.2).
In this paper we show how to use our improved algorithms for structured regression in order to obtain faster algorithms for maximum flow. We do so by leveraging the tighter dependence on the domain size (in the norm rather than ) and coordinate smoothness properties of the function to be minimized (due to the structure of the regression matrix). In particular we show the following.
Theorem 1.2 ( accelerated approximate maximum flow)
There is an algorithm that takes time to find an approximate maximum flow, where is ratio of the the norm squared of the congestion vector of any optimal flow to the congestion of that flow.
Our running time improves upon the previous fastest running time of this problem of . Since we achieve a faster running time whenever the graph is slightly dense, i.e. for any constant .
Interestingly our algorithm achieves even faster running times when there is a sparse maximum flow, i.e. a maximum flow in which the average path length in the flow decomposition of the optimal flow is small. Leveraging this we provide several new results on exact undirected and directed maximum flow on uncapacitated graphs as well.
Theorem 1.3 (Improved algorithms for exact maximum flows)
There are algorithms for finding an exact maximum flow in the following types of uncapacitated graphs.

There is an algorithm which finds a maximum flow in an undirected, uncapacitated graph in time .

There is an algorithm which finds a maximum flow in an undirected, uncapacitated graph with maximum flow value in time .

There is an algorithm which finds a maximum flow in an undirected, uncapacitated graph with a maximum flow that uses at most edges in time .

There is an algorithm which finds a maximum flow in a directed, uncapacitated graph in time .
Each of these runtimes improves upon previous work in some range of parameters. For example, the bound of for undirected, uncapacitated graphs improves upon the previous best running times of achievable by [She17] whenever and of achievable by [KL02] whenever .
We also separately include the following result (which has no dependence on the sparsity ) for finding exact flows in general uncapcitated directed graphs, as it improves upon the running time of achieved by [GR98] whenever and .
Theorem 1.4 (Exact maximum flow for directed uncapacitated graphs)
There is an algorithm which finds a maximum flow in a directed, uncapacitated graph in time .
Although the runtime of [GR98] has been improved upon by the recent works of [Mad13] achieving a running time of and of [LS14] achieving a running time , which dominate our running time, they do it using sophisticated advances in interior point methods, whereas our algorithm operates using a firstorder method which only queries gradient information of the objective function, rather than secondorder Hessian information. In particular, our algorithm is the first to improve runtimes for directed graphs while relying only on firstorder information of the objective function. We find it interesting that our result achieves any running time improvement for unit capacity maximum flow over [GR98] without appealing to interior point machinery and think this may motivate further research in this area, namely designing firstorder methods for structured linear programs.
1.3 Previous work
Here we embark on a deeper dive into the context of the problems and tools discussed in this paper.
Solving the regression problem. For a nondifferentiable function like , it is possible to use the toolkit for linear programming (including interior point and cutting plane [LS14, LS15b]) to obtain iterative algorithms for approximate minimization, but these particular algorithms have a larger dependence on dimension, and it is widely believed that they are inherently superlinear. Traditionally, iterative algorithms with a better dependence on dimension for approximately solving the regression problem proceed in two stages. First, the algorithm will construct a smooth approximation to the original function, which is typically explicitly derived via regularizing the dual function using a regularizer which is both smooth and bounded in range [Nes05]. The smooth approximation is constructed such that approximately minimizing the approximate function is sufficient to approximately minimize the original function. Secondly, an iterative firstorder method such as gradient descent in a particular norm, or one of its many variants, is applied to approximately minimize the smoothed function.
One of the earlier works to develop algorithms using firstorder methods under this framework to solve the regression problem is [Nes05]. One regularizer used in this work for optimization over a dual variable in the simplex was the entropy regularizer, which yields the smooth approximation to the norm defined by . Essentially, the methods presented in this work converge to an approximate solution in a number of iterations proportional to either or , hiding problemspecific dependencies on smoothness and domain size. However, the cost of each iteration involves computing a whole gradient, which incurs another multiplicative loss of the dimension in the runtime.
Several other works which aimed to solve the regression problem via constructing a smooth approximation, including [Nem04] and [Nes07], incurred the same fundamental barrier in convergence rate. These works aimed to pose the (smooth) regularized regression problem as finding the saddle point of a convexconcave function via a speciallyconstructed firstorder method. The main barrier to improving prior work up to this point has been the inability to construct regularizers of small range which are strongly convex with respect to the norm. For some time, these issues posed a barrier towards finding faster algorithms for the regression problem, and many related problems.
Very recently, Sherman [She17] presented an alternative method which was able to break this barrier and obtain an iteration count for finding approximate solutions to the regression problem, where each iteration can be applied in time to compute a gradient. The algorithm used was a variation of Nesterov’s dual extrapolation method [Nes07] for approximately finding a saddle point in a convexconcave function, adapted to work for regularizers satisfying a weaker property known as area convexity, and an analysis of its convergence. As a corollary, this algorithm was able to obtain the currently fastestknown algorithm for maximum flow and multicommodity flow.
Year  Author  Method  Iteration Complexity  Iteration Cost  Norm 

2003  [Nes05]  Smoothing  
2004  [Nem04]  Mirror prox  
2005  [Nes07]  Dual extrapolation  
2017  [She17]  Areaconvexity  
2018  This paper  Local smoothness 
Abbreviated history of firstorder methods, emphasizing coordinatebased methods. Firstorder methods for convex optimization have a long history. Gradient descent methods converging at a rate for Lipschitz functions and at a rate for smooth functions has been well studied (for example, see [Nes03] or [Bub15] for a more detailed exposition), and has been applied in many important settings.
Nesterov was the first to note that there existed a first order algorithm for minimizing functions smooth in the Euclidean norm which converged at the rate . The method is optimal in the sense that matching lower bounds for generic smooth functions had long been known. Unfortunately, this method does not apply generically to functions which are smooth in other norms, in the same way that unaccelerated variants do, without possibly paying an additional dependence on the dimension. In particular, the accelerated convergence rate depends on the regularizer that the mirror descent steps use, and thus the analysis incurs a loss based on the size of the regularizer, which is the barrier in the aforementioned smooth function case. More formally, it is a folklore result that any function which is stronglyconvex over in the norm has range at least , which we show in Section A.1.
There has been much interest in applying randomized first order methods to more efficiently obtain an approximate minimizer on expectation, when the convex optimization problem has certain structure. Two examples of these randomized methods in the literature are stochastic gradient and coordinate gradient descent methods. Stochastic gradient descent is useful when the function to be minimized efficiently admits a stochastic gradient oracle (for example in finitesum settings), which is defined as an oracle which at any iterate gives a noisy estimate such that .
Coordinate methods, on the other hand, are an alternative randomized algorithm studied first in [Nes12]. With a similar motivation as stochastic gradient methods, the idea is that using crude, computationally efficient, approximations to the full gradient, one is still able to find an approximate minimizer on expectation. One benefit is that coordinate descent admits a more finegrained analysis of convergence rate, based on structural properties of the function, i.e. the smoothness of the function in each coordinate. There is a line of work which uses that there is a dual function of the finitesum problem (which admits a natural stochastic oracle) that separates by coordinate, implying an interesting duality between coordinate descent and stochastic gradient descent algorithms .
Generalizations of standard coordinate descent have received much attention recently, both for their powerful theoretical and practical implications. [Nes12] provided an accelerated version of the standard coordinate descent algorithm, but the naive implementation of its steps were inefficient, taking linear time in the dimension. The study of efficient accelerated coordinate descent methods (which converge at the rate iterations without an additional dependence on dimension) was pioneered by [LS13], and since then a flurry of other works, including [FR15, AQRY16, QR16] have improved the rate of convergence and generalized the methods to composite functions with a separable composite term, of the form . We remark that our box constraint can be represented as such a separable composite term in the objective, and our constrained accelerated coordinate descent algorithm is an adaptation of such composite methods. For a more detailed history of the study of coordinate descent methods, we refer the reader to [FR15].
Accelerated coordinate based methods have proven to be useful in many ways when applied to problems in theoretical computer science. For example, the authors of [LS13] framed graph Laplacian system solvers as a coordinate descent problem to give better runtime guarantees. One particularly interesting example that highlighted the potential for using accelerated coordinate descent in minimizing entropybased functions was the work of [AO15] in solving packing and covering LPs, where the constraint matrix is nonnegative, in which they also attained a rate of convergence. Conventional wisdom is that these results are specific to the structure of the particular problem, so any exploration of accelerated methods in greater generality is particularly interesting.
The maximum flow problem. The maximum flow problem is a fundamental problem in combinatorial optimization that has been studied extensively for several decades. Up until recently, the toolkit used to solve the problem has been primarily combinatorial, culminating in algorithms with runtime roughly for finding a maximum flow in graphs with edges and vertices and polynomially bounded capacities [GR98], and for finding a maximum flow in undirected graphs with edges, vertices, and a maximum flow value of [KL02].
Breakthroughs in the related problem of electrical flow using tools from continuous optimization and numerical linear algebra were first achieved by Spielman and Teng [ST04] who showed that solving a linear system in the Laplacian of a graph could be done in nearly linear time, which is equivalent to computing an electrical flow.
Notably, the electric flow problem corresponds to approximately solving an regression problem , and the maximum flow problem corresponds to approximately solving an regression problem . Accordingly, using the faster algorithms for electric flow combined with a multiplicative weights approach, the authors of [CKM11] were able to make a breakthrough to approximately solve maximum flow with a runtime of , where hides logarithms in the runtime. Finally, using constructions presented in [Mad10], the authors of [She13] and [KLOS14] were able to reduce this runtime to nearly linear, essentially using variants of preconditioned gradient descent in the norm. This runtime was reduced to by Peng in [Pen16] by using a recursive construction of the combinatorial preconditioner. As previously mentioned, the dependence in the runtime was a barrier which was typical of algorithms for minimizing smooth functions, and was broken in [She17], who attained a runtime of for the multicommodity flow problem.
Year  Author  Complexity  Weighted  Directed 
1998  [GR98]  Yes  Yes  
1998  [Kar98]  Yes  No  
2002  [KL02]  Yes  No  
2011  [CKM11]  Yes  No  
2012  [LRS13]  No  No  
2013  [She13], [KLOS14]  Yes  No  
2013  [Mad13]  No  Yes  
2014  [LS14]  Yes  Yes  
2016  [Pen16]  Yes  No  
2017  [She17]  Yes  No  
2018  This work  Yes  No  

1.4 Organization
The rest of this paper is organized as follows. Many proofs are deferred to the appendices.

Section 2: Overview. We introduce the definitions and notation we use throughout the paper, and give a general framework motivating our work.

Section 3: Regression. We first give a framework for accelerated randomized algorithms which minimize the constrained regression function based on uniform sampling, as well as a faster one based on nonuniform sampling which assumes access to a coordinate smoothness and sampling oracle, then give efficient implementations for these oracles for structured problems.

Section 4: Maximum Flow. We show how to attain a faster algorithm for maximum flow by providing implementations for the oracles via exploiting combinatorial structure of the flow regression problem (indeed, our implementation works in more generality to regression problems in a columnsparse matrix). Furthermore, we give the exact maximum flow runtimes achieved via rounding the resulting approximate flow of our method.
2 Overview
2.1 Basic Definitions
First, we define some basic objects and properties which we will deal with throughout this paper.
General Notations. We use to denote runtimes of the following form: where is a constant. With an abuse of notation, we let denote runtimes hiding polynomials in when the variable is clear from context, and refer to such runtimes as “nearly constant.”
Generally, we work with functions whose arguments are vectorvalued variables in dimensional space, and may depend on a linear operator . Correspondingly we use and to index into these sets of dimensions, where is the set . We use to denote the standard basis vector, i.e. the vector in which is 1 in dimension and 0 everywhere else. We use to denote the vector which is the coordinatewise product, i.e. its coordinate is .
Matrices. Generally in this work, we will be dealing with matrices unless otherwise specified. Accordingly, we index into rows of with , and into columns with . We refer to rows of via or when it is clear from context, and columns via .
We use to denote the diagonal matrix whose diagonal entries are the coordinates of a vector . We call a square symmetric matrix positive semidefinite if for all vectors , holds. For positive semidefinite matrices we apply the Loewner ordering and write if for all vectors , holds.
Finally, we say that a matrix is columnsparse if no column of has more than nonzero entries.
Norms. We use to denote an arbitrary norm when one is not specified. For scalar valued , including , we use to denote the norm. For vector valued , we use to denote the weighted quadratic norm.
For a norm , we write the dual norm as , defined as . It is well known that the dual norm of is for . For a matrix and a vector norm , we correspondingly define the matrix norm . For example, is the largest norm of a row of .
Functions. We will primarily be concerned with minimizing convex functions subject to the argument being restricted by a box constraint, where the domain will be where is some scaled ball unless otherwise specified. Whenever the function is clear from context, will refer to any minimizing argument of the function. We use the term approximate minimizer of a function to mean any point such that . Furthermore, we define the OPT operator to be such that is the optimal value of , when this optimal value is welldefined.
For differentiable functions we let be the gradient and let be the Hessian. We let be the value of the partial derivative; we also abuse notation and use it to denote the vector when it is clear from context.
Properties of functions. We say that a function is smooth with respect to some norm if it obeys , the dual norm of the gradient is Lipschitz continuous. It is well known in the optimization literature that when is convex, this is equivalent to for and, for twicedifferentiable , .
We say that a function is coordinate smooth in the coordinate if the restriction of the function to the coordinate is smooth, i.e. . Equivalently, for twicedifferentiable convex , .
Graphs. We primarily study capacitated undirected graphs with edge set , edge capacities . When referring to graphs, we let and . Throughout this paper, we assume that is strongly connected.
We associate the following matrices with the graph , when the graph is clear from context. The matrix of edge weights is defined as . Orienting the edges of the graph arbitrarily, the vertexedge incidence matrix is defined as if , if and otherwise. Finally, define the Laplacian matrix .
2.2 Overview of our algorithms
Here, we give an overview of the main ideas used in our algorithms for approximately solving regression problems. The main ideological contribution of this work is that it uses a new variation of coordinate descent which uses the novel concept of local coordinate smoothness in order to get tighter guarantees for accelerated algorithms. The result for regression in particular follows from a bound on the local coordinate smoothnesses for an smooth function, which is described in full detail in Section 3.4. Finally, in order to implement the steps of the accelerated algorithm, it is necessary to efficiently compute overestimates to the square roots of the local coordinate smoothnesses, and furthermore sample coordinates proportional to these overestimates. This procedure is fully described in Lemma 3.13.
Local coordinate smoothness. In this work, we introduce the concept of local coordinate smoothness at a point . This generalizes the concept of global coordinate smoothness to a particular point. This definition is crucial to the analysis throughout the rest of the paper.
Definition 2.1 (Local coordinate smoothness)
We say a function is locally coordinate smooth in coordinate at a point , if for , . Equivalently, for differentiable convex , for between , .
The proof of this fact is the same as the proof of equivalence for the standard definition of smoothness, restricted to an interval. Note that this says that a coordinate descent step using the local smoothness at a point exhibits essentially the same behavior as a single step of coordinate descent with global smoothnesses. In particular, for the point which the coordinate descent algorithm would step to, the function values exhibit the same quadraticlike upper bound along the coordinate. For a more motivating discussion of this definition, we refer the reader to an analysis of coordinate descent presented in Section A.4. We will drop the from the notation when the point we are discussing is clear, i.e. a particular iterate of one of our algorithms.
Bounding the progress of coordinate descent in smooth functions. Here, we sketch the main idea underlying our accelerated methods. Why is it possible to hope to accelerate gradient methods in the norm via coordinate descent? One immediate reason is that smoothness in this norm is a strong assumption on the sum of the local coordinate smoothness values of .
As we show in Appendix A, gradient descent for a smooth function initialized at takes roughly iterations to converge to a solution which has additive error, whereas coordinate descent with appropriate sampling probabilities , for , takes iterations to converge to the same quality of solution.
Note that when the norm in the gradient descent method is , we have , but the iterates can be times cheaper because they do not require a full gradient computation. Furthermore, the coordinate method can be accelerated. So, if we can demonstrate , we can hope to match and improve the runtime. Of course, there are several caveats: we can only implement such an algorithm if we are able to efficiently update overestimates the local smoothnesses and sample efficiently by them, issues which we will discuss later. To be more concrete, we will demonstrate the following fact.
Lemma 2.2
Suppose for some point , is convex and smooth with respect to , , and . Then .
Proof: Throughout, fix , and define and . Consider drawing uniformly at random from . By the smoothness assumption, we have . Also, note that
Thus, by the probabilistic method, there exists some such that , as desired.
While this gives a bound on the number of iterations required by a coordinate descent algorithm, it requires being able to compute and sample by the . Note that as we take coordinate descent steps, it is not clear how the local coordinate smoothnesses will change, and how to update and compute them. Naively, at each iteration, we could recompute the local smoothnesses, but this requires as much work as a full gradient computation if not more. Furthermore, we need to implement sampling the coordinates in an appropriate way, and show how the algorithm behaves under acceleration. However, a key idea in our work is that if we can take steps within regions where the smoothness values do not change by much, we can still make iterates computationally cheap, which we will show.
Implementation of local smoothness estimates for accelerated algorithms. One useful property of coordinate descent is that as long as we implement the algorithm with overestimates to the local smoothness values, the convergence rate is still the same, with a dependence on the overestimates. Our full algorithm for regression proceeds by showing how to compute and sample proportional to slight overestimates to the local smoothnesses, for regression problems in a columnsparse matrix. We do so by first proving that the smooth approximation to regression admits local smoothnesses which can be bounded in a structured way, in Section 3.2. Further, using a lightweight data structure, we are able to maintain these overestimates and sample by them in nearlyconstant time, yielding a very efficient implementation, which we show in Lemma 3.13.
Furthermore, in Section 3.3 and Section 3.4, we use modified algorithms from the literature on accelerated proximal coordinate descent, adapted to our methods of local smoothness coordinate descent. Combining these pieces, we are able to give the full algorithm for regression.
In Section 4, we study the maximum flow problem as an example of a problem which can be reduced to regression in a columnsparse matrix. Using a careful analysis of bounds on the local smoothness values, we show that a direct application of our accelerated regression algorithm yields the fastest currently known approximate maximum flow algorithm.
3 Minimizing subject to a box constraint
We will now discuss how to turn the framework presented in the previous section into different specialized algorithms for the problem of boxconstrained regression in the norm, and analyze the rates of convergence depending on the sampling probabilities associated with the (accelerated) coordinate descent method. Recall throughout that our goal is to compute an approximate minimizer of the constrained regression problem at an rate.
Definition 3.1 (Boxconstrained regression problem)
This is the problem of finding a minimizer to the function , where the argument has domain for some .
In the style of previous approaches to solving regression, because is not a smooth function, we choose to minimize a suitable smooth approximation instead. Intuitively, the rate comes from accelerating gradient descent for a function which is smooth. Therefore, the function error of the iterate with respect to OPT is proportional to
so if we wish for an approximate minimizer, it suffices to pick .
3.1 Constructing the smooth approximation to regression
In this section, we define the smooth approximation for regression we use through the paper and provide some technical facts about this approximation. Note that these approximations are standard in the literature. First, we define the smax function which is used throughout.
Definition 3.2 (Softmax)
For all real valued vectors we let .
Fact 3.3 (smax Additive Error)
,
Proof: This follows from monotonicity of and positivity of : letting be the maximal index of , , and .
Definition 3.4
Inside the scope of the remainder of this section, let .
Note that these properties say something about the quality of approximation smax provides on the maximum element of a vector, instead of its norm. However, we could have simply applied it to the vector in which has the first coordinates the same as and the last the same as . For the regression problem, we could consider the minimization problem applied to , for defined with a proxy matrix and a proxy vector . For notational convenience, we will focus on minimizing defined above, but with and in the original dimensionalities, which preserves all dependencies on the dimension and structural sparsity assumptions used later in this work up to a constant.
Furthermore, note that setting and asking for an approximate minimizer of is sufficient to finding an approximate minimizer of the original regression problem. Thus, for notational convenience, we will fix for the remainder of this work, and concentrate on finding an approximate minimizer of , which up to constants approximately solves the original problem.
Next, we state some technical properties of our approximation. We drop the from many definitions because the we choose for all our methods is fixed.
Definition 3.5
For let be defined as .
We note that as defined, the for any form a probability distribution. Moreover, they are defined in this way because they directly are used in the calculation of the gradient and Hessian of smax.
Fact 3.6 (Softmax Calculus)
The gradient and Hessian of smax are as follows: and , for some vector .
These facts about smax can be verified by direct calculation. Now, taking a cue from our earlier naive analysis of coordinate descent, we wish to make steps in regions where the coordinate smoothnesses do not change by much. Thus, we will show the following key stability property in providing estimates to the coordinate smoothnesses. In particular, we demonstrate that within some box around our current iterate, the function exhibits good local coordinate smoothness behavior.
Lemma 3.7
For all with , the following upper bound holds:
Proof: For , let . Then, we have
(3.1) 
However, based on creftypecap 3.6, we know that . Also, note that for all for the following reason: for the numerator underestimates by at most a factor of , and the denominator similarly is an overestimate by at most a factor of , so the discrepancy is at most a factor of . So, . Plugging this into Equation 3.1, we have the desired conclusion.
3.2 Local coordinate smoothness estimates of the approximation
In order to analyze convergence rates of local smoothness coordinate descent algorithms, we need to provide estimates of the progress of a single step in the style of Lemma A.4 and Lemma A.5. These corresponding results for bounding the convergence of gradient descent and coordinate descent operate by giving an upper bound on the function value, and then showing that the descent step minimizes the upper bound. Therefore, we proceed by providing overestimates of the local coordinate smoothnesses at a given iterate. The next couple of lemmas prove, respectively, a lower bound on the progress of a single step via our overestimate of the local smoothness, and an upper bound on the sum of the smoothness overestimates. This allows us to analyze our later algorithm in full.
In particular, we claim that for the function that we have defined, at a point , choosing suffices as a local coordinate smoothness overestimate, where is the absolute value applied entrywise on the column of . For notational convenience, throughout the rest of this section we will fix a point and use to denote , which is fixed for the duration of an iteration of coordinate descent. Though this estimate looks daunting, the intuition for it comes from fairly straightforward analysis of what we have already derived.
Lemma 3.8
Let , where refers to the vector obtained by squaring each entry of the column of . Then, at , is locally coordinate smooth. Consequently, .
Proof: Firstly, we know that and . Furthermore, we have the following upper bound based on the analysis of Lemma 3.8:
Indeed, specialized to the case where , we have
(3.2) 
Let and let . We wish to prove that for , the guarantee holds.
Note that setting is the largest step size we can take to stay within the box where the quadratic upper bound given by Equation 3.3 holds. So if we take a step size of smaller absolute value than , since , certainly we stay within the valid region. Also, note that since is an overestimate of , the quadratic upper bound
(3.3) 
certainly holds. Consequently, is locally coordinate smooth at . The progress bound of the step holds as an immediate corollary of local coordinate smoothness.
It will prove to be useful that we can further use any overestimates to the defined specifically in the statement of Lemma 3.9. As we show in the application to columnsparse matrices and maximum flow, it is possible that using these overestimates yields more efficient sampling and smoothness oracles in implementation, without affecting runtime by more than a factor. The next lemma gives the particular overestimates we use in our paper.
Lemma 3.9
Let . Then, , and .
Proof: First, we show . is obvious and follows from the triangle inequality. follows from
Second, we show that the sum of these estimates is not too large. Indeed, for any , the sum up to scaling by is upper bounded by the following:
(3.4) 
For the remainder of this section, whenever we refer to , unless otherwise specified, we will mean these particular overestimates .
3.3 Acceleration analysis under local smoothness sampling and dual initializations
We now state the following theorem, adapted from [QR16], for performing coordinate descent under local smoothness estimates. We note that this adaptation differs from its original presentation in two ways, which may be of independent interest. Firstly, it is specifically stated for the use of local coordinate smoothnesses, a key contribution of this work. Furthermore, the convergence rate depends on two different points, and , which are initializations of a primal and dual variable respectively. This allows us to obtain a better dependency on the domain size of the problem, because the original algorithm requires reinitializing both the primal and the dual point multiple times in order to converge, whereas our algorithm restarts with a new primal point, but the same dual point. In particular, the new algorithm pays a domain dependence only on the initial dual point, whereas the original algorithm pays a domain dependence with regards to the worst dual point, giving us a better dependence on the sparsity of the solution.
The proof is essentially the same as in its original form, where we note that due to the choice of sampling probabilities the only norm used in the analysis is a multiple of the norm, and carefully change the argument to extend to a convergence for differing primal and dual initial points. Nonetheless, we include the proof for completeness in the appendices: on a first read, one may use the statement of convergence as a black box result for further analysis. We follow the presentation of [QR16].
Theorem 3.10 (Acceleration for local coordinate smoothnesses and dual initializations)
For some choice of , let be the convex function defined such that for and is otherwise. Further assume that there is a global constant , and values at every point , such that for each point , is locally coordinate smooth in the coordinate at the point, and . There is an algorithm, APCD, initialized at some primal point and some dual point both in , which in iterations returns a point such that
Each iteration of the algorithm requires (1) maintaining the sum , (2) sampling from the distribution , (3) computing the local smoothness for the coordinate sampled, and (4) solving the constrained minimization problem for some vectorvalued .
In particular, assuming nearlyconstant time solutions to the subproblems (1), (2), (3) and (4), each iteration takes nearlyconstant time to implement. In our setting of interval constraints, it is easy to see that there is a constant time implementation for (4). Thus, the bottleneck for a full implementation of this algorithm usually is the efficient sampling / smoothness calculation step. In Lemma 3.13, we show that all these steps are implementable in nearlyconstant time for regression in a columnsparse matrix.
3.4 Accelerated constrained regression: reducing the iteration count
We are now ready to put the tools we have together to provide runtime guarantees for the constrained regression problem. Suppose for now we have as an input , the function distance to OPT from some initial point , and , the squared distance from the minimizer. We will discuss the details of this assumption at the end of this section, and why they are not restrictive. Then, we give our algorithm in full. Recall that per the guarantees in Theorem 3.10, we can guarantee that in iterations of APCD initialized at , , we can guarantee a point such that
for some globally computable , where is some upper bound on the sum of square roots of the local coordinate smoothnesses of at any point. The highlevel idea is that we will run phases of APCD for iterations. At the end of the phase, we will either approximately minimize the original function, or obtain a point which has halved the function error with respect to . For the next phase, we will set the primal initial point to be the point we obtained, but we will reset the dual point to be , the very first dual point. In this way, the domain dependence we pay is only with respect to the first dual point, but the function error we obtain halves each round.
Theorem 3.11
AccelRegress initialized at computes an approximate minimizer to the regression problem