Randomized sketch descent methods for non-separable linearly constrained optimization

# Randomized sketch descent methods for non-separable linearly constrained optimization

## Abstract

In this paper we consider large-scale smooth optimization problems with multiple linear coupled constraints. Due to the non-separability of the constraints, arbitrary random sketching would not be guaranteed to work. Thus, we first investigate necessary and sufficient conditions for the sketch sampling to have well-defined algorithms. Based on these sampling conditions we developed new sketch descent methods for solving general smooth linearly constrained problems, in particular, random sketch descent and accelerated random sketch descent methods. From our knowledge, this is the first convergence analysis of random sketch descent algorithms for optimization problems with multiple non-separable linear constraints. For the general case, when the objective function is smooth and non-convex, we prove for the non-accelerated variant sublinear rate in expectation for an appropriate optimality measure. In the smooth convex case, we derive for both algorithms, non-accelerated and accelerated random sketch descent, sublinear convergence rates in the expected values of the objective function. Additionally, if the objective function satisfies a strong convexity type condition, both algorithms converge linearly in expectation. In special cases, where complexity bounds are known for some particular sketching algorithms, such as coordinate descent methods for optimization problems with a single linear coupled constraint, our theory recovers the best-known bounds. We also show that when random sketch is sketching the coordinate directions randomly produces better results than the fixed selection rule. Finally, we present some numerical examples to illustrate the performances of our new algorithms.

## 1 Introduction

During the last decade first order methods, that eventually utilize also some curvature information, have become the methods of choice for solving optimization problems of large sizes arising in all areas of human endeavor where data is available, including machine learning [21, 27, 31], portfolio optimization [17, 6], internet and multi-agent systems [10], resource allocation [18, 40] and image processing [39]. These large-scale problems are often highly structured (e.g., sparsity in data, separability in objective function, convexity) and it is important for any optimization method to take advantage of the underlying structure. It turns out that gradient-based algorithms can really benefit from the structure of the optimization models arising in these recent applications [5, 22].

Why random sketch descent methods? The optimization problem we consider in this paper has the following features: the size of data is very large so that usual methods based on whole gradient/Hessian computations are prohibitive; moreover the constraints are coupled. In this case, an appropriate way to approach these problems is through sketch descent methods due to their low memory requirements and low per-iteration computational cost. Sketching is a very general framework that covers as a particular case the (block) coordinate descent methods [15] when the sketch matrix is given by sampling columns of the identity matrix. Sketching was used, with a big success, to either decrease the computation burden when evaluating the gradient in first order methods [22] or to avoid solving the full Newton direction in second order methods [24]. Another crucial advantage of sketching is that for structured problems it keeps the computation cost low, while preserving the amount of data brought from RAM to CPU as for full gradient or Newton methods, and consequently allows for better CPUs utilization on modern multi-core machines. Moreover, in many situations general sketching keeps the per-iteration running-time almost unchanged when compared to the particular sketching of the identity matrix (i.e. comparable to coordinate descent settings). This, however, leads to a smaller number of iterations needed to achieve the desired quality of the solution as observed e.g. in [25].

In second order methods sketching was used to either decrease the computation cost when evaluating the full Hessian or to avoid solving the full Newton direction. In [24, 3] a Newton sketch algorithm was proposed for unconstrained self-concordant minimization, which performs an approximate Newton step, wherein each iteration only a sub-sampled Hessian is used. This procedure significantly reduces the computation cost, and still guarantees superlinear convergence for self-concordant objective functions. In [25], a random sketch method was used to minimize a smooth function which admits a non-separable quadratic upper-bound. In each iteration a block of coordinates was chosen and a subproblem involving a random principal submatrix of the Hessian of the quadratic approximation was solved to obtain an improving direction.

In first order methods particular sketching was used, by choosing as sketch matrix (block) columns of the identity matrix, in order to avoid computation of the full gradient, leading to coordinate descent framework. The main differences in all variants of coordinate descent methods consist in the criterion of choosing at each iteration the coordinate over which we minimize the objective function and the complexity of this choice. Two classical criteria used often in these algorithms are the cyclic and the greedy coordinate search, which significantly differs by the amount of computations required to choose the appropriate index. For cyclic coordinate search estimates on the rate of convergence were given recently in [2, 8, 32], while for the greedy coordinate search (e.g. Gauss-Southwell rule) the convergence rates were given in [35, 15]. Another approach is based on random choice rule, where the coordinate search is random. Complexity results on random coordinate descent methods for smooth convex objective functions were obtained in [22, 18]. The extension to composite convex objective functions was given e.g. in [27, 19, 14, 21, 28]. These methods are inherently serial. Recently, accelerated [5, 4], parallel [19, 29, 34], asynchronous [13] and distributed implementations [33, 16] of coordinate descent methods were also analyzed. Let us note that the idea of sketching or sub-sampling was also successfully applied in various other settings, including [37, 30, 7].

Related work. However, most of the aforementioned sketch descent methods assume essentially unconstrained problems, which at best allow separable constraints. In contrast, in this paper we consider sketch descent methods for general smooth problems with linear coupled constraints. Particular sketching-based algorithms, such as greedy coordinate descent schemes, for solving linearly constrained optimization problems were investigated in [35, 15], while more recently in [1] a greedy coordinate descent method is developed for minimizing a smooth function subject to a single linear equality constraint and additional bound constraints on the decision variables. Random coordinate descent methods that choose at least 2 coordinates at each iteration have been also proposed recently for solving convex problems with a single linear coupled constraint in [18, 21, 20]. In all these papers, detailed convergence analysis is provided for both, convex and non-convex settings. Motivated by the work in [20] several recent papers have tried to extended the random coordinate descent settings to multiple linear coupled constraints [6, 21, 26]. In particular, in [26] an extension of the 2-random coordinate descent method from [20] has been analyzed, however under very conservative assumptions, such as full rank condition on each block of the matrix describing the linear constraints. In [6] a particular sketch descent method is proposed, where the sketch matrices specify arbitrary subspaces that need to generate the kernel of the matrix describing the coupled constraints. However, in the large-scale context and for general linear constraints it is very difficult to generate such sketch matrices. Another strand of this literature develops and analysis center-free gradient methods [40], augmented Lagrangian based methods [9] or Newton methods [38].

Our approach and contribution. Our approach introduces general sketch descent algorithms for solving large-scale smooth optimization problems with multiple linear coupled constraints. Since we have non-separable constraints in the problem formulation, a random sketch descent scheme needs to consider new sampling rules for choosing the coordinates. We first investigate conditions on the sketching of the coordinates over which we minimize at each iteration in order to have well-defined algorithms. Based on these conditions we develop new random sketch descent methods for solving our linearly constrained convex problem, in particular, random sketch descent and accelerated random sketch descent methods. However, unlike existing methods such as coordinate descent, our algorithms are capable of utilizing curvature information, which leads to striking improvements in both theory and practice.

Our contribution. To this end, our main contribution can be summarized as follows:

(i) Since we deal with optimization problems having non-separable constraints we need to design sketch descent schemes based on new sampling rules for choosing the sketch matrix. We derive necessary and sufficient conditions on the sketching of the coordinates over which we minimize at each iteration in order to have well-defined algorithms. To our knowledge, this is the first complete work on random sketch descent type algorithms for problems with more than one linear constraint. Our theoretical results consist of new optimization algorithms, accompanied with global convergence guarantees to solve a wide class of non-separable optimization problems.

(ii) In particular, we propose a random sketch descent algorithm for solving such general optimization problems. For the general case, when the objective function is smooth and non-convex, we prove sublinear rate in expectation for an appropriate optimality measure. In the smooth convex case we obtain in expectation an -accurate solution in at most iterations, while for strongly convex functions the method converges linearly.

(iii) We also propose an accelerated random sketch descent algorithm. From our knowledge, this is the first analysis of an accelerated variant for optimization problems with non-separable linear constraints. In the smooth convex case we obtain in expectation an -accurate solution in at most iterations. For strongly convex functions the new random sketch descent method converges linearly.

Let us emphasize the following points of our contribution. First, our sampling strategies are for multiple linear constraints and thus very different from the existing methods designed only for one linear constraint. Second, our (accelerated) sketch descent schemes are the first designed for this class of problems. Thirdly, our non-accelerated sketch descent algorithm covers as special cases some methods designed for problems with a single linear constraint and coordinate sketch. In these special cases, where convergence bounds are known, our theory recovers the best known bounds. We also illustrate, that for some problems, random sketching of the coordinates produces better results than deterministic selection of them. Finally, our theory can be used to further develop other methods such as Newton-type schemes.

Paper organization. The rest of this paper is organized as follows. Section 2 presents necessary and sufficient conditions for the sampling of the sketch matrix. Sections 3 provides a full convergence analysis of the random sketch descent method, while Section 4 extends this convergence analysis to an accelerated variant. In Section 5 we show the benefits of general sketching over fixed selection of coordinates.

### 1.1 Problem formulation

We consider the following large-scale general smooth optimization problem with multiple linear coupled constraints:

 (1) f∗=minx∈Rnf(x)s.t.Ax=b,

where is a general differentiable function and , with , is such that the feasible set is nonempty. The last condition is satisfied if e.g. has full row rank. The simplest case is when , that is we have a single linear constraint as considered in [1, 18, 21, 20]. Note that we do not necessarily impose to be a convex function. From the optimality conditions for our optimization problem (1) we have that is a stationary point if there exists some such that:

 ∇f(x∗)+ATλ∗=0andAx∗=b.

However, if is convex, then any satisfying the previous optimality conditions is a global optimum for optimization problem (1). Let us define the set of these points. Therefore, is a stationary (optimal) point if it is feasible and satisfies the condition:

 ∇f(x∗)∈range(AT).

### 1.2 Motivation

We present below several important applications from which the interest for problems of type (1) stems.

#### Page ranking

This problem has many applications in google ranking, network control, data analysis [10, 22, 18]. For a given graph let be its incidence matrix, which is sparse. Define , where denotes the vector with all entries equal to . Since , i.e. the matrix is column stochastic, the goal is to determine a vector such that: and . This problem can be written directly in optimization form:

 minx∈Rnf(x)(:=12∥Ex−x∥2)s.t.eTx=1,

which is a particular case of our optimization problem (1) with and sparse matrix.

#### Machine learning

Consider the optimization problem associated with the loss minimization of linear predictors without regularization for a training data set containing observations [31]:

 minw∈Rm1nn∑i=1ϕi(wTai).

Here is some loss function, e.g. SVM , logistic regression , ridge regression , regression with the absolute value and support vector regression for some predefined insensitivity parameter . Moreover, in classification the labels , while in regression . Further, let denote the Fenchel conjugate of . Then the dual of this problem becomes:

 minx∈Rnf(x)(=1nn∑i=1ϕ∗i(xi))s.t.Ax=0,

where . Clearly, this problem fits into our model (1), with representing the number of features, the number of training data, and the objective function is separable.

#### Portfolio optimization

In the basic Markowitz portfolio selection model [17], see also [6] for related formulations, one assumes a set of assets, each with expected returns , and a covariance matrix , where is the covariance between returns of assets and . The goal is to allocate a portion of the budget into different assets, i.e. represents a portion of the wealth to be invested into asset , leading to the first constraint: . Then, the expected return (profit) is and the variance of the portfolio can be computed as . The investor seeks to minimize risk (variance) and maximize the expected return, which is usually formulates as maximizing profit while limiting the risk or minimizing risk while requiring given expected return. The later formulation can be written as:

 minx∈RnxTΣxs.t.n∑i=1μixi=r,n∑i=1xi=1,

which clearly fits again into our optimization model (1) with . We can further assume that each asset belongs exactly to one class , e.g. financials, health care, industrials, etc. The investor would like to diversify its portfolio in such a way that the net allocation in class is : for all , where if asset is in class and otherwise. One can observer that in this case we get a similar problem as above, but with additional linear constraints ().

## 2 Random sketching

It is important to note that stochasticity enters in our algorithmic framework through a user-defined distribution describing an ensemble of random matrices (also called sketch matrices). We assume that , in fact we usually require and note that can also be random (i.e. the can return matrices with different ). Our schemes and the underlying convergence theory support virtually all thinkable distributions. The choice of the distribution should ideally depend on the problem itself, as it will affect the convergence speed. However, for now we leave such considerations aside. The basic idea of our algorithmic framework consists of a given feasible , a sample sketch matrix and a basic update of the form:

 (2) x+=x+Sdsuch thatASd=0,

where the requirement ensures that the new point will remain feasible. Clearly, one can choose a distribution which will not guarantee convergence to stationary/optimal point. Therefore, we need to impose some minimal necessary conditions for such a scheme to be well-defined. In particular, in order to avoid trivial updates, we need to choose such that the homogeneous linear system admits also nontrivial solutions, that is we require:

 (3) range(S)∩ker(A)≠0.

Moreover, since for any feasible an optimal solution satisfies , it is necessary to require that with our distribution we can generate :

 (4) ker(A)=Span(∪S∼S(range(S)∩ker(A))).

Note that the geometric conditions (3)-(4) are only necessary for a sketch descent type scheme to be well-defined. However, for a discrete probability distribution, having e.g. the property that for all , condition (4) is also sufficient. In Section 2.3 (see Assumption 2.3) we will provide sufficient conditions for a general probability distribution in order to obtain well-defined algorithms based on such sketching. Below we provide several examples of distributions satisfying our geometric conditions (3)-(4).

### 2.1 Example 1 (finite case)

Let us consider a finite (or even countable) probability distribution . Further, let be a particular solution of the linear system . For example, if denotes the pseudo-inverse of the matrix , then we can take . Moreover, by the properties of the pseudo-inverse, is a projection matrix onto , that is . Therefore, any solution of the linear system can be written as:

 x=A†b+(In−A†A)y,

for any . Thus, we may consider a finite (the extension to countable case is straightforward) set of matrices endowed with a probability distribution for all and condition (4) requires that the span of the image spaces of contains or is equal to :

 (5) ker(A)=range(In−A†A)=Span(∪i:Pi>0(range(Si)∩ker(A))).

In particular, we have several choices for the sampling for a finite distribution:

1. If one can compute a basis for , then we can take as random sketch matrix any block of elements of this basis endowed with some probability (for the case the matrix represents a single element of this basis generating ). This sampling was also considered in [6]. Clearly, in this particular case condition (3) and condition (4) or equivalently (5) hold since .

2. However, for a general matrix it is difficult to compute a basis of . A simple alternative is to consider then any -tuple , with , and the corresponding random sketch matrix , where denotes the th column of the identity matrix , with some probability distribution over the set of -tuples in . It is clear that for this choice condition (3) and condition (4) or equivalently (5) also hold. For the particular case when we have a single linear coupled constraint, i.e. , we can take random matrices also considered e.g. in [18]. This particular sketch matrix based on sampling columns of the identity matrix leads to coordinate descent framework. However, the other examples (including those from Section 2.2) show that our sketching framework is more general than coordinate descent.

3. Instead of working with the matrix , as considered previously, we can take any orthogonal or full rank matrix having the columns and thus forming a basis of . Then, we can consider tuples , with , and the corresponding random sketch matrix , with some probability distribution over the set of -tuples in . It is clear that for this choice of the random sketch matrices the condition (3) and condition (4) or equivalently (5) still hold.

### 2.2 Example 2 (infinite case)

Let us now consider a continuous (uncountable) probability distribution . We can consider in this case two simple sampling strategies:

1. If one can sample easily a random matrix such that , then we can choose one or several columns from this matrix as a sketch matrix . In this case .

2. Alternatively, we can sample random full rank matrices in and then define to be random columns. Furthermore, since it is known that random Gaussian matrices are full rank almost surely, then we can define to be a random Gaussian matrix. Similarly, we can consider random uniform matrices and define e.g. .

A sufficient condition for a well-defined sampling in the infinite case is to ensure that in expectation one can move in any direction in . Considering the general update rule (2), we see that if we sample , then our update can be only for some . Now, we also have a condition, that we want to stay in the , and therefore cannot be anything, but has to be chosen such that . Now, this restricts the set of possible ’s to be such that:

 ASd=0⇒d=(Ip−(AS)†(AS))t

for some . Recall, that we allow to be also random, hence to derive the sufficient condition we need to have some quantity with dimension independent on . Note that each can be represented as for some (possibly non-unique) . Therefore, we see that if is sampled, then we can move in the direction:

 Sd=S(I−(AS)†(AS))t=S(I−(AS)†(AS))STt′,

hence, we have the ability to move in . Now, the condition to be able to move in can be expressed as requiring that on expectation we can move anywhere in :

 (6) range(E[S(I−(AS)†(AS))ST])=ker(A),

provided that the expectation exists and is finite. Note, that this condition must hold also for a discrete probability distribution, however the condition (4) is more intuitive in the discrete case. In the next section we provide algebraic sufficient conditions on the sampling for a general probability distribution in order to obtain well-defined algorithms.

### 2.3 Sufficient conditions for sketching

It is well known that in order to derive any reasonable convergence guarantees for a minimization scheme we need to impose some smoothness property on the objective function. Therefore, throughout the paper we consider the following blanket assumption on the smoothness of :

###### Assumption \thetheorem

For any feasible there exists a positive semidefinite matrix such that is positive definite on and the following inequality holds:

 (7) f(y)≤f(x)+⟨∇f(x),y−x⟩+12(y−x)TM(y−x),∀x,y∈x0+ker(A).

Note that for a general (possibly non-convex) differentiable function the smoothness inequality (7) does not imply that the objective function has Lipschitz continuous gradient, so our assumption is less conservative than requiring Lipchitz gradient assumption. However, when is convex the condition (7) is equivalent with Lipschitz continuity of the gradient of on [23]. In particular, if for some Lipschitz constant we recover the usual definition of Lipschitz continuity of the gradient for the class of convex functions. Our sketching methods derived below are based on (7) and therefore they have the capacity to utilize curvature information. In particular, if the objective function is quadratic, our methods can be interpreted as novel extensions to more general optimization models of the recently introduced iterative Hessian sketch method for minimizing self-concordant objective functions [24]. The reader should also note that we can further relax the condition (7) and require smoothness of with respect to any image space generated by the random matrix . More precisely, it is sufficient to assume that for any sample there exists a positive semidefinite matrix such that is positive definite on and the following inequality holds:

 f(y)≤f(x)+⟨∇f(x),y−x⟩+12(y−x)TMS(y−x)∀x,y∈x0+ker(A) ∧ x−y∈range(S).

Note that if for all we recover the relation (7). For simplicity of the exposition in the sequel we assume (7) to be valid, although all our convergence results can be also extended under previous smoothness condition given in terms of .

From the above discussion it is clear that the direction in our basic update (2) needs to be in the kernel of matrix . However, it is well known that the projection onto is given by the projection matrix:

 PS=Ip−(AS)†(AS).

Clearly, we have . Let us further define the matrix:

 (8) ZS=SPS(PTSSTMSPS)†PTSST∈Rn×n.

The matrix has some important properties that we will derive below since they are useful for algorithm development. First we observe that: {lemma} For any probability distribution the matrix is symmetric (), positive semidefinite (), and for any we have , that is . Moreover, the following identity holds .

{proof}

It is clear that is positive semidefinite matrix since is assumed positive semidefinite. It is well-known that for any given matrix its pseudo-inverse satisfies and . Now, for the first statement given the expression of it is sufficient to prove that for . However, if then there exists such that and consequently we have:

 PTSSTu =PTSSTATy=(I−(AS)†(AS))T(AS)Ty =[(AS)(I−(AS)†(AS))]Ty =((AS)−(AS)(AS)†(AS))Ty=0,

where in the last equality we used the first property of pseudo-inverse . For the second part of the lemma we use the expression of and the second property of the pseudo-inverse applied to the matrix , that is:

 ZSMZS=[SPS(PTSSTMSPS)†PTSST]M[SPS(PTSSTMSPS)†PTSST]=ZS,

which concludes our statements.

Now, since the random matrix is positive semidefinite, then we can define its expected value, which is also a symmetric positive semidefinite matrix:

 (9) Z=ES[ZS].

In the sequel we also consider the following assumption on the expectation matrix :

###### Assumption \thetheorem

We assume that the distribution is chosen such that has a finite mean, that is the matrix is well defined, and positive definite (notation ) on .

As we will see below, Assumption 2.3 is a sufficient condition on the probability distribution in order to ensure convergence of our algorithms that will be defined in the sequel. To our knowledge this algebraic characterization of the probability distribution defining the sketch matrices for problems with multiple non-separable linear constraints seems to be new.

Note that the necessary condition (3) holds provided that . Indeed, from Lemma (2.3) we have for all and . Therefore, we get that and we know that . Let , , then there exists unique and such that . Moreover, we have , i.e. , which implies that . Thus, and . From the last relation, we get:

 range(ZS)⊆ker(A).

Moreover, from the definition of the symmetric matrix we have , which combined with the previous relation leads to:

 range(ZS)⊆ker(A)∩range(S),

and consequently proving that the condition (3) holds provided that . Moreover, we can show that the necessary condition (4) holds if satisfies Assumption 2.3:

{lemma}

Under Assumption 2.3 the necessary condition (4) is valid. Additionally, the following identity takes place:

 range(AT)=ker(Z)

and consequently is a projection matrix onto , where denotes the pseudo-inverse of the matrix .

{proof}

Note that Assumption 2.3 holds, i.e. on , if and only if on . Moreover, for any non-zero , we have , that is . In conclusion, we get . But, , from which we can conclude (4).

For the second part we use again Lemma (2.3): for all . This means that . The other inclusion follows by reducing to absurd. Assume that there exists such that , or equivalently for some . However, note that on if and only if on , which contradicts our assumption. In conclusion, the second statement holds. Finally, it is well-known that is an orthogonal projector onto and the rest follows from standard algebraic arguments.

The primal-dual ”norms”. Since the matrix is positive semidefinite, matrix is also positive semidefinite. Moreover, from Lemma 2.3 we conclude that . In the sequel we assume that such that is a positive definite matrix on and consequently on (see Assumption 2.3). Then, we can define a norm induced by the matrix on or even . This norm will be used subsequently for measuring distances in the subspace . More precisely, we define the primal norm induced by the positive semidefinite matrix as:

 ∥u∥Z=√uTZu∀u∈Rn.

Note that for all (see Lemma 2.3) and for all . On the subspace we introduce the extended dual norm:

 ∥x∥∗Z=maxu∈Rn:∥u∥Z≤1⟨x,u⟩∀x∈ker(A).

Using the definition of conjugate norms, the Cauchy-Schwartz inequality holds:

 (10) ⟨u,x⟩≤∥u∥Z⋅∥x∥∗Z∀x∈ker(A),u∈RN.
{lemma}

Under Assumption 2.3 the primal and dual norms have the following expressions:

 (11) ∥u∥Z=√uTZu,∥x∥∗Z=√xTZ†x∀u∈Rn,∀x∈ker(A).
{proof}

Let us consider any . Then, the dual norm can be computed for any as follows:

 ∥x∥∗Z=maxu∈Rn:⟨Zu,u⟩≤1⟨x,u⟩=maxu:⟨Z(u−^u),u−^u⟩≤1⟨x,u−^u⟩ =maxu:⟨Zu,u⟩≤1,u∈ker(A)⟨x,u⟩=maxu:⟨Zu,u⟩≤1,Au=0⟨x,u⟩ =maxu:⟨Zu,u⟩≤1,uTATAu≤0⟨x,u⟩ =minν,μ≥0maxu∈Rn[⟨x,u⟩+μ(1−⟨Zu,u⟩)−ν⟨ATAu,u⟩] =minν,μ≥0μ+⟨(μZ+νATA)−1x,x⟩=minν≥0minμ≥0[μ+1μ⟨(Z+νμATA)−1x,x⟩] =minζ≥0√⟨(Z+ζATA)−1x,x⟩.

We obtain an extended dual norm that is well defined on the subspace :

 (12) ∥x∥∗Z=minζ≥0√⟨(Z+ζATA)−1x,x⟩∀x∈ker(A).

The eigenvalue decomposition of the positive semidefinite matrix can be written as , where are its positive eigenvalues and the columns of orthogonal matrix are the corresponding eigenvectors, generating and generating . Then, we have:

 (Z+ζATA)−1=Udiag(λ1,⋯,λr,ζλr+1,⋯,ζλn)−1UT,

where are the nonzero eigenvalues of symmetric matrix . From (12) it follows that our newly defined dual norm has the following closed form:

 ∥x∥∗Z=√xTZ†x∀x∈ker(A),

where denotes the pseudoinverse of matrix .

The following example shows that the 2-coordinate sampling proposed in [20] (in the presence of a single linear constraint ) is just a special case of the sketching analyzed in this paper:

###### Example \thetheorem

Let us consider the following optimization problem:

 f(x)=n∑i=1fi(xi)subject ton∑i=1xi=b.

In this case, assuming that each scalar function has Lipschitz continuous gradient, then . Moreover, we can take any random pair of coordinates with and consider the particular sketch matrix . Note that, for simplicity, we focus here on Lipschitz dependent probabilities for choosing the pair , that is with . Following basic derivations we get:

 Z(ij)=1Li+LjS(ij)[1−1−11]ST(ij)=1Li+Lj(ei−ej)(ei−ej)T, (13) Z=n(n−1)L(In−1neeT),Z†=(n−1)Ln(In−1neeT).

Clearly, on and thus Assumption 2.3 holds. Similarly, we can compute explicitly and for the fixed selection of the pair of coordinates with .

## 3 Random Sketch Descent (RSD ) algorithm

For the large-scale optimization problem (1) methods which scale cubically, or even quadratically, with the problem size is already out of the question; instead, linear scaling of the computational costs per-iteration is desired. Clearly, optimization problem (1) can be solved using projected first order methods, such as gradient or accelerated gradient, both algorithms having comparable cost per iteration [23]. In particular, both methods require the computation of the full gradient and finding the optimal solution of a subproblem with quadratic objective over the subspace :

For example, for the projected gradient method since we assume positive definite on (see Assumption 2.3), then the previous subproblem has a unique solution leading to the following gradient iteration:

 (15) xk+1G=xkG−ZIn∇f(xkG),

where is obtained by replacing in the definition of the matrix . However, for very large even the first iteration is not computable, since the cost of computing is cubic in the problem dimension (i.e. of order operations) for a dense matrix . Moreover, since usually is a dense matrix regardless of the matrix being dense or spare, the cost of the subsequent iterations is quadratic in the problem size (i.e. ). Therefore, the development of new optimization algorithms that target linear cost per iteration and nearly dimension-independent convergence rate is needed. These properties can be achieved using the sketch descent framework. In particular, let us assume that the initial iterate is a feasible point, i.e. . Then, the first algorithm we propose, Random Sketch Descent (RSD ) algorithm, chooses at each iteration a random sketch matrix according to the probability distribution and find a new direction solving a simple subproblem (see Algorithm 1 below).

Let us explain the update rule of our algorithm RSD . Note that the new direction in the update of RSD is computed from a subproblem with quadratic objective over the subspace that it is simpler than subproblem (14) corresponding to the full gradient:

 dk=argmind∈Rp:ASd=0f(xk)+⟨∇f(xk),Sd⟩+12 dTSTMSd.

We observe that from the feasibility condition we can compute as:

 d=PSt(:=(Ip−(AS)†(AS))t),

for some . Then, the constrains will not be violated. Now, let’s plug this into the objective function of the subproblem, to obtain an unconstrained problem in :

 tk=argmint∈Rp⟨∇f(xk),S((Ip−(AS)†(AS))t)⟩+12∥S(Ip−(AS)†(AS))t∥2M.

Then, from the first order optimality conditions we obtain that:

 PTSSTMSPStk=−PTSST∇f(xk),

and hence we can define as

 tk=−(PTSSTMSPS)†PTSST∇f(xk).

In conclusion we obtain the following update rule for our RSD algorithm:

 (16) xk+1=xk−SPS(PTSSTMSPS)†PTSST=ZS∇f(xk)=xk−ZS∇f(xk).

After iterations of the RSD algorithm, we generate a random output , which depends on the observed implementation of the random variable:

 Fk=(S0,⋯,Sk−1).

Let us define the expected value of the objective function w.r.t. :

 ϕk=E[f(xk)].

Next, we compute the decrease of the objective function after one random step:

 f(xk+1) =f(xk+SPStk)=f(xk−ZS∇f(xk)) f(xk)−⟨∇f(xk),ZS∇f(xk)⟩+12∥ZS∇f(xk)∥2M =f(xk)−⟨∇f(xk),ZS∇f(xk)⟩+12∇f(xk)TZSMZS∇f(xk) =f(xk)−⟨∇f(xk),ZS∇f(xk)⟩+12∇f(xk)TZS∇f(xk) (17) =f(xk)−12⟨∇f(xk),ZS∇f(xk)⟩.

Then, we obtain the following strict decrease for the objective function in the conditional expectation:

 (18) E[f(xk+1)|Fk] ≤f(xk)−12∥∇f(xk)∥2Z,

provided that is not optimal. This holds since we assume that on and since any feasible satisfying is optimal for the original problem. Therefore, RSD algorithm belongs to the class of descent methods.

### 3.1 Computation cost per-iteration for RSD

It is easy to observe that if the cost of updating the gradient is negligible, then the cost per iteration in RSD is given by the computational effort of finding the solution of the subproblem. The sketch sampling can be completely dense (e.g. Gaussian random matrix) or can be extremely sparse (e.g. a few columns of the identity matrix).

Case 1: dense sketch matrix . In this case, since we assume (in fact we usually choose of order or even smaller), then the computational cost per-iteration in the update (16) is linear in (more precisely of order ) plus the cost of computing the matrix . Clearly, if is also a dense matrix, then the cost of computing the matrix is quadratic in . However, it can be reduced substantially, that is the cost of computing this matrix depends linearly on , when e.g. we have available a decomposition of the matrix as , with and , or is sparse.

Case 2: sparse sketch matrix . For simplicity, we can assume that is chosen as few columns of the identity matrix and thus obtaining a coordinate descent type method. In this case, the cost per-iteration of RSD is independent of the problem size . For example, the cost of computing is , while the cost of computing is .

In conclusion, in all situations the iteration (16) of RSD is much computationally cheaper (at least one order of magnitude) than the iteration (15) corresponding to the full gradient. Based on the decrease of the objective function (18) we can derive different convergence rates for our algorithm RSD depending on the assumptions imposed on the objective function .

### 3.2 Convergence rate: smooth case

We derive in this section the convergence rate of the sequence generated by the RSD algorithm when the objective function is only smooth (Assumption 2.3). Recall that in the non-convex settings a feasible is a stationary point for optimization problem (1) if . On the other hand, for any feasible we have the unique decomposition of :

 ∇f(x)=ATλ+∇f(x)⊥,whereλ∈Rm,∇f(x)⊥∈ker(A).

It is clear that if a feasible satisfies , then such an is a stationary point for (1). In conclusion, a good measure of optimality for a feasible is described in terms of . The theorem below provides a convergence rate for the sequence generated by RSD in terms of this optimality measure: {theorem} Let be bounded from below, i.e. there exists such that we have and Assumptions 2.3 and 2.3 hold. Then, the iterates of RSD have the following sublinear convergence rate in expectation:

 (19) min0≤l≤k−1E[∥∇f(xl)⊥∥2Z]≤2(f(x0)−¯f)k.
{proof}

Taking expectation over the entire history in (18) we get:

 (20) ϕk+1≤ϕk−12E[∥∇f(xk)∥2Z].

Summing the previous relation and using that is bounded from below we further get:

 k−1∑l=0E[∥∇f(xl)∥2Z]≤2(ϕ0−ϕk)≤2(ϕ0−¯f).

Using the unique decomposition for all and since , then we obtain . Therefore, taking the limit as we obtain the asymptotic convergence . Moreover, since on and we also get:

 min0≤l≤k−1E[∥∇f(xl)⊥∥2Z]≤2(f(x0)−¯f)k,

which concludes our statement.

### 3.3 Convergence rate: smooth convex case

In order to estimate the rate of convergence of our algorithm when the objective function is smooth and convex, we introduce the following distance that takes into account that our algorithm is a descent method:

 (21) R(x0)=max{x∈x0+ker(A):f(x)≤f(x0)}