# Dual optimization for convex constrained objectives without the gradient-Lipschitz assumption

## Abstract

The minimization of convex objectives coming from linear supervised learning problems, such as penalized generalized linear models, can be formulated as finite sums of convex functions. For such problems, a large set of stochastic first-order solvers based on the idea of variance reduction are available and combine both computational efficiency and sound theoretical guarantees (linear convergence rates) [18], [31], [32], [12]. Such rates are obtained under both gradient-Lipschitz and strong convexity assumptions. Motivated by learning problems that do not meet the gradient-Lipschitz assumption, such as linear Poisson regression, we work under another smoothness assumption, and obtain a linear convergence rate for a shifted version of Stochastic Dual Coordinate Ascent (SDCA) [32] that improves the current state-of-the-art. Our motivation for considering a solver working on the Fenchel-dual problem comes from the fact that such objectives include many linear constraints, that are easier to deal with in the dual. Our approach and theoretical findings are validated on several datasets, for Poisson regression and another objective coming from the negative log-likelihood of the Hawkes process, which is a family of models which proves extremely useful for the modeling of information propagation in social networks and causality inference [11], [13].

## 1 Introduction

In the recent years, much effort has been made to minimize strongly convex finite sums with first order information. Recent developments, combining both numerical efficiency and sound theoretical guarantees, such as linear convergence rates, include SVRG [18], SAG [31], SDCA [32] or SAGA [12] to solve the following problem:

(1) |

where the functions correspond to a loss computed at a sample of the dataset, and is a (eventually non-smooth) penalization. However, theoretical guarantees about these algorithms, such as linear rates guaranteeing a numerical complexity to obtain a solution -distant to the minimum, require both strong convexity of and a gradient-Lipschitz property on each , namely for any , where stands for the Euclidean norm on and is the Lipschitz constant. However, some problems, such as the linear Poisson regression, which is of practical importance in web-marketing for instance [8], do not meet such a smoothness assumption. Indeed, we have in this example for where are the features vectors and are the labels, and where the model-weights must satisfy the linear constraints for all .

Motivated by machine learning problems described in Section 3 below, that do not satisfy the gradient-Lipschitz assumption, we consider a more specific task relying on a new smoothness assumption. Given convex functions with , a vector , features vectors corresponding to the rows of a matrix we consider the objective

(2) |

where , where is a -strongly convex function and is the open polytope

(3) |

Note that the linear term can be included in the regularization but the problem stands clearer if it is kept out. {definition} We say that a function is - smooth, where , if it is a differentiable and strictly monotone convex function that satisfies

for . All along the paper, we assume that the functions are - smooth. Note that is - smooth, for which we actually have an equality. Note also that the Poisson regression objective fits in this setting, where is - smooth and . See Section 3.1 below for more details.

#### Related works.

Standard first-order batch solvers (non stochastic) for composite convex objectives are ISTA and its accelerated version FISTA [4] and first-order stochastic solvers are mostly built on the idea of Stochastic Gradient Descent (SGD) [30]. Recently, stochastic solvers based on a combination of SGD and the Monte-Carlo technique of variance reduction [31], [32], [18], [12] turn out to be both very efficient numerically (each update has a complexity comparable to vanilla SGD) and very sound theoretically, because of strong linear convergence guarantees, that match or even improve the one of batch solvers. These algorithms involve gradient steps on the smooth part of the objective and theoretical guarantees justify such steps under the gradient-Lipschitz assumptions thanks to the descent lemma [5, Proposition A.24]. Without this assumption, such theoretical guarantees fall apart. Also, stochastic algorithms loose their numerical efficiency if their iterates are projected on the feasible set at each iteration as Equation (2) requires. STORC [17] can deal with constrained objectives without a full projection but is restricted to compact sets of constraints which is not the case of . Then, a modified proximal gradient method from [35] provides convergence bounds relying on self-concordance rather than the gradient-Lipschitz property. However, the convergence rate is guaranteed only once the iterates are close to the optimal solution and we observed in practice that this algorithm is simply not working (since it ends up using very small step-sizes) on the problems considered here. Recently, [6] has provided descent lemma working with a wider set of functions. Though, this set reduces to the gradient-Lipschitz functions when the canonical penalization, , is used.

#### Our contribution.

The first difficulty with the objective (2) is to remain in the open polytope . To deal with simpler constraints we rather perform optimization on the dual problem derived in Appendix 6.1

(4) |

where for a function , the Fenchel conjugate is given by , and the domain of the function . This strategy is the one used by Stochastic Dual Coordinate Ascent (SDCA) [32]. The dual problem solutions are box-constrained to which is much easier to maintain than the open polytope . Note that if is strictly increasing (resp. decreasing), then its dual is defined on (resp. ). By design, this approach keeps these constraints maintained all along the iterations so that the primal iterate converges to a point of .

In this paper, we derive linear convergence rates for SDCA without the gradient-Lipschitz assumption, by replacing it with -log smoothness, see Definition 1. Our results provide a state-of-the-art optimization technique for the considered problem (2), with sound theoretical guarantees (see Section 2) and very strong empirical properties as illustrated on experiments conducted with several datasets for Poisson regression and Hawkes processes likelihood (see Section 4). We study also SDCA with importance sampling [37] under -log smoothness and prove that it improves both theoretical guarantees and convergence rates observed in practical situations, see Sections 2.2 and 4. We provide also a heuristic initialization technique in Section 4.3 and a “mini-batch” [29] version of the algorithm in Section 4.4 that allows to end up with a particularly efficient solver for the considered problems.

We motivate even further the problem considered in this paper in Figure 1, where we consider a toy Poisson regression problem (with 2 features and 3 data points), for which L-BFGS-B typically fails while SDCA works. This illustrates the difficulty of the problem even on such an easy example.

#### Outline.

We first present the shifted SDCA algorithm in Section 2 and state its convergence guarantees in Theorem 2 under the -log smoothness assumption. We also provide theoretical guarantees for variants of the algorithm, one using proximal operators [33] and the second using importance sampling [37] which leads to better convergence guarantees in Theorem 2.2. In Section 3 we focus on two specific problems, namely Poisson regression and Hawkes processes, and explain how they fit into the considered setting. Section 4 contains experiments that illustrate the benefits of our approach compared to baselines. This Section also proposes a very efficient heuristic initialization and numerical details allowing to optimize over several indices at each iteration, which is a trick allowing to accelerate even further the algorithm.

## 2 The Shifted SDCA algorithm

The dual objective (4) cannot be written as a composite sum of convex functions as in 1, which is required for stochastic algorithms such as SRVG [18] or SAGA [12]. It is better to use a coordinate-wise approach to optimize this problem, which leads to SDCA [33], in which the starting point has been shifted by . This shift is induced by the relation linking primal and dual variables at the optimum,

(5) |

where is the solution of (1) and of (4). We first present the general algorithm (Algorithm 1), then its proximal alternative (Algorithm 2) and finally how importance sampling leads to better theoretical results.

We assume that we know bounds such that for any , such bounds can be explicitly computed from the data in the particular cases considered in the paper, see Section 3 for more details.

The next theorem provides a linear convergence rate for Algorithm 1 where we assume that each is - smooth (see Definition 1). {theorem} Suppose that we known bounds such that for and assume that all are - smooth and is 1-strongly convex. Then, Algorithm 1 satisfies

(6) |

where

(7) |

The proof of Theorem 2 is given in Section 6.5. It states that in considered setting, SDCA achieves a linear convergence rate for the dual objective. Bounds are provided in Section 3 below for two particular cases: Poisson regression and likelihood Hawkes processes. The existence of the bounds is not sufficient to retrieve the gradient-Lipschitz property. Hence, in order to compare the rate obtained in Theorem 2 with already known linear rates for SDCA under the gradient-Lipschitz assumption (see [32]), we need a stronger assumption that implies smoothness, as stated in the following Proposition. {proposition} Let be a strictly monotone convex function and be its Fenchel conjugate and be the second derivative of . Then, if there is such that

(8) |

for any , is - smooth. This proposition is proved in Section 6.6. If satisfies (8) and if bounds are known, then it is easy to see that is -strongly convex. Now, following carefully the proof in [32] in this setting leads to the convergence rate given in Equation (6) but with

Since for any , Theorem 2 provides a faster convergence rate under a weaker assumption. The comparative gain depends on the values of and but it increases logarithmically with the value of . Table 1 below compares the explicit values of these linear rates on a dataset used in our experiments for Poisson regression.

Convergence rates for the primal objective are not provided since the primal iterate typically belongs to only when it is close enough to the optimum. This would make most of the values of the primal objective undefined and therefore not comparable to .

### 2.1 Proximal algorithm

Algorithm 1 maximizes the dual over one coordinate at Line 4 whose solution might not be explicit and require inner steps to obtain . Whenever can be written as

(9) |

where is a convex, prox capable and possibly non-differentiable function, we use the same technique as Prox-SDCA [33] with a proximal lower bound that leads to

with

see Section 6.2 for details. This leads to a proximal variant described in Algorithm 2 below, which is able to handle various regularization techniques and which has the same convergence guarantees as Algorithm 1 given in Theorem 2. Also, note that assuming that can be written as (9) with a prox-capable function is rather unrestrictive, since one can always add a ridge penalization term in the objective.

### 2.2 Importance sampling

Importance sampling consists in adapting the probabilities of choosing a sample (which is basically done uniformly at random, see Line 3 from Algorithm 1) using the improvement which is expected by sampling it. Consider a distribution on with probabilities such that for any and . The Shifted SDCA and Shifted Prox-SDCA with importance sampling algorithms are simply obtained by modifying the way is sampled in Line 3 of Algorithms 1 and 2: instead of sampling uniformly at random, we sample using such a distribution . The optimal sampling probability is obtained in the same way as [37] and it also leads under our - smoothness assumption to a tighter convergence rate, as stated in Theorem 2.2 below. {theorem} Suppose that we known bounds such that for and assume that all are - smooth and that is 1-strongly convex. Consider defined by (7) and consider the distribution

(10) |

for . Then, Algorithm 1 and 2 where Line 3 is replaced by sampling satisfy

where . The proof is given in Appendix 6.7. This convergence rate is stronger than the previous one from Theorem 2 since . Table 1 below compares the explicit values of these linear rates on a dataset used in our experiments for Poisson regression (facebook dataset). We observe that the smooth rate with importance sampling is orders of magnitude better than the one obtained with the standard theory for SDCA which exploits only the strong convexity of the functions .

strongly convex | strongly convex with | log smooth | log smooth with |

importance sampling | importance sampling | ||

## 3 Applications to Poisson regression and Hawkes processes

In this Section we describe two important models that fit into the setting of the paper. We precisely formulate them as in Equation (2) and give the explicit value of bounds such as , where is the solution to (4).

### 3.1 Linear Poisson regression

Linear Poisson regression is widely used in image reconstruction [15], where the original image is retrieved from photons counts distributed as a Poisson distribution with intensity , that are received while observing the image, where different detectors are represented by a vector . It is also very useful in survival analysis to model additive effects, as opposed to multiplicative effects [7], which can be more accurate for some datasets and for web-marketing [8], where the intensity corresponds to an intensity of clicks on banners in web-marketing. Consider a training dataset with and and assume without loss of generality that for while for where (this simply means that we put first the samples corresponding to a label ). The negative log-likelihood of this model with a penalization function can be written as

where corresponds to the level of penalization, with the constraint that for . This corresponds to Equation (2) with for , which is --smooth functions and

for . The algorithms and results proposed in Section 2 can therefore be applied for this model.

### 3.2 Hawkes processes

Hawkes processes are used to study cross causality that might occur in one or several events series. First, they were introduced to study earthquake propagation, the network across which the aftershocks propagate can be recovered given all tremors timestamps [28]. Then, they have been used in high frequency finance to describe market reactions to different types of orders [3]. In the recent years Hawkes processes have found many new applications including crime prediction [21] or social network information propagation [20]. A Hawkes process [16] is a multivariate point-process: it models timestamps of nodes using a multivariate counting process with a particular auto-regressive structure in its intensity. More precisely, we say that a multivariate counting process where for is a Hawkes process if the intensity of has the following structure:

The are called baselines intensities, and correspond to the exogenous intensity of events from node , and the functions for are called kernels, that quantify the influence of past events from node on the intensity of events from node . The main parametric model for the kernels is the so-called exponential kernel, in which we consider

(11) |

with . In this model is understood as an adjacency matrix, since entry quantifies the impact of the activity of node on the activity of node , while are memory parameters. We stack these parameters into a vector containing the baselines and the self and cross-excitation parameters . Note that in this model the memory parameters are supposed to be given. The associated goodness-of-fit is the negative log-likelihood, which is given by the general theory of point processes (see [10]) as

Let us define the following weights

(12) |

that can be computed efficiently thanks to recurrence formulas (complexity is linear with respect to the number of events of each node). Using Equation (11) on the kernels , we can rewrite the negative log-likelihood as

(13) |

where is the number of events of node , where and where

Now, each can be rewritten in a vectorial form. Let us define

which corresponds to the model weights involved in , then let us define

which corresponds to the vector involved in the linear part of (2) and finally put

which contains all the timestamps data computed in the weights computed in Equation (12). With these notations the negative log-likelihood for node can be written as

(14) |

which shows in particular that the negative log-likelihood can be separated into independent sub-problems with goodness-of-fit which correspond to node (although the timestamps data is shared across all the ). Each subproblem is a particular case of the primal objective given in Equation 2, where all the labels are equal to . As a consequence, we can use the algorithms and results from Section 2 to train penalized multivariate Hawkes processes efficiently.

### 3.3 Closed form solution and bounds on dual variables

In this Section with provide the explicit solution to Line 4 of Algorithm 2 when the objective corresponds to the linear Poisson regression or the Hawkes process goodness-of-fit. In both cases, the dual loss is given by for any (with for the Hawkes processes). This dual loss is indeed - smooth. In Proposition 3.3 below we provide the closed-form solution of the local maximization step corresponding to Line 4 of Algorithm 2.

For Poisson regression and Hawkes processes, Line 4 of Algorithm 2 has a closed form solution, namely

This closed-form expression allows to derive a numerically very efficient training algorithm, as illustrated in Section 4 below. For this specific dual loss, we can provide also upper bounds for all optimal dual variables , as stated in the next Proposition. {proposition} For Poisson regression and Hawkes processes, if and if for all , we have the following upper bounds on the dual variables at the optimum:

for any . The proofs of Propositions 3.3 and 3.3 are provided in Section 6.8. Note that the inner product assumption from Proposition 3.3 is mild: it is always met for the Hawkes process with kernels given by (11) and it it met for Poisson regression whenever one applies for instance a min-max scaling on the features matrix.

## 4 Experiments

To evaluate efficiently Shifted SDCA we have compared it with other optimization algorithms that can handle the primal problem (2) nicely, without the gradient-Lipschitz assumptions. We have discarded the modified proximal gradient method from [35] since most of the time it was diverging while computing the initial step with the Barzilai-Borwein method on the considered examples. We consider the following algorithms.

#### Svrg.

This is a stochastic gradient descent algorithm with variance reduction introduced in [18, 36]. We used a variant introduced in [34], which uses Barzilai-Borwein in order to adapt the step-size, since gradient-Lipschitz constants are unavailable in the considered setting. We consider this version of variance reduction, since alternatives such as SAGA[12] and SAG [31] do not propose variants with Barzilai-Borwein type of step-size selection.

#### L-Bfgs-B.

L-BFGS-B is a a limited-memory quasi-Newton algorithm [26, 27]. It relies on an estimation of the inverse of the Hessian based on gradients differences. This technique allows L-BFGS-B to consider the curvature information leading to faster convergence than other batch first order algorithms such as ISTA and FISTA [4].

#### Newton algorithm.

This is the standard second-order Newton algorithm which computes at each iteration the hessian of the objective to solve a linear system with it. In our experiments, the considered objectives are both -smooth and self-concordant [24]. The self-concordant property bounds the third order derivative by the second order derivative, giving explicit control of the second order Taylor expansion [1]. This ensures supra-linear convergence guarantees and keeps all iterates in the open polytope (3) if the starting point is in it [25]. However, the computational cost of the hessian inversion makes this algorithm scale very poorly with the number of dimensions (the size of the vectors ).

#### Sdca.

This is the Shifted-SDCA algorithm, see Algorithm 2, without importance sampling. Indeed, the bounds given in Proposition 3.3 are not tight enough to improve convergence when used for importance sampling in the practical situations considered in this Section (despite the fact that the rates are theoretically better).

SVRG and L-BFGS-B are often diverging in these experiments in the same way as in the simple example considered in Figure 1. Hence, the problems are tuned to avoid any violation of the open polytope constraint (3), and to output comparable results between algorithms. Namely, to ensure that for any iterate , we scale the vectors so that they contain only non-negative entries, and the iterates of SVRG and L-BFGS-B are projected onto . This highlights two first drawbacks of these algorithms: they cannot deal with a generic feature matrix and their solutions contain only non-negative coefficients. For each run, the simply take where . This simple choice seemed relevant for all the considered problems.

### 4.1 Poisson regression

For Poisson regression we have processed our feature matrices to obtain coefficients between 0 and 1.
Numerical features are transformed with a min-max scaler and categorical features are one hot encoded.
We run our experiments on six datasets found on UCI dataset repository
[19] and Kaggle^{2}

dataset | wine ^{3} |
facebook ^{4} |
vegas ^{5} |
news ^{6} |
property ^{7} |
simulated ^{8} |
---|---|---|---|---|---|---|

# lines | 4898 | 500 | 2215 | 504 | 50099 | 100000 |

# features | 11 | 41 | 102 | 160 | 194 | 100 |

These datasets are used to predict a number of interactions for a social post (news and facebook), the rating of a wine or a hotel (wine and vegas) or the number of hazards occurring in a property (property). The last one comes from simulated data which follows a Poisson regression. In Figure 2 we present the convergence speed of the four algorithms. As our algorithms follow quite different schemes, we measure this speed regarding to the computational time. In all runs, SVRG and L-BFGS-B cannot reach the optimal solution as the problem minimizer contains negative values. This is illustrated in detail in Figure 3 for vegas dataset where it appears that all solvers obtain similar results for the positive values of but only Newton and SDCA algorithms are able to estimate the negatives values of . As expected, the Newton algorithm becomes very slow as the number of features increases. SDCA is the only first order solver that can reach the optimal solution and combines the best of both world, the scalability of a first order solver and the ability to reach solutions with negative entries.

### 4.2 Hawkes processes

If the adjacency matrix is forced to be entrywise positive, then no event type can have an inhibitive effect on another. In Figure 4 we present the aggregated influence of the kernels obtained after training a Hawkes process on a finance dataset exploring market microstructure [2]. While L-BFGS-B (or SVRG) recover only excitation in the adjacency matrix, SDCA also retrieves inhibition that one event type can have on another.

It is expected that when stocks are sold (resp. bought) the price is unlikely to go up (resp. down) but this is retrieved by SDCA only. On simulated data this is even clearer and in Figure 5 we observe the same behavior when the ground truth contains inhibitive effects. Our experiment consists in two simulated Hawkes processes with 10 nodes and sum-exponential kernels with 3 decays. There are only excitation effects - all are positive - in the first case and we allow inhibitive effects in the second. Events are simulated according to these kernels that we try to recover. While it would be standard to compare the performances in terms of log-likelihood obtained on the a test sample, nothing ensures that the problem optimizer lies in the feasible set of the test set. Hence the results are compared by looking at the estimation error (RMSE) of the adjacency matrix across iterations. Figure 5 shows that SDCA always converges faster to its solution and when the adjacency matrix contains inhibitive effects, SDCA obtains a better estimation error than L-BFGS-B.

### 4.3 Heuristic initialization

The default dual initialization in [33] () is not a feasible dual point. For Poisson regression and Hawkes processes, we design, from three properties, a vector that is linearly linked to and then rely on Proposition 4.3 to find a heuristic starting point from .

#### Property 1: link with .

Proposition 4.3 relates exactly to the inverse of the norm of . {proposition} For Poisson regression and Hawkes processes, the value of the dual optimum is linearly linked to the inverse of the norm of . Namely, if there is such that for any , then the solution of the dual problem

satisfies for all . This Proposition is proved in Section 6.11. It suggests to consider .

#### Property 2: link with .

#### Property 3: link with the features matrix.

The inner product is positive and at the optimum, the Karush-Kuhn-Tucker Condition (5) (which links to through ) tells that is likely to be large if is poorly correlated to other features, i.e. if is small. So, the choice

(15) |

takes these three properties into account.

Figure 6 below plots the optimal dual variables from the Poisson regression experiments of Section 4.1 against the the vector from Equation 15. We observe in these experiments a good correlation between the two, but is only a good guess for initialization up to a multiplicative factor that the following proposition aims to find.

For Poisson regression and Hawkes processes and , if we constraint the dual solution to be collinear with a given vector , i.e. for some then the optimal value for is given by

Combined with the previous Properties, we suggest to consider

(16) |

as an initial point, where is defined in Equation (15). This Proposition is proved in Section 6.12. Figure 7 presents the values of given its initial value for and shows that the rescaling has worked properly. We validate this heuristic initialization by showing that it leads to a much faster convergence in Figure 8 below. Indeed, we observe that SDCA initialized with Equation (16) reaches optimal objective much faster than when initialization consists in setting all dual variables to 1.

### 4.4 Using mini batches

At each step , SDCA [32] maximizes the dual objective by picking one index and maximizing the dual objective over the coordinate of the dual vector , and sets

In some cases this maximization has a closed-form solution, such as for Poisson regression, see Proposition 3.3) or least-squares regression where leads to the explicit update

In some other cases, such as logistic regression, this closed form solution cannot be exhibited and one must perform a Newton descent algorithm. Each Newton step consists in computing and , which are one dimensional operations given and . Hence, in a large dimensional setting with sparse features, namely whenever the number of non-zeros in is much smaller than , the main cost of the steps resides mostly in computing and . Since and must also be computed when using a closed-form solution, using Newton steps instead of the closed-form is eventually not much more expensive computationally. So, in order to obtain a better trade-off between Newton steps and inner-products computations, we can consider more than a single index on which we will maximize the dual objective. This is called the mini-batch approach, see Stochastic Dual Newton Ascent (SDNA) [29]. It consists in selecting a set of indices at each iteration of SDCA. The value of becomes in this case

The two extreme cases are , which is the standard SDCA algorithm, and for which we perform a full Newton algorithm. After computing the inner products and for all each iteration will simply performs up to 10 Newton steps in which the bottleneck is to solve a linear system. This allows to better exploit curvature and obtain better convergence guarantees for gradient-Lipschitz losses [29].

We can apply this to Poisson regression and Hawkes processes where . The maximization steps of Line 4 in Algorithm 2 is now performed on a set of coordinates and consists in finding

where we denote by the sub-vector of of size containing the values of all indices in . We initialize the vector to the corresponding values of the coordinates of in and then perform the Newton steps, i.e.

(17) |

The gradient and the hessian can be computed using the following formulas

and

Note that is a concave function hence will be positive semi-definite and the system in Equation (17) can be solved very efficiently with BLAS and LAPACK libraries. Let us explicit computations when . Suppose that and put