A Distributed SecondOrder Algorithm You Can Trust
Abstract
Due to the rapid growth of data and computational resources, distributed optimization has become an active research area in recent years. While firstorder methods seem to dominate the field, secondorder methods are nevertheless attractive as they potentially require fewer communication rounds to converge. However, there are significant drawbacks that impede their wide adoption, such as the computation and the communication of a large Hessian matrix. In this paper we present a new algorithm for distributed training of generalized linear models that only requires the computation of diagonal blocks of the Hessian matrix on the individual workers. To deal with this approximate information we propose an adaptive approach that  akin to trustregion methods  dynamically adapts the auxiliary model to compensate for modeling errors. We provide theoretical rates of convergence for a wide class of problems including regularized objectives. We also demonstrate that our approach achieves stateoftheart results on multiple large benchmark datasets.
1 Introduction
The last decade has witnessed a growing number of successful machine learning applications in various fields, along with the availability of larger training datasets. However, the speed at which training datasets grow in size is strongly outpacing the evolution of the computational power of single devices, as well as their memory capacity. Therefore, distributed approaches for training machine learning models have become tremendously important while also being increasingly more accessible to users with the rise of cloudcomputing. Scaling up optimization algorithms for training machine learning models in such a setting poses many challenges. One key aspect is communication efficiency; because communication is often more expensive than local computation, the overall speed of distributed algorithms strongly depends on how frequently information is exchanged between workers. In order to develop communicationefficient distributed algorithms, we advocate the use of secondorder methods which benefit from faster rates of convergence compared to their firstorder (gradientbased) counterparts, and hence require less communication rounds to achieve the same accuracy. However, secondorder methods have the significant drawback of requiring the computation and storage – and potentially the communication – of a Hessian matrix. Exact methods are therefore elusive for large datasets and one has to resort to approximate methods. In this paper, we propose a method where every worker uses local Hessian information only (i.e., with respect to the local parameters on that worker), hence it does not require any secondorder information to be communicated. Conceptually, this approach relies on approximating the full Hessian matrix with a blockdiagonal version. At the same time, to automatically adapt to the model misfit, we use an adaptive approach similar in spirit to trustregion methods Conn et al. (2000).
Problem Setup & Distributed Setting.
We address the problem of training generalized linear models which are ubiquitous in machine learning, including e.g. logistic regression, support vector machines as well as sparse linear models such as lasso and elastic net. Formally, we address convex optimization problems with an objective of the form
(1) 
where we assume to be smooth and convex, and to be convex functions. is a given data matrix and the parameter vector to be learned from data.
We assume that every worker only has access to its own local part of the data, which corresponds to a subset of the columns of the matrix . In machine learning, these columns typically correspond to a subset of the features or data examples, depending on the application. For example, in the case where (1) corresponds to the objective of a regularized generalized linear model – i.e., where is a data dependent loss and a regularization term – the columns of correspond to features. In another scenario where (1) corresponds to the dual representation of the respective problem, such as typically chosen for SVM models, the columns of correspond to data examples.
Blockseparable model.
In such a distributed setting we suggest optimizing a blockseparable auxiliary model which can be split over workers. This auxiliary model is then updated in each round, upon receiving a summary of the updates from all workers. A significant advantage of such a model is that the workload of a single round can be parallelized across the individual workers, where each worker computes an update for its own model parameters by solving a local optimizaition task. Then, to synchronize the work, each worker communicates this update to the master node which aggregates all the updates, applies them to the global model and shares this information with all the workers. One common problem faced with this type of distributed approach is to evaluate whether the local models can be trusted in order to update the global model. This is usually addressed by the selection of an appropriate stepsize or by relying on a linesearch approach. However, the latter uses a fixed model and typically requires multiple model evaluations which can therefore be computationally expensive. In this paper, we instead leverage ideas from trustregion methods Conn et al. (2000), where we dynamically adapt the model based on how much we trust the approximate secondorder information.
Contributions.
We propose a new distributed Newton’s method, built on an adaptive blockseparable approximation of the objective function, and allowing the use of arbitrary solvers to solve the local subproblems approximately. Two characteristics differentiate our approach from existing work. First, unlike previous methods that rely on fixed stepsize schedules or linesearch strategies, our algorithm evaluates the fit of the auxiliary model using a trustregion approach. This yields an efficient method with global convergence guarantees for convex functions, while providing full adaptivity to the quality of the secondorder model. Second, our method, to the best of our knowledge, is the first to give convergence guarantees for a distributed secondorder method applied to problems with general regularizers (not necessarily strongly convex). This includes regularized objectives such as Lasso and sparse logistic regression as very important application cases, which were not covered by earlier methods such as Shamir et al. (2014); Zhang & Lin (2015); Wang et al. (2017); Lee & Chang (2017).
2 Method Description
We present an iterative descent algorithm that minimizes the objective introduced in (1). At each step, we optimize an auxiliary blockseparable model that acts as a surrogate for the objective . This auxiliary model is adaptive and changes depending on its approximation quality.
2.1 BlockSeparable Model
Let us, in every iteration of our algorithm, consider the following auxiliary model replacing (1):
(2) 
where is a secondorder approximation of the datadependent term in (1), i.e.,
(3) 
The parameter is introduced to control the approximation quality of the auxiliary model; its role will be detailed in Section 2.2.
Let us consider (3) for the case where is chosen to be the Hessian matrix . Then, the auxiliary model (2) with corresponds to a classical secondorder approximation of the function . However, this choice of is not feasible in a distributed setting where the data is partitioned among the workers, since the computation of the Hessian matrix requires access to the entire data matrix.
Partitioning.
In particular, we assume each worker has access to a subset of the columns of . In our setting, are disjoint index sets such that and denotes the size of partition . Hence, each machine stores in its memory the submatrix corresponding to its partition .
Given such a partitioning, we suggest choosing to be a block diagonal approximation to the Hessian matrix aligned with the partitioning of the model parameters, such that
(4) 
We use the notation to denote the vector with only nonzero coordinates for . As a consequence of (4) the model presented in (2) splits over the partitions, i.e.,
(5) 
where each subproblem only requires access to the local data indexed by , the respective coordinates of the model , as well as :
(6) 
Hence, in a distributed setting, each worker is assigned the subproblem corresponding to its partition. These individual subproblems can be optimized independently and in parallel on the different workers. We note that this requires access to the shared information on every node; we will detail in Section 3 how this can be efficiently achieved in a distributed setting. A significant benefit of this model is that it is based on local secondorder information and does not require sending gradients and Hessian matrices to the master node, which would be a significant cost in terms of communication.
2.2 Approximation Quality of the Model
The role of the parameter introduced in (2) is to account for the loss of information that arises by enforcing the approximate Hessian matrix of to have a block diagonal structure. The better the approximation, the closer to the optimal parameter is. If the Hessian approximation is unreliable, then the model should be adapted accordingly by changing the value of . An alternative model to (2) would be to include a damping factor to the secondorder term, i.e., use where . This type of model is usually employed in trustregion methods Conn et al. (2000) where , and is chosen to ensure strongconvexity. The use of might therefore not be necessary for models that are already (strongly)convex. We conducted a set of experiments to determine whether this alternative model would achieve better empirical performance and we found little difference between the two models. We will therefore report results for our suggested model with in the experimental section.
Adaptive Choice of .
We have established that the parameter has a central role for the convergence and the practical performance of our method, and we therefore need an efficient way to choose and update this parameter in an adaptive manner. Here we suggest updating at each iteration of the algorithm using an update rule inspired by trustregion methods Cartis et al. (2011a), where acts as the reciprocal of the trustregion radius. Further details are provided in Section 2.3.
2.3 Algorithm Procedure
The pseudocode of the proposed approach, denoted as Adaptive Distributed Newton method (ADN), is summarized in Algorithm 1 and the fourstage iterative procedure is illustrated in Figure 1. We focus on a masterworker setting in this paper, but our algorithm could similarly be applied in a noncentralized fashion. Specifically, in every round, each worker works on its local subproblem (6) to find an update to its local parameters of the model (stage 1). Then, it communicates this update to the master node (stage 2) which, aggregates the updates, and decides a new for the next iteration based on the misfit of the current model (stage 3). Finally, the master node broadcasts the new model together with to every worker (stage 4) for the next round. Note that in Algorithm 1 we have not explicitly stated the communication of and two scalars that are necessary for evaluating the function values distributedly; we will elaborate more on this in Section 3.
Local Solver.
The computation of the model update on every worker (stage 2 of our algorithm) can be done using any arbitrary solver, depending on user preference or the available hardware resources. As in Smith et al. (2018), the amount of computation time spent in the local solver is a tunable hyperparameter. This allows the algorithm to be optimally adjusted according to the tradeoff between communication and computation cost of a given system. To reflect this flexibility in our theory we will assume the local subproblems (6) are not necessarily optimized exactly but approximately, i.e., the local updates are such that:
(7) 
where .
As previously mentioned, one of the key steps in the adaptive approach presented in Algorithm 1 is the strategy for adapting the model over iterations. This is done by adjusting in every iteration resulting in a schedule described by the sequence . In particular, after every iteration, we adjust based on the agreement between the model function (2) and the objective (1) for the current iterate. This is measured by the variable defined in (8) in Algorithm 1. If is close to there is a good agreement between the model and the function and we retain our current model. On the other hand, if the model overestimates the objective, we decrease for the next iteration, which can be thought of as adjusting the trust in the current approximation of the Hessian. On the contrary, if our model underestimates the objective we increase . In addition, we only apply updates that satisfy and hence provide sufficient function decrease. If this is not fulfilled, the step is rejected and a new update is computed in the next iteration, based on the adjusted model. In order to adequately deal with all these cases that influence , we introduce two constants and that control how to update based on the value of (see (10) in Algorithm 1). We will discuss the choice of these constants in the experiment section.
(8) 
(9) 
(10) 
3 Implementation
In order to implement Algorithm 1 efficiently in a distributed environment, two key aspects need to be considered.
3.1 Shared Information
We have seen in Section 2.1 that every worker needs access to in order to evaluate the gradient for solving the local subproblem. To avoid the evaluation of in every round we suggest sharing and updating the vector throughout the algorithm – thus, the term shared vector. Hence, if the model parameters are updated locally, the respective change is shared between workers, whereas the local model parameters are kept local on every worker. A similar approach to achieve communicationefficiency is suggested in Smith et al. (2018). They also emphasize that the vector to be communicated is dimensional which can be preferable compared to the dimensional model vector , depending on the dimensionality of the problem. This shared vector modification is a minor change of step 6 in Algorithm 1, where is aggregated and shared instead of .
3.2 CommunicationEfficient Function Evaluation
Let us detail how in Step 9 of Algorithm 1 can be evaluated efficiently without central access to the model . We therefore consider the individual terms in (8) separately: The cost is known from the previous iteration and can be stored in memory. The cost at the new iterate is composed of two terms, where the first term can be computed on the master locally as and the second term needs to be computed in a distributed fashion. Every node computes based on its local model parameters and sends the resulting value to the master node, which adds the overall sum to the first term, completing the evaluation of the new objective value. Similarly, the model cost is computed distributedly by every node independently evaluating and then sharing the result. Note that this step can be computationally expensive, since it requires one pass through the local data on every node; the communication cost of the two scalar values is negligible.
4 Convergence Analysis
Theorem 1 (nonstrongly convex ).
Let be smooth and be convex functions. Assume the sequence is bounded by .
iterations, where is a constant defined as where are such that and , and is the initial suboptimality.
For the special case where are stronglyconvex, Algorithm 1 achieves a faster rate of convergence as described in the following theorem.
Theorem 2 (stronglyconvex ).
Let be smooth and strongly convex. Assume the sequence is bounded by . Then, Algorithm 1 reaches a suboptimality within a total number of
iterations, where is a constant defined as with and measures the initial suboptimality.
Note that for stronglyconvex functions , similar global rates of convergence to the one derived in Theorem 2 are obtained by existing distributed secondorder methods such as Lee & Chang (2017); Wang et al. (2017). However, we are not aware of any result similar to Theorem 1 in the more general case where are nonstrongly convex functions.
Proof Sketch
We summarize the main steps in the proof of Theorem 1 and 2, a detailed derivation is provided in the Appendix.
Step 1. Recall that the model with block diagonal Hessian approximation, described in Section 2.1, acts as a surrogate to minimize the function introduced in (1). The first step is therefore to establish a bound on the decrease of the auxiliary model for every step of the algorithm, given that each local subproblem is solved approximately. This bound on the model decrease , stated in Lemma 4, is established using a primaldual perspective on the problem, similar to ShalevShwartz & Zhang (2013).
[]lemmalemmathree Assume is smooth and are strongly convex with . Then, the perstep model decrease of Algorithm 1 can be lower bounded as:
where denotes the duality gap, and
with
Step 2. For iterations that are successful (i.e., they provide sufficient function decrease as measured by in step 10 of Algorithm 1), the construction of Algorithm 1 allows us to relate the model decrease from Lemma 4 to the function decrease through the parameter . This yields a lower bound on the function decrease for every successful update as provided in Lemma 4 below.
[]lemmalemmafour The function decrease of Algorithm 1 for a successful update can be bounded as:
where and is defined as in Lemma 4.
Step 3. At this stage, we have shown that each successful iteration decreases the function value, therefore making progress towards the optimum. However, unsuccessful iterations (for which ) do not decrease the objective and overall convergence to an optimum can only occur if the number of these iterations is limited. The next step is therefore to bound the number of unsuccessful iterations. This is accomplished by showing that the construction of the sequence is such that the number of successive unsuccessful iterations is bounded and, hence, increasing will eventually yield a successful iteration that will allow us to decrease the objective function. This results in a bound on the number of successful and unsuccessful iterations derived in the Appendix. Finally, the rate of convergence in Theorem 1 and Theorem 2 are obtained by combining the bound on the number of steps with the function decrease for each successful step.
5 Related Work
Firstorder Methods. Most firstorder stochastic methods require frequent communication which comes with high costs in distributed settings, thus they are often prefered in multicore settings. This is for example the case for the popular Hogwild! algorithm Niu et al. (2011) that relies on asynchronous SGD updates in a lockfree setting and requires communication after each optimization step. Alternatives include variancereduced methods such as Lee et al. (2015) and coordinate descent methods such as Richtárik & Takáč (2016), however, they suffer similar communication bottlenecks.
Trustregion Methods. These methods use a surrogate model to approximate the objective within a region around the current iterate. The size of the trust region is expanded or contracted according to the fitness of the surrogate model to the true objective. For efficiency reasons, the surrogate model is often a quadratic model Conn et al. (2000); Karimireddy et al. (2018), although cubic models can also be used Nesterov & Polyak (2006). Though trustregion methods have been extensively used in a singlemachine setting, to the best of our knowledge we are the first to apply a trustregionlike approach in a distributed setting.
Linesearch vs Trustregion. Linesearch techniques are a popular way to guarantee convergence and they have recently been explored in distributed settings, e.g., Hsieh et al. (2016); Lee & Chang (2017); Trofimov & Genkin (2017); Mahajan et al. (2017); Lee et al. (2018). Our trustregion approach has clear advantages compared to linesearch methods: i) a linesearch method assumes a fixed auxiliary model –which may be an arbitrarily bad approximation of the true objective– that is used to find an acceptable step size. In contrast, our approach adaptively tunes the auxiliary model to ensure that it is a good fit to the true objective. ii) in general, a linesearch method requires multiple objective value evaluations in order to test different step sizes, while our approach only needs one objective value evaluation to calculate . The advantages of our method are verified empirically in Section 6.
Approximate Newtontype Methods. For distributed regularized problems Andrew & Gao (2007) proposed a quasinewton method without convergence guarantees. Most of the literature on Newtontype methods are otherwise designed to optimize stronglyconvex objectives. DANE Shamir et al. (2014) is a distributed approximate Newtonmethod with a linear rate of convergence for quadratic functions. AIDE Reddi et al. (2016) is an accelerated version using the Catalyst scheme. Another similar approach is DiSCO Zhang & Lin (2015) which consists of an inexact damped Newton method using conjugate gradient steps, achieving a linear rate of convergence for selfconcordant functions. Finally, GIANT Wang et al. (2017) relies on conjugate gradient steps and achieves a local linearquadratic convergence rate but does not provide a global rate of convergence. It was shown empirically to outperform DANE, AIDE and DiSCO. Note that the convergence results of these approaches require each subproblem to be solved with high accuracy, which is often prohibitive for largescale datasets. Some approaches suggest using a blockdiagonal Hessian approximation such as Hsieh et al. (2016); Lee & Chang (2017); Lee & Wright (2018) but they all rely on a linesearch approach which is shown to be inferior to our adaptive approach in the experimental section. While both our approach and Lee & Chang (2017) require iterations to reach accuracy for a stronglyconvex , we further provide a rate of convergence for the more general case where is nonstrongly convex.
Distributed PrimalDual Methods. Approaches such as (Yang, 2013; Jaggi et al., 2014; Zhang & Lin, 2015; Zheng et al., 2017; Wang et al., 2017) are restricted to stronglyconvex regularizers, and typically work on the dual formulation of the objective. CoCoA (Smith et al., 2018) provides an extension to a wider class of regularizers, including , as of interest here. Although it allows for the use of arbitrary solvers on each worker to regulate the amount of communication, this approach is inherently based on a firstorder model of the objective and does not use secondorder information.
In an earlier work by Gargiani (2017) a modification of CoCoA was discussed which incorporates local secondorder information for the general class of problems (1). We here extend this approach to be adaptive to the quality of the local surrogate model in a trust region sense, in contrast to using fixed Hessian information Hsieh et al. (2016); Gargiani (2017); Lee & Chang (2017); Lee & Wright (2018).
6 Experimental Results
We devote the first part of this section to analysing the properties of our adaptive scheme. In the second part we evaluate its performance for training a logistic regression model regularized with and regularization. We compare ADN to stateoftheart distributed solvers on four largescale datasets (see Table 1). All algorithms presented in this section are implemented in C++, they are optimized for sparse data structures and use MPI to handle communication between workers. If not stated otherwise, we use workers.
# examples  # features  sparsity  

url  2’396’130  3’230’442  3.58 E05 
webspam  262’938  680’715  2.24 E04 
kdda  8’407’751  19’306’083  1.80 E06 
criteo  45’840’617  1’000’000  1.95 E06 
6.1 Algorithm Properties
Initialization of .
Given the wide dissemination of machine learning models to diverse fields, it is becoming increasingly important to develop algorithms that can be deployed without requiring expert knowledge to choose parameters. In this context we first check the sensitivity of our algorithm to the choice of . The results shown in Figure 2 demonstrate that our adaptive scheme dynamically finds an appropriate value of , independently of the initialization.
ParameterFree Update Strategy.
In addition to there are three more parameters in Algorithm 1 – namely , and – that determine how to update . The most natural choice for is a small positive value, as we do not want to discard updates that would yield a function decrease; we therefore choose . The convergence of Algorithm 1 is guaranteed for any choice of , and we found empirically that the performance is not very sensitive to the choice of these parameters and the optimal values are robust across different datasets (e.g., is generally a good choice). However, to completely eliminate these parameters from the algorithm we suggest the following practical parameterfree update schedule:
This scheme is not only parameterfree, but it also adapts proportionally to the misfit of the model. The evaluation of this scaling factor does not add any additional computation to the evaluation of . Note that for this scheme to meet the required conditions of convergence presented in Section 4, we need to ensure that the sequence of is bounded, which can easily be done by defining an arbitrary maximum value although we empirically found that this was not necessary. Because of this appealing property of not requiring any tuning we will use this strategy for the following experiments.
Gain of Adaptive Strategy.
In this section we investigate the benefits of using an adaptive as opposed to a static one. We focus on a dual regularized logistic regression model where is a quadratic function and thus, its Hessian corresponds to a scaled identity matrix. This allows us to study the effect of adaptivity in isolation. It also allows us to compare to a reference model with which comes with convergence guarantees, see Smith et al. (2018). In Figure 3 we compare the two approaches and observe that with an increasing number of workers, the gains provided by the adaptive approach increase. This comes from the fact that the more workers we have, the less accurate the block diagonal approximation in the auxiliary model is and thus it is increasingly difficult to establish a safe fixed value for that covers any partitioning of the data in an ad hoc fashion. Note that the adaptive strategy does not only improve over the safe fixed value of as shown in Figure 3 but it also enables convergence for objectives to be guaranteed where no tight practical bound is known.
6.2 Performance for Logistic Regression
We now analyse the performance of ADN for training a Logistic Regression model on multiple largescale datasets and compare it to different stateoftheart methods. First, we will consider regularization, which results in a stronglyconvex objective function. This enables the application of a broad range of existing methods. In the second part of this section we focus on regularization, where – to the best of our knowledge – the only existing baselines that come with convergence guarantees are CoCoA Smith et al. (2018) and slower minibatch proximal SGD.
Baselines.
We compare our approach against GIANT as a representative scheme for the class of approximate Newton methods. This approach was shown in Wang et al. (2017) to achieve competitive performance to other similar algorithms such as DANE or DiSCO. The main difference between these methods and ours is that they build updates based on a local approximation of the full Hessian matrix, whereas we work with exact blocks of the full Hessian matrix. In order to establish a fair comparison, we reimplemented GIANT using MPI while following the open source implementation provided by the authors
Our second baseline is the approach presented in Lee & Chang (2017) which is similar to ours as it builds on the same block diagonal approximation of the Hessian matrix. However, it uses a fixed model and then relies on a backtracking line search approach to guarantee convergence. We will refer to this scheme as LS in our experiments.
The third baseline is CoCoA which approximates the Hessian by a scaled identity matrix using the smoothness property of . Their quadratic model performs well if is indeed a quadratic function such as the least squares loss or the dual of the regularizer. However, we will see that this is not a good model for the logistic loss function.
Regularization.
We consider the regularized logistic regression problem on the datasets introduced in Table 1. We compare CoCoA (applied to the L1 primal problem) and LS to ADN in Figure 4. In general, we see significant gains from ADN over CoCoA which can be attributed to CoCoA using a quadratic approximation to the logistic function which is not a good fit. The performance of LS is similar or slightly worse than our approach, depending on the dataset. However, as shown in Figure 4(c), it can be unstable since the linesearch approach used in Lee & Chang (2017) does not come with any theoretical guarantees for functions that are not stronglyconvex.
Regularization.
For regularized logistic regression, CoCoA, ADN and LS use a dual solver. The results presented in Figure 5 show that CoCoA is competitive in this case since it uses the same block diagonal approximation of the Hessian matrix and benefits from cheap iterations as no function evaluations are needed. However, we can see that using an adaptive strategy nevertheless pays off and we can achieve a gain over CoCoA. For very high accuracy solutions (<), a solver that uses the full Hessian should be preferred if possible.
7 Conclusion
We have presented a novel distributed secondorder algorithm that optimizes an auxiliary model with a blockdiagonal Hessian matrix. The separable structure of this model makes its optimization easily parallelizable. Each worker optimizes its own local model and sends a minimal amount of information to the master node. Our framework therefore avoids the computation and communication of an expensive Hessian matrix. In order to adjust for the approximation error of the model, we proposed using an adaptive scheme that resembles trustregion methods. This allows us to derive global guarantees of convergence for convex functions. Specializing our approach to stronglyconvex functions recovers convergence results derived by existing distributed secondorder methods. From the practical side, we have proposed a parameterfree version of our algorithm, discussed how to develop an efficient implementation and demonstrated significant speedups over stateoftheart baselines on several largescale datasets.
Appendix
Appendix A Analysis
In order to prove convergence of Algorithm 1 we proceed as follows:

For iterations that are successful (i.e., they achieve and thus successfully decrease the objective function, see Definition 1), the construction of Algorithm 1 allows us to relate the model decrease to the function decrease through the constant . This lets us establish a convergence rate in terms of the number of successful iterations, which is shown in Lemma 3, 4 for nonstrongly convex and strongly convex , respectively.

Finally, in order to establish overall convergence of Algorithm 1 we need to bound the number of unsuccessful iterations (i.e., iterations for which where no update is applied to the model parameters). This is accomplished by showing that the construction of the sequence is such that the number of successive unsuccessful iterations is limited and Algorithm 1 will therefore eventually yields a successful iteration which will allow us to decrease the objective function. In details, this is accomplished as follows:

show that Algorithm 1 finds a successful step as soon as the penalty parameter exceeds some critical value, thereby the sequence is guaranteed to stay within some bounded positive interval.

use the boundedness of to establish an upper bound on the maximum number of unsuccessful iterations and hence the total number of steps to reach a target suboptimality.

lastly in Section A.7 we establish the boundedness of the sequence for two general situations.

a.1 Model Decrease
\lemmathree*
Proof.
Given that the updates optimize the respective local models (defined in (7)) approximately, we can relate the model decrease provided by to the optimal model decrease as follows:
where and .
From here we proceed by bounding the model decrease for the optimal update, i.e., which, using (2), can be written as
where we omit the superscript for reasons of readability. Since is the minimizer of the following inequality must hold for an arbitrary update direction :
(11) 
Hence, let us consider the specific update for some and . We find
(13)  
Furthermore, using strong convexity of with , (i.e., the bound also holds for in which case is convex), we get
which combined with (13) yields
(14)  
To further simplify this bound we choose such that where denote the columns of the data matrix , and denotes the convex conjugate of the function . For this particular choice the term “(gap)” in (14) corresponds to the duality gap of the objective at the iterate . To see this, note that the duality gap (see, e.g., Dünner et al. (2016)) for (1) can be written as
(15)  
where equality holds for any since for such an optimal the FenchelYoung inequality holds with equality, i.e.,
(16) 
Now combining (14) with (15) and (16) we find
(17) 
and Lemma 4 follows. ∎
a.2 Function Decrease
In order to relate the model decrease to the function decrease we use the fact that every update applied to the parameter vector in Algorithm 1 is successful in the following sense.
Definition 1 (successful update).
The update is called successful if the following inequality is satisfied:
(18) 
otherwise it is called unsuccessful.
*
a.3 Rate of Convergence
Let denote the set of successful iterations as
Further, let us define two disjoint index sets and , which represent the un and successful steps that have occurred up to some iteration ;
Now, we will use Lemma 4 to establish convergence of Algorithm 1 as a function of the number of successful iterations. Therefore, we will start with convex functions where we show sublinear convergence and then show that for strongly convex functions , this result can be improved to obtain a linear rate of convergence.
Nonstrongly Convex .
Lemma 3.
(nonstrongly convex ) Let be smooth and be convex with bounded support. Assume the sequence is bounded above by . Then, we can bound the suboptimality of Algorithm 1 as
where counts the number of successful updates up to iteration , and with .
Proof.
For nonstrongly convex (i.e., ) we know from Lemma 4 that for any for successful update the function decrease at iteration can be lower bounded as
(20) 
For our block diagonal hessian approximation
it holds that
(21) 
with . Inequality (21) relies on the assumption that has bounded support: a) by duality between Lipschitzness and BoundedSupport Dünner et al. (2016) of the univariate functions we have since is in the support of and b) by the equivalence between Lipschitzness and bounded subgradient we also have since . Together this yields and the bound (21) follows.
In the following we assume that the sequence is bounded by . We write for the suboptimality at step and use that the duality gap upperbounds the suboptimality, i.e, . Combining this with (20) yields
(22) 
Let being positive constants defined as and , then the above inequality can be written as
(23) 
and holds for any . Now let us choose to minimize the RHS of (23) which yields and to have we further constrain to
Now let us consider the two cases separately:

. In this case we choose . Thus, from (23) we get
(24) 
. In this case we choose and hence from (23) we get
(25)
Note that the inequalities (24) and(25) together with nonnegativity of and imply that and thus is a decreasing sequence. Combining the two inequalities (24) and (25) we get the following bound which holds for both cases and thus for every :
(26) 
where we used . Thus it holds that
(27) 
Deviding both sides by yields
(28) 
Applying this bound recursively and plugging in the definition of