A Distributed Second-Order Algorithm You Can Trust

A Distributed Second-Order Algorithm You Can Trust

Abstract

Due to the rapid growth of data and computational resources, distributed optimization has become an active research area in recent years. While first-order methods seem to dominate the field, second-order methods are nevertheless attractive as they potentially require fewer communication rounds to converge. However, there are significant drawbacks that impede their wide adoption, such as the computation and the communication of a large Hessian matrix. In this paper we present a new algorithm for distributed training of generalized linear models that only requires the computation of diagonal blocks of the Hessian matrix on the individual workers. To deal with this approximate information we propose an adaptive approach that - akin to trust-region methods - dynamically adapts the auxiliary model to compensate for modeling errors. We provide theoretical rates of convergence for a wide class of problems including -regularized objectives. We also demonstrate that our approach achieves state-of-the-art results on multiple large benchmark datasets.

\printAffiliationsAndNotice

1 Introduction

The last decade has witnessed a growing number of successful machine learning applications in various fields, along with the availability of larger training datasets. However, the speed at which training datasets grow in size is strongly outpacing the evolution of the computational power of single devices, as well as their memory capacity. Therefore, distributed approaches for training machine learning models have become tremendously important while also being increasingly more accessible to users with the rise of cloud-computing. Scaling up optimization algorithms for training machine learning models in such a setting poses many challenges. One key aspect is communication efficiency; because communication is often more expensive than local computation, the overall speed of distributed algorithms strongly depends on how frequently information is exchanged between workers. In order to develop communication-efficient distributed algorithms, we advocate the use of second-order methods which benefit from faster rates of convergence compared to their first-order (gradient-based) counterparts, and hence require less communication rounds to achieve the same accuracy. However, second-order methods have the significant drawback of requiring the computation and storage – and potentially the communication – of a Hessian matrix. Exact methods are therefore elusive for large datasets and one has to resort to approximate methods. In this paper, we propose a method where every worker uses local Hessian information only (i.e., with respect to the local parameters on that worker), hence it does not require any second-order information to be communicated. Conceptually, this approach relies on approximating the full Hessian matrix with a block-diagonal version. At the same time, to automatically adapt to the model misfit, we use an adaptive approach similar in spirit to trust-region methods Conn et al. (2000).

Problem Setup & Distributed Setting.

We address the problem of training generalized linear models which are ubiquitous in machine learning, including e.g. logistic regression, support vector machines as well as sparse linear models such as lasso and elastic net. Formally, we address convex optimization problems with an objective of the form

(1)

where we assume to be smooth and convex, and to be convex functions. is a given data matrix and the parameter vector to be learned from data.

We assume that every worker only has access to its own local part of the data, which corresponds to a subset of the columns of the matrix . In machine learning, these columns typically correspond to a subset of the features or data examples, depending on the application. For example, in the case where (1) corresponds to the objective of a regularized generalized linear model – i.e., where is a data dependent loss and a regularization term – the columns of  correspond to features. In another scenario where (1) corresponds to the dual representation of the respective problem, such as typically chosen for SVM models, the columns of correspond to data examples.

Block-separable model.

In such a distributed setting we suggest optimizing a block-separable auxiliary model which can be split over workers. This auxiliary model is then updated in each round, upon receiving a summary of the updates from all workers. A significant advantage of such a model is that the workload of a single round can be parallelized across the individual workers, where each worker computes an update for its own model parameters by solving a local optimizaition task. Then, to synchronize the work, each worker communicates this update to the master node which aggregates all the updates, applies them to the global model and shares this information with all the workers. One common problem faced with this type of distributed approach is to evaluate whether the local models can be trusted in order to update the global model. This is usually addressed by the selection of an appropriate step-size or by relying on a line-search approach. However, the latter uses a fixed model and typically requires multiple model evaluations which can therefore be computationally expensive. In this paper, we instead leverage ideas from trust-region methods Conn et al. (2000), where we dynamically adapt the model based on how much we trust the approximate second-order information.

Contributions.

We propose a new distributed Newton’s method, built on an adaptive block-separable approximation of the objective function, and allowing the use of arbitrary solvers to solve the local subproblems approximately. Two characteristics differentiate our approach from existing work. First, unlike previous methods that rely on fixed step-size schedules or line-search strategies, our algorithm evaluates the fit of the auxiliary model using a trust-region approach. This yields an efficient method with global convergence guarantees for convex functions, while providing full adaptivity to the quality of the second-order model. Second, our method, to the best of our knowledge, is the first to give convergence guarantees for a distributed second-order method applied to problems with general regularizers (not necessarily strongly convex). This includes -regularized objectives such as Lasso and sparse logistic regression as very important application cases, which were not covered by earlier methods such as Shamir et al. (2014); Zhang & Lin (2015); Wang et al. (2017); Lee & Chang (2017).

2 Method Description

We present an iterative descent algorithm that minimizes the objective introduced in (1). At each step, we optimize an auxiliary block-separable model that acts as a surrogate for the objective . This auxiliary model is adaptive and changes depending on its approximation quality.

2.1 Block-Separable Model

Let us, in every iteration of our algorithm, consider the following auxiliary model replacing (1):

(2)

where is a second-order approximation of the data-dependent term in (1), i.e.,

(3)

The parameter is introduced to control the approximation quality of the auxiliary model; its role will be detailed in Section 2.2.

Let us consider (3) for the case where is chosen to be the Hessian matrix . Then, the auxiliary model (2) with corresponds to a classical second-order approximation of the function . However, this choice of is not feasible in a distributed setting where the data is partitioned among the workers, since the computation of the Hessian matrix requires access to the entire data matrix.

Partitioning.

In particular, we assume each worker has access to a subset of the columns of . In our setting, are disjoint index sets such that and denotes the size of partition . Hence, each machine stores in its memory the submatrix corresponding to its partition .

Given such a partitioning, we suggest choosing to be a block diagonal approximation to the Hessian matrix  aligned with the partitioning of the model parameters, such that

(4)

We use the notation to denote the vector with only non-zero coordinates for . As a consequence of (4) the model presented in (2) splits over the partitions, i.e.,

(5)

where each subproblem only requires access to the local data indexed by , the respective coordinates of the model , as well as :

(6)

Hence, in a distributed setting, each worker is assigned the subproblem corresponding to its partition. These individual subproblems can be optimized independently and in parallel on the different workers. We note that this requires access to the shared information on every node; we will detail in Section 3 how this can be efficiently achieved in a distributed setting. A significant benefit of this model is that it is based on local second-order information and does not require sending gradients and Hessian matrices to the master node, which would be a significant cost in terms of communication.

2.2 Approximation Quality of the Model

The role of the parameter introduced in (2) is to account for the loss of information that arises by enforcing the approximate Hessian matrix of to have a block diagonal structure. The better the approximation, the closer to the optimal parameter is. If the Hessian approximation is unreliable, then the model should be adapted accordingly by changing the value of . An alternative model to (2) would be to include a damping factor to the second-order term, i.e., use where . This type of model is usually employed in trust-region methods Conn et al. (2000) where , and is chosen to ensure strong-convexity. The use of might therefore not be necessary for models that are already (strongly)-convex. We conducted a set of experiments to determine whether this alternative model would achieve better empirical performance and we found little difference between the two models. We will therefore report results for our suggested model with in the experimental section.

Adaptive Choice of .

We have established that the parameter has a central role for the convergence and the practical performance of our method, and we therefore need an efficient way to choose and update this parameter in an adaptive manner. Here we suggest updating at each iteration of the algorithm using an update rule inspired by trust-region methods Cartis et al. (2011a), where acts as the reciprocal of the trust-region radius. Further details are provided in Section 2.3.

Figure 1: Four-stage algorithmic procedure of ADN. Every worker only has access to its local partition of the data matrix and the respective block of the Hessian matrix. Arrows indicate the (synchronous) communication per round.

2.3 Algorithm Procedure

The pseudo-code of the proposed approach, denoted as Adaptive Distributed Newton method (ADN), is summarized in Algorithm 1 and the four-stage iterative procedure is illustrated in Figure 1. We focus on a master-worker setting in this paper, but our algorithm could similarly be applied in a non-centralized fashion. Specifically, in every round, each worker works on its local subproblem (6) to find an update  to its local parameters of the model (stage 1). Then, it communicates this update to the master node (stage 2) which, aggregates the updates, and decides a new for the next iteration based on the misfit of the current model (stage 3). Finally, the master node broadcasts the new model together with to every worker (stage 4) for the next round. Note that in Algorithm 1 we have not explicitly stated the communication of and two scalars that are necessary for evaluating the function values distributedly; we will elaborate more on this in Section 3.

Local Solver.

The computation of the model update  on every worker (stage 2 of our algorithm) can be done using any arbitrary solver, depending on user preference or the available hardware resources. As in Smith et al. (2018), the amount of computation time spent in the local solver is a tunable hyperparameter. This allows the algorithm to be optimally adjusted according to the trade-off between communication and computation cost of a given system. To reflect this flexibility in our theory we will assume the local subproblems (6) are not necessarily optimized exactly but -approximately, i.e., the local updates are such that:

(7)

where .1

As previously mentioned, one of the key steps in the adaptive approach presented in Algorithm 1 is the strategy for adapting the model over iterations. This is done by adjusting  in every iteration resulting in a schedule described by the sequence . In particular, after every iteration, we adjust based on the agreement between the model function (2) and the objective (1) for the current iterate. This is measured by the variable defined in (8) in Algorithm 1. If  is close to there is a good agreement between the model and the function and we retain our current model. On the other hand, if the model over-estimates the objective, we decrease for the next iteration, which can be thought of as adjusting the trust in the current approximation of the Hessian. On the contrary, if our model under-estimates the objective we increase . In addition, we only apply updates that satisfy and hence provide sufficient function decrease. If this is not fulfilled, the step is rejected and a new update is computed in the next iteration, based on the adjusted model. In order to adequately deal with all these cases that influence , we introduce two constants and that control how to update  based on the value of  (see (10) in Algorithm 1). We will discuss the choice of these constants in the experiment section.

1:  Input: (e.g.,  ) and ., and
2:  for  do
3:     for  do
4:        Obtain by minimizing -approximately
5:     end for
6:     Aggregate updates
7:     Compute (distributed over workers)
8:     Compute (distributed over workers)
9:     Evaluate
(8)
10:     Set
(9)
11:     Set
(10)
12:  end for
Algorithm 1 Adaptive Distributed Newton Method (ADN)

3 Implementation

In order to implement Algorithm 1 efficiently in a distributed environment, two key aspects need to be considered.

3.1 Shared Information

We have seen in Section 2.1 that every worker needs access to in order to evaluate the gradient for solving the local subproblem. To avoid the evaluation of in every round we suggest sharing and updating the vector throughout the algorithm – thus, the term shared vector. Hence, if the model parameters are updated locally, the respective change is shared between workers, whereas the local model parameters are kept local on every worker. A similar approach to achieve communication-efficiency is suggested in Smith et al. (2018). They also emphasize that the vector to be communicated is -dimensional which can be preferable compared to the -dimensional model vector , depending on the dimensionality of the problem. This shared vector modification is a minor change of step 6 in Algorithm 1, where is aggregated and shared instead of .

3.2 Communication-Efficient Function Evaluation

Let us detail how in Step 9 of Algorithm 1 can be evaluated efficiently without central access to the model . We therefore consider the individual terms in (8) separately: The cost is known from the previous iteration and can be stored in memory. The cost at the new iterate is composed of two terms, where the first term can be computed on the master locally as and the second term needs to be computed in a distributed fashion. Every node computes based on its local model parameters and sends the resulting value to the master node, which adds the overall sum to the first term, completing the evaluation of the new objective value. Similarly, the model cost is computed distributedly by every node independently evaluating and then sharing the result. Note that this step can be computationally expensive, since it requires one pass through the local data on every node; the communication cost of the two scalar values is negligible.

4 Convergence Analysis

We now establish the convergence of Algorithm 1 for the general class of functions fitting (1).

Theorem 1 (non-strongly convex ).

Let be -smooth and be convex functions. Assume the sequence is bounded by .2 Then, Algorithm 1 reaches a suboptimality within a total number of

iterations, where is a constant defined as where are such that and , and is the initial suboptimality.

For the special case where are strongly-convex, Algorithm 1 achieves a faster rate of convergence as described in the following theorem.

Theorem 2 (strongly-convex ).

Let be -smooth and -strongly convex. Assume the sequence is bounded by . Then, Algorithm 1 reaches a suboptimality within a total number of

iterations, where is a constant defined as with and measures the initial suboptimality.

Note that for strongly-convex functions , similar global rates of convergence to the one derived in Theorem 2 are obtained by existing distributed second-order methods such as Lee & Chang (2017); Wang et al. (2017). However, we are not aware of any result similar to Theorem 1 in the more general case where are non-strongly convex functions.

Proof Sketch

We summarize the main steps in the proof of Theorem 1 and 2, a detailed derivation is provided in the Appendix.

Step 1. Recall that the model with block diagonal Hessian approximation, described in Section 2.1, acts as a surrogate to minimize the function introduced in (1). The first step is therefore to establish a bound on the decrease of the auxiliary model for every step of the algorithm, given that each local subproblem is solved -approximately. This bound on the model decrease , stated in Lemma 4, is established using a primal-dual perspective on the problem, similar to Shalev-Shwartz & Zhang (2013).

{restatable}

[]lemmalemmathree Assume is -smooth and are -strongly convex with . Then, the per-step model decrease of Algorithm 1 can be lower bounded as:

where denotes the duality gap, and

with 3.

Step 2. For iterations that are successful (i.e., they provide sufficient function decrease as measured by in step 10 of Algorithm 1), the construction of Algorithm 1 allows us to relate the model decrease from Lemma 4 to the function decrease through the parameter . This yields a lower bound on the function decrease for every successful update as provided in Lemma 4 below.

{restatable}

[]lemmalemmafour The function decrease of Algorithm 1 for a successful update can be bounded as:

where and is defined as in Lemma 4.

Step 3. At this stage, we have shown that each successful iteration decreases the function value, therefore making progress towards the optimum. However, unsuccessful iterations (for which ) do not decrease the objective and overall convergence to an optimum can only occur if the number of these iterations is limited. The next step is therefore to bound the number of unsuccessful iterations. This is accomplished by showing that the construction of the sequence is such that the number of successive unsuccessful iterations is bounded and, hence, increasing will eventually yield a successful iteration that will allow us to decrease the objective function. This results in a bound on the number of successful and unsuccessful iterations derived in the Appendix. Finally, the rate of convergence in Theorem 1 and Theorem 2 are obtained by combining the bound on the number of steps with the function decrease for each successful step.

Remark. Note that the update scheme (10) in Algorithm 1 is one of many that satisfy the conditions required for proving convergence. For further details, we refer the reader to the literature on trust-region methods Conn et al. (2000).

5 Related Work

First-order Methods. Most first-order stochastic methods require frequent communication which comes with high costs in distributed settings, thus they are often prefered in multi-core settings. This is for example the case for the popular Hogwild! algorithm Niu et al. (2011) that relies on asynchronous SGD updates in a lock-free setting and requires communication after each optimization step. Alternatives include variance-reduced methods such as Lee et al. (2015) and coordinate descent methods such as Richtárik & Takáč (2016), however, they suffer similar communication bottlenecks.

Trust-region Methods.  These methods use a surrogate model to approximate the objective within a region around the current iterate. The size of the trust region is expanded or contracted according to the fitness of the surrogate model to the true objective. For efficiency reasons, the surrogate model is often a quadratic model Conn et al. (2000); Karimireddy et al. (2018), although cubic models can also be used Nesterov & Polyak (2006). Though trust-region methods have been extensively used in a single-machine setting, to the best of our knowledge we are the first to apply a trust-region-like approach in a distributed setting.

Line-search vs Trust-region. Line-search techniques are a popular way to guarantee convergence and they have recently been explored in distributed settings, e.g., Hsieh et al. (2016); Lee & Chang (2017); Trofimov & Genkin (2017); Mahajan et al. (2017); Lee et al. (2018). Our trust-region approach has clear advantages compared to line-search methods: i) a line-search method assumes a fixed auxiliary model –which may be an arbitrarily bad approximation of the true objective– that is used to find an acceptable step size. In contrast, our approach adaptively tunes the auxiliary model to ensure that it is a good fit to the true objective. ii) in general, a line-search method requires multiple objective value evaluations in order to test different step sizes, while our approach only needs one objective value evaluation to calculate . The advantages of our method are verified empirically in Section 6.

Approximate Newton-type Methods. For distributed -regularized problems Andrew & Gao (2007) proposed a quasi-newton method without convergence guarantees. Most of the literature on Newton-type methods are otherwise designed to optimize strongly-convex objectives. DANE Shamir et al. (2014) is a distributed approximate Newton-method with a linear rate of convergence for quadratic functions. AIDE Reddi et al. (2016) is an accelerated version using the Catalyst scheme. Another similar approach is DiSCO Zhang & Lin (2015) which consists of an inexact damped Newton method using conjugate gradient steps, achieving a linear rate of convergence for self-concordant functions. Finally, GIANT Wang et al. (2017) relies on conjugate gradient steps and achieves a local linear-quadratic convergence rate but does not provide a global rate of convergence. It was shown empirically to outperform DANE, AIDE and DiSCO. Note that the convergence results of these approaches require each subproblem to be solved with high accuracy, which is often prohibitive for large-scale datasets. Some approaches suggest using a block-diagonal Hessian approximation such as Hsieh et al. (2016); Lee & Chang (2017); Lee & Wright (2018) but they all rely on a line-search approach which is shown to be inferior to our adaptive approach in the experimental section. While both our approach and Lee & Chang (2017) require iterations to reach accuracy for a strongly-convex , we further provide a rate of convergence for the more general case where is non-strongly convex.

Distributed Primal-Dual Methods. Approaches such as (Yang, 2013; Jaggi et al., 2014; Zhang & Lin, 2015; Zheng et al., 2017; Wang et al., 2017) are restricted to strongly-convex regularizers, and typically work on the dual formulation of the objective. CoCoA (Smith et al., 2018) provides an extension to a wider class of regularizers, including , as of interest here. Although it allows for the use of arbitrary solvers on each worker to regulate the amount of communication, this approach is inherently based on a first-order model of the objective and does not use second-order information.

In an earlier work by Gargiani (2017) a modification of CoCoA was discussed which incorporates local second-order information for the general class of problems (1). We here extend this approach to be adaptive to the quality of the local surrogate model in a trust region sense, in contrast to using fixed Hessian information Hsieh et al. (2016); Gargiani (2017); Lee & Chang (2017); Lee & Wright (2018).

6 Experimental Results

We devote the first part of this section to analysing the properties of our adaptive scheme. In the second part we evaluate its performance for training a logistic regression model regularized with and regularization. We compare ADN to state-of-the-art distributed solvers on four large-scale datasets (see Table 1). All algorithms presented in this section are implemented in C++, they are optimized for sparse data structures and use MPI to handle communication between workers. If not stated otherwise, we use workers.

# examples # features sparsity
url 2’396’130 3’230’442 3.58 E-05
webspam 262’938 680’715 2.24 E-04
kdda 8’407’751 19’306’083 1.80 E-06
criteo 45’840’617 1’000’000 1.95 E-06
Table 1: Datasets used for the experiments.
Figure 2: Robustness to initialization: Training the dual logistic regression model on a subsample (1 million examples) of the criteo dataset for different with , and .

6.1 Algorithm Properties

Initialization of .

Given the wide dissemination of machine learning models to diverse fields, it is becoming increasingly important to develop algorithms that can be deployed without requiring expert knowledge to choose parameters. In this context we first check the sensitivity of our algorithm to the choice of . The results shown in Figure 2 demonstrate that our adaptive scheme dynamically finds an appropriate value of , independently of the initialization.

Parameter-Free Update Strategy.

In addition to there are three more parameters in Algorithm 1 – namely , and – that determine how to update . The most natural choice for is a small positive value, as we do not want to discard updates that would yield a function decrease; we therefore choose . The convergence of Algorithm 1 is guaranteed for any choice of , and we found empirically that the performance is not very sensitive to the choice of these parameters and the optimal values are robust across different datasets (e.g., is generally a good choice). However, to completely eliminate these parameters from the algorithm we suggest the following practical parameter-free update schedule:

This scheme is not only parameter-free, but it also adapts  proportionally to the misfit of the model. The evaluation of this scaling factor does not add any additional computation to the evaluation of . Note that for this scheme to meet the required conditions of convergence presented in Section 4, we need to ensure that the sequence of is bounded, which can easily be done by defining an arbitrary maximum value although we empirically found that this was not necessary. Because of this appealing property of not requiring any tuning we will use this strategy for the following experiments.

(a) K=32
(b) K=16
(c) K=8
(d) K=4
Figure 3: Comparison of using an adaptive approach for vs. using a fixed safe value for for different numbers of workers (). Training logistic regression on a subsample (10 million examples) of the criteo dataset.
(a) url
(b) webspam
(c) kdda
(d) criteo
Figure 4: Performance comparison of primal solver for -regularized Logistic Regression.
Gain of Adaptive Strategy.

In this section we investigate the benefits of using an adaptive as opposed to a static one. We focus on a dual -regularized logistic regression model where is a quadratic function and thus, its Hessian corresponds to a scaled identity matrix. This allows us to study the effect of adaptivity in isolation. It also allows us to compare to a reference model with which comes with convergence guarantees, see Smith et al. (2018). In Figure 3 we compare the two approaches and observe that with an increasing number of workers, the gains provided by the adaptive approach increase. This comes from the fact that the more workers we have, the less accurate the block diagonal approximation in the auxiliary model is and thus it is increasingly difficult to establish a safe fixed value for that covers any partitioning of the data in an ad hoc fashion. Note that the adaptive strategy does not only improve over the safe fixed value of as shown in Figure 3 but it also enables convergence for objectives to be guaranteed where no tight practical bound is known.

6.2 Performance for Logistic Regression

We now analyse the performance of ADN for training a Logistic Regression model on multiple large-scale datasets and compare it to different state-of-the-art methods. First, we will consider regularization, which results in a strongly-convex objective function. This enables the application of a broad range of existing methods. In the second part of this section we focus on regularization, where – to the best of our knowledge – the only existing baselines that come with convergence guarantees are CoCoA Smith et al. (2018) and slower mini-batch proximal SGD.

Baselines.

We compare our approach against GIANT as a representative scheme for the class of approximate Newton methods. This approach was shown in Wang et al. (2017) to achieve competitive performance to other similar algorithms such as DANE or DiSCO. The main difference between these methods and ours is that they build updates based on a local approximation of the full Hessian matrix, whereas we work with exact blocks of the full Hessian matrix. In order to establish a fair comparison, we re-implemented GIANT using MPI while following the open source implementation provided by the authors4. We use conjugate gradient descent as a local solver and implemented the suggested backtracking line-search approach.

Our second baseline is the approach presented in Lee & Chang (2017) which is similar to ours as it builds on the same block diagonal approximation of the Hessian matrix. However, it uses a fixed model and then relies on a backtracking line search approach to guarantee convergence. We will refer to this scheme as LS in our experiments.

The third baseline is CoCoA which approximates the Hessian by a scaled identity matrix using the smoothness property of . Their quadratic model performs well if is indeed a quadratic function such as the least squares loss or the dual of the regularizer. However, we will see that this is not a good model for the logistic loss function.

Figure 5: Performance comparison for -regularized logistic regression on url dataset (left) and criteo dataset (right) for solving the primal problem (top) and the dual problem (bottom).
Regularization.

We consider the -regularized logistic regression problem on the datasets introduced in Table 1. We compare CoCoA (applied to the L1 primal problem) and LS to ADN in Figure 4. In general, we see significant gains from ADN over CoCoA which can be attributed to CoCoA using a quadratic approximation to the logistic function which is not a good fit. The performance of LS is similar or slightly worse than our approach, depending on the dataset. However, as shown in Figure 4(c), it can be unstable since the line-search approach used in Lee & Chang (2017) does not come with any theoretical guarantees for functions that are not strongly-convex.

Regularization.

For -regularized logistic regression, CoCoA, ADN and LS use a dual solver. The results presented in Figure 5 show that CoCoA is competitive in this case since it uses the same block diagonal approximation of the Hessian matrix and benefits from cheap iterations as no function evaluations are needed. However, we can see that using an adaptive strategy nevertheless pays off and we can achieve a gain over CoCoA. For very high accuracy solutions (<), a solver that uses the full Hessian should be preferred if possible.

7 Conclusion

We have presented a novel distributed second-order algorithm that optimizes an auxiliary model with a block-diagonal Hessian matrix. The separable structure of this model makes its optimization easily parallelizable. Each worker optimizes its own local model and sends a minimal amount of information to the master node. Our framework therefore avoids the computation and communication of an expensive Hessian matrix. In order to adjust for the approximation error of the model, we proposed using an adaptive scheme that resembles trust-region methods. This allows us to derive global guarantees of convergence for convex functions. Specializing our approach to strongly-convex functions recovers convergence results derived by existing distributed second-order methods. From the practical side, we have proposed a parameter-free version of our algorithm, discussed how to develop an efficient implementation and demonstrated significant speed-ups over state-of-the-art baselines on several large-scale datasets.

Appendix

Appendix A Analysis

In order to prove convergence of Algorithm 1 we proceed as follows:

  1. For the auxilary model with block diagonal as described in Section 2.1, we lower bound the decrease in the model achieved in each step of Algorithm 1. This yields the lower bound on provided in Lemma 4.

  2. For iterations that are successful (i.e., they achieve and thus successfully decrease the objective function, see Definition 1), the construction of Algorithm 1 allows us to relate the model decrease to the function decrease through the constant . This lets us establish a convergence rate in terms of the number of successful iterations, which is shown in Lemma 3, 4 for non-strongly convex and strongly convex , respectively.

  3. Finally, in order to establish overall convergence of Algorithm 1 we need to bound the number of unsuccessful iterations (i.e., iterations for which where no update is applied to the model parameters). This is accomplished by showing that the construction of the sequence is such that the number of successive unsuccessful iterations is limited and Algorithm 1 will therefore eventually yields a successful iteration which will allow us to decrease the objective function. In details, this is accomplished as follows:

    • show that Algorithm 1 finds a successful step as soon as the penalty parameter exceeds some critical value, thereby the sequence is guaranteed to stay within some bounded positive interval.

    • use the boundedness of to establish an upper bound on the maximum number of unsuccessful iterations and hence the total number of steps to reach a target suboptimality.

    • lastly in Section A.7 we establish the boundedness of the sequence for two general situations.

a.1 Model Decrease

\lemmathree

*

Proof.

Given that the updates optimize the respective local models (defined in (7)) -approximately, we can relate the model decrease provided by to the optimal model decrease as follows:

where and .

From here we proceed by bounding the model decrease for the optimal update, i.e., which, using (2), can be written as

where we omit the superscript for reasons of readability. Since is the minimizer of the following inequality must hold for an arbitrary update direction :

(11)

Hence, let us consider the specific update for some and . We find

(13)

Furthermore, using -strong convexity of with , (i.e., the bound also holds for in which case is convex), we get

which combined with (13) yields

(14)

To further simplify this bound we choose such that where denote the columns of the data matrix , and denotes the convex conjugate of the function . For this particular choice the term “(gap)” in (14) corresponds to the duality gap of the objective at the iterate . To see this, note that the duality gap (see, e.g., Dünner et al. (2016)) for (1) can be written as

(15)

where equality holds for any since for such an optimal the Fenchel-Young inequality holds with equality, i.e.,

(16)

Now combining (14) with (15) and (16) we find

(17)

and Lemma 4 follows. ∎

a.2 Function Decrease

In order to relate the model decrease to the function decrease we use the fact that every update applied to the parameter vector in Algorithm 1 is successful in the following sense.

Definition 1 (successful update).

The update is called successful if the following inequality is satisfied:

(18)

otherwise it is called unsuccessful.

\lemmafour

*

Proof.

Starting from (18), and observing that we have

(19)

Combining this inequality with Lemma 4 concludes the proof. ∎

a.3 Rate of Convergence

Let denote the set of successful iterations as

Further, let us define two disjoint index sets and , which represent the un- and successful steps that have occurred up to some iteration ;

Now, we will use Lemma 4 to establish convergence of Algorithm 1 as a function of the number of successful iterations. Therefore, we will start with convex functions where we show sublinear convergence and then show that for strongly convex functions , this result can be improved to obtain a linear rate of convergence.

Non-strongly Convex .
Lemma 3.

(non-strongly convex ) Let be -smooth and be convex with -bounded support. Assume the sequence is bounded above by . Then, we can bound the suboptimality of Algorithm 1 as

where counts the number of successful updates up to iteration , and with .

Proof.

For non-strongly convex (i.e., ) we know from Lemma 4 that for any for successful update the function decrease at iteration can be lower bounded as

(20)

For our block diagonal hessian approximation

it holds that

(21)

with . Inequality (21) relies on the assumption that has -bounded support: a) by duality between Lipschitzness and Bounded-Support Dünner et al. (2016) of the univariate functions we have since is in the support of and b) by the equivalence between Lipschitzness and bounded subgradient we also have since . Together this yields and the bound (21) follows.

In the following we assume that the sequence is bounded by . We write for the suboptimality at step and use that the duality gap upper-bounds the suboptimality, i.e, . Combining this with (20) yields

(22)

Let being positive constants defined as and , then the above inequality can be written as

(23)

and holds for any . Now let us choose to minimize the RHS of (23) which yields and to have we further constrain to

Now let us consider the two cases separately:

  1. . In this case we choose . Thus, from (23) we get

    (24)
  2. . In this case we choose and hence from (23) we get

    (25)

Note that the inequalities (24) and(25) together with non-negativity of and imply that and thus is a decreasing sequence. Combining the two inequalities (24) and (25) we get the following bound which holds for both cases and thus for every :

(26)

where we used . Thus it holds that

(27)

Deviding both sides by yields

(28)

Applying this bound recursively and plugging in the definition of