A Multi-Batch L-BFGS Method for Machine Learning

# A Multi-Batch L-BFGS Method for Machine Learning

## Abstract

The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This can cause difficulties because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases.

## 1Introduction

It is common in machine learning to encounter optimization problems involving millions of parameters and very large datasets. To deal with the computational demands imposed by such applications, high performance implementations of stochastic gradient and batch quasi-Newton methods have been developed [1]. In this paper we study a batch approach based on the L-BFGS method [20] that strives to reach the right balance between efficient learning and productive parallelism.

In supervised learning, one seeks to minimize empirical risk,

where denote the training examples and is the composition of a prediction function (parametrized by ) and a loss function. The training problem consists of finding an optimal choice of the parameters with respect to , i.e., At present, the preferred optimization method is the stochastic gradient descent (SGD) method [23], and its variants [14], which are implemented either in an asynchronous manner (e.g. when using a parameter server in a distributed setting) or following a synchronous mini-batch approach that exploits parallelism in the gradient evaluation [2]. A drawback of the asynchronous approach is that it cannot use large batches, as this would cause updates to become too dense and compromise the stability and scalability of the method [16]. As a result, the algorithm spends more time in communication as compared to computation. On the other hand, using a synchronous mini-batch approach one can achieve a near-linear decrease in the number of SGD iterations as the mini-batch size is increased, up to a certain point after which the increase in computation is not offset by the faster convergence [26].

An alternative to SGD is a batch method, such as L-BFGS, which is able to reach high training accuracy and allows one to perform more computation per node, so as to achieve a better balance with communication costs [27]. Batch methods are, however, not as efficient learning algorithms as SGD in a sequential setting [6]. To benefit from the strength of both methods some high performance systems employ SGD at the start and later switch to a batch method [1].

Multi-Batch Method. In this paper, we follow a different approach consisting of a single method that selects a sizeable subset (batch) of the training data to compute a step, and changes this batch at each iteration to improve the learning abilities of the method. We call this a multi-batch approach to differentiate it from the mini-batch approach used in conjunction with SGD, which employs a very small subset of the training data. When using large batches it is natural to employ a quasi-Newton method, as incorporating second-order information imposes little computational overhead and improves the stability and speed of the method. We focus here on the L-BFGS method, which employs gradient information to update an estimate of the Hessian and computes a step in flops, where is the number of variables. The multi-batch approach can, however, cause difficulties to L-BFGS because this method employs gradient differences to update Hessian approximations. When the gradients used in these differences are based on different data points, the updating procedure can be unstable. Similar difficulties arise in a parallel implementation of the standard L-BFGS method, if some of the computational nodes devoted to the evaluation of the function and gradient are unable to return results on time — as this again amounts to using different data points to evaluate the function and gradient at the beginning and the end of the iteration. The goal of this paper is to show that stable quasi-Newton updating can be achieved in both settings without incurring extra computational cost, or special synchronization. The key is to perform quasi-Newton updating based on the overlap between consecutive batches. The only restriction is that this overlap should not be too small, something that can be achieved in most situations.

Contributions. We describe a novel implementation of the batch L-BFGS method that is robust in the absence of sample consistency; i.e., when different samples are used to evaluate the objective function and its gradient at consecutive iterations. The numerical experiments show that the method proposed in this paper — which we call the multi-batch L-BFGS method — achieves a good balance between computation and communication costs. We also analyze the convergence properties of the new method (using a fixed step length strategy) on both convex and nonconvex problems.

## 2The Multi-Batch Quasi-Newton Method

In a pure batch approach, one applies a gradient based method, such as L-BFGS [20], to the deterministic optimization problem . When the number of training examples is large, it is natural to parallelize the evaluation of and by assigning the computation of the component functions to different processors. If this is done on a distributed platform, it is possible for some of the computational nodes to be slower than the rest. In this case, the contribution of the slow (or unresponsive) computational nodes could be ignored given the stochastic nature of the objective function. This leads, however, to an inconsistency in the objective function and gradient at the beginning and at the end of the iteration, which can be detrimental to quasi-Newton methods . Thus, we seek to find a fault-tolerant variant of the batch L-BFGS method that is capable of dealing with slow or unresponsive computational nodes.

A similar challenge arises in a multi-batch implementation of the L-BFGS method in which the entire training set is not employed at every iteration, but rather, a subset of the data is used to compute the gradient. Specifically, we consider a method in which the dataset is randomly divided into a number of batches — say 10, 50, or 100 — and the minimization is performed with respect to a different batch at every iteration. At the -th iteration, the algorithm chooses a batch , computes

and takes a step along the direction , where is an approximation to . Allowing the sample to change freely at every iteration gives this approach flexibility of implementation and is beneficial to the learning process, as we show in Section 4. (We refer to as the sample of training points, even though only indexes those points.)

The case of unresponsive computational nodes and the multi-batch method are similar. The main difference is that node failures create unpredictable changes to the samples , whereas a multi-batch method has control over sample generation. In either case, the algorithm employs a stochastic approximation to the gradient and can no longer be considered deterministic. We must, however, distinguish our setting from that of the classical SGD method, which employs small mini-batches and noisy gradient approximations. Our algorithm operates with much larger batches so that distributing the function evaluation is beneficial and the compute time of is not overwhelmed by communication costs. This gives rise to gradients with relatively small variance and justifies the use of a second-order method such as L-BFGS.

Robust Quasi-Newton Updating. The difficulties created by the use of a different sample at each iteration can be circumvented if consecutive samples and overlap, so that One can then perform stable quasi-Newton updating by computing gradient differences based on this overlap, i.e., by defining

in the notation given in . The correction pair can then be used in the BFGS update. When the overlap set is not too small, is a useful approximation of the curvature of the objective function along the most recent displacement, and will lead to a productive quasi-Newton step. This observation is based on an important property of Newton-like methods, namely that there is much more freedom in choosing a Hessian approximation than in computing the gradient [7]. Thus, a smaller sample can be employed for updating the inverse Hessian approximation than for computing the batch gradient in the search direction . In summary, by ensuring that unresponsive nodes do not constitute the vast majority of all working nodes in a fault-tolerant parallel implementation, or by exerting a small degree of control over the creation of the samples in the multi-batch method, one can design a robust method that naturally builds upon the fundamental properties of BFGS updating.

We should mention in passing that a commonly used strategy for ensuring stability of quasi-Newton updating in machine learning is to enforce gradient consistency [?], i.e., to use the same sample to compute gradient evaluations at the beginning and the end of the iteration. Another popular remedy is to use the same batch for multiple iterations [19], alleviating the gradient inconsistency problem at the price of slower convergence. In this paper, we assume that achieving such sample consistency is not possible (in the fault-tolerant case) or desirable (in a multi-batch framework), and wish to design a new variant of L-BFGS that imposes minimal restrictions in the sample changes.

### 2.1Specification of the Method

At the -th iteration, the multi-batch BFGS algorithm chooses a set and computes a new iterate

where is the step length, is the batch gradient and is the inverse BFGS Hessian matrix approximation that is updated at every iteration by means of the formula

To compute the correction vectors , we determine the overlap set consisting of the samples that are common at the -th and -st iterations. We define

and compute the correction vectors as in . In this paper we assume that is constant.

In the limited memory version, the matrix is defined at each iteration as the result of applying BFGS updates to a multiple of the identity matrix, using a set of correction pairs kept in storage. The memory parameter is typically in the range 2 to 20. When computing the matrix-vector product in it is not necessary to form that matrix since one can obtain this product via the two-loop recursion [20], using the most recent correction pairs . After the step has been computed, the oldest pair is discarded and the new curvature pair is stored.

A pseudo-code of the proposed method is given below, and depends on several parameters. The parameter denotes the fraction of samples in the dataset used to define the gradient, i.e., . The parameter denotes the length of overlap between consecutive samples, and is defined as a fraction of the number of samples in a given batch , i.e., .

### 2.2Sample Generation

We now discuss how the sample is created at each iteration (Line 8 in Algorithm ?).

Distributed Computing with Faults. Consider a distributed implementation in which slave nodes read the current iterate from the master node, compute a local gradient on a subset of the dataset, and send it back to the master node for aggregation in the calculation . Given a time (computational) budget, it is possible for some nodes to fail to return a result. The schematic in Figure ?a illustrates the gradient calculation across two iterations, and , in the presence of faults. Here , denote the batches of data that each slave node receives (where ), and is the gradient calculation using all nodes that responded within the preallocated time.

Let and be the set of indices of all nodes that returned a gradient at the -th and -st iterations, respectively. Using this notation and , and we define . The simplest implementation in this setting preallocates the data on each compute node, requiring minimal data communication, i.e., only one data transfer. In this case the samples will be independent if node failures occur randomly. On the other hand, if the same set of nodes fail, then sample creation will be biased, which is harmful both in theory and practice. One way to ensure independent sampling is to shuffle and redistribute the data to all nodes after a certain number of iterations.

Multi-batch Sampling. We propose two strategies for the multi-batch setting.

Figure ?b illustrates the sample creation process in the first strategy. The dataset is shuffled and batches are generated by collecting subsets of the training set, in order. Every set (except ) is of the form , where and are the overlapping samples with batches and respectively, and are the samples that are unique to batch . After each pass through the dataset, the samples are reshuffled, and the procedure described above is repeated. In our implementation samples are drawn without replacement, guaranteeing that after every pass (epoch) all samples are used. This strategy has the advantage that it requires no extra computation in the evaluation of and , but the samples are not independent.

The second sampling strategy is simpler and requires less control. At every iteration , a batch is created by randomly selecting elements from . The overlapping set is then formed by randomly selecting elements from (subsampling). This strategy is slightly more expensive since requires extra computation, but if the overlap is small this cost is not significant.

## 3Convergence Analysis

In this section, we analyze the convergence properties of the multi-batch L-BFGS method (Algorithm ?) when applied to the minimization of strongly convex and nonconvex objective functions, using a fixed step length strategy. We assume that the goal is to minimize the empirical risk given in , but note that a similar analysis could be used to study the minimization of the expected risk.

### 3.1Strongly Convex case

Due to the stochastic nature of the multi-batch approach, every iteration of Algorithm ? employs a gradient that contains errors that do not converge to zero. Therefore, by using a fixed step length strategy one cannot establish convergence to the optimal solution , but only convergence to a neighborhood of [18]. Nevertheless, this result is of interest as it reflects the common practice of using a fixed step length and decreasing it only if the desired testing error has not been achieved. It also illustrates the tradeoffs that arise between the size of the batch and the step length.

In our analysis, we make the following assumptions about the objective function and the algorithm.

Assumptions A.

1. is twice continuously differentiable.

2. There exist positive constants and such that for all and all sets .

3. There is a constant such that for all and all sets .

4. The samples are drawn independently and is an unbiased estimator of the true gradient for all , i.e.,

Note that Assumption implies that the entire Hessian also satisfies

for some constants . Assuming that every sub-sampled function is strongly convex is not unreasonable as a regularization term is commonly added in practice when that is not the case.

We begin by showing that the inverse Hessian approximations generated by the multi-batch L-BFGS method have eigenvalues that are uniformly bounded above and away from zero. The proof technique used is an adaptation of that in [8].

Utilizing Lemma ?, we show that the multi-batch L-BFGS method with a constant step length converges to a neighborhood of the optimal solution.

The bound provided by this theorem has two components: (i) a term decaying linearly to zero, and (ii) a term identifying the neighborhood of convergence. Note that a larger step length yields a more favorable constant in the linearly decaying term, at the cost of an increase in the size of the neighborhood of convergence. We will consider again these tradeoffs in Section 4, where we also note that larger batches increase the opportunities for parallelism and improve the limiting accuracy in the solution, but slow down the learning abilities of the algorithm.

One can establish convergence of the multi-batch L-BFGS method to the optimal solution by employing a sequence of step lengths that converge to zero according to the schedule proposed by Robbins and Monro [23]. However, that provides only a sublinear rate of convergence, which is of little interest in our context where large batches are employed and some type of linear convergence is expected. In this light, Theorem ? is more relevant to practice.

### 3.2Nonconvex case

The BFGS method is known to fail on noconvex problems [17]. Even for L-BFGS, which makes only a finite number of updates at each iteration, one cannot guarantee that the Hessian approximations have eigenvalues that are uniformly bounded above and away from zero. To establish convergence of the BFGS method in the nonconvex case cautious updating procedures have been proposed [15]. Here we employ a cautious strategy that is well suited to our particular algorithm; we skip the update, i.e., set , if the curvature condition

is not satisfied, where is a predetermined constant. Using said mechanism we show that the eigenvalues of the Hessian matrix approximations generated by the multi-batch L-BFGS method are bounded above and away from zero (Lemma ?). The analysis presented in this section is based on the following assumptions.

Assumptions B.

1. is twice continuously differentiable.

2. The gradients of are -Lipschitz continuous, and the gradients of are -Lipschitz continuous for all and all sets .

3. The function is bounded below by a scalar .

4. There exist constants and such that for all and all sets .

5. The samples are drawn independently and is an unbiased estimator of the true gradient for all , i.e.,

We can now follow the analysis in [4] to establish the following result about the behavior of the gradient norm for the multi-batch L-BFGS method with a cautious update strategy.

This result bounds the average norm of the gradient of after the first iterations, and shows that the iterates spend increasingly more time in regions where the objective function has a small gradient.

## 4Numerical Results

In this Section, we present numerical results that evaluate the proposed robust multi-batch L-BFGS scheme (Algorithm ?) on logistic regression problems. Figure 2 shows the performance on the webspam dataset1, where we compare it against three methods: (i) multi-batch L-BFGS without enforcing sample consistency (L-BFGS), where gradient differences are computed using different samples, i.e., ; (ii) multi-batch gradient descent (Gradient Descent), which is obtained by setting in Algorithm ?; and, (iii) serial SGD, where at every iteration one sample is used to compute the gradient. We run each method with 10 different random seeds, and, where applicable, report results for different batch () and overlap () sizes. The proposed method is more stable than the standard L-BFGS method; this is especially noticeable when is small. On the other hand, serial SGD achieves similar accuracy as the robust L-BFGS method and at a similar rate (e.g., ), at the cost of communications per epochs versus communications per epoch. Figure 2 also indicates that the robust L-BFGS method is not too sensitive to the size of overlap. Similar behavior was observed on other datasets, in regimes where was not too small; see Appendix B.1. We mention in passing that the L-BFGS step was computed using the a vector-free implementation proposed in [9].

We also explore the performance of the robust multi-batch L-BFGS method in the presence of node failures (faults), and compare it to the multi-batch variant that does not enforce sample consistency (L-BFGS). Figure 3 illustrates the performance of the methods on the webspam dataset, for various probabilities of node failures , and suggests that the robust L-BFGS variant is more stable; see Appendix B.2 for further results.

Lastly, we study the strong and weak scaling properties of the robust L-BFGS method on artificial data (Figure 4). We measure the time needed to compute a gradient (Gradient) and the associated communication (Gradient+C), as well as, the time needed to compute the L-BFGS direction (L-BFGS) and the associated communication (L-BFGS+C), for various batch sizes (). The figure on the left shows strong scaling of multi-batch LBFGS on a dimensional problem with samples. The size of input data is 24GB, and we vary the number of MPI processes, . The time it takes to compute the gradient decreases with , however, for small values of , the communication time exceeds the compute time. The figure on the right shows weak scaling on a problem of similar size, but with varying sparsity. Each sample has non-zero elements, thus for any the size of local problem is roughly GB (for size of data 192GB). We observe almost constant time for the gradient computation while the cost of computing the L-BFGS direction decreases with ; however, if communication is considered, the overall time needed to compute the L-BFGS direction increases slightly. For more details see Appendix C.

## 5Conclusion

This paper describes a novel variant of the L-BFGS method that is robust and efficient in two settings. The first occurs in the presence of node failures in a distributed computing implementation; the second arises when one wishes to employ a different batch at each iteration in order to accelerate learning. The proposed method avoids the pitfalls of using inconsistent gradient differences by performing quasi-Newton updating based on the overlap between consecutive samples. Numerical results show that the method is efficient in practice, and a convergence analysis illustrates its theoretical properties.

#### Acknowledgements

The first two authors were supported by the Office of Naval Research award N000141410313, the Department of Energy grant DE-FG02-87ER25047 and the National Science Foundation grant DMS-1620022. Martin Takáč was supported by National Science Foundation grant CCF-1618717.

## AProofs and Technical Results

### a.1Assumptions

We first restate the Assumptions that we use in the Convergence Analysis section (Section 3). Assumption and are used in the strongly convex and nonconvex cases, respectively.

##### Assumptions A
1. is twice continuously differentiable.

2. There exist positive constants and such that

for all and all sets .

3. There is a constant such that

for all and all batches .

4. The samples are drawn independently and is an unbiased estimator of the true gradient for all , i.e.,

Note that Assumption implies that the entire Hessian also satisfies

for some constants .

##### Assumptions B
1. is twice continuously differentiable.

2. The gradients of are -Lipschitz continuous and the gradients of are -Lipschitz continuous for all and all sets .

3. The function is bounded below by a scalar .

4. There exist constants and such that

for all and all batches .

5. The samples are drawn independently and is an unbiased estimator of the true gradient for all , i.e.,

### a.2Proof of Lemma 3.1 (Strongly Convex Case)

The following Lemma shows that the eigenvalues of the matrices generated by the multi-batch L-BFGS method are bounded above and away from zero if is strongly convex.

If Assumptions A.1-A.2 above hold, there exist constants such that the Hessian approximations generated by the multi-batch L-BFGS method (Algorithm 1) satisfy

Instead of analyzing the inverse Hessian approximation , we study the direct Hessian approximation . In this case, the limited memory quasi-Newton updating formula is given as follows

1. Set and ; where is the memory in L-BFGS.

2. For set and compute

3. Set

The curvature pairs and are updated via the following formulae

A consequence of Assumption is that the eigenvalues of any sub-sampled Hessian ( samples) are bounded above and away from zero. Utilizing this fact, the convexity of component functions and the definitions , we have

On the other hand, strong convexity of the sub-sampled functions, the consequence of Assumption and definitions , provide a lower bound,

Combining the upper and lower bounds and

The above proves that the eigenvalues of the matrices at the start of the L-BFGS update cycles are bounded above and away from zero, for all . We now use a Trace-Determinant argument to show that the eigenvalues of are bounded above and away from zero.

Let and denote the trace and determinant of matrix , respectively, and set . The trace of the matrix can be expressed as,

for some positive constant , where the inequalities above are due to , and the fact that the eigenvalues of the initial L-BFGS matrix are bounded above and away from zero.

Using a result due to Powell [21], the determinant of the matrix generated by the multi-batch L-BFGS method can be expressed as,

for some positive constant , where the above inequalities are due to the fact that the largest eigenvalue of is less than and Assumption .

The trace and determinant inequalities derived above imply that largest eigenvalues of all matrices are bounded above, uniformly, and that the smallest eigenvalues of all matrices are bounded away from zero, uniformly.

### a.3Proof of Theorem 3.2 (Strongly Convex Case)

Utilizing the result from Lemma ?, we now prove a linear convergence result to a neighborhood of the optimal solution, for the case where Assumptions hold.

Suppose that Assumptions A.1-A.4 above hold, and let , where is the minimizer of . Let be the iterates generated by the multi-batch L-BFGS method (Algorithm 1) with

starting from . Then for all ,

We have that

where the first inequality arises due to , and the second inequality arises as a consequence of Lemma ?.

Taking the expectation (over ) of equation

where in the first inequality we make use of Assumption , and the second inequality arises due to Lemma ? and Assumption .

Since is -strongly convex, we can use the following relationship between the norm of the gradient squared, and the distance of the -th iterate from the optimal solution.

Using the above with ,

Let

where the expectation is over all batches and all history starting with . Thus can be expressed as,

from which we deduce that in order to reduce the value with respect to the previous function value, the step length needs to be in the range

Subtracting from either side of yields

Recursive application of yields

and thus,

Finally using the definition of with the above expression yields the desired result,

### a.4Proof of Lemma 3.3 (Nonconvex Case)

The following Lemma shows that the eigenvalues of the matrices generated by the multi-batch L-BFGS method are bounded above and away from zero (nonconvex case).

Suppose that Assumptions B.1-B.2 hold and let be given. Let be the Hessian approximations generated by the multi-batch L-BFGS method (Algorithm ?), with the modification that the Hessian approximation update is performed only when

else . Then, there exist constants such that

Similar to the proof of Lemma ?, we study the direct Hessian approximation .

The curvature pairs and are updated via the following formulae

The skipping mechanism provides both an upper and lower bound on the quantity , which in turn ensures that the initial L-BFGS Hessian approximation is bounded above and away from zero. The lower bound is attained by repeated application of Cauchy’s inequality to condition . We have from that

and therefore

It follows that

and hence

The upper bound is attained by the Lipschitz continuity of sample gradients,

Re-arranging the above expression yields the desired upper bound,

Combining and ,

The above proves that the eigenvalues of the matrices at the start of the L-BFGS update cycles are bounded above and away from zero, for all . The rest of the proof follows the same trace-determinant argument as in the proof of Lemma ?, the only difference being that the last inequality in Equation 21 comes as a result of the cautious update strategy.

### a.5Proof of Theorem 3.4 (Nonconvex Case)

Utilizing the result from Lemma ?, we can now establish the following result about the behavior of the gradient norm for the multi-batch L-BFGS method with a cautious update strategy.

Suppose that Assumptions B.1-B.5 above hold. Let be the iterates generated by the multi-batch L-BFGS method (Algorithm ?) with

where is the starting point. Also, suppose that if

for some , the inverse L-BFGS Hessian approximation is skipped, . Then, for all ,

Starting with ,

where the second inequality holds due to Assumption , and the fourth inequality is obtained by using the upper bound on the step length. Taking an expectation over all batches and all history starting with yields

where . Summing over the first iterations

The left-hand-side of the above inequality is a telescoping sum

Substituting the above expression into and re-arranging terms

Dividing the above equation by completes the proof.

## BExtended Numerical Experiments - Real Datasets

In this Section, we present further numerical results, on the datasets listed in Table 1, in both the multi-batch and fault-tolerant settings. Note, that some of the datasets are too small, and thus, there is no reason to run them on a distributed platform; however, we include them as they are part of the standard benchmarking datasets.

Notation.

Let denote the number of training samples in a given dataset, the dimension of the parameter vector , and the number of MPI processes used. The parameter denotes the fraction of samples in the dataset used to define the gradient, i.e., . The parameter denotes the length of overlap between consecutive samples, and is defined as a fraction of the number of samples in a given batch , i.e., .

We focus on logistic regression classification; the objective function is given by

where denote the training examples and is the regularization parameter.

### b.1Multi-batch L-BFGS Implementation

For the experiments in this section (Figures Figure 7-Figure 10), we run four methods:

• (Robust L-BFGS) robust multi-batch L-BFGS (Algorithm ?),

• (L-BFGS) multi-batch L-BFGS without enforcing sample consistency; gradient differences are computed using different samples, i.e., ,

• (Gradient Descent) multi-batch gradient descent; obtained by setting in Algorithm ?,

• (SGD) serial SGD; at every iteration one sample is used to compute the gradient.

In Figures Figure 7-Figure 10 we show the evolution of for different step lengths , and for various batch () and overlap () sizes. Each Figure (Figure 7-Figure 10) consists of 10 plots that illustrate the performance of the methods with the following parameters:

• Top 3 plots: , and

• Middle 3 plots: , and

• Bottom 4 plots: , and

As is expected for quasi-Newton methods, robust L-BFGS performs best with a step-size , for the most part.