Asynchronous Stochastic QuasiNewton MCMC for NonConvex Optimization
Asynchronous Stochastic QuasiNewton MCMC for NonConvex Optimization
Supplementary Document
Abstract
Recent studies have illustrated that stochastic gradient Markov Chain Monte Carlo techniques have a strong potential in nonconvex optimization, where local and global convergence guarantees can be shown under certain conditions. By building up on this recent theory, in this study, we develop an asynchronousparallel stochastic LBFGS algorithm for nonconvex optimization. The proposed algorithm is suitable for both distributed and sharedmemory settings. We provide formal theoretical analysis and show that the proposed method achieves an ergodic convergence rate of ( being the total number of iterations) and it can achieve a linear speedup under certain conditions. We perform several experiments on both synthetic and real datasets. The results support our theory and show that the proposed algorithm provides a significant speedup over the recently proposed synchronous distributed LBFGS algorithm.
New.aux \NAT@set@cites
1 Introduction
QuasiNewton (QN) methods are powerful optimization techniques that are able to attain fast convergence rates by incorporating local geometric information through an approximation of the inverse of the Hessian matrix. The LBFGS algorithm (Nocedal & Wright, 2006) is a wellknown limitedmemory QN method that aims at solving the following optimization problem:
(1) 
where is a twice continuously differentiable function that can be convex or nonconvex, and is often referred to as the empirical risk. In a typical machine learning context, a dataset with independent and identically distributed (i.i.d.) data points is considered, which renders the function as a sum of different functions .
In large scale applications, the number of data points often becomes prohibitively large and therefore using a ‘batch’ LBFGS algorithm becomes computationally infeasible. As a remedy, stochastic LBGFS methods have been proposed (Byrd et al., 2016; Schraudolph et al., 2007; Moritz et al., 2016; Zhou et al., 2017; Yousefian et al., 2017; Zhao et al., 2017), which aim to reduce the computational requirements of LBFGS by replacing (i.e. the full gradients that are required by LBFGS) with some stochastic gradients that are computed on small subsets of the dataset. However, using stochastic gradients within LBFGS turns out to be a challenging task since it brings additional technical difficulties, which we will detail in Section 2.
In a very recent study, Berahas et al. (2016) proposed a parallel stochastic LBFGS algorithm, called multibatch LBFGS (mbLBFGS), which is suitable for synchronous distributed architectures. This work illustrated that carrying out LBFGS in a distributed setting introduces further theoretical and practical challenges; however, if these challenges are addressed, stochastic LBFGS can be powerful in a distributed setting as well, and outperform conventional algorithms such as distributed stochastic gradient descent (SGD), as shown by their experimental results.
Despite the fact that synchronous parallel algorithms have clear advantages over serial optimization algorithms, the computational efficiency of synchronous algorithms is often limited by the overhead induced by the synchronization and coordination among the worker processes. Inspired by asynchronous parallel stochastic optimization techniques (Agarwal & Duchi, 2011; Lian et al., 2015; Zhang et al., 2015; Zhao & Li, 2015; Zheng et al., 2017), in this study, we propose an asynchronous parallel stochastic LBFGS algorithm for largescale nonconvex optimization problems. The proposed approach aims at speeding up the synchronous algorithm presented in (Berahas et al., 2016) by allowing all the workers work independently from each other and circumvent the inefficiencies caused by synchronization and coordination.
Extending stochastic LBFGS to asynchronous settings is a highly nontrivial task and brings several challenges. In our strategy, we first reformulate the optimization problem (1) as a sampling problem where the goal becomes drawing random samples from a distribution whose density is concentrated around . We then build our algorithm upon the recent stochastic gradient Markov Chain Monte Carlo (SGMCMC) techniques (Chen et al., 2015, 2016b) that have close connections with stochastic optimization techniques (Dalalyan, 2017; Raginsky et al., 2017; Zhang et al., 2017), and have proven successful in largescale Bayesian machine learning. We provide formal theoretical analysis and prove nonasymptotic guarantees for the proposed algorithm. Our theoretical results show that the proposed algorithm achieves an ergodic global convergence with rate , where denotes the total number of iterations. Our results further imply that the algorithm can achieve a linear speedup under ideal conditions.
For evaluating the proposed method, we conduct several experiments on synthetic and real datasets. The experimental results support our theory: our experiments on a largescale matrix factorization problem show that the proposed algorithm provides a significant speedup over the synchronous parallel LBFGS algorithm.
2 Technical Background
Preliminaries: As opposed to the classical optimization perspective, we look at the optimization problem (1) from a maximum aposteriori (MAP) estimation point of view, where we consider as a random variable in and as the optimum of a Bayesian posterior whose density is given as , where is a set of i.i.d. observed data points. Within this context, is often called the potential energy and defined as , where is the likelihood function and is the prior density. In a classical optimization context, would correspond to the dataloss and would correspond to a regularization term. Throughout this study, we will assume that the problem (1) has a unique solution in .
We define a stochastic gradient , that is an unbiased estimator of , as follows: , where denotes a random data subsample that is drawn with replacement, is the cardinality of . In the sequel, we will occasionally use the notation and to denote the stochastic gradient computed at iteration of a given algorithm, or on a specific data subsample , respectively.
The LBFGS algorithm: The LBFGS algorithm iteratively applies the following equation in order to find the MAP estimate given in (1):
(2) 
where denotes the iterations. Here, is an approximation to the inverse Hessian at and is computed by using the past values of the ‘iterate differences’ , and ‘gradient differences’ . The collection of the iterate and gradient differences is called the LBFGS memory. The matrixvector product is often implemented by using the twoloop recursion (Nocedal & Wright, 2006), which has linear time and space complexities .
In order to achieve computational scalability, stochastic LBFGS algorithms replace with . This turns out to be problematic, since the gradient differences would be inconsistent, meaning that the stochastic gradients in different iterations will be computed on different data subsamples, i.e. and . On the other hand, in the presence of the stochastic gradients, LBFGS is no longer guaranteed to produce positive definite approximations even in convex problems, therefore more considerations should be taken in order to make sure that is positive definite.
Stochastic Gradient Markov Chain Monte Carlo: Along with the recent advances in MCMC techniques, diffusionbased algorithms have become increasingly popular due to their applicability in largescale machine learning applications. These techniques, so called the Stochastic Gradient MCMC (SGMCMC) algorithms, aim at generating samples from the posterior distribution as opposed to finding the MAP estimate, and have strong connections with stochastic optimization techniques (Dalalyan, 2017). In this line of work, Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011) is one of the pioneering algorithms and generates an approximate sample from by iteratively applying the following update equation:
(3) 
where is the stepsize and is a collection of standard Gaussian random variables in . Here, is called the inverse temperature: it is fixed to in vanilla SGLD and when the algorithm is called ‘tempered’. In an algorithmic sense, SGLD is identical to SGD, except that it injects a Gaussian noise at each iteration and it coincides with SGD when goes to infinity.
SGLD has been extended in several directions (Ma et al., 2015; Chen et al., 2015; Şimşekli et al., 2016b; Şimşekli, 2017). In (Şimşekli et al., 2016a), we proposed an LBFGSbased SGLD algorithm with computational complexity, which aimed to improve the convergence speed of the vanilla SGLD. We showed that a straightforward way of combining LBFGS in SGLD would incur an undesired bias; however, the remedy to prevent this bias resulted in numerical instability, which would limit the applicability of the algorithm. In other recent studies, SGLD has also been extended to synchronous (Ahn et al., 2014) and asynchronous (Chen et al., 2016b; Springenberg et al., 2016) distributed MCMC settings.
SGLD can be seen as a discretetime simulation of a continuoustime Markov process that is the solution of the following stochastic differential equation (SDE):
(4) 
where denotes the standard Brownian motion in . Under mild regularity conditions on , the solution process attains a unique stationary distribution with a density that is proportional to (Roberts & Stramer, 2002). An important property of this distribution is that, as goes to infinity, this density concentrates around the global minimum of (Hwang, 1980; Gelfand & Mitter, 1991). Therefore, for large enough , a random sample that is drawn for the stationary distribution of would be close to . Due to this property, SGMCMC methods have recently started drawing attention from the nonconvex optimization community. Chen et al. (2016a) developed an annealed SGMCMC algorithm for nonconvex optimization and it was recently extended by Ye et al. (2017). Raginsky et al. (2017) and Xu et al. (2017) provided finitetime guarantees for SGLD to find an ‘approximate’ global minimizer that is close to , which imply that the additive Gaussian noise in SGLD can help the algorithm escape from poor local minima. In a complementary study, Zhang et al. (2017) showed that SGLD enters a neighborhood of a local minimum of in polynomial time, which shows that even if SGLD fails to find the global optimum, it will still find a point that is close to one of the local optima. Even though these results showed that SGMCMC is promising for optimization, it is still not clear how an asynchronous stochastic LBFGS method could be developed within an SGMCMC framework.
3 Asynchronous Stochastic LBFGS
In this section, we propose a novel asynchronous LBFGSbased (tempered) SGMCMC algorithm that aims to provide an approximate optimum that is close to by generating samples from a distribution that has a density that is proportional to . We call the proposed algorithm asynchronous parallel stochastic LBFGS (asLBFGS). Our method is suitable for both distributed and sharedmemory settings. We will describe the algorithm only for the distributed setting; the sharedmemory version is almost identical to the distributed version as long as the updates are ensured to be atomic.
We consider a classical asynchronous optimization architecture, which is composed of a master node, several worker nodes, and a data server. The main task of the master node is to maintain the newest iterate of the algorithm. At each iteration, the master node receives an additive update vector from a worker node, it adds this vector to the current iterate in order to obtain the next iterate, and then it sends the new iterate to the worker node which has sent the update vector. On the other hand, the worker nodes work in a completely asynchronous manner. A worker node receives the iterate from the master node, computes an update vector, and sends the update vector to the master node. However, since the iterate would be possibly modified by another worker node which runs asynchronously in the mean time, the update vector that is sent to the server will thus be computed on an old iterate, which causes both practical and theoretical challenges. Such updates are aptly called ‘delayed’ or ‘stale’. The full data is kept in the data server and we assume that all the workers have access to the data server.
The proposed algorithm iteratively applies the following update equations in the master node:
(5) 
where is the iteration index, is called the momentum variable, and and are the update vectors that are computed by the worker nodes. A worker node runs the following equations in order to compute the update vectors:
(6)  
(7) 
where is the stepsize, is the friction parameter that determines the weight of the momentum, is the inverse temperature, denotes standard Gaussian random variables, and denotes the LBFGS matrix at iteration . Here, denotes the ‘staleness’ of a particular update and measures the delay between the current update and the uptodate iterate that is stored in the master node. We assume that the delays are bounded, i.e. . Note that the matrixvector products have timespace complexity.
algocf[t] \end@float
Due to the asynchrony, the stochastic gradients and the LBFGS matrices will be computed on the delayed variables and . As opposed to the asynchronous stochastic gradient algorithms, where the main difficulty stems from the delayed gradients, our algorithm faces further challenges since it is not straightforward to obtain the gradient and iterate differences that are required for the LBFGS computations in an asynchronously parallel setting.
We propose the following approach for the computation of the LBFGS matrices. As opposed to the mbLBFGS algorithm, which uses a central LBFGS memory (i.e. the collection of the gradient and iterate differences) that is stored in the master node, we let each worker have their own local LBFGS memories since the master node would not be able to keep track of the gradient and iterate differences, which are received in an asynchronous manner. In our strategy, each worker updates its own LBFGS memory right after sending the update vector to the master node. The overall algorithm is illustrated in Algorithms LABEL:algo:master and LABEL:algo:worker ( denotes the number of workers).
In order to be able to have consistent gradient differences, each worker applies a multibatch subsampling strategy that is similar to mbLBFGS. We divide the data subsample into two subsets, i.e. with , , and . Here the main idea is to choose and use as an overlapping subset for the gradient differences. In this manner, in addition to the gradients that are computed on and , we also perform an extra gradient computation on the previous overlapping subset, at the end of each iteration. As will be small, this extra cost will not be significant. Finally, in order to ensure the LBFGS matrices are positive definite, we use a ‘cautious’ update mechanism that is useful for nonconvex settings (Li & Fukushima, 2001; Zhang & Sutton, 2011; Berahas et al., 2016) as shown in Algorithm LABEL:algo:worker.
algocf[t] \end@float
Note that, in addition to asynchrony, the proposed algorithm also extends the current stochastic LBFGS methods by introducing momentum. This brings two critical practical features: (i) without the existence of the momentum variables, the injected Gaussian noise must depend on the LBFGS matrices, as shown in (Şimşekli et al., 2016a), which results in an algorithm with time complexity whereas our algorithm has time complexity, (ii) the use of the momentum significantly repairs the numerical instabilities caused by the asynchronous updates, since inherently encapsulates a direction for , which provides additional information to the algorithm besides the gradients and LBFGS computations. Furthermore, in a very recent study (Loizou & Richtárik, 2017) the use of momentum variables has been shown to be useful in other secondorder optimization methods. On the other hand, despite their advantages, the momentum variable also drifts apart the proposed algorithm from the original LBFGS formulation. However, even such approximate approaches have proven useful in various scenarios (Zhang & Sutton, 2011; Fu et al., 2016). Also note that, when , , and for all , the algorithm coincides with SGD with momentum. A more detailed illustration is given in the supplementary document.
4 Theoretical Analysis
In this section, we will provide nonasymptotic guarantees for the proposed algorithm. Our analysis strategy is different from the conventional analysis approaches for stochastic optimization and makes use of tools from analysis of SDEs. In particular, we will first develop a continuoustime Markov process whose marginal stationary measure admits a density that is proportional to . Then we will show that (5)(7) form an approximate EulerMaruyama integrator that approximately simulates this continuous process in discretetime. Finally, we will analyze this approximate numerical scheme and provide a nonasymptotic error bound. All the proofs are given in the supplementary document.
We start by considering the following stochastic dynamical system:
(8) 
where is also called the momentum variable, denotes the LBFGS matrix at time and is a vector that is defined as follows:
(9) 
where denotes the component of a vector and similarly denotes a single element of a matrix .
In order to analyze the invariant measure of the SDE defined in (8), we need certain conditions to hold. First, we have two regularity assumptions on and :
H 1.
The gradient of the potential is Lipschitz continuous, i.e. .
H 2.
The LBFGS matrices have bounded secondorder derivatives and they are Lipschitz continuous, i.e. .
The assumptions H 1 and H 2 are standard conditions in analysis of SDEs (Duan, 2015) and similar assumptions have also been considered in stochastic gradient (Moulines & Bach, 2011) and stochastic LBFGS algorithms (Zhou et al., 2017). Besides, H 2 provides a direct control on the partial derivatives of , which will be useful for analyzing the overall numerical scheme. We now present our first result that establishes the invariant measure of the SDE (8).
Proposition 1.
This result shows that, if the SDE (8) could be exactly simulated, the marginal distribution of the samples would converge to a measure which has a density that is proportional to . Therefore, for large enough and , would be close to the global optimum .
We note that when , the SDE (8) shares similarities with the SDEs presented in (Fu et al., 2016; Ma et al., 2015). While the main difference being the usage of the tempering scheme, (Fu et al., 2016) further differs from our approach as it directly discard the term since is in a MetropolisHastings framework, which is not adequate for largescale applications. On the other hand, the stochastic gradient Riemannian Hamiltonian Monte Carlo algorithm given in (Ma et al., 2015), chooses as the Fisher information matrix; a quantity that requires spacetime complexity and is not analytically available in general.
We will now show that the proposed algorithm (5)(7) form an approximate method for simulating (8) in discretetime. For illustration, we first consider the EulerMaruyama integrator for (8), given as follows:
(10)  
(11) 
Here, the term introduces an additional computational burden and its importance is very insignificant (i.e. its magnitude is of order due to H 2). Therefore, we discard , define , , , and use these quantities in (S3) and (S1). We then obtain the following reparametrized Euler integrator:
The detailed derivation is given in the supplementary document. Finally, we replace with the stochastic gradients, replace the variables and with stale variables and in the update vectors, and obtain the ultimate update equations, given in (5). Note that, due to the negligence of , the proposed approach would require a large and would not be suitable for classical posterior sampling settings, where .
In this section, we will analyze the ergodic error , where we define and . This error resembles the bias of a statistical estimator; however, as opposed to the bias, it directly measures the expected discrepancy to the global optimum. Similar ergodic error notions have been considered in the analysis of nonconvex optimization methods (Lian et al., 2015; Chen et al., 2016a; Berahas et al., 2016).
In our proof strategy, we decompose the error into two terms: , where , and . We then upperbound these terms separately.
The term turns out to be the bias of a statistical estimator, which we can analyze by using ideas from recent SGMCMC studies. However, existing tools cannot be directly used because of the additional difficulties introduced by the LBFGS matrices. In order to bound , we first require the following smoothness and boundedness condition.
H 3.
Let be a functional that is the unique solution of a Poisson equation that is defined as follows:
(12) 
where , is the generator of (8) at and is formally defined in the supplementary document. The functional and its up to thirdorder derivatives are bounded by a function , such that for and . Furthermore, and is smooth such that for all , , and .
Assumption H 3 is also standard in SDE analysis and SGMCMC (Mattingly et al., 2010; Teh et al., 2016; Chen et al., 2015; Durmus et al., 2016) and gives us control over the weak error of the numerical integrator. We further require the following regularity conditions in order to have control over the error induced by the delayed stochastic gradients.
H 4.
The variance of the stochastic gradients is bounded, i.e. for some .
H 5.
For a smooth and bounded function , the remainder in the following Taylor expansion is bounded:
(13) 
The following lemma presents an upperbound for .
Here, the term in (14) appears due to the negligence of . In order to bound the second term , we follow a similar strategy to (Raginsky et al., 2017), where we use H 1 and the following moment condition on .
H 6.
The secondorder moments of are bounded and satisfies the following inequality: for some .
This assumption is mild since concentrates around as tends to infinity. The order is arbitrary, hence the assumption can be further relaxed. The following lemma establishes an upperbound for .
Theorem 1.
More explicit constants and a discussion on the relation of the theorem to other recent theoretical results are provided in the supplementary document.
Theorem 1 provides a nonasymptotic guarantee for convergence to a point that is close to the global optimizer even when is nonconvex, thanks to the additive Gaussian noise. The bound suggests an optimal rate of convergence of , which is in line with the current rates of the nonconvex asynchronous algorithms (Lian et al., 2015). Furthermore, if we assume that the total number of iterations is a linear function of the number of workers, e.g. , where is the number of iterations executed by a single worker, Theorem 1 implies that, in the ideal case, the proposed algorithm can achieve a linear speedup with increasing , provided that .
Despite their nice theoretical properties, it is wellknown that tempered sampling approaches also often get stuck near a local minimum. In our case, this behavior would be mainly due to the hidden constant in (14), which can be exponential in dimension , as illustrated in (Raginsky et al., 2017) for SGLD. On the other hand, Theorem 1 does not guarantee that the proposed algorithm will converge to a neighborhood of a local minimum; however, we believe that we can also prove local convergence guarantees by using the techniques provided in (Zhang et al., 2017; Tzen et al., 2018), which we leave as a future work.
5 Experiments
The performance of asynchronous stochastic gradient methods has been evaluated in several studies, where the advantages and limitations have been illustrated in various scenarios, to name a few (Dean et al., 2012; Zhang et al., 2015; Zheng et al., 2017). In this study, we will explore the advantages of using LBFGS in an asynchronous environment. In order to illustrate the advantages of asynchrony, we will compare asLBFGS with mbLBFGS (Berahas et al., 2016); and in order to illustrate the advantages that are brought by using higherorder geometric information, we will compare asLBFGS to asynchronous SGD (aSGD) (Lian et al., 2015). We will also explore the speedup behavior of asLBFGS for increasing .
We conduct experiments on both synthetic and real datasets. For real data experiments, we have implemented all the three algorithms in C++ by using a lowlevel message passing protocol for parallelism, namely the OpenMPI library. This code can be used both in a distributed environment or a single computer with multiprocessors. For the experiments on synthetic data, we have implemented the algorithms in MATLAB, by developing a realistic discreteevent simulator. This simulated environment is particularly useful for understanding the behaviors of the algorithms in detail since we can explicitly control the computation time that is spent at the master or worker nodes, and the communication time between the nodes. This simulation strategy also enables us to explicitly control the variation among the computational powers of the worker nodes; a feature that is much harder to control in real distributed environments.
Linear Gaussian model: We conduct our first set of experiments on synthetic data where we consider a rather simple convex quadratic problem whose optimum is analytically available. The problem is formulated as finding the MAP estimate of the following linear Gaussian probabilistic model:
We assume that and are known and we aim at computing . For these experiments, we develop a parametric discrete event simulator that aims to simulate the algorithms in a controllable yet realistic way. The simulator simulates a distributed optimization algorithm once it is provided four parameters: (i) : the average computational time spent by the master node at each iteration, (ii) : the average computational time spent by a single worker at each iteration, (iii) : the standard deviation of the computational time spent by a single worker per iteration, and (iv) : the time spent for communications per iteration. All these parameters are in a generic base time unit. Once these parameters are provided for one of the three algorithms, the simulator simulates the (a)synchronous distributed algorithm by drawing random computation times from a lognormal distribution whose mean and variance is specified by and . Figure 1 illustrates a typical outcome of the real and the simulated implementations of asLBFGS, where we observe that the simulator is able to provide realistic simulations that can even very well reflect the fluctuations of the algorithm.
In our first experiment, we set , , , we randomly generate and fix the vectors in such a way that there will be a strong correlation in the posterior distribution, and we finally generate a true and the observations by using the generative model.
For each algorithm, we fix , , and to realistic values and investigate the effect of the variation among the workers by comparing the running time of the algorithms for achieving accuracy (i.e., ) for different values of when . We repeat each experiment times. In all our experiments, we have tried several values for the hyperparameters of each algorithm and we report the best results. All the hyperparameters are provided in the supplementary document.
Figure 2 visualizes the results for the first experiment. We can observe that, for smaller values asLBFGS and mbLBFGS perform similarly, where aSGD requires more computational time to achieve accuracy. However, as we increase the value of , mbLBFGS requires more computational time in order to be able to collect sufficient amount of stochastic gradients. The results show that both asynchronous algorithms turn out to be more robust to the variability of the computational power of the workers, where asLBFGS shows a better performance.
In our second experiment, we investigate the speedup behavior of asLBFGS within the simulated setting. In this setting, we consider a highly varying set of workers and set and vary the number of workers . As illustrated in Figure 3, as increases, increases as well and the algorithm hence requires more iterations in order to achieve accuracy, since a smaller stepsize needs to be used. However, this increment in the number of iterations is compensated by the increased number of workers, as we observe that the required computational time gracefully decreases with increasing . We observe a similar behavior for different values of , where the speedup is more prominent for smaller .
Largescale matrix factorization: In our next set of experiments, we consider a largescale matrix factorization problem (Gemulla et al., 2011; Şimşekli et al., 2015, 2017), where the goal is to obtain the MAP solution of the following probabilistic model: . Here, is the data matrix, and and are the factor matrices to be estimated.
In this context, we evaluate the algorithms on three largescale movie ratings datasets, namely MovieLens Million (MLM), Million (MLM), and Million (MLM) (grouplens.org). The MLM dataset contains million nonzero entries, where (movies) and (users). The MLM dataset contains million nonzero entries, resulting in a data matrix. Finally, the MLM dataset contains million ratings, resulting in a data matrix. We have conducted these experiments on a cluster of more than interconnected computers, each of which is equipped with variable quality CPUs and memories. In these experiments, we have found that the numerical stability is improved when is replaced with for small . This small modification does not violate our theoretical results. The hyperparameters are provided in the supplementary document.
Figure 4 shows the performance of the three algorithms on the MovieLens datasets in terms of the rootmeansquarederror (RMSE), which is a standard metric for recommendation systems, and the norm of the gradients through iterations. In these experiments, we set for all the three datasets and we set the number of workers to . The results show that, in all datasets, asLBFGS provides a significant speedup over mbLBFGS thanks to asynchrony. We can observe that even when the speed of convergence of mbLBFGS is comparable to aSGD and asLBFGS (cf. the plots showing the norm of the gradients), the final RMSE yielded by mbLBFGS is poorer than the two other methods, which is an indicator that the asynchronous algorithms are able to find a better local minimum. On the other hand, the asynchrony causes more fluctuations in asLBFGS when compared to aSGD.
As opposed to the synthetic data experiments, in all the three MovieLens datasets, we observe that asLBFGS provides a slight improvement in the convergence speed when compared to aSGD. This indicates that aSGD is able to achieve a comparable convergence speed by taking more steps while asLBFGS is computing the matrixvector products. However, this gap can be made larger by considering a more efficient, yet more sophisticated implementation for LBFGS computations (Chen et al., 2014).
In our last experiment, we investigate the speedup properties of asLBFGS in the real distributed setting. In this experiment, we only consider the MLM dataset and run the asLBFGS algorithm for different number of workers. Figure 5 illustrates the results of this experiment. As we increase from to , we obtain a decent speedup that is close to a linear speedup. However, when we set the algorithm becomes unstable, since the term in (15) dominates. Therefore, for we need to decrease the stepsize , which requires the algorithm to be run for a longer amount of time in order to achieve the same error as we achieved when was smaller. On the other hand, the algorithm achieves a linear speedup in terms of iterations; however, the corresponding result is provided in the supplementary document due to the space constraints.
6 Conclusion
In this study, we proposed an asynchronous parallel LBFGS algorithm for nonconvex optimization. We developed the algorithm within the SGMCMC framework, where we reformulated the problem as sampling from a concentrated probability distribution. We proved nonasymptotic guarantees and showed that asLBFGS achieves an ergodic global convergence with rate and it can achieve a linear speedup. Our experiments supported our theory and showed that the proposed algorithm provides a significant speedup over the synchronous parallel LBFGS algorithm.
Acknowledgments
The authors would like to thank to Murat A. Erdoğdu for fruitful discussions. This work is partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX project (ANR16CE230014), by the Scientific and Technological Research Council of Turkey (TÜBİTAK) grant number 116E580, and by the industrial chair Machine Learning for Big Data from Télécom ParisTech.
46
 Agarwal & Duchi (2011)

Agarwal, A. and Duchi, J. C. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pp. 873–881, 2011.
 Ahn et al. (2014)
 Berahas et al. (2016)
 Byrd et al. (2016)
 Chen et al. (2015)
 Chen et al. (2016a)
 Chen et al. (2016b)
 Chen et al. (2014)
 Şimşekli et al. (2016a)
 Şimşekli et al. (2016b)
 Şimşekli et al. (2017)
 Dalalyan (2017)
 Dean et al. (2012)
 Duan (2015)
 Durmus et al. (2016)
 Fu et al. (2016)
 Gelfand & Mitter (1991)
 Gemulla et al. (2011)
 Hwang (1980)
 Li & Fukushima (2001)
 Lian et al. (2015)
 Loizou & Richtárik (2017)
 Ma et al. (2015)
 Mattingly et al. (2010)
 Moritz et al. (2016)
 Moulines & Bach (2011)
 Nocedal & Wright (2006)
 Raginsky et al. (2017)
 Roberts & Stramer (2002)
 Schraudolph et al. (2007)
 Şimşekli (2017)
 Şimşekli et al. (2015)
 Springenberg et al. (2016)
 Teh et al. (2016)
 Tzen et al. (2018)
 Welling & Teh (2011)
 Xu et al. (2017)
 Ye et al. (2017)
 Yousefian et al. (2017)
 Zhang et al. (2015)
 Zhang & Sutton (2011)
 Zhang et al. (2017)
 Zhao et al. (2017)
 Zhao & Li (2015)
 Zheng et al. (2017)
 Zhou et al. (2017)
Ahn, S., Shahbaba, B., and Welling, M. Distributed stochastic gradient MCMC. In International conference on machine learning, pp. 1044–1052, 2014.
Berahas, A. S., Nocedal, J., and Takác, M. A multibatch LBFGS method for machine learning. In Advances in Neural Information Processing Systems, pp. 1055–1063, 2016.
Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A stochastic quasiNewton method for largescale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
Chen, C., Ding, N., and Carin, L. On the convergence of stochastic gradient MCMC algorithms with highorder integrators. In Advances in Neural Information Processing Systems, pp. 2269–2277, 2015.
Chen, C., Carlson, D., Gan, Z., Li, C., and Carin, L. Bridging the gap between stochastic gradient MCMC and stochastic optimization. In AISTATS, 2016a.
Chen, C., Ding, N., Li, C., Zhang, Y., and Carin, L. Stochastic gradient MCMC with stale gradients. In Advances In Neural Information Processing Systems, pp. 2937–2945, 2016b.
Chen, W., Wang, Z., and Zhou, J. Largescale LBFGS using MapReduce. In Advances in Neural Information Processing Systems, pp. 1332–1340, 2014.
Şimşekli, U., Badeau, R., Cemgil, A. T., and Richard, G. Stochastic quasiNewton Langevin Monte Carlo. In ICML, 2016a.
Şimşekli, U., Badeau, R., Richard, G., and Cemgil, A. T. Stochastic thermodynamic integration: efficient Bayesian model selection via stochastic gradient MCMC. In ICASSP, 2016b.
Şimşekli, U., Durmus, A., Badeau, R., Richard, G., Moulines, E., and Cemgil, A. T. Parallelized stochastic gradient Markov Chain Monte Carlo algorithms for nonnegative matrix factorization. In ICASSP, 2017.
Dalalyan, A. S. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. Proceedings of the 2017 Conference on Learning Theory, 2017.
Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., and Ng, A. Y. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.
Duan, J. An Introduction to Stochastic Dynamics. Cambridge University Press, New York, 2015.
Durmus, A., Şimşekli, U., Moulines, E., Badeau, R., and Richard, G. Stochastic gradient RichardsonRomberg Markov Chain Monte Carlo. In NIPS, 2016.
Fu, T., Luo, L., and Zhang, Z. QuasiNewton Hamiltonian Monte Carlo. In UAI, 2016.
Gelfand, S. B. and Mitter, S. K. Recursive stochastic algorithms for global optimization in R^d. SIAM Journal on Control and Optimization, 29(5):999–1018, 1991.
Gemulla, R., Nijkamp, E., J., H. P., and Sismanis, Y. Largescale matrix factorization with distributed stochastic gradient descent. In ACM SIGKDD, 2011.
Hwang, C. Laplace’s method revisited: weak convergence of probability measures. The Annals of Probability, pp. 1177–1182, 1980.
Li, D.H. and Fukushima, M. On the global convergence of the BFGS method for nonconvex unconstrained optimization problems. SIAM Journal on Optimization, 11(4):1054–1064, 2001.
Lian, X., Huang, Y., Li, Y., and Liu, J. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 2737–2745, 2015.
Loizou, N. and Richtárik, P. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677, 2017.
Ma, Y. A., Chen, T., and Fox, E. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pp. 2899–2907, 2015.
Mattingly, J. C., Stuart, A. M., and Tretyakov, M. V. Convergence of numerical timeaveraging and stationary measures via Poisson equations. SIAM Journal on Numerical Analysis, 48(2):552–577, 2010.
Moritz, P., Nishihara, R., and Jordan, M. A linearlyconvergent stochastic LBFGS algorithm. In Artificial Intelligence and Statistics, pp. 249–258, 2016.
Moulines, E. and Bach, F. R. Nonasymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pp. 451–459, 2011.
Nocedal, J. and Wright, S. J. Numerical optimization. Springer, Berlin, 2006.
Raginsky, M., Rakhlin, A., and Telgarsky, M. Nonconvex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Proceedings of the 2017 Conference on Learning Theory, volume 65, pp. 1674–1703, 2017.
Roberts, G. O. and Stramer, O. Langevin Diffusions and MetropolisHastings Algorithms. Methodology and Computing in Applied Probability, 4(4):337–357, December 2002. ISSN 13875841.
Schraudolph, N. N., Yu, J., and Günter, S. A stochastic quasiNewton method for online convex optimization. In Artificial Intelligence and Statistics, pp. 436–443, 2007.
Şimşekli, U. Fractional Langevin Monte carlo: Exploring Levy driven stochastic differential equations for Markov chain Monte Carlo. In ICML, pp. 3200–3209, 2017.
Şimşekli, U., Koptagel, H., Güldaş, H., Cemgil, A. T., Öztoprak, F., and Birbil, Ş. İ. Parallel stochastic gradient Markov Chain Monte Carlo for matrix factorisation models. arXiv preprint arXiv:1506.01418, 2015.
Springenberg, J. T., Klein, A., Falkner, S., and Hutter, F. Asynchronous stochastic gradient MCMC with elastic coupling. arXiv preprint arXiv:1612.00767, 2016.
Teh, Y. W., Thiery, A. H., and Vollmer, S. J. Consistency and fluctuations for stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17:1–33, 2016.
Tzen, B., Liang, T., and Raginsky, M. Local optimality and generalization guarantees for the langevin algorithm via empirical metastability. In Proceedings of the 2018 Conference on Learning Theory, 2018.
Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, pp. 681–688, 2011.
Xu, P., Chen, J., and Gu, Q. Global convergence of Langevin dynamics based algorithms for nonconvex optimization. arXiv preprint arXiv:1707.06618, 2017.
Ye, N., Zhu, Z., and Mantiuk, R. Langevin dynamics with continuous tempering for training deep neural networks. In Advances in Neural Information Processing Systems, pp. 618–626. 2017.
Yousefian, F., Nedić, A., and Shanbhag, U. On stochastic and deterministic quasiNewton methods for nonstrongly convex optimization: convergence and rate analysis. arXiv preprint arXiv:1710.05509, 2017.
Zhang, S., Choromanska, A. E., and LeCun, Y. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pp. 685–693, 2015.
Zhang, Y. and Sutton, C. A. QuasiNewton methods for Markov Chain Monte Carlo. In Advances in Neural Information Processing Systems, pp. 2393–2401, 2011.
Zhang, Y., Liang, P., and Charikar, M. A hitting time analysis of stochastic gradient langevin dynamics. In Proceedings of the 2017 Conference on Learning Theory, volume 65, pp. 1980–2022, 2017.
Zhao, R., Haskell, W. B., and Tan, V. Y. F. Stochastic LBFGS: Improved convergence rates and practical acceleration strategies. IEEE Transactions on Signal Processing, 2017.
Zhao, S.Y. and Li, W.J. Fast asynchronous parallel stochastic gradient decent. arXiv preprint arXiv:1508.05711, 2015.
Zheng, S., Meng, Q., Wang, T., Chen, W., Yu, N., Ma, Z., and Liu, T. Asynchronous stochastic gradient descent with delay compensation. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 4120–4129, 2017.
Zhou, C., Gao, W., and Goldfarb, D. Stochastic adaptive quasiNewton methods for minimizing expected values. In International Conference on Machine Learning, pp. 4150–4159, 2017.
Umut Şimşekli, Çağatay Yıldız, Thanh Huy Nguyen, Gaël Richard, A. Taylan Cemgil
1: LTCI, Télécom ParisTech, Université ParisSaclay, 75013, Paris, France
2: Department of Computer Science, Aalto University, Espoo, 02150, Finland
3: Department of Computer Engineering, Boğaziçi University, 34342, Bebek, Istanbul, Turkey
1 The Approximate EulerMaruyama Scheme
1.1 Connection with gradient descent with momentum
The standard EulerMaruyama scheme for the SDE (8) can be developed as follows:
(S1)  
(S2)  
(S3) 
where is the stepsize and is a collection of standard Gaussian random variables.
We can obtain simplified update rules if we define and use it in (S3). The modified update rules are given as follows:
(S4)  
(S5)  
(S6)  
(S7) 
where . If we use the modified Euler scheme as described in (Neal, 2010) and replace with in (S1), we obtain the following update equation:
(S8)  
(S9) 
Note that, when we have the following update rules:
(S10)  
(S11) 
which coincides with Gradient descent with momentum when for all .
1.2 Numerical integration with stale stochastic gradients
We now focus on (S1) and (S2). We first drop the term , replace the gradients with the stochastic gradients, and then modify the update rules by using stale parameters and . The resulting scheme is given as follows:
(S12)  
(S13) 
By using a similar argument to the one used in Section 1.1, we define , , , and obtain the following update equations:
(S14)  
(S15) 
2 Proof of Proposition 1
Proof.
We start by rewriting the SDE given in (8) as follows:
(S16) 
Here, we observe that is positive semidefinite, is antisymmetric. Furthermore, for all we observe that
(S17) 
The assumptions H 2 and 1 directly imply that the function is locally Lipschitz continuous in for all . Then, the desired result is obtained by applying Theorem 1 of (Ma et al., 2015) and Proposition 4.2.2 of (Kunze, 2012). ∎
3 Proof of Lemma 1
3.1 Preliminaries
In the rest of this document, if there is no specification, the notation will denote the expectation taken over all the random sources contained in .
Before providing the proof of Lemma 1, let us consider the following Itô diffusion:
(S18) 
where , , , and is Brownian motion in . The generator for (S18) is formally defined as follows:
(S19) 
where is any twice differentiable function whose support is compact. Here, denotes the inner product between vectors and , by definition is equal to for some matrices and . In our study, the generator for the diffusion (S16) is then defined as follows: (define and use (S19))
(S20) 
We also define the following operator for the approximate EulerMaruyama scheme with delayed updates:
(S21) 
By using the definitions and , we obtain the following identity:
(S22) 
where . This operator essentially captures all the errors induced by the approximate integrator.
3.2 Proof of Lemma 1
Proof.
We first consider the EulerMaruyama integrator of the SDE (S16), to combine (S1) and (S3) into a single equation, given as follows:
where is a standard Gaussian random variable in , h is the stepsize, , , and are defined in (S16). We then modify this scheme such that we replace the gradient with the stale stochastic gradients and we discard the term . The resulting numerical integrator is given as follows:
(S23) 
Note that (S23) coincides with the proposed algorithm, given in (5).
In the sequel, we follow a similar strategy to (Chen et al., 2016b). However, we have additional difficulties caused by the usage of LBFGS matrices, which are reflected in the operator . Since we are using the EulerMaruyama integrator, we have the following inequality (Chen et al., 2015):
(S24) 
By summing both sides of (S24) over , taking the expectation, and using (S22), we obtain the following:
(S25) 
By rearranging the terms and dividing all the terms by , we obtain:
(S26) 
By using the Poisson equation given in (12) for each and rearranging the terms, we obtain:
(S27) 
By assumption H 3, the term is uniformly bounded. Then, by Assumption H 3 and Lemma S5, we obtain the following bound:
(S28) 
as desired. ∎
Remark 1.
Theorem 1 significantly differentiates from other recent results. First of all, none of the references we are aware of provides an analysis for an asynchronous stochastic LBFGS algorithm. Aside from this fact, when compared to (Chen et al., 2016a), our bound handles the case of delayed updates and provides an explicit dependence on . When compared to (Chen et al., 2016b), our analysis considers the tempered case and handles the additional difficulties brought by the LBFGS matrices and their derivatives. On the other hand, our analysis is also significantly different than the ones presented in (Raginsky et al., 2017) and (Xu et al., 2017), as it considers the asynchrony and LBFGS matrices, and provides a bound for the ergodic error.
4 Proof of Lemma 2
Proof.
We use the same proof technique given in (Raginsky et al., 2017)[Proposition 11]. We assume that admits a density with respect to the Lebesgue measure, denoted as , where is the normalization constant: . We start by using the definition of , as follows:
(S29) 
where is the differential entropy, defined as follows:
(S30) 
We now aim at upperbounding and lowerbounding . By Assumption H 6, the distribution has a finite second order moment, therefore its differential entropy is upperbounded by the differential entropy of a Gaussian distribution that has the same second order moment. Then, we obtain
(S31)  
(S32)  
(S33) 
where denotes the covariance matrix of the Gaussian distribution. In (S32) we used the relation between the arithmetic and geometric means, and in (S33) we used Assumption H 6.