A Model Parallel Proximal Stochastic Gradient Algorithm for Partially Asynchronous Systems
Large models are prevalent in modern machine learning scenarios, including deep learning, recommender systems, etc., which can have millions or even billions of parameters. Parallel algorithms have become an essential solution technique to many large-scale machine learning jobs. In this paper, we propose a model parallel proximal stochastic gradient algorithm, AsyB-ProxSGD, to deal with large models using model parallel blockwise updates while in the meantime handling a large amount of training data using proximal stochastic gradient descent (ProxSGD). In our algorithm, worker nodes communicate with the parameter servers asynchronously, and each worker performs proximal stochastic gradient for only one block of model parameters during each iteration. Our proposed algorithm generalizes ProxSGD to the asynchronous and model parallel setting. We prove that AsyB-ProxSGD achieves a convergence rate of to stationary points for nonconvex problems under constant minibatch sizes, where is the total number of block updates. This rate matches the best-known rates of convergence for a wide range of gradient-like algorithms. Furthermore, we show that when the number of workers is bounded by , we can expect AsyB-ProxSGD to achieve linear speedup as the number of workers increases. We implement the proposed algorithm on MXNet and demonstrate its convergence behavior and near-linear speedup on a real-world dataset involving both a large model size and large amounts of data.
Many machine learning problems can be formulated as the following general minimization framework:
where is typically a smooth yet possibly nonconvex loss function of the -th sample, and is a convex yet nonsmooth regularizer term that promotes some structures. Examples include deep learning with regularization (Dean et al., 2012; Chen et al., 2015; Zhang et al., 2015), LASSO (Tibshirani et al., 2005), sparse logistic regression (Liu et al., 2009), robust matrix completion (Xu et al., 2010; Sun and Luo, 2015), and sparse support vector machine (SVM) (Friedman et al., 2001).
Many classical deterministic (non-stochastic) algorithms are available to solve problem (1), including the proximal gradient (ProxGD) method (Parikh et al., 2014) and its accelerated variants (Li and Lin, 2015) as well as the alternating direction method of multipliers (ADMM) (Hong et al., 2016). These methods leverage the so-called proximal operators (Parikh et al., 2014) to handle the nonsmoothness in the problem. However, these algorithms require calculating the gradient of all samples in each iteration, which is expensive in modern machine learning problems. The trend to deal with large volumes of data is the use of stochastic algorithms. As the number of training samples increases, the cost of updating the model taking into account all error gradients becomes prohibitive. To tackle this issue, stochastic algorithms make it possible to update using only a small subset of all training samples at a time.
Stochastic gradient descent (SGD) is one of the first algorithms widely implemented in an asynchronous parallel fashion; its convergence rates and speedup properties have been analyzed for both convex (Agarwal and Duchi, 2011; Mania et al., 2017) and nonconvex (Lian et al., 2015) optimization problems. Nevertheless, SGD is mainly applicable to the case of smooth optimization, and yet is not suitable for problems with a nonsmooth term in the objective function, e.g., an norm regularizer. In fact, such nonsmooth regularizers are commonplace in many practical machine learning problems or constrained optimization problems. In these cases, SGD becomes ineffective, as it is hard to obtain gradients for a nonsmooth objective function.
With rapidly growing data volumes and model complexity, the need to scale up machine learning has sparked broad interests in developing efficient parallel optimization algorithms. A typical parallel optimization algorithm usually decomposes the original problem into multiple subproblems, each handled by a worker node. Each worker iteratively downloads the global model parameters and computes its local gradients to be sent to the master node or servers for model updates. Recently, asynchronous parallel optimization algorithms (Niu et al., 2011; Li et al., 2014b; Lian et al., 2015), exemplified by the Parameter Server architecture (Li et al., 2014a), have been widely deployed in industry to solve practical large-scale machine learning problems. Asynchronous algorithms can largely reduce overhead and speedup training, since each worker may individually perform model updates in the system without synchronization.
Existing parallel algorithms fall into two categories: data parallelism and model parallelism. In data parallelism, each worker takes a subset of training samples and calculates their loss functions ’s and/or gradients in parallel. For example, a typical implementation of parallel SGD is to divide a minibatch with samples into several smaller minibatches (each with samples), and each worker computes gradients on samples. This is preferred when the size of data is large. In model parallelism, the model parameters is partitioned into blocks, where with and . Since proximal operator on can be decomposed into those on individual blocks (Parikh et al., 2014), proximal gradient and its stochastic version (proximal stochastic gradient descent or ProxSGD) is again a natural candidate to solve (1). (Zhou et al., 2016) shows that Block Proximal Gradient Descent works in an asynchronous parallel mode, and in the meantime proves that Block ProxSGD can converge to critical points when the maximum staleness is bounded. However, theoretical understanding of the behavior of Asynchronous Block ProxSGD, a more useful algorithm in practical Parameter Servers, is a gap yet to be filled.
In this paper, we propose AsyB-ProxSGD (Asynchronous Block Proximal Stochastic Gradient Descent), an extension of proximal stochastic gradient (ProxSGD) algorithm to the model parallel paradigm and to the partially asynchronous protocol (PAP) setting. In AsyB-ProxSGD, workers asynchronously communicate with the parameter servers, which collectively store model parameters in blocks. In an iteration, each worker pulls the latest yet possibly outdated model from servers, calculates partial gradients for only one block based on stochastic samples, and pushes the gradients to the corresponding server. As workers can update different blocks in parallel, AsyB-ProxSGD is different from traditional data parallel ProxSGD can handle both a large model size and a large number of training samples, a case frequently observed in reality.
Our theoretical contribution is summarized as follows. We prove that AsyB-ProxSGD can converge to stationary points of the nonconvex and nonsmooth problem (1) with an ergodic convergence rate of , where is the total number of times that any block in is updated. This rate matches the convergence rate known for asynchronous SGD. The latter, however, is suitable only for smooth problems. To our best knowledge, this is the first work that provides convergence rate guarantees for ProxSGD in a model parallel mode, especially in an asynchronous setting. We also provide a linear speedup guarantee as the number of workers increases, provided that the number of workers is bounded by . This result has laid down a theoretical ground for the scalability and performance of AsyB-ProxSGD in practice. Evaluation based on a real-world dataset involving both a large model and a large dataset has corroborated our theoretical findings on the convergence and speedup behavior of AsyB-ProxSGD, under a Parameter Server implementation.
In this section, we first introduce some notations to be used throughout the paper. Then we introduce the stochastic optimization problem to be studied. Finally, we introduce proximal operators and enumerate fundamental assumptions made in the model.
We use to denote the norm of the vector , and to denote the inner product of two vectors and . We use to denote the “true” gradient and use to denote the stochastic gradient for a function . Let be the derivative w.r.t. , the -th coordinate of ; and let and represent and , respectively. For a random variable or vector , let be the conditional expectation of w.r.t. a sigma algebra .
2.1 Stochastic Optimization Problems
In this paper, we consider the following stochastic optimization problem instead of the original deterministic version (1):
where the stochastic nature comes from the random variable , which in our problem settings, represents a random index selected from the training set . Therefore, (2) attempts to minimize the expected loss of a training sample plus a regularizer . When it comes to large models, we decompose into blocks, and rewrite (2) into the following block optimization form:
where , for those and , and .
In this work, we assume all ’s for problem (3) are proper, closed and convex, yet not necessarily smooth. To handle the potential non-smoothness, we introduce the following generalized notion of derivatives to be used in convergence analysis.
Definition 1 (Subdifferential e.g., (Parikh et al., 2014)).
We say a vector is a subgradient of the function at , if for all ,
Moreover, denote the set of all such subgradients at by , which is called the subdifferential of at .
Definition 2 (Critical point (Attouch et al., 2013)).
A point is a critical point of , iff .
2.2 Proximal Gradient Descent
Definition 3 (Proximal operator).
The proximal operator of a point under a proper and closed function with parameter is defined as:
In its vanilla version, proximal gradient descent performs the following iterative updates:
for , where is the step size at iteration .
In ProxSGD, the gradient is replaced by the gradients from a random subset of training samples, denoted by at iteration . Since is a random variable indicating a random index in , is a random loss function for the random sample , such that .
With these definitions, we now introduce our metric used in ergodic convergence analysis:
which is also called the gradient mapping in the literature, e.g., (Parikh et al., 2014). For non-convex problems, it is a standard approach to measure convergence (to a stationary point) by gradient mapping according to the following lemma:
Therefore, we can use the following definition as a convergence metric:
Definition 4 (Iteration complexity (Reddi et al., 2016)).
A solution is called -accurate, if for some . If an algorithm needs at least iterations to find an -accurate solution, its iteration complexity is .
We make the following assumptions throughout the paper. Other algorithm specific assumptions will be introduced later in the corresponding sections.
We assume that is a smooth function with the following properties:
Assumption 1 (Lipschitz Gradient).
For function there are Lipschitz constants such that
where is an indicator vector that is only valid at block with value 1, and vanishes to zero at other blocks. Clearly we have .
As discussed above, assume that (or ) is a proper, closed and convex function, which is yet not necessarily smooth. If the algorithm has been executed for iterations, we let denote the set that consists of all the samples used up to iteration . Since for all , the collection of all such forms a filtration. Under such settings, we can restrict our attention to those stochastic gradients with an unbiased estimate and bounded variance, which are common in the analysis of stochastic gradient descent or stochastic proximal gradient algorithms, e.g., (Lian et al., 2015; Ghadimi et al., 2016).
Assumption 2 (Unbiased gradient).
For any , we have .
Assumption 3 (Bounded variance).
The variance of the stochastic gradient is bounded by .
2.3 Parallel Stochastic Optimization
Recent years have witnessed rapid development of parallel and distributed computation frameworks for large-scale machine learning problems. One popular architecture is called parameter server (Dean et al., 2012; Li et al., 2014a), which consists of some worker nodes and server nodes. In this architecture, one or multiple master machines play the role of parameter servers, which maintain the model . All other machines are worker nodes that communicate with servers for training machine learning models. In particular, each worker has two types of requests: pull the current model from servers, and push the computed gradients to servers.
Before proposing our AsyB-ProxSGD algorithm in the next section, let us first introduce its synchronous version. Suppose we execute ProxSGD with a minibatch of 128 random samples on 8 workers and our goal is to use these 8 workers to train a model with 8 blocks
Note that in this scenario, all workers have to calculate the whole minibatch of the computation. Thanks to the decomposition property of proximal operator (Parikh et al., 2014), updating blocks can be done in parallel by servers, which corresponds to model parallelism in the literature (e.g., (Niu et al., 2011; Zhou et al., 2016; Pan et al., 2016)). Another type of parallelism is called data parallelism, in which each worker uses only part of random samples in the minibatch to compute a full gradient on (e.g., (Agarwal and Duchi, 2011; Ho et al., 2013)). We handle the issue of large by using stochastic algorithms, and our main focus is to handle the large model challenge by model parallelism.
3 AsyB-ProxSGD: Asynchronous Block Proximal Stochastic Gradient
We now present our main contribution in this paper, Asynchronous Block Proximal Stochastic Gradient (AsyB-ProxSGD) algorithm. Recall that asynchronous algorithm tries to alleviate random delays in computation and communication in different iterations. When model is big, it is hard to put the whole model in a single node (a single machine or device), and we have to split it into blocks. In this case, no single node maintains all of the parameters in memory and the nodes can update in parallel. The idea of model parallelism has been used in many applications, including deep learning (Dean et al., 2012) and factorization machine (Li et al., 2016).
We now formally introduce how our proposed algorithm works. The main idea of our proposed algorithm is to update block in parallel by different workers. In Algorithm 1, the first step is to ensure that the staleness is upper bounded by , which is essential to ensure convergence. Here we use to emphasize that the pulled model parameters may not be consistent with that stored on parameter servers. Since blocks are scattered on multiple servers, different blocks may be not consistent with updates and thus results in different delays. For example, suppose the server stores model , and we have two workers that updates and in parallel. Our expectation is that is updated by them and it becomes . However, in partially asynchronous protocol (PAP) where workers may skip synchronization, the following case may happen. At time 1, worker 1 pushes and pulls ; thus, worker 1 gets . At time 2, worker 2 pushes and pulls ; thus, worker 2 gets . We can see that the next update by worker 1 is based on , which has different delays on two blocks.
Let us discuss this in more implementation details for distributed clusters. In distributed clusters, we split a large model into blocks, and one server only maintains a single block to achieve model parallelism. Thus, different block may be updated at different iterations by different workers. The same phenomenon also exist in shared memory systems (i.e., a single machine with multiple CPU cores or GPUs, etc.). In these systems, the model is stored on the main memory and we can regard it as a “logical” server. In these systems, “reading” and “writing” can be done simultaneously, thus block may be “pulled” while it is being updated. In summary, model parameters may be inconsistent with any actual state on the server side.
In our algorithm, workers can update multiple blocks in parallel, and this is the spirit of model parallelism here. However, we note that on the server side, push request is usually more time consuming than pull request since it needs additional computations of the proximal operator. Therefore, we should let workers gather more stochastic gradients before pushing to the sever, and that is the reason we let each worker to compute gradients on all samples in a minibatch. That is, a worker iteration should compute
where is the index of block to be updated at iteration , and is the partial gradient w.r.t. block at model pulled at iteration and on sample .
4 Convergence Analysis
To facilitate the analysis of Algorithm 1, we rewrite it in an equivalent global view (from the server’s perspective), as described in Algorithm 2. In this algorithm, we define one iteration as the time to update any single block of and to successfully store it at the corresponding server. We use a counter to record how many times the model has been updated; increments every time a push request (model update request) is completed for a single block. Note that such a counter is not required by workers to compute gradients and is different from the counter in Algorithm 1— is maintained by each worker to count how many sample gradients have been computed locally.
In particular, for every worker, it takes stochastic sample gradients and aggregates them by averaging:
where is the random index chosen at iteration , denotes the delay vector, i.e., the delays of different blocks in when computing the gradient for sample at iteration , and is the delay of a specific block . In addition, we denote as a vector of model parameters pulled from the server side. Then, the server updates to using proximal gradient descent.
4.1 Assumptions and Metrics
Assumption 4 (Bounded delay).
There exists an constant such that for all , all values in delay vector are upper bounded by : for all .
Assumption 5 (Independence).
All random variables including selected indices and samples for all and in Algorithm 2 are mutually independent.
The assumption of bounded delay is to guarantee that gradients from workers should not be too old. Note that the maximum delay is roughly proportional to the number of workers in practice. We can enforce all workers to wait for others if it runs too fast, like step 3 of workers in Algorithm 1. This setting is also called partially synchronous parallel (Ho et al., 2013; Li et al., 2014b; Zhou et al., 2016) in the literature. Another assumption on independence can be met by selecting samples with replacement, which can be implemented using some distributed file systems like HDFS (Borthakur et al., 2008). These two assumptions are common in convergence analysis for asynchronous parallel algorithms, e.g., (Lian et al., 2015; Davis et al., 2016).
4.2 Theoretical Results
We present our main convergence theorem as follows:
Taking a closer look at Theorem 1, we can properly choose the step size as a constant value and obtain the following results on convergence rate:
Let the step length be a constant, i.e.,
If the delay bound satisfies
then the output of Algorithm 2 satisfies the following ergodic convergence rate as
Remark 1. (Linear speedup w.r.t. the staleness) When the maximum delay is bounded by , we can see that the gradient mapping decreases regardless of , and thus linear speedup is achievable (if other parameters are constants). In other words, we can see that by (12) and (13), as long as is no more than , the iteration complexity (from a global perspective) to achieve -optimality is , which is independent from .
Remark 2. (Linear speedup w.r.t. number of workers) We note that the delay bound is roughly proportional to the number of workers, so the total iterations w.r.t. can be an indicator of convergence w.r.t. the number of workers. As the iteration complexity is to achieve -optimality, and it is independent from , we can conclude that the total iterations will be shortened to of a single worker’s iterations if workers work in parallel. This shows that our algorithm nearly achieves linear speedup.
Remark 3. (Consistency with ProxSGD) When , our proposed AsyB-ProxSGD reduces to the vanilla proximal stochastic gradient descent (ProxSGD) (e.g., (Ghadimi et al., 2016)). Thus, the iteration complexity is according to (13), attaining the same result as that in (Ghadimi et al., 2016) yet without assuming increased minibatch sizes.
We now present numerical results to confirm that our proposed algorithms can be used to solve the challenging non-convex non-smooth problems in machine learning.
Setup: In our experiments, we consider the sparse logistic regression problem:
The -regularized logistic regression is widely used for large scale risk minimization. We consider the Avazu dataset
We use a cluster of 16 instances on Google Cloud. Each server or worker process uses just one core. Up to 8 instances serve as server nodes, while the other 8 instances serve as worker nodes. To show the advantage of asynchronous parallelism, we set up four experiments adopting 1, 2, 4, and 8 worker nodes, respectively. For all experiments, the whole dataset is shuffled and all workers have a copy of this dataset. When computing a stochastic gradient, each worker takes one minibatch of random samples from its own copy. This way, each sample is used by a worker with an equal probability empirically to mimic the scenario of our analysis.
We consider the suboptimality gap as our performance metric, which is defined as the gap between and . Here we estimate the optimal value by performing as many iterations as needed for convergence. The hyper-parameters are set as follows. For all experiments, the coefficients are set as and . We set the minibatch size to 8192. The step size is set according to at iteration .
We implemented our algorithm on MXNet (Chen et al., 2015), a flexible and efficient deep learning library with support for distributed machine learning. Due to the sparse nature of the dataset, the model is stored as a sparse ndarray, and in each iteration, a worker only pulls those blocks of that are actively related to its sampled minibatch, and then calculates the gradient w.r.t. this minibatch of data, and pushes the gradients for those activated blocks only.
Results: Empirically, Assumption 4 (bounded delays) are observed to hold for this cluster. In our experiments, the maximum delay does not exceed 100 iterations unless some worker nodes fail. Fig. 1(a) and Fig. 1(b) show the convergence behavior of AsyB-ProxSGD algorithm in terms of objective function values. We can clearly observe the convergence of our proposed algorithm, confirming that asynchrony with tolerable delays can still lead to convergence. In addition, the running time drops in trend when the number of workers increases.
For our proposed AsyB-ProxSGD algorithm, we are particularly interested in two kinds of speedup, namely, iteration speedup and running time speedup. If we need iterations (with sample gradients processed by servers) to achieve a certain suboptimality level using one worker, and iterations to achieve the same level using workers, then the iteration speedup is defined as (Lian et al., 2015). Note that iterations are counted on the server side, which is actually the number of minibatch gradients are processed by the server. On the other hand, the time speedup is simply defined as the ratio between the running time of using one worker and that of using workers to achieve the same suboptimality level. We summarize iteration and running time speedup in Table 1.
We further evaluate the relationship between the number of servers and the convergence behavior. Since the model has millions of parameters to be trained, storing the whole model in a single machine can be ineffective. In fact, from Fig. 2 we can even see nearly linear speedup w.r.t. the number of servers. The reason here is that, more servers can significantly decrease the length of request queue at the server side. When we have only one server, the blue dashed curve in Fig. 2(b) looks like a tilt staircase, and further investigation shows that some push requests take too long time to be processed. Therefore, we have to set more than one servers to observe parallel speedup in Fig. 1 so that servers are not the bottleneck.
6 Related Work
Robbins and Monro (1951) propose a classical stochastic approximation algorithm for solving a class of strongly convex problems, which is regarded as the seminal work of stochastic optimization problems. For nonconvex problems, Ghadimi and Lan (2013) prove that SGD has an ergodic convergence rate of , which is consistent with the convergence rate of SGD for convex problems. To deal with nonsmoothness, proximal gradient algorithm is widely considered and its stochastic variant is heavily studied for convex problems. Duchi and Singer (2009) show ProxSGD can converge at the rate of for -strongly convex objective functions when the step size diminishes during iterations. However, for nonconvex problems, rather limited studies on ProxSGD exist so far. To our best knowledge, the seminal work on ProsSGD for nonconvex problems was done by Ghadimi et al. (2016), in which the convergence analysis is based on the assumption of an increasing minibatch size.
Updating a single block in each iteration is also referred to as block coordinate methods in the literature. Block coordinate methods for smooth problems with separable, convex constraints (Tseng, 1991) and general nonsmooth regularizers (Razaviyayn et al., 2014; Davis, 2016; Zhou et al., 2016) are proposed. However, the study on stochastic coordinate descent is limited and existing work like (Liu and Wright, 2015) focuses on convex problems. Xu and Yin (2015) study block stochastic proximal methods for nonconvex problems. However, they only analyze convergence to stationary points assuming an increasing minibatch size, and the convergence rate is not provided. Davis et al. (2016) presents a stochastic block coordinate method, which is the closest one with our work in this paper. However, the algorithm studied in (Davis et al., 2016) depends on the use of a noise term with diminishing variance to guarantee convergence. Our convergence results of ProxSGD do not rely on the assumption of increasing batch sizes, variance reduction or the use of additional noise terms.
7 Concluding Remarks
In this paper, we propose AsyB-ProxSGD as an extension of the proximal stochastic gradient (ProxSGD) algorithm to asynchronous model parallelism setting. Our proposed algorithm aims at solving nonconvex nonsmooth optimization problems involved with large data size and model dimension. Theoretically, we prove that the AsyB-ProxSGD method has a convergence rate to critical points with the same order as ProxSGD, as long as workers have bounded delay during iterations. Our convergence result does not rely on minibatch size increasing, which is required in all existing works for ProxSGD (and its variants). We further prove that AsyB-ProxSGD can achieve linear speedup when the number of workers is bounded by . We implement AsyB-ProxSGD on Parameter Server and experiments on large scale real-world dataset confirms its effectiveness.
Appendix A Auxiliary Lemmas
Lemma 2 ([Ghadimi et al., 2016]).
For all , we have:
By the definition of proximal function, there exists a such that:
which proves the lemma.
By applying the above lemma for each block, we have the following corollary, which is useful in convergence analysis for Algorithm LABEL:alg:abpsgd-global.
For all , we have:
Lemma 3 ([Ghadimi et al., 2016]).
For all , if is a convex function, we have
For all , we have
Lemma 4 ([Ghadimi et al., 2016]).
For any and , we have
It can be obtained by directly applying Lemma 3 and the definition of gradient mapping.
Let . Then, for any and , we have
Lemma 5 ([Reddi et al., 2016]).
Suppose we define for some . Then for , the following inequality holds:
for all .
Suppose we define for some , and the index is chosen among indices with uniform distribution. For other , we assume . Then the following inequality holds:
for all .
From the definition of proximal operator, we have
By rearranging the above inequality, we have
Since is -Lipschitz, we have
Adding these two inequality we have
Lemma 6 (Young’s Inequality).
We recall and define some notations for convergence analysis in the subsequent. We denote as the average of delayed stochastic gradients and as the average of delayed true gradients, respectively:
Moreover, we denote as the difference between these two differences.
Appendix B Convergence analysis for B-PAPSG
To simplify notations, we use instead of in this section. Since we update one block only in each iteration, we define an auxiliary function as follows:
where the variables and take the -th coordinate.
b.1 Milestone Lemmas
Lemma 7 (Descent Lemma).
Suppose we have a sequence by Algorithm 2. Then, we have
Suppose we have a sequence by Algorithm 2. Then, we have