Hybrid-DCA: A Double Asynchronous Approach for Stochastic Dual Coordinate Ascent

Hybrid-DCA: A Double Asynchronous Approach for Stochastic Dual Coordinate Ascent

\nameSoumitra Pal \emailmitra@uconn.edu
\nameTingyang Xu \emailtix11001@engr.uconn.edu
\addrComputer Science and Engineering, University of Connecticut, Storrs, CT 06269 USA \AND\nameTianbao Yang \emailtianbao-yang@uiowa.edu
\addrComputer Science, University of Iowa, Iowa City, IW 52242 USA \AND\nameSanguthevar Rajasekaran \emailrajasek@engr.uconn.edu
\nameJinbo Bi \emailjinbo@engr.uconn.edu
\addrComputer Science and Engineering, University of Connecticut, Storrs, CT 06269 USA
Abstract

In prior works, stochastic dual coordinate ascent (SDCA) has been parallelized in a multi-core environment where the cores communicate through shared memory, or in a multi-processor distributed memory environment where the processors communicate through message passing. In this paper, we propose a hybrid SDCA framework for multi-core clusters, the most common high performance computing environment that consists of multiple nodes each having multiple cores and its own shared memory. We distribute data across nodes where each node solves a local problem in an asynchronous parallel fashion on its cores, and then the local updates are aggregated via an asynchronous across-node update scheme. The proposed double asynchronous method converges to a global solution for -Lipschitz continuous loss functions, and at a linear convergence rate if a smooth convex loss function is used. Extensive empirical comparison has shown that our algorithm scales better than the best known shared-memory methods and runs faster than previous distributed-memory methods. Big datasets, such as one of 280 GB from the LIBSVM repository, cannot be accommodated on a single node and hence cannot be solved by a parallel algorithm. For such a dataset, our hybrid algorithm takes 30 seconds to achieve a duality gap of on 16 nodes each using 8 cores, which is significantly faster than the best known distributed algorithms, such as CoCoA+, that take more than 300 seconds on 16 nodes.

\jmlrheading

xyyyypp-ppmm/yymm/yySoumitra Pal, Tingyang Xu, Tianbao Yang, Sanguthevar Rajasekaran and Jinbo Bi \ShortHeadingsHybrid-DCA: Double Asynchronous Stochastic DCAPal, Xu, Yang, Rajasekaran, Bi \firstpageno1 \editorNot known yet

{keywords}

dual coordinate descent, distributed computing, optimization

1 Introduction

The immense growth of data has made it important to efficiently solve large scale machine learning problems. It is necessary to take advantage of modern high performance computing (HPC) environments such as multi-core settings where the cores communicate through shared memory, or multi-processor distributed memory settings where the processors communicate by passing messages. In particular, a large class of supervised learning formulations, including support vector machines (SVMs), logistic regression, ridge regression and many others, solve the following generic regularized risk minimization (RRM) problem: given a set of instance-label pairs of data points ,

(1)

where is the label for the data point , is the linear predictor to be optimized, is a loss function that is convex with respect to its first argument, is a regularization parameter that balances between the loss and a regularizer , especially the -norm penalty .

Many efficient sequential algorithms have been developed in the past decades to solve (1), e.g., stochastic gradient descent (SGD) Zhang (2004), or alternating direction method of multipliers (ADMM) Boyd et al. (2011). Especially, (stochastic) dual coordinate ascent (DCA) algorithm Shalev-Shwartz and Zhang (2013) has been one of the most widely used algorithms for solving (1). It efficiently optimizes the following dual formulation (2)

(2)
(3)

where are the convex conjugates of , respectively, defined as, e.g., and it is known that if is an optimal dual solution then is an optimal primal solution and . The dual objective has a separate dual variable associated with each training data point. The stochastic DCA updates dual variables, one at a time, while maintaining the primal variables by calculating (3) from the dual variables.

Recently, many efforts have been undertaken to solve (1) in a distributed or parallel framework. It has been shown that distributed DCA algorithms have comparable and sometimes even better convergence than SGD-based or ADMM-based distributed algorithms Yang (2013). The distributed DCA algorithms can be grouped into two sets. The first set contains synchronous algorithms in which a random dual variable is updated by each processor and the primal variables are synchronized across the processors in every iteration Jaggi et al. (2014); Ma et al. (2015b); Yang (2013). This approach incurs a large communication overhead. The second set of algorithms avoids communication overhead by exploiting the shared memory in a multi-core setting Hsieh et al. (2015) where the primal variables are stored in a primary memory shared across all the processors. Further speedups have been obtained by using (asynchronous) atomic memory operations instead of costly locks for shared memory updates Hsieh et al. (2015); Peng et al. (2015). Nevertheless, this approach is difficult to scale up for big datasets that cannot be fully accommodated in the shared memory. This leads to a challenging question: how do we scale up the asynchronous shared memory approach for big data while maintaining the speed up?

(a)
(b)
Figure 1: (a) A simplified view of the modern HPC system and (b) Algorithms on this architecture.

We address this challenge by proposing and implementing a hybrid strategy. The modern HPC platforms can be viewed as a collection of nodes interconnected through a network as shown in Fig. 1(a). Each node contains a memory shared among processing cores. Our strategy exploits this architecture by equally distributing the data across the local shared memory of the nodes. Each of the cores within a node runs a computing thread that asynchronously updates a random dual variable from those associated with the data allocated to the node. Each node also runs a communicating thread. One of the communicating threads is designated as master and the rest are workers. After every round of local iterations in each computing thread, each worker thread sends the local update to the master. After accumulating the local updates from of the workers, the master broadcasts the global update to the contributing workers. However, to avoid a slower worker falling back too far, the master ensures that in every consecutive global updates there is at least one local update from each worker. Fig. 1(b) shows how our scheme is a generalization of the existing approaches: for , our setup coincides with the shared memory multi-core setting Hsieh et al. (2015) and for our setup coincides with the synchronous algorithms in distributed memory setting Jaggi et al. (2014); Ma et al. (2015b); Yang (2013). With a proper adjustment of the parameters our strategy could balance the computation time of the first setting with the communication time of the second one, while ensuring scalability in big data applications.

Thus, our contributions are 1) we propose and analyze a hybrid asynchronous shared memory and asynchronous distributed memory implementation (Hybrid-DCA) of the mostly used DCA algorithm to solve (1); 2) we prove a strong guarantee of convergence for -Lipschitz continuous loss functions, and further linear convergence when a smooth convex loss function is used; and 3) the experimental results using our light-weight OpenMP+MPI implementation show that our algorithms are much faster than existing distributed memory algorithms Jaggi et al. (2014); Ma et al. (2015b), and easily scale up with the volume of data in comparison with the shared memory based algorithms Hsieh et al. (2015) as the shared memory size is limited.

2 Related Work

Sequential Algorithms. SGD is the oldest and simplest method of solving problem (1). Though SGD is easy to implement and converges to modest accuracy quickly, it requires a long tail of iterations to reach ‘good’ solutions and also requires adjusting a step-size parameter. On the other hand, SDCA methods are free of learning-rate parameters and have faster convergence rate around the end Moulines and Bach (2011); Needell et al. (2014). A modified SGD has also been proposed with faster convergence by switching to SDCA after quickly reaching a modest solution Shalev-Shwartz and Zhang (2013). Recently, ‘variance reduced’ modifications to the original SGD have also caught attention. These modifications estimate gradients with small variance as they approach to an optimal solution. Mini-batch algorithms are also proposed to update several dual variables (data points) in a batch rather than a single data point per iterationTakac et al. (2013). Mini-batch versions of both SGD and SDCA have slower convergence when the batch size increasesRichtárik and Takáč (2013); Shalev-Shwartz and Zhang (2016). All these sequential algorithms become ineffective when the datasets get bigger.

Distributed Algorithms. In the early single communication scheme McWilliams et al. (2014); Heinze et al. (2015); Mcdonald et al. (2009), a dataset is ‘decomposed’ into smaller parts that can be solved independently. The final solution is reached by ‘accumulating’ the partial solutions using a single round of communications. This method has limited utility because most datasets cannot be decomposed in such a way. Using the primal-dual relationship (3), fully distributed algorithms of DCA are later developed where each processor updates a separate which is then used to update , and synchronizes across all processors (e.g., CoCoA Jaggi et al. (2014)). To trade off communications vs computations, a processor can solve its subproblem with dual updates before synchronizing the primal variable (e.g., CoCoA+ Ma et al. (2015b),DisDCA Yang (2013)). In Yang (2013); Ma et al. (2015b), a more general framework is proposed in which the subproblem can be solved using not only SDCA but also any other sequential solver that can guarantee a -approximation of the local solution for some . Nevertheless, the synchronized update to the primal variables has the inherent drawback that the overall algorithm runs at a speed of the slowest processor even when there are fast processors Agarwal et al. (2014).

Parallel Algorithms. Multi-core shared memory systems have also been exploited, where the primal variables are maintained in a shared memory, removing the communication cost. However, updates to shared memory requires synchronization primitives, such as locks, which again slows down computation. Recent methods Hsieh et al. (2015); Liu et al. (2015) avoid locks by exploiting (asynchronous) atomic memory updates in modern memory systems. There is even a wild version in Hsieh et al. (2015) that takes arbitrarily one of the simultaneous updates. Though the shared memory algorithms are faster than the distributed versions, they have an inherent drawback of being not scalable, as there can be only a few cores in a processor board.

Other Distributed Methods for RRM. Besides distributed DCA methods, there are several recent distributed versions of other algorithms with faster convergence, including distributed Newton-type methods (DISCO Zhang and Xiao (2015), DANE Shamir et al. (2014)) and distributed stochastic variance reduced gradient method (DSVRG Lee et al. (2015)). It has been shown that they can achieve the same accurate solution using fewer rounds of communication, however, with additional computational overhead. In particular, DISCO and DANE need to solve a linear system in each round, which could be very expensive for higher dimensions. DSVRG requires each machine to load and store a second subset of the data sampled from the original training data, which also increase its running time.

The ADMM Boyd et al. (2011) and quasi-Newton methods such as L-BFGS also have distributed solutions. These methods have low communication cost, however, their inherent drawback of computing the full batch gradient does not give computation vs communications trade-off. In the context of consensus optimization, Zhang and Kwok (2014) gives an asynchronous distributed ADMM algorithm but that does not directly apply to solving (1).

To the best of our knowledge, this paper is the first to propose, implement and analyze a hybrid approach exploiting modern HPC architecture. Our approach is the amalgamation of three different ideas – 1) CoCoA+/DisDCA distributed framework, 2) asynchronous multi-core shared-memory solver Hsieh et al. (2015) and 3) asynchronous distributed approach Zhang and Kwok (2014) – taking the best of each of them. In a sense ours is the first algorithm which asynchronously uses updates which themselves have been computed using asynchronous methods.

3 Algorithm

At the core of our algorithm, the data is distributed across nodes and each node, called a worker, repeatedly solves a perturbed dual formulation on its data partition and sends the local update to one of the nodes designated as the master which merges the local updates and sends back the accumulated global update to the workers to solve the subproblem once again, unless a global convergence is reached. Let denote the indices of the data and the dual variables residing on node and . For any let denote the vector in defined in such a way that the th component if , otherwise. Let denote the matrix consisting of the columns of the indexed by and replaced with zeros in all other columns.

Ideally, the dual problem solved by node is (2) with replaced by , respectively, and hence is independent of other nodes. However, following the efficient practical implementation idea in Yang (2013); Ma et al. (2015b), we let the workers communicate among them a vector , an estimate of that summarizes the last known global solution . Also following Yang (2013); Ma et al. (2015b) for faster convergence, each worker in our algorithm solves the following perturbed local dual problem, which we henceforth call the subproblem:

(4)

where denotes the local (incremental) update to the dual variable and the scaling parameter measures the difficulty of solving the given data partition (see Yang (2013); Ma et al. (2015b)) and must be chosen such that

(5)

where the aggregation parameter is the weight given by the master to each of local updates from the contributing workers while computing the global update. The second term in the objective of our subproblem has denominator in stead of . Unlike the synchronous all reduce approach in Ma et al. (2015b), our asynchronous method merges the local updates from only out of nodes in each global update. By Lemma 3.2 in Ma et al. (2015b), is a safe choice to hold condition (5).

3.1 Asynchronous updates by cores in a worker node

Input: Initial , data partition ,
scaling parameter , aggregation parameter
1 ;
2 for  do
3       ;
4       for cores in parallel do
5             for  do
6                   Randomly pick from ;
7                   ;
8                   ;
9                   ;
10                  
11            
12      send to the master;
13       receive from the master;
14       ;
15      
Algorithm 1 Hybrid-DCA: Worker node

In each communication round, each worker solves its subproblem using a parallel asynchronous DCA method Hsieh et al. (2015) on the cores. Let the data partition stored in the shared memory be logically divided into subparts where subpart , is exclusively used by core . In each of the iterations, core chooses a random coordinate and updates in the th unit direction by a step size computed using a single variable optimization problem:

(6)

which has a closed form solution for SVM problems Fan et al. (2008), and a solution using an iterative solver for logistic regression problems Yu et al. (2011). The local updates to are also maintained appropriately. While the coordinates used by any two cores and hence the corresponding updates to are independent of each other, there might be conflicts in the updates to if the corresponding columns in have nonzero values in some common row. We use lock-free atomic memory updates to handle such conflicts. When all cores complete iterations, worker sends the accumulated update from the current round to the master; waits until it receives the globally updated from the master; and repeats for another round unless the master indicates termination.

3.2 Merging updates from workers by master

If the master had to wait for the updates from all the workers, it could compute the global updates only after the slowest worker finished. To avoid this problem, we use bounded barrier: in each round, the master waits for updates from only a subset of workers, and sends them back the global update . However, due to this relaxation, there might be some slow workers with out-of-date . When updates from such workers are merged by the master, it may degrade the quality of the global solution and hence may cause slow convergence or even divergence. We ensure sufficient freshness of the updates using bounded delay: the master makes sure that no worker has a stale update older than rounds. This asynchronous approach has two benefits: 1) the overall progress is no more bottlenecked by the slowest processor, and 2) the total number of communications is reduced. On the flip side, convergence may get slowed down for very small or very large .

Input: Initial , aggregation parameter ,
barrier bound , delay bound
1 ;   ;
2 for  do
3       while  or  do
4             receive update from some worker ;
5             ;   ;
6            
7       workers in with oldest updates;
8       ;  ;
9       foreach  do ;
10       broadcast to all workers in ;
11      
Algorithm 2 Hybrid-DCA: Master node

Example: Figure 2 shows a possible sequence of important events in a run of our algorithm on a dataset having data points in dimensions using nodes each having cores such that each core works with only data points. The activities in solving the subproblem using local iterations in a round is shown in a rectangular box. For the first subproblem, core 1 and core 2 in worker 1 randomly selects dual coordinates such that the corresponding data points have nonzero entries in the dimensions , respectively. Each core first reads the entries of corresponding to these nonzero data dimensions, and then computes the updates , respectively, and finally applies these updates to . The atomic memory updates ensure that all the conflicting writes to , such as in the first write-cycle, happen completely. At the end of local iterations by each core, worker 1 sends to the master, the responsibility of which is shared by one of the nodes, but shown separately in the figure. By this time, the faster workers 2 and 3 already complete 3 rounds. As , the master takes first 2 updates from and computes the global updates using . However, as , the master holds back the third updates from workers 2,3 until the first update from worker 1 reaches master. The subsequent events in the run are omitted in the figure.

Figure 2: Sequence of important events in an example run of Hybrid-DCA where .

4 Convergence Analysis

In this section we prove the convergence of the global solution computed by our hybrid algorithm. For ease we prove for ; the proof can be similarly extended for other regularizers . The analysis is divided into two parts. First we show that the solution of the subproblem computed by each node locally is indeed not far from the optimum. Using this result on the subproblem, we next show the convergence of the global solution. Though our proofs for the two parts are based on the works Hsieh et al. (2015) and Ma et al. (2015a), respectively, we need to make significant adjustments in the proofs due to our modified framework handling two cascaded levels of asynchronous updates. Here we outline the modifications, the complete proofs are given in the appendix.

4.1 Near optimality of the solution to the local subproblem

{definition}

For given , a solution to the subproblem (4) is said to be -approximate, , if

(7)

where is the optimum solution to (4).

The main challenge in using the results of Hsieh et al. (2015) to prove (7) for the solution returned by the parallel asynchronous stochastic DCA solver used by each worker in Algorithm 1 is to tackle the following changes: 1) the solver here solves only a part of the dual problem and 2) the subproblem is now perturbed (see Section 3). While the first change is simply handled by considering the updates by the cores in worker  only, the second change needs modifications in each step of the proof in Hsieh et al. (2015), including the definition of the proximal operators and the assumptions. Here we rewrite the statements of the modified lemmas and assumptions.

We consider the updates made in the current round by all the cores in the ascending order of the actual time point ( in Figure 2) when the step size of the update is computed (breaking ties arbitrarily) and prove (7) by showing sufficient progress in between two successive updates in this order, however, under some assumptions similar to those used in Hsieh et al. (2015).

Because of the atomic updates, the step size computation may not include all the latest updates, however, we assume all the updates before the ()-th update have already been written into . Let denote the normalized data matrix where each row is . Define , , where is the set of all the feature indices, and is the -th column of . Moreover, is defined as the minimum value of global data matrix, i.e., .

Assumption 1 (Bounded delay of updates )
(8)
{definition}

[Global error bound] For a convex function , the optimization problem: admits a global error bound if there is a constant such that

(9)

where is the Euclidean projection to the set of optimal solutions, and is the operator defined as

The optimization problem admits a relaxed condition called global error bound from the beginning if (9) holds for any satisfying for some constant .

Assumption 2

The local subproblem formulation (4) admits global error bound from the beginning for and any update .

The global error bound helps prove that our subproblem solver achieves significant improvement after each update. It has been shown that when the loss functions are hinge loss or squared hinge loss, the local subproblem formulation (4) does indeed satisfy global error bound condition Hsieh et al. (2015).

Assumption 3

The local subproblem objective (4) is -Lipschitz continuous.

Assumption 4 (Bounded )
{lemma}

When Assumptions 1-4 hold, the solutions computed in two successive updates by the local subproblem solver has a linear convergence rate in expectation, i.e.,

where is the after the th update,

and is the size of the largest data part. Moreover, is a -approximate solution for

(10)

4.2 Convergence of global solution

Although we showed that the local subproblem solver outputs a -approximate solution, we cannot directly apply the results of Ma et al. (2015a) for the global solution because our algorithm uses updates from only a subset of workers which is unlike the synchronous all-reduce of the updates from all workers used in Ma et al. (2015a). We need to handle this asynchronous nature of the global updates, just like we handled asynchronous updates for the local subproblem. Let us consider the global updates in the order the master computed them (at time in Figure 2). Let denote the value of distributed across all the nodes at the time master computed th global update . If then the update has already been included in . However, if then it may not be included. Let be such that for all and for all , has been included in . By the design of our algorithm, . Let be defined as follows: and for the latest for which the update is already included in global , . Let be and respectively. Note that .

{lemma}

For any dual , primal and real values satisfying (5), it holds that

(11)
Assumption 5

There exists a such that

(12)
{lemma}

If are all -strongly convex and Assumptions 1-5 are satisfied then for any , any round of Algorithm 2 satisfies

(13)

where

(14)
(15)

for with .

Using the main results in Ma et al. (2015b) and combining Lemma 4.1 with Lemma 4.2 gives us the following two convergence results, one for smooth loss functions and the other for the Lipschitz continuous loss functions. The theorems use the quantities where .

{theorem}

If the loss functions are all -smooth, then in iterations Algorithm 2 finds a solution with objective atmost from the optimal, i.e., whenever where and is given by (10). Furthermore, in iterations, it finds a solution with duality gap atmost , i.e., whenever .

{theorem}

If the loss functions are all -Lipschitz, then in iterations Algorithm 2 finds a solution with duality gap atmost , i.e., for the average iterate whenever , and and is given by (10).

Theorem 4.2 establishes the convergence for -Lipschitz continuous loss functions, and Theorem 4.2 proves a linear convergence rate for smooth convex loss functions.

5 Communication Cost Analysis

In each communication round, the algorithms based on synchronous updates on all nodes require transmissions, each consisting of all values of or . Half of these transmissions are from the workers to the master and the rest are from the master to the workers. Whereas, our asynchronous update scheme requires transmissions in each round.

6 Experimental Results

We implemented our algorithm in C++ where each node runs exactly one MPI process which in turn runs one OpenMP thread on each core available within the node and the main thread handles the inter-node communication. We evaluated for hinge loss, though it applies to other loss functions too, on four datasets from LIBSVM website as shown in Table 1, using upto 16 nodes available with the Hornet cluster at University of Connecticut where each node has 24 Xeon E5-2690 cores and 128 GB main memory. The last column in Table 1 gives the total number of non-zero entries in the matrix for each of the four dataset we used.

We experimented with the following algorithms: 1) Baseline: an implementation of DCA Hsieh et al. (2008), 2) CoCoAMa et al. (2015a), 3) PassCoDe Hsieh et al. (2015) and 4) our Hybrid-DCA. The algorithm parameters were varied as follows: 1) regularization parameter , 2) local iterations , 3) aggregation parameters , and 4) scaling parameter for CoCoA+, Hybrid-DCA, respective, as recommended in Ma et al. (2015a). For different combinations of , we observed similar patterns of results and report for only. The details of other parameter values are given later.

Dataset LIBSVM name File size
rcv1 rcv1_test 677,399 47,236 49,556,258 1.2 GB
webspam webspam 280,000 16,609,143 1,045,051,224 20 GB
kddb kddb train 19,264,097 29,890,095 566,345,888 5.1 GB
splicesite splice_site.t 4,627,840 11,725,480 15,383,587,858 280 GB
Table 1: Datasets

6.1 Comparison with existing algorithms

Figure 3 shows the progress of duality gap achieved by the four algorithms on three smaller datasets. We chose the number of nodes () and the number of cores () per node such that the total number of worker cores () is the same (16) for all algorithms except Baseline. The duality gap is measured as where is the estimate of shared across the nodes. When , it is not possible for the master in Hybrid-DCA to gather the parts of from all workers at the end of each round. We let the master temporarily store in disk after each round and at the end of all stipulated rounds, the workers compute the respective parts of from the stored and the master computes the duality gap using a series of synchronous all-reduce computations from all the workers. The bottom row shows the progress of the duality gap across time, while the top row shows progress across each round that consists of a communication round in CoCoA+ and Hybrid-DCA whereas consists of local updates in Baseline and PassCoDe. In this experiment, Hybrid-DCA uses and making the global updates synchronous. The progress of baseline is slow as it applies only updates compared to updates by the other algorithms. In terms of time, Hybrid-DCA clearly outperforms both , as expected, and PassCoDe which incurs a larger number of cache-misses when many cores are used. In terms of the number of rounds, PassCoDe outperforms both CoCoA+ and Hybrid-DCA, as expected. However, PassCoDe is not scalable beyond the number of cores in a single node. As the number of nodes increases, the convergence of Hybrid-DCA becomes slower due to the costly merging process of many distributed updates.

Figure 3: Performance of different solvers on three datasets, rcv1 (left column), webspam (middle column), and kddb (right column), in terms of the progress of the duality gap across the number of rounds (top row) and across the wall time taken (bottom row).

6.2 Speedup

We ran sufficient rounds of each of the four algorithms such that the duality gap falls below a threshold and noted the time taken by the algorithms to achieve this threshold. Figure 4 shows the speedup of all the algorithms except, Baseline, computed as the ratio of the time taken by Baseline to the time taken by the target algorithm on nodes each with cores. The thresholds used were for rcv1, webspam, kddb, respectively. PassCoDe can be run only on a single node; so we vary only the number of cores. CoCoA+ uses only 1 core per node. We ran CoCoA+ and Hybrid-DCA on nodes and plotted them separately. For each , Hybrid-DCA uses cores per node, however, under the restriction that the total number of worker cores () does not exceed 144, a limit set by our HPC usage policy. When , the number of cache-misses increases due to thread switching on the physical cores and reduces speedup for both PassCoDe and Hybrid-DCA. This could be improved by carefully scheduling the OpenMP threads to the same physical cores. Nevertheless, Hybrid-DCA has good speedup for , as evident for .

Figure 4: Speedup of different parallel or distributed solvers with respect to the sequential implementation Baseline.

6.3 Effects of the parameter

Figure 5 shows the results of varying with fixed on nodes each with cores. When , only a minority of the workers contribute in a round and the duality gap does not progress below some certain level. On the other hand, when at least half of the workers contribute in each round, it is possible to achieve the same duality level obtained using all the workers. However, the reduction in time per round is eventually eaten by the larger number of rounds required to achieve the same duality gap. Nevertheless, the approach is useful for HPC platforms with heterogeneous nodes, unlike ours, where the waiting for updates from all workers has larger penalty per round, or for the case, where the need is to run for a specified number of rounds and quickly achieve a reasonably good duality gap.

Figure 5: Effect of varying on worker nodes, with fixed at 10.

6.4 Effects of the parameter

Figure 6 shows the results of varying with fixed on nodes each with cores. We do not see much effect of as the HPC platform used for our experiments has homogeneous nodes. Our experimentation showed that even if we use , the stale value at any worker was for at most rounds. We expect to see a larger variance of staleness in case of heterogeneous nodes.

Figure 6: Effect of varying on worker nodes, with fixed at 8.

6.5 Performance on a big dataset

We experimented our hybrid algorithm on the big dataset splicesite of size about 280 GB and compared with the previous best algorithm CoCoA+. Because of the enormous size, the dataset cannot be accommodated on a single node and hence PassCoDe cannot be run on this dataset. In this experiment, we used the number of local iterations . The results are shown in Figure 7 where the progress of duality gap across the rounds of communication is shown on the left and across the wall time on the right. To achieve a duality gap of at least on 16 nodes, CoCoA+ took more than 300 seconds which somewhat matches the 1200 seconds (20 minutes) time on 4 nodes reported in Ma et al. (2015a). Hybrid-DCA on 16 nodes each using 8 cores took about 30 seconds to achieve the same duality gap, showing enough evidence about the scalability of our algorithm. One could also argue that CoCoA+ can be run on all these 16x8=128 cores, treating each core as a distributed node. We also experimented with this configuration which achieved better progress on duality gap than CoCoA+ on 16 nodes, however, still performed far worse than Hybrid-DCA in terms of both the number of rounds and the time taken.

Figure 7: Performance of Hybrid-DCA on big dataset splicesite.

7 Conclusions

In this paper, we present a hybrid parallel and distributed asynchronous stochastic dual coordinate ascent algorithm utilizing modern HPC platforms with many nodes of multi-core shared-memory systems. We analyze the convergence properties of this novel algorithm which uses asynchronous updates at two cascading levels: inter-cores and inter-nodes. Experimental results show that our algorithm is faster than the state-of-the-art distributed algorithms and scales better than the state-of-the-art parallel algorithms.

References

  • Agarwal et al. [2014] Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. A Reliable Effective Terascale Linear Learning System. Journal of Machine Learning Research, 15(1):1111–1133, 2014.
  • Boyd et al. [2011] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
  • Fan et al. [2008] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
  • Heinze et al. [2015] Christina Heinze, Brian McWilliams, and Nicolai Meinshausen. Dual-loco: Distributing statistical estimation using random projections. arXiv preprint arXiv:1506.02554, 2015.
  • Hsieh et al. [2008] Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and Sellamanickam Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the 25th international conference on Machine Learning (ICML), pages 408–415, 2008.
  • Hsieh et al. [2015] Cho-Jui Hsieh, Hsiang-Fu Yu, and Inderjit S Dhillon. PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015.
  • Jaggi et al. [2014] Martin Jaggi, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. Communication-Efficient Distributed Dual Coordinate Ascent. In Advances in Neural Information Processing Systems (NIPS), pages 3068–3076, 2014.
  • Lee et al. [2015] Jason Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang. Distributed stochastic variance reduced gradient methods and a lower bound for communication complexity. CoRR, arXiv:1507.07595, 2015.
  • Liu et al. [2015] Ji Liu, Stephen J Wright, Christopher Ré, Victor Bittorf, and Srikrishna Sridhar. An Asynchronous Parallel Stochastic Coordinate Descent Algorithm. Journal of Machine Learning Research, 16:285–322, 2015.
  • Ma et al. [2015a] Chenxin Ma, Jakub Konečnỳ, Martin Jaggi, Virginia Smith, Michael I Jordan, Peter Richtárik, and Martin Takáč. Distributed optimization with arbitrary local solvers. arXiv preprint arXiv:1512.04039, 2015a.
  • Ma et al. [2015b] Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I Jordan, Peter Richtárik, and Martin Takáč. Adding vs. Averaging in Distributed Primal-Dual Optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015b.
  • Mcdonald et al. [2009] Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, and Gideon S Mann. Efficient large-scale distributed training of conditional maximum entropy models. In Advances in Neural Information Processing Systems, pages 1231–1239, 2009.
  • McWilliams et al. [2014] Brian McWilliams, Christina Heinze, Nicolai Meinshausen, Gabriel Krummenacher, and Hastagiri P Vanchinathan. Loco: Distributing ridge regression with random projections. arXiv preprint arXiv:1406.3469, 2014.
  • Moulines and Bach [2011] Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems, pages 451–459, 2011.
  • Needell et al. [2014] Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems, pages 1017–1025, 2014.
  • Peng et al. [2015] Zhimin Peng, Yangyang Xu, Ming Yan, and Wotao Yin. Arock: an algorithmic framework for asynchronous parallel coordinate updates. arXiv preprint arXiv:1506.02396, 2015.
  • Richtárik and Takáč [2013] Peter Richtárik and Martin Takáč. Distributed coordinate descent method for learning with big data. arXiv preprint arXiv:1310.2059, 2013.
  • Shalev-Shwartz and Zhang [2013] Shai Shalev-Shwartz and Tong Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. Journal of Machine Learning Research, 14(1):567–599, 2013.
  • Shalev-Shwartz and Zhang [2016] Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1-2):105–145, 2016.
  • Shamir et al. [2014] Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication-Efficient Distributed Optimization using an Approximate Newton-type Method. In Proceedings of the 31th International Conference on Machine Learning, (ICML), Beijing, China, 21-26 June 2014, pages 1000–1008, 2014.
  • Takac et al. [2013] Martin Takac, Avleen Bijral, Peter Richtarik, and Nati Srebro. Mini-Batch Primal and Dual Methods for SVMs. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 1022–1030, 2013.
  • Yang [2013] Tianbao Yang. Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent. In Advances in Neural Information Processing Systems, pages 629–637, 2013.
  • Yu et al. [2011] Hsiang-Fu Yu, Fang-Lan Huang, and Chih-Jen Lin. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning, 85(1-2):41–75, 2011.
  • Zhang and Kwok [2014] Ruiliang Zhang and James Kwok. Asynchronous distributed ADMM for consensus optimization. In Proceedings of the 31st International Conference on Machine Learning (ICML), pages 1701–1709, 2014.
  • Zhang [2004] Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, page 116. ACM, 2004.
  • Zhang and Xiao [2015] Yuchen Zhang and Lin Xiao. Communication-efficient distributed optimization of self-concordant empirical loss. CoRR, abs/1501.00263, 2015.

Hybrid-DCA: A Double Asynchronous Approach for Stochastic Dual Coordinate Ascent

Appendix A Proof of near optimality of local subproblem solution

{definition}

First, we define that:

where denotes the sequence generated by the local atomic solver and denotes the actual values of maintained at update in the local atomic solver. Note that, and .

{lemma}

Under Assumption 1, and let , . Then, the local subproblem satisfy:

(16)

where , represents the -th update to in a local solver.

a.1 Proof of Lemma A

{proof}

We omit the subscript of the notations, which specifies the -th data partition, in the proof. For all , we have following definitions:

where denotes any fixed vector and . denotes the proximal operator. We can see the connection of above operator and proximal operator: . Here both and were revised from Hsieh et al. (2015) to satisfy the subproblem 4.

{proposition}
(17)
{proposition}
(18)
{proposition}
(19)
{proposition}

Let , , , and . If and , then , and

(20)
{proposition}

For all , we have

(21)
(22)

Above propositions are similar to Hsieh et al. (2015). We keep the conclusions of those propositions for the future use in our proof. We prove Eq (16) by induction. As shown in Hsieh et al. (2015), we have

(23)

The second of factor in the r.h.s of Eq 23 is bounded as follows with the revisions:

(24)
(25)

No we start the induction. Although some steps may be the same as the steps in Hsieh et al. (2015), we still keep them here to make the proof self-contained.

Induction Hypothesis. We prove the following equivalent statement. For all ,

Induction Basis. When ,

By Proposition A.1 and AM-GM inequality, which for any and any , we have

(26)

Therefore, we have

Therefore,

which implies