Parallel and Distributed BlockCoordinate FrankWolfe Algorithms
Abstract
We develop parallel and distributed FrankWolfe algorithms; the former on shared memory machines with minibatching, and the latter in a delayed update framework. Whenever possible, we perform computations asynchronously, which helps attain speedups on multicore machines as well as in distributed environments. Moreover, instead of worstcase bounded delays, our methods only depend (mildly) on expected delays, allowing them to be robust to stragglers and faulty worker threads. Our algorithms assume blockseparable constraints, and subsume the recent BlockCoordinate FrankWolfe (BCFW) method [24]. Our analysis reveals problemdependent quantities that govern the speedups of our methods over BCFW. We present experiments on structural SVM and Group Fused Lasso, obtaining significant speedups over competing stateoftheart (and synchronous) methods.
1 Introduction
The classical FrankWolfe (FW) algorithm [13] has witnessed a huge surge of interest recently [7, 20, 21, 2]. The FW algorithm iteratively solves the problem
(1) 
where is a smooth function (typically convex) and is a closed convex set. The key factor that makes FW appealing is its use of a linear oracle that solves , instead of a projection (quadratic) oracle that solves , especially because the linear oracle can be much simpler and faster.
This appeal has motivated several new variants of basic FW, e.g., regularized FW [41, 6, 17], linearly convergent special cases [23, 16], stochastic/online versions [34, 18, 25], and a randomized blockcoordinate FW [24].
But despite this progress, parallel and distributed FW variants are barely studied. In this work, we develop new parallel and distributed FW algorithms, in particular for blockseparable instances of (1) that assume the form
(2) 
where () is a compact convex set and are coordinate blocks of . This setting for FW was considered in [24], who introduced the BlockCoordinate FrankWolfe (Bcfw) method.
Such problems arise in many applications, notably, structural SVMs [24], routing [26], group fused lasso [1, 5], tracenorm based tensor completion [29], reduced rank nonparametric regression [12], and structured submodular minimization [22], among others.
One approach to solve (2) is via blockcoordinate (gradient) descent (BCD), which forms a local quadratic model for a block of variables, and then solves a projection subproblem [32, 36, 3]. However, for many problems, including the ones noted above, projection can be expensive (e.g., projecting onto the trace norm ball, onto base polytopes [15]), or even computationally intractable [8].
FrankWolfe (FW) methods excel in such scenarios as they rely only on linear oracles solving . For , this breaks into the independent problems
(3) 
where denotes the gradient w.r.t. the coordinates . It is immediate that these subproblems can be solved in parallel (an idea dating back to at least [26]). But there is a practical impediment: updating all the coordinates at each iteration (serially or in parallel) is expensive hampering use of FW on bigdata problems.
This drawback is partially ameliorated by Bcfw [24], a method that randomly selects a block at each iteration and performs FW updates with it. However, this procedure is strictly sequential: it does not take advantage of modern multicore architectures or of highperformance distributed clusters.
Contributions. In light of the above, we develop scalable FW methods, and make the following main contributions:

Parallel and distributed blockcoordinate FrankWolfe algorithms, henceforth both referred as ApBcfw, that allow asynchronous computation. ApBcfw depends only (mildly) on the expected delay, therefore is robust to stragglers and faulty worker threads.

An analysis of the primal and primaldual convergence of ApBcfw and its variants for any minibatch size and potentially unbounded maximum delay. When the maximum delay is actually bounded, we show stronger results using results from loadbalancing on maxload bounds.

Insightful deterministic conditions under which minibatching provably improves the convergence rate for a class of problems (sometimes by orders of magnitude).

Experiments that demonstrate on real data how our algorithm solves a structural SVM problem several times faster than the stateoftheart.
In short, our results contribute to making FW more attractive for bigdata applications. To lend further perspective, we compare our methods to some closely related works below. Space limits our summary; we refer the reader to Jaggi [21], Zhang et al. [40], LacosteJulien et al. [24], Freund & Grigas [14] for additional notes and references.
Bcfw and Structural SVM. Our algorithm ApBcfw extends and generalizes Bcfw to parallel computation using minibatches. Our convergence analysis follows the proof structure in LacosteJulien et al. [24], but with different stepsizes that must be carefully chosen. Our results contain Bcfw as a special case. A large portion of LacosteJulien et al. [24] focuses on more explicit (and stronger) guarantee for Bcfw on structural SVM. While we mainly focus on a more general class of problems, the particular subroutine needed by structural SVM requires special treatment; we discuss the details in Appendix C.
Parallelization of sequential algorithms. The idea of parallelizing sequential optimization algorithms is not new. It dates back to [38] for stochastic gradient methods; more recently Richtárik & Takáč [36], Liu et al. [30], Lee et al. [27] study parallelization of BCD. The conditions under which these parallel BCD methods succeed, e.g., expected separable overapproximation (ESO), and coordinate Lipschitz conditions, bear a close resemblance to our conditions in Section 2.2, but are not the same due to differences in how solutions are updated and what subproblems arise. In particular, our conditions are affine invariant. We provide detailed comparisons to parallel coordinate descents in Appendix D.4.
Asynchronous algorithms. Asynchronous algorithms that allow delayed parameter updates have been proposed earlier for stochastic gradient descent [33] and parallel BCD [30]. We propose the first asynchronous algorithm for FrankWolfe. Our asynchronous scheme not only permits delayed minibatch updates, but also allows the updates for coordinate blocks within each minibatch to have different delays. Therefore, each update may not be a solution of (3) for any single . In addition, we obtained strictly better dependency on the delay parameter than predecessors (e.g., an exponential improvement over Liu et al. [30]) possibly due to a sharper analysis.
Other related work. While preparing our manuscript, we discovered the preprint [4] which also studies distributed FrankWolfe. We note that [4] focuses on Lasso type problems and communication costs, and hence, is not directly comparable to our results.
Notation. We briefly summarize our notation now. The vector denotes the parameter vector, possibly split into coordinate blocks. For block , is the projection matrix which projects down to ; thus . The adjoint operator maps , thus is with zeros in all dimensions except (note the subscript ). We denote the size of a minibatch by , and the number of parallel workers (threads) by . Unless otherwise stated, denotes the iteration/epoch counter and denotes a stepsize. Finally, (and other such constants) denotes some curvature measure associated with function and minibatch size . Such constants are important in our analysis, and will be described in greater detail in the main text.
2 Algorithm
In this section, we develop and analyze an asynchronous parallel blockcoordinate FrankWolfe algorithm, hereafter ApBcfw, to solve (2).
Our algorithm is designed to run fully asynchronously on either a sharedmemory multicore architecture or on a distributed system. For the sharedmemory model, the computational work is divided amongst worker threads, each of which has access to a pool of coordinates that it may work on, as well as to the shared parameters. This setup matches the system assumptions in Niu et al. [33], Richtárik & Takáč [36], Liu et al. [30], and most modern multicore machines permit such an arrangement. On a distributed system, the parameter server [28, 9] broadcast the most recent parameter vector periodically to each worker and workers keep sending updates to the parameter vector after solving the subroutines corresponding to a randomly chosen parameter. In either settings, we do not wait fo slower workers or synchronize the parameters at any point of the algorithm, therefore many updates sent from the workers could be calculated based on a delayed parameter.
The above scheme is made explicit by the pseudocode in Algorithm 1, following a serverworker terminology. The shared memory version of the pseudocode is very similar, hence deferred to the Appendix. The three most important questions pertaining to Algorithm 1 are:

Does it converge?

If so, then how fast? And how much faster is it compared to Bcfw ()?

How do delayed updates affect the convergence?
We answer the first two questions in Section 2.1 and 2.2. Specifically, we show ApBcfw converges at the familiar rate. Our analysis reveals that the speedup of ApBcfw over Bcfw through parallelization is problem dependent. Intuitively, we show that the extent that minibatching () can speed up convergence depends on the average “coupling” of the objective function across different coordinate blocks. For example, we show that if has a block symmetric diagonally dominant Hessian, then ApBcfw converges times faster. We address the third question in Section 2.3, where we establish convergence results that depend only mildly in the “expected” delay . The bound is proportional to when we allow the delay to grow unboundedly, and proportional to when the delay is bounded by a small .
2.1 Main convergence results
Before stating the results, we need to define a few quantities. The first key quantity—also key to the analysis of several other FW methods—is the notion of curvature. Since ApBcfw updates a subset of coordinate blocks at a time, we define set curvature for an index set as
(4)  
For index sets of size , we define the expected set curvature over a uniform choice of subsets as
(5) 
These curvature definitions are closely related to the global curvature constant of [21] and the coordinate curvature and product curvature of [24]. Lemma 1 makes this relation more precise.
Lemma 1 (Curvature relations).
Suppose with cardinality and . Then,

;

.
The way the average set curvature scales with is critical for bounding the amount of speedup we can expect over Bcfw; we provide a detailed analysis of this speedup in Section 2.2.
The next key object is an approximate linear minimizer. At iteration , as in Jaggi [21], LacosteJulien et al. [24], we also low the core computational subroutine that solves (3) to yield an approximate minimizer . The approximation is quantified by an additive constant that for a minibatch of size , the approximate solution obeys in expectation that
(6) 
where the expectation is taken over both the random coins in selecting and any other source of uncertainty in this oracle call during the entire history up to step . (6) is strictly weaker than what is required in Jaggi [21], LacosteJulien et al. [24], as we only need the approximation to hold in expectation. With definitions (5) and (6) in hand, we are ready to state our first main convergence result.
Theorem 1 (Primal Convergence).
At a first glance, the term in the numerator might seem bizzare, but as we will see in the next section, can be as small as . This is the scale of the constant one should keep in mind to compare the rate to other methods, e.g. coordinate descent. Also note that so far this convergence result does not explicitly work for delayed updates, which we will analyze in Section 2.3 separately via the approximation parameter .
For FW methods, one can also easily obtain a convergence guarantee in an appropriate primaldual sense. To this end, we introduce our version of the surrogate duality gap [21]; we define this as
(7)  
To see why (7) is actually a duality gap, note that since is convex, the linearization is always smaller than the function evaluated at any , so that
This duality gap is obtained for “free” in batch FrankWolfe, but not in Bcfw or ApBcfw. Here, we only have an unbiased estimator . As gets large, is close to with high probability (McDiarmid’s Inequality), and can still be useful as a stopping criterion.
Theorem 2 (PrimalDual Convergence).
Relation with FW and Bcfw: The above convergence guarantees can be thought of as an interpolation between Bcfw and batch FW. If we take , this gives exactly the convergence guarantee for Bcfw [24, Theorem 2] and if we take , we can drop from (with a small modification in the analysis) and it reduces to the classic batch guarantee as in [21].
Dependence on initialization: Unlike classic FW, the convergence rate for our method depends on the initialization. When and , the convergence is slower by a factor of . The same concern was also raised in [24] with . We can actually remove the from as long as we know that . By Lemma 1, the expected set curvature increases with , so the fast convergence region becomes larger when we increase . In addition, if we pick , the rate of convergence is not affected by initialization anymore.
Speedup: The careful reader may have noticed the term in the numerator. This is undesirable as can be large (for instance, in structural SVM is the total number of data points). The saving grace in Bcfw is that when , is as small as (see [24, Lemmas A1 and A2]), and it is easy to check that the dependence in is the same even for . What really matters is how much speedup one can achieve over Bcfw, and this speedup critically relies on how depends on . Analyzing this dependence will be our main focus in the next section.
2.2 Effect of parallelism / minibatching
To understand when minibatching is meaningful and to quantify its speedup, we take a more careful look at the expected set curvature in this section. In particular, we analyze and present a set of insightful conditions that govern its relationship with . The key idea is to roughly quantify how strongly different coordinate blocks interact with each other.
To begin, assume that there exists a positive semidefinite matrix such that for any
(8) 
The matrix may be viewed as a generalization of the gradient’s Lipschitz constant (a scalar) to a matrix. For quadratic functions , we can take . For twice differentiable functions, we can choose
Since (we write instead of for brevity), we separate into blocks; so represents the block corresponding to and such that we can take the product . Now, we define a boundedness parameter for every , and an incoherence condition with parameter for every block coordinate pair such that
Then, using these quantities, we obtain the following bound on the expected setcurvature.
Theorem 3.
If problem (2) obeys expected boundedness and expected incoherence. Then,
(9) 
It is clear that when the incoherence term is large, the expected set curvature is proportional to , and when is close to 0, then is proportional to . In other words, when the interaction between coordinates block is small, one would gain from parallelizing the blockcoordinate FrankWolfe. This is analogous to the situation in parallel coordinate descent [36, 30] and we will compare the rate of convergence explicitly with them in the next section.
Remark 1.
Let us form a matrix with on the diagonal and on the offdiagonal. If is symmetric diagonally dominant (SDD), i.e., the sum of absolute offdiagonal entries in each row is no greater than the diagonal entry, then is proportional to .
The above result depends on the parameters and . We now derive specific instances of the above results for the structural SVM and Group Fused Lasso. For the structural SVM, a simple generalization of [24, Lemmas A.1, A.2] shows that in the worst case, using offers no gain at all. Fortunately, if we are willing to consider a more specific problem and consider the average case instead, using larger does make the algorithm converge faster (and this is the case according to our experiments).
Example 1 (Structural SVM for multilabel classification (with random data)).
We describe the application to structural SVMs in detail in Section C (please see this section for details on notation). Here, we describe the convergence rate for this application. According to [39], the compatibility function for multiclass classification will be where the only nonzero block that we fill with the feature vector is the th block. So looks like . This already ensures that provided lie on a unit sphere. Suppose we have classes and each class has a unique feature vector drawn randomly from a unit sphere in ; furthermore, for simplicity assume we always draw data points with distinct labels^{2}^{2}2This is an oversimplification but it offers a rough ruleofthumb. In practice, should be in the same ballpark as our estimate here. for some constant . In addition, if , then with high probability
which yields a convergence rate , where
using notation from Lemmas A.1 and A.2 of [24].
This analysis suggests that a good ruleofthumb is that we should choose to be at most the number of categories for the classification. If each class is a mixture of random draws from the unit sphere, then we can choose to be the underlying number of mixture components.
Example 2 (Group Fused Lasso).
The Group Fused Lasso aims to solve (typically for )
(10) 
where , and column of is an observed noisy dimensional feature vector at time . The matrix is the differencing matrix that takes the difference of feature vectors at adjacent time points (columns). The formulation aims to filter the trend that has some piecewise constant structures. The dual to (10) is
s.t. 
where is conjugate to , i.e., . This blockconstrained problem fits our structure (2). For this problem, we find that and , which yields the bound
Consequently, the rate of convergence becomes . In this case, batch FW will have a better rate of convergence than Bcfw ^{3}^{3}3Observe that does not have an term in the denominator to cancel out the numerator. This is because the objective function is not appropriately scaled with like it does in the structural SVM formulation..
2.3 Convergence with delayed updates
Due to the delays in communication, it happens all the time that some updates pushed back by workers are calculated based on delayed parameters that we broadcast earlier. Dropping these updates or enforcing synchronization will create a huge system overhead especially when the size of the minibatch is small. Ideally, we want to just accept the delayed updates as if they are correct, and broadcast new parameters to workers without locking the updates. The question is, does it actually work? In this section, we model the delay from every update to be iid from an unknown distribution. Under weak assumptions, we show that the effect of delayed updates can be treated as a form of approximate oracle evaluation as in (6) with some specific constant that depends on the expected delay and the maximum delay parameter (when exists), therefore establishing that the convergence results in the previous section remains valid for this variant. The results will also depend on the following diameter and gradient Lipschitz constant for a norm
Theorem 4 (Delayed Updates as Approximate Oracle).
For each norm of choice, let and be defined above. Let the a random variable of delay be and let be the expected delay from any worker, moreover, assume that the algorithm drops any updates with delay greater than at iteration . Then for the version of the algorithm without linesearch, the delayed oracle will produce such that (6) holds with
(11) 
Furthermore, if we assume that there is a such that for all , then (6) holds with where
(12) 
The results above imply that ApBcfw (without linesearch) converges in both primal optimality and in duality gap according to Theorem 1 and 2.
Note that (11) depends on the expected delay rather than the maximum delay and as we allow the maximum delay to grow unboundedly. This allows the system to automatically deal with heavytailed delay distribution and sporadic stragglers. When we do have small bounded delay, we produce stronger bounds (12) with a multiplier that is either a constant (when for any ), proportional to (when ) or proportional to (when is large). The whole expression often has sublinear dependency in the expected delay . To be more precise, when is Euclidean norm, by Jensen’s inequality. Therefore in this case the bound is essentially proportional to . This is strictly better than Niu et al. [33] which has quadratic dependency in and Liu et al. [30] which has exponential dependency in . Our mild dependency for the cases suggests that the (12) remains proportional to even when we allow the maximum delay parameter to be as large as or larger without significantly affecting the convergence. Note that this allows some workers to be delayed for several data passes.
Observe that when , where the results reduces to a lockfree variant for Bcfw, becomes proportional to . This is always greater than (see e.g., [21, Appendix D]) but due to the flexibility of choosing the norm, this quantity corresponding to the most favorable norm is typically a small constant. For example, when is a quadratic function, we show that (see Appendix D.2). When , is often for an appropriately chosen norm. Therefore, (11) and (12) are roughly in the order of and respectively^{4}^{4}4For details, see our discussion in Appendix D.2.
Lastly, we remark that and are not independent. When we increase , we update the parameters less frequently and gets smaller. In a real distributed system, with constant throughput in terms of number of oracle solves per second from all workers. If the average delay is a fixed number in clock time specified by communication time. Then is roughly a constant regardless how is chosen.
3 Experiments
In this section, we experimentally demonstrate the performance gains of the three key features of our algorithm: minibatches of data, parallel workers, and asynchronous updates.
3.1 Minibatches of Data
We conduct simulations to study the effect of minibatch size , where larger implies greater degrees of parallelism as each worker can solve one or more subproblems in a minibatch. In our simulation for structural SVM we use sequence labeling task on a subset of the OCR dataset [37] . The subproblem can be solved using the Viterbi algorithm. The speedup on this dataset is shown in Figure 1(a). For this dataset, we use with weighted averaging and linesearch throughout. We measure the speedup for a particular in terms of the number of epochs (Algorithm 1) required to converge relative to , which corresponds to Bcfw. Figure 1(a) shows that ApBcfw achieves linear speedup for minibatch size up to . Further speedup is sensitive to the convergence criteria, where more stringent thresholds lead to lower speedups. This is because large minibatch sizes introduce errors, which reduces progress per update, and is consistent with existing work on the effect of parameter staleness on convergence [19, 10]. This suggests that it might be possible to use more workers initially for a large speedup and reduce parallelism as the algorithm approaches the optimum.
In our simulation for Group Fused Lasso, we generate a piecewise constant dataset of size (, in Eq. 2) with Gaussian noise. We use and a primal suboptimality threshold as our convergence criterion. At each iteration, we solve subproblems (i.e. the minibatch size). Figure 1(b) shows the speedup over (Bcfw). Similar to the structural SVM, the speedup is almost perfect for small () but tapers off for large to varying degrees depending on the convergence thresholds.
3.2 Shared Memory Parallel Workers
We implement ApBcfw for the structural SVM in a multicore sharedmemory system using the full OCR dataset . All sharedmemory experiments were implemented in C++ and conducted on a 16core machine with Intel(R) Xeon(R) CPU E52450 2.10GHz processors and 128G RAM. We first fix the number of workers at and vary the minibatch size . Figure 2(a) shows the absolute convergence (i.e. the convergence per second). We note that ApBcfw outperforms singlethreaded Bcfw under all investigated , showing the efficacy of parallelization. Within ApBcfw, convergence improves with increasing minibatch sizes up to , but worsens when as the error from the large minibatch size dominates additional computation. The optimal for a given number of workers () depends on both the dataset (how “coupled” are the coordinates) and also system implementations (how costly is the synchronization as the system scales).
Since speedup for a given depends on , we search for the optimal across multiples of to find the best speedup for each . Figure 2(b) shows faster convergence of ApBcfw over Bcfw () when workers are available. It is important to note that the xaxis is wallclock time rather than the number of epochs.
Figure 2(c) shows the speedup with varying . ApBcfw achieves nearlinear speed up for smaller . The speedup curve tapers off for larger for two reasons: (1) Large incurs higher system overheads, and thus needs larger to utilize CPU efficiently; (2) Larger incurs errors as shown in Fig. 1(a). If the subproblems were more timeconsuming to solve, the affect of system overhead would be reduced. We simulate harder subproblems by simply solving them Uniform times instead of just once. The speedup is nearly perfect as shown in Figure 2(d). Again, we observe that a more generous convergence threshold produces higher speedup, suggesting that resource scheduling could be useful (e.g., allocate more CPUs initially and fewer as algorithm converges).
3.3 Performance gain with asynchronous updates
We compare ApBcfw with a synchronous version of the algorithm (SpBcfw) where the server assigns subproblems to each worker, then waits for and accumulates the solutions before proceeding to the next iteration. We simulate workers of varying slowdowns in our sharedmemory setup by assigning a return probability to each worker . After solving each subproblem, worker reports the solution to the server with probability . Thus a worker with will drop 20% of the updates on average corresponding to slowdown.
We use workers for the experiments in this section. We first simulate the scenario with just one straggler with return probability while the other workers run at full speed . Figure 3(a) shows that the average time per effective datapass (over 20 passes and 5 runs) of ApBcfw stays almost unchanged with slowdown factor of the straggler, whereas it increases linearly for SpBcfw. This is because ApBcfw relies on the average available worker processing power, while SpBcfw is only as fast as the slowest worker.
Next, we simulate a heterogeneous environment where the workers have varying speeds. While varying a parameter , we set for . Figure 3(b) shows that ApBcfw slows down by only a factor of 1.4 compared to the nostraggler case. Assuming that the server and worker each takes about half the (wallclock) time on average per epoch, we would expect the run time to increase by 50% if average worker speed halves, which is the case if (i.e., ). Therefore a factor of 1.4 is reasonable. The performance of SpBcfw is almost identical to that in the previous experiment as its speed is determined by the slowest worker. Thus our experiments show that ApBcfw is robust to stragglers and system heterogeneity.
3.4 Convergence under unbounded heavytailed delay
In this section, we illustrate the mild effect of delay on convergence by randomly drawing an independent delay variable for each worker. For simplicity, we use (Bcfw) on the same group fused lasso problem as in Section 3.1. We sample using either a Poisson distribution or a heavytailed Pareto distribution (round to the nearest integer). The Pareto distribution is chosen with shape parameter and scale parameter such that and . During the experiment, at iteration , any updates that were based on a delay greater than are dropped (as our theory demanded). The results are shown in Figure 4. Observe that for both cases, the impact of the delay is rather mild. With expected delay up to , the algorithm only takes fewer than twice as many iterations to converge.
4 Conclusion
In this paper, we propose an asynchronous parallel generalization of the blockcoordinate FrankWolfe method [24] and provide intuitive conditions under which it has a provable speedup over Bcfw. The asynchronous updates allow our method to be robust to stragglers and node failure as the speed of ApBcfw depends on average worker speed instead of the slowest. We demonstrate the effectiveness of the algorithm in structural SVM and Group Fused Lasso with both controlled simulation and realdata experiments on a multicore workstation. For the structural SVM, it leads to a speedup over the stateoftheart Bcfw by an order of magnitude using 16 parallel processors. As a projectionfree FrankWolfe method, we expect our algorithm to be very competitive in largescale constrained optimization problems, especially when projections are expensive. Future work includes analysis for the strongly convex case and ultimately releasing a carefully implemented software package for practitioners to deploy in Big Data applications.
References
 Alaíz et al. [2013] Alaíz, Carlos M, Barbero, Álvaro, and Dorronsoro, José R. Group fused lasso. In Artificial Neural Networks and Machine Learning–ICANN 2013, pp. 66–73. Springer, 2013.
 Bach [2013] Bach, Francis. Conditional gradients everywhere. 2013.
 Beck & Tetruashvili [2013] Beck, Amir and Tetruashvili, Luba. On the convergence of block coordinate descent type methods. SIAM Journal on Optimization, 23(4):2037–2060, 2013.
 Bellet et al. [2014] Bellet, Aurélien, Liang, Yingyu, Garakani, Alireza Bagheri, Balcan, MariaFlorina, and Sha, Fei. Distributed frankwolfe algorithm: A unified framework for communicationefficient sparse learning. CoRR, abs/1404.2644, 2014.
 Bleakley & Vert [2011] Bleakley, Kevin and Vert, JeanPhilippe. The group fused lasso for multiple changepoint detection. arXiv, 2011.
 Bredies et al. [2009] Bredies, Kristian, Lorenz, Dirk A, and Maass, Peter. A generalized conditional gradient method and its connection to an iterative shrinkage method. Computational Optimization and Applications, 42(2):173–193, 2009.
 Clarkson [2010] Clarkson, Kenneth L. Coresets, sparse greedy approximation, and the FrankWolfe algorithm. ACM Transactions on Algorithms (TALG), 6(4):63, 2010.
 Collins et al. [2008] Collins, Michael, Globerson, Amir, Koo, Terry, Carreras, Xavier, and Bartlett, Peter L. Exponentiated gradient algorithms for conditional random fields and maxmargin markov networks. JMLR, 9:1775–1822, 2008.
 Dai et al. [2013] Dai, Wei, Wei, Jinliang, Zheng, Xun, Kim, Jin Kyu, Lee, Seunghak, Yin, Junming, Ho, Qirong, and Xing, Eric P. Petuum: A framework for iterativeconvergent distributed ml. arXiv:1312.7651, 2013.
 Dai et al. [2014] Dai, Wei, Kumar, Abhimanu, Wei, Jinliang, Ho, Qirong, Gibson, Garth, and Xing, Eric P. Highperformance distributed ml at scale through parameterserver consistency models. In AAAI, 2014.
 Fercoq & RichtÃ¡rik [2015] Fercoq, Olivier and RichtÃ¡rik, Peter. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 25(4):1997–2023, 2015.
 Foygel et al. [2012] Foygel, Rina, Horrell, Michael, Drton, Mathias, and Lafferty, John D. Nonparametric reduced rank regression. In NIPS’12, pp. 1628–1636, 2012.
 Frank & Wolfe [1956] Frank, Marguerite and Wolfe, Philip. An algorithm for quadratic programming. Naval research logistics quarterly, 3(12):95–110, 1956.
 Freund & Grigas [2014] Freund, Robert M. and Grigas, Paul. New analysis and results for the frank–wolfe method. Mathematical Programming, 155(1):199–230, 2014. ISSN 14364646.
 Fujishige & Isotani [2011] Fujishige, Satoru and Isotani, Shigueo. A submodular function minimization algorithm based on the minimumnorm base. Pacific Journal of Optimization, 7(1):3–17, 2011.
 Garber & Hazan [2013] Garber, Dan and Hazan, Elad. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arXiv:1301.4666, 2013.
 Harchaoui et al. [2015] Harchaoui, Zaid, Juditsky, Anatoli, and Nemirovski, Arkadi. Conditional gradient algorithms for normregularized smooth convex optimization. Mathematical Programming, 152(12):75–112, 2015. ISSN 00255610.
 Hazan & Kale [2012] Hazan, Elad and Kale, Satyen. Projectionfree online learning. In ICML’12, 2012.
 Ho et al. [2013] Ho, Qirong, Cipar, James, Cui, Henggang, Lee, Seunghak, Kim, Jin Kyu, Gibbons, Phillip B., Gibson, Garth A., Ganger, Greg, and Xing, Eric. More effective distributed ml via a stale synchronous parallel parameter server. In NIPS’13. 2013.
 Jaggi [2011] Jaggi, Martin. Sparse convex optimization methods for machine learning. PhD thesis, Diss., Eidgenössische Technische Hochschule ETH Zürich, Nr. 20013, 2011, 2011.
 Jaggi [2013] Jaggi, Martin. Revisiting FrankWolfe: Projectionfree sparse convex optimization. In ICML’13, pp. 427–435, 2013.
 Jegelka et al. [2013] Jegelka, Stefanie, Bach, Francis, and Sra, Suvrit. Reflection methods for userfriendly submodular optimization. In NIPS’13, pp. 1313–1321, 2013.
 LacosteJulien & Jaggi [2015] LacosteJulien, Simon and Jaggi, Martin. On the global linear convergence of frankwolfe optimization variants. In NIPS’15, pp. 496–504, 2015.
 LacosteJulien et al. [2013] LacosteJulien, Simon, Jaggi, Martin, Schmidt, Mark, and Pletscher, Patrick. Blockcoordinate frankwolfe optimization for structural svms. In ICML’13, pp. 53–61, 2013.
 Lafond et al. [2015] Lafond, Jean, Wai, HoiTo, and Moulines, Eric. Convergence analysis of a stochastic projectionfree algorithm. arXiv:1510.01171, 2015.
 LeBlanc et al. [1975] LeBlanc, Larry J, Morlok, Edward K, and Pierskalla, William P. An efficient approach to solving the road network equilibrium traffic assignment problem. Transportation Research, 9(5):309–318, 1975.
 Lee et al. [2014] Lee, Seunghak, Kim, Jin Kyu, Zheng, Xun, Ho, Qirong, Gibson, Garth A, and Xing, Eric P. On model parallelization and scheduling strategies for distributed machine learning. In NIPS’14, pp. 2834–2842, 2014.
 Li et al. [2013] Li, Mu, Zhou, Li, Yang, Zichao, Li, Aaron, Xia, Fei, Andersen, David G, and Smola, Alexander. Parameter server for distributed machine learning. In NIPS Workshop: Big Learning, 2013.
 Liu et al. [2013] Liu, Ji, Musialski, Przemyslaw, Wonka, Peter, and Ye, Jieping. Tensor completion for estimating missing values in visual data. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(1):208–220, 2013.
 Liu et al. [2014] Liu, Ji, Wright, Stephen J, Ré, Christopher, and Bittorf, Victor. An asynchronous parallel stochastic coordinate descent algorithm. JMLR, 2014.
 Mitzenmacher [2001] Mitzenmacher, Michael. The power of two choices in randomized load balancing. Parallel and Distributed Systems, IEEE Transactions on, 12(10):1094–1104, 2001.
 Nesterov [2012] Nesterov, Yu. Efficiency of coordinate descent methods on hugescale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
 Niu et al. [2011] Niu, Feng, Recht, Benjamin, Ré, Christopher, and Wright, Stephen J. Hogwild!: A lockfree approach to parallelizing stochastic gradient descent. arXiv:1106.5730, 2011.
 Ouyang & Gray [2010] Ouyang, Hua and Gray, Alexander G. Fast stochastic FrankWolfe algorithms for nonlinear SVMs. In SDM, 2010.
 Raab & Steger [1998] Raab, Martin and Steger, Angelika. Balls into bins  a simple and tight analysis. In Randomization and Approximation Techniques in Computer Science, pp. 159–170. Springer, 1998.
 Richtárik & Takáč [2012] Richtárik, Peter and Takáč, Martin. Parallel coordinate descent methods for big data optimization. arXiv:1212.0873, 2012.
 Taskar et al. [2004] Taskar, Ben, Guestrin, Carlos, and Koller, Daphne. Maxmargin Markov networks. In NIPS’04, pp. 25–32. MIT Press, 2004.
 Tsitsiklis et al. [1986] Tsitsiklis, John N, Bertsekas, Dimitri P, Athans, Michael, et al. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE transactions on automatic control, 31(9), 1986.
 Yu & Joachims [2009] Yu, ChunNam John and Joachims, Thorsten. Learning structural svms with latent variables. In ICML’09, pp. 1169–1176. ACM, 2009.
 Zhang et al. [2012] Zhang, Xinhua, Yu, Yaoliang, and Schuurmans, Dale. Accelerated training for matrixnorm regularization: A boosting approach. In NIPS’12, pp. 2915–2923, 2012.
 Zhang et al. [2013] Zhang, Xinhua, Yu, YaoLiang, and Schuurmans, Dale. Polar operators for structured sparse estimation. In NIPS’13, pp. 82–90, 2013.
Appendix A Convergence analysis
We provide a selfcontained convergence proof in this section. The skeleton of our convergence proof follow closely from LacosteJulien et al. [24] and Jaggi [21]. There are a few subtle modification and improvements that we need to add due to our weaker definition of approximate oracle call that is nearly correct only in expectation. The delayed convergence is new and interesting for the best of our knowledge, which uses a simple result in “load balancing” [31].
Note that for the cleanness of the presentation, we focus on the primal and primaldual convergence of the version of the algorithms with predefined step sizes and additive approximate subroutine, it is simple to extend the same analysis for linesearch variant and multiplicative approximation.
a.1 Primal Convergence
Lemma 2.
Denote the gap between current and the optimal to be . The iterative updates in Algorithm 1(with arbitrary fixed stepsize or by the coordinateline search) obey
where the expectation is taken over the joint randomness all the way to iteration .
Proof.
Let for notational convenience. We prove the result for Algorithm 1 first. Apply the definition of and then apply the definition of the additive approximation in (6), to get
Subtract on both sides we get:
Now take the expectation over the entire history then apply (6) and definition of the surrogate duality gap (7), we obtain
(13)  
The last inequality follows from the property of the surrogate duality gap due to the fact that . This completes the proof of the descent lemma. ∎
Now we are ready to state the proof for Theorem 1.
Proof of Theorem 1.
We follow the proof in Theorem C.1 in LacosteJulien et al. [24] to prove the statement for Algorithm 1. The difference is that we use a different and carefully chosen sequence of step size.
Take and denote as for short hands. The inequality in Lemma 2 simplifies to
Now we will prove for by induction. The base case is trivially true since . Assuming that the claim holds for , we apply the induction hypothesis and the above inequality is reduced to
This completes the induction and hence the proof for the primal convergence for Algorithm 1. ∎
a.2 Convergence of the surrogate duality gap
Proof of Theorem 2.
We mimic the proof in LacosteJulien et al. [24, Section C.3] for the analogous result closely, and we will use the same notation for and as in the proof for primal convergence, moreover denote First from (13) in the proof of Lemma 2, we have
Rearrange the terms, we get
(14) 
The idea is that if we take an arbitrary convex combination of , the result will be within the convex hull, namely between the minimum and the maximum, hence proven the existence claim in the theorem. By choosing weight where normalization constant and taking the convex combination of both side of (14), we have
(15) 
Note that , so we simply dropped a negative term in last line. Applying the step size , we get
Plug the above back into (15) and use the bound , we get
This completes the proof for . ∎
Proof of Convergence with Delayed Gradient
The idea is that we are going to treat the updates calculated from the delayed gradients as an additive error and then invoke our convergence results that allow the oracle to be approximate. We will first present a lemma that we will use for the proof of Theorem 4.
Lemma 3.
Let , be a norm, , be the gradient Lipschitz constant with respect to the given norm . Let be the maximum staleness of the gradient, be the largest stepsize in the past steps. Then
Proof.
Because minimizes over , we can write
Using this and Hölder’s inequality, we can write
It remains to bound .
where we used the fact that is at most steps away from . Assume is the stepsize used and