Quantized Frank-Wolfe: Faster Optimization, Lower Communication, and Projection Free

Quantized Frank-Wolfe: Faster Optimization, Lower Communication, and Projection Free

\nameMingrui Zhang \emailmingrui.zhang@yale.edu
\addrDepartment of Statistics and Data Science
Yale University
New Haven, CT 06511 \AND\nameLin Chen \emaillin.chen@yale.edu
\addrYale Institute for Network Science
\addrDepartment of Electrical Engineering
Yale University
New Haven, CT 06511 \AND\nameAryan Mokhtari \emailaryanm@mit.edu
\addrLaboratory for Information and Decision Systems
Massachusetts Institute of Technology
Cambridge, MA 02139 \AND\nameHamed Hassani \emailhassani@seas.upenn.edu
\addrDepartment of Electrical and Systems Engineering
University of Pennsylvania
Philadelphia, PA 19104 \AND\nameAmin Karbasi \emailamin.karbasi@yale.edu
\addrDepartment of Electrical Engineering and Computer Science
Yale University
New Haven, CT 06511
Abstract

How can we efficiently mitigate the overhead of gradient communications in distributed optimization? This problem is at the heart of training scalable machine learning models and has been mainly studied in the unconstrained setting. In this paper, we propose Quantized Frank-Wolfe (QFW), the first projection-free and communication-efficient algorithm for solving constrained optimization problems at scale. We consider both convex and non-convex objective functions, expressed as a finite-sum or more generally a stochastic optimization problem, and provide strong theoretical guarantees on the convergence rate of QFW. This is accomplished by proposing novel quantization schemes that efficiently compress gradients while controlling the noise variance introduced during this process. Finally, we empirically validate the efficiency of QFW in terms of communication and the quality of returned solution against natural baselines.

\makesavenoteenv

tabular \makesavenoteenvtable

1 Introduction

The Frank-Wolfe (FW) method (Frank and Wolfe, 1956), also known as conditional gradient, has recently received considerable attention in the machine learning community, as a projection free algorithm for various constrained convex (Jaggi, 2013; Garber and Hazan, 2014; Lacoste-Julien and Jaggi, 2015; Garber and Hazan, 2015; Hazan and Luo, 2016; Mokhtari et al., 2018b) and non-convex (Lacoste-Julien, 2016; Reddi et al., 2016) optimization problems. In order to apply the FW algorithm to large-scale problems (e.g., training deep neural networks(Ravi et al., 2018; Schramowski et al., 2018; Berrada et al., 2018), RBMs(Ping et al., 2016)) parallelization is unavoidable. To this end, distributed FW variants have been proposed for specific problems, e.g., online learning (Zhang et al., 2017), learning low-rank matrices (Zheng et al., 2018), and optimization under block-separable constraint sets (Wang et al., 2016). A significant performance bottleneck of distributed optimization methods is the cost of communicating gradients, typically handled by using a parameter-server framework. Intuitively, if each worker in the distributed system transmits the entire gradient, then at least floating-point numbers are communicated for each worker, where is the dimension of the problem. This communication cost can be a huge burden on the performance of parallel optimization algorithms (Chilimbi et al., 2014; Seide et al., 2014; Strom, 2015). To circumvent this drawback, communication-efficient parallel algorithms have received significant attention. One major approach is to quantize the gradients while maintaining sufficient information (De Sa et al., 2015; Abadi et al., 2016; Wen et al., 2017). For unconstrained optimization, when projection is not required for implementing Stochastic Gradient Descent (SGD), several communication-efficient distributed methods have been proposed, including QSGD (Alistarh et al., 2017), SIGN-SGD (Bernstein et al., 2018), and Sparsified-SGD (Stich et al., 2018).

In the constrained setting, and in particular for distributed FW algorithms, the communication-efficient versions were only studied for specific problems such as sparse learning (Bellet et al., 2015; Lafond et al., 2016). In this paper, however, we develop Quantized Frank-Wolfe (QFW), a general communication-efficient distributed FW for both convex and non-convex objective functions. We study the performance of QFW in in two widely recognized settings: 1) stochastic and 2) finite-sum optimization.

Let be the constraint set. For constrained stochastic optimization the goal is to solve

(1)

where is the optimization variable, is a random variable drawn from a distribution , which determines the choice of a stochastic function . For constrained finite-sum optimization we further assume that is a uniform distribution over and the goal is to solve a special case of Problem (1), namely,

(2)

In parallel settings, we suppose that there is a master and workers, and each worker maintains a local copy of . At every iteration of the stochastic case, each worker has access to independent stochastic gradients of ; whereas in the finite-sum case, we assume , thus the objective function can be decomposed as , and each worker has access to the exact gradients of component functions for all .

1st stage

Master

W: Compute

W: Compute

W: Compute

2nd stage

Master:

W

W

W

3rd stage

Master

W:

W:

W:

Figure 1: Stages of our general Quantized Frank-Wolfe scheme at time . In the first stage, each worker computes its local gradient estimation and sends the quantized version to the master node. In the second stage, master computes the average of decoded received signals , i.e., and then sends its quantized version to the workers. In the third stage, workers use the decoded gradient average computed by all workers and their previous gradient estimation to update their new gradient estimation via a variance reduction (VR) scheme. Once the variance reduced gradient approximation is evaluated, workers compute the new variable by following the update of Frank-Wolfe (FW).

This way the task of computing gradients is divided among the workers. The master node aggregates local gradients from the workers, and sends the aggregated gradients back to them so that each worker can update the model (i.e., their own iterate) locally. Thus, by transmitting quantized gradients, we can reduce the communication complexity (i.e., number of transmitted bits) significantly. The workflow diagram of the distributed quantization scheme is summarized in Figure 1. Finally, we should highlight that there is a trade-off between gradient quantization and information flow. Intuitively, more intensive quantization reduces the communication cost, but also loses more information, which may decelerate the convergence rate.

Our contributions: In this paper, we propose a novel distributed projection-free framework that handles quantization for constrained convex and non-convex optimization problems in stochastic and finite-sum cases. It is well-known that unlike projected gradient-based methods, FW methods may diverge when fed with stochastic gradient (Mokhtari et al., 2018b). Indeed, a similar issue arises in a distributed setting where nodes exchange quantized gradients which are noisy estimates of the gradients. By incorporating appropriate variance reduction techniques in different settings, we show that with quantized gradients, we can obtain a provably convergent method which preserves the convergence rates of the vanilla unquantized method in most cases. We believe our work presents the first quantized, distributed, and projection-free method, in contrast to all the previous works which consider quantization in the unconstrained setting. Our theoretical results for Quantized Frank-Wolfe (QFW) are summarized in Table 1, where the SFO complexity is the required number of stochastic gradients in stochastic case, and the IFO complexity is the number of exact gradients for component functions in finite-sum case. To be more specific, we show that (i) QFW improves the IFO complexity of the SVRF method (Hazan and Luo, 2016) to for finite-sum convex case, by using the newly proposed SPIDER variance reduction technique; (ii) QFW preserves the SFO/IFO complexities of the SFW algorithm (Mokhtari et al., 2018b) for stochastic convex case, and the accelerated NFWU method (Shen et al., 2019) for finite-sum non-convex case; (iii) QFW has slightly worse SFO complexity than that of SVFW-S (Reddi et al., 2016), , for the stochastic non-convex case, while it uses quantized gradients.

Setting Function SFO/IFO Complexity Average Bits
stoch. convex
stoch. non-convex
finite-sum convex
finite-sum non-convex
Table 1: SFO/IFO Complexity and average communication bits in different settings, where is the number of workers, .

2 Gradient Quantization Schemes

As mentioned earlier, the communication cost can be reduced effectively by sending quantized gradients. In this section, we introduce a quantization scheme called s-Partition Encoding Scheme. Consider the gradient vector and let be the -th coordinate of the gradient. The s-Partition Encoding Scheme encodes into an element from the set in a random way. To do so, we first compute the ratio and find the indicator such that . Then we define the random variable as

(3)

Finally, instead of transmitting , we send , alongside the norm . It can be verified that . So we define the corresponding decoding scheme as to ensure that is an unbiased estimator of . We note that this quantization scheme is similar to the Stochastic Quantization method in (Alistarh et al., 2017), except that we use -norm while they adopt the -norm. In the -Partition Encoding Scheme, for each coordinate , we need 1 bit to transmit . Moreover, since , we need bits to send . Finally, we need 32 bits to transmit . Hence, the total number of communicated bits is . Here, by “bits” we mean the number of 0’s and 1’s transmitted.

One major advantage of the -Partition Encoding Scheme is that by tuning the partition parameter or the corresponding assigned bits , we can smoothly control the trade-off between gradient quantization and information loss, which helps distributed algorithms to attain their best performance. We proceed to characterize the variance of the -Partition Encoding Scheme.

{lemma}

The variance of -Partition Encoding Scheme for any is bounded by

(4)

If we set , we obtain the Sign Encoding Scheme, which requires communicating the encoded scalars and the norm . Since , the overall communicated bits for each worker are per round. We characterize its variance in Section 2.

{lemma}

The variance of Sign Encoding Scheme is given by

(5)
{remark}

For the probability distribution of the random variable , instead of , we can use other norms (where ). But it can be verified that the -norm leads to the smallest variance for Sign Encoding Scheme. That is also the reason why we do not use -norm as in (Alistarh et al., 2017).

3 Stochastic Optimization

In this section, we aim to solve the constrained stochastic optimization problem defined in (1) in a distributed fashion. In particular, we are interested in projection-free (Frank-Wolfe type) methods and execute quantization to reduce the communication cost between the workers and the master. Recall that we assume at each round , each worker has access to an unbiased estimator of the objective function gradient , which is denoted by , i.e., . We further assume that the stochastic gradients are independent of each other.

1:  Input: constraint set , iteration horizon , initial point , , step sizes
2:  Output: or , where is chosen from uniformly at random
3:  for  to  do
4:     Each worker gets an independent stochastic gradient
5:     Each worker encodes its local gradient as , and pushes to the master
6:     Master decodes as
7:     Master computes the average gradient
8:     Master encodes as and broadcasts it to all the workers
9:     Workers decode as
10:     Workers compute the momentum-based gradient
11:     Workers update based on where
12:  end for
Algorithm 1 Stochastic Quantized Frank-Wolfe (S-QFW)

In our proposed Stochastic Quantized Frank-Wolfe (S-QFW) method, at iteration , each worker first computes its local stochastic gradient . Then, it encodes as – which is quantized and can be transmitted at a low communication cost – to the master. Once the master receives all the coded stochastic gradients , it uses a proper decoding scheme to evaluate , which are the decoded versions of the received signals . Indeed, by design, each of the decoded signals is an unbiased estimator of the objective function gradient . Then, the master evaluates the average of the decoded signals denoted by , i.e., . After using a proper quantization scheme, the master broadcasts the coded signal to all the workers. The workers decode the received signals and use the resulted vector to improve their gradient approximation.

Note that even in the unquantized setting, if we use the stochastic gradient , instead of , Frank-Wolfe may still diverge Mokhtari et al. (2018b). As a result, we need to further reduce the variance. To do so, each worker uses a momentum local vector to update the iterates, which is defined by

(6)

As the update of in (6) computes a weighted average of the previous stochastic gradient approximation and the updated network average stochastic gradient , it has a lower variance comparing to the vector . The key fact that allows us to prove convergence is that the estimation error of approaches zero as time passes (check Appendix C in Appendix C). After computing the gradient estimation based on (6), workers update their variables by following the FW scheme, i.e., , where S-QFW is outlined in Algorithm 1. Finally, note that we can use different quantization schemes in S-QFW, which leads to different convergence rates and communication costs.

Now we proceed to analyze S-QFW and first focus on convex settings.

Assumption 1

The constraint set is convex and compact, with diameter .

Assumption 2

The function is convex, bounded, i.e., , and -smooth over .

Assumption 3

For each worker and iteration , the stochastic gradient is unbiased and has a uniformly bounded variance, i.e., for all and ,

Assumption 4

For any , and vectors and generated by Stochastic Quantized Frank-Wolfe, the quantization scheme satisfies

{theorem}

[Convex] Under Assumptions 1 to 4, if we set in Algorithm 1, then after iterations, the output satisfies

where , and is a global minimizer of on .

Section 3 shows that the suboptimality gap of S-QFW converges to zero at a sublinear rate of . Hence, after running at most iterations, we can find a solution that is close to the optimum. We also characterize the exact complexity bound for S-QFW when the Sign Encoding Scheme is used for quantization and show that it obtains an -accurate solution after rounds for communication. This result is presented in Appendix E due to space limit. Note that as each communication round in Sign Encoding Scheme requires bits, the overall communication cost to find an -suboptimal solution is of .

With slightly different parameters, S-QFW can be applied to non-convex settings as well. In unconstrained non-convex optimization problems, the gradient norm is usually a good measure of convergence as implies convergence to a stationary point. However, in the constrained setting we study the Frank-Wolfe Gap (Jaggi, 2013; Lacoste-Julien, 2016) defined as

(7)

For constrained optimization problem (1), if a point satisfies , then it is a first-order stationary point. Also, by definition, we have . We analyze the convergence rate of Algorithm 1 under the following assumption on the objective function .

Assumption 5

The function is bounded, i.e., , and -smooth over .

{theorem}

[Non-convex] Under Assumptions 3, 4, 5 and 1, and given the iteration horizon , if we set in Algorithm 1, then

where . Section 3 indicates that in the non-convex setting, S-QFW finds an -first order stationary point after at most iterations. By using Sign Encoding Scheme, each round of communication requires bits. Therefore, to find an -first order stationary point, we need rounds with the overall communication cost of .

4 Finite-Sum Optimization

In this section, we analyze the finite-sum problem defined in (2). Recall that we assume that there are functions and workers in total, and each worker has access to functions for . The major difference with the stochastic setting is that we can use a more aggressive variance reduction for communicating quantized gradients. Nguyen et al. (2017a, b, 2019) developed the StochAstic Recursive grAdient algoritHm (SARAH), a stochastic recursive gradient update framework. Recently, Fang et al. (2018) proposed Stochastic Path-Integrated Differential Estimator (SPIDER) technique, a variant of SARAH, for unconstrained optimization in centralized settings. In this paper, we generalize SPIDER to the constrained and distributed settings.

We first consider the case where no quantization is performed. Let be a period parameter. At the beginning of each period, namely, mod, each worker , computes the average of all its local gradients and sends it to the master. Then, master calculates the average of the received signals and broadcasts it to all workers. Then, workers update their gradient estimation as

Note is identical for all the workers. In the rest of that period, i.e., mod, each worker samples a set of local component functions, denoted as , of size uniformly at random, computes the average of these gradients and sends it to master. Then, master calculates the average of the signals and broadcasts it to all the workers. The workers update their gradient estimation as

(8)

So is still identical for all the workers. In order to incorporate quantization, each worker simply pushes the quantized version of the average gradients. Then the master decodes the quantizations, encodes the average of decoded signals in a quantized fashion, and broadcasts the quantization. Finally, each worker decodes the quantized signal and updates locally. The full description of our proposed Finite-Sum Quantized Frank-Wolfe (F-QFW) algorithm is outlined in Algorithm 2.

1:  Input: , , No. of workers , initial point , period parameter , sample size
2:  Output: or , where is chosen from uniformly at random
3:  for  to  do
4:     if mod then
5:        Each worker computes its local gradient
6:        Each worker encodes as and pushes it to the master
7:        Master decodes as
8:        Master computes the average gradient
9:        Master encodes as , and broadcasts it to all workers
10:        Workers decode as and update
11:     else
12:        Each worker at time samples component functions uniformly at random called
13:        Each worker computes exact gradients for all
14:        Each worker encodes and pushes to master
15:        Master decodes the signals
16:        Master computes
17:        Master encodes as , and broadcasts all workers
18:        Workers decode as and update
19:     end if
20:     Each worker updates locally by where
21:  end for
Algorithm 2 Finite-Sum Quantized Frank-Wolfe (F-QFW)

To analyze the convex case, we first make an assumption on the component functions.

Assumption 6

The functions are convex, -smooth on , and uniformly bounded, i.e., We also assume that .

{theorem}

[Convex] Consider F-QFW outlined in Algorithm 2. Recall that indicates the number of local functions at each node, and indicates the size of mini-batch used in (8). Under Assumptions 6 and 1, if we set , , and , and use the -Partition Encoding Scheme, and -Partition Encoding Scheme as and in Algorithm 2, then the output satisfies

where , , and is a minimizer of on .

Section 4 indicates that in convex setting, if we use the recommended quantization schemes, then the output of Finite-Sum Quantized Frank-Wolfe is -suboptimal with at most rounds. As , the Linear-optimization Oracle (LO) complexity is . Also, the total Incremental First-order Oracle (IFO) complexity is . By considering the quantization schemes with and quantization levels, the average communication bits per round are at most .

Algorithm 2 can also be applied to the non-convex setting with a slight change in parameters. We first make a standard assumption on the component functions.

Assumption 7

The component functions are -smooth on and uniformly bounded, i.e., . We also assume that .

{theorem}

[Non-convex] Under Assumptions 7 and 1, if we set , , and , and use the -Partition Encoding Scheme, and -Partition Encoding Scheme as and in Algorithm 2, then the output satisfies

Section 4 shows that for non-convex minimization, if we adopt the recommended quantization schemes, then Algorithm 2 finds an -first order stationary point with at most rounds. Also, the total IFO complexity is , and the average communication bits per round are .

5 Experiments

We evaluate the performance of algorithms by visualizing their optimality gap (for convex settings), their loss (for non-convex settings) as well as their testing accuracy vs. the number of transmitted bits. The experiments were performed on 20 Intel Xeon E5-2660 cores and thus the number of workers is 20. For each curve in the figures below, we ran at least 50 repeated experiments, and the height of shaded regions represents two standard deviations.

In our first setup, we consider a multinomial logistic regression problem. Consider the dataset with samples that have different labels. We aim to find a model to classify these sample points under the condition that the solution has a small -norm. Therefore, we aim to solve the following convex problem

(9)

In our experiments, we use the MNIST and CIFAR-10 datasets. For the MNIST dataset, we assume that each worker stores images, and, therefore, the overall number of samples in the training set is . The result on CIFAR-10 is similar and deferred to Appendix J.

In our second setup, our goal is to minimize the loss of a three-layer neural network under some conditions on the norm of the solution. Before stating the problem precisely, let us define the log-loss function as for and a -dimensional probability vector . We aim to solve the following non-convex problem

(10)

where is the sigmoid function and is the softmax function. The imposed constraint on the weights leads to a sparse network. We further remark that Frank-Wolfe methods are suitable for training a neural network subject to an constraint as they are equivalent to a dropout regularization (Ravi et al., 2018). We use the MNIST and CIFAR-10 datasets. For the MNIST dataset, we assume that each worker stores images. The size of matrices and are and , respectively, and the constraints parameters are . We obtain a similar result on CIFAR-10 and discuss it in Appendix J.

(a) Optimality gap vs. bits transmitted
(b) Testing accuracy vs. bits transmitted
Figure 2: Comparison in terms of optimality gap (left) and test accuracy (right) versus number of transmitted bits for a multinomial logistic regression problem. The best performance belongs to QFW with Sign Encoding Scheme (), and FW without quantization has the worst performance.
(a) Loss vs. bits transmitted
(b) Testing accuracy vs. bits transmitted
Figure 3: Comparison of algorithms in terms of loss function (left) and test accuracy (right) versus number of transmitted bits for a three-layer neural network. FW without quantization () significantly underperforms the quantized FW methods.

In our third setup, we study a multi-task least square regression problem (Zheng et al., 2018). Its setting and result are discussed in Appendix J.

For all of the considered settings, we vary the quantization level and use the -partition encoding scheme ( indicates FW without quantization). We also propose SignFW, an effective heuristic based on QFW, where the norm of the gradient is discarded and only the sign of each coordinate is transmitted. Even though this method may not enjoy the strong theoretical guarantees of QFW (and may even diverge) we observed in our experiments that it performs on par with QFW in practice. Let us emphasize that the proposed SignFW algorithm is similar to QFW with Sign Encoding Scheme except that is not transmitted and only is transmitted (see Section 2).

In Figure 2, we observe the performance of SignFW, FW without quantization, and different variants of QFW for solving the multinomial logistic regression problem in (9). We observe that QFW with Sign Encoding Scheme () has the best performance and all quantized variants of FW outperform the FW method without quantization both in terms of training error and test accuracy. Specifically, QFW with Sign Encoding Scheme () requires transmitted bits to hit the lowest optimality gap in Fig. 1(a), while QFW with and require and bits, respectively, for achieving the same error. Furthermore, FW without quantization requires more than bits to reach the same error, i.e., quantization reduces communication load by at least an order of magnitude.

Figure 3 demonstrates the performance of SignFW, FW without quantization, and different variants of QFW for solving the three-layer neural network in (10). The relative behavior of the considered methods in Figure 3 is similar to the one in Figure 2. QFW with Sign Encoding Scheme obtains a loss less than after transmitting bits, while to attain the same loss level, it takes bits if one uses SignFW or QFW with . The number of required bits becomes approximately for . Also, if no quantization is applied, then the number of required bits is at least the (i.e., quantization reduces communication load by at least two orders of magnitude). To achieve a testing accuracy greater than , QFW with requires bits transmission, while the second most communication-efficient method QFW with needs bits.

6 Conclusion

In this paper, we developed Quantized Frank-Wolfe (QFW), the first general-purpose projection-free and communication-efficient framework for constrained optimization. Along with proposing various quantization schemes, QFW can address both convex and non-convex optimization settings in stochastic and finite-sum cases. We provided theoretical guarantees on the convergence rate of QFW and validated its efficiency empirically on training multinomial logistic regression and neural networks. Our theoretical results highlighted the importance of variance reduction techniques to stabalize Frank Wolfe and achieve a sweet trade-off between the communication complexity and convergence rate in distributed settings.

References

  • Abadi et al. [2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
  • Alistarh et al. [2017] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
  • Bellet et al. [2015] Aurélien Bellet, Yingyu Liang, Alireza Bagheri Garakani, Maria-Florina Balcan, and Fei Sha. A distributed frank-wolfe algorithm for communication-efficient sparse learning. In Proceedings of the 2015 SIAM International Conference on Data Mining, pages 478–486. SIAM, 2015.
  • Bernstein et al. [2018] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Anima Anandkumar. signsgd: compressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434, 2018.
  • Berrada et al. [2018] Leonard Berrada, Andrew Zisserman, and M Pawan Kumar. Deep frank-wolfe for neural network optimization. arXiv preprint arXiv:1811.07591, 2018.
  • Chilimbi et al. [2014] Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In OSDI, volume 14, pages 571–582, 2014.
  • De Sa et al. [2015] Christopher M De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. Taming the wild: A unified analysis of hogwild-style algorithms. In Advances in neural information processing systems, pages 2674–2682, 2015.
  • Fang et al. [2018] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 687–697, 2018.
  • Frank and Wolfe [1956] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics (NRL), 3(1-2):95–110, 1956.
  • Garber and Hazan [2014] Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. arXiv preprint arXiv:1406.1305, 2014.
  • Garber and Hazan [2015] Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In ICML, volume 15, pages 541–549, 2015.
  • Hazan and Luo [2016] Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. In ICML, pages 1263–1271, 2016.
  • Jaggi [2013] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML, pages 427–435, 2013.
  • Lacoste-Julien [2016] Simon Lacoste-Julien. Convergence rate of frank-wolfe for non-convex objectives. arXiv preprint arXiv:1607.00345, 2016.
  • Lacoste-Julien and Jaggi [2015] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of frank-wolfe optimization variants. In Advances in Neural Information Processing Systems, pages 496–504, 2015.
  • Lafond et al. [2016] Jean Lafond, Hoi-To Wai, and Eric Moulines. D-fw: Communication efficient distributed algorithms for high-dimensional sparse optimization. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 4144–4148. IEEE, 2016.
  • Mokhtari et al. [2018a] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Conditional gradient method for stochastic submodular maximization: Closing the gap. In AISTATS, pages 1886–1895, 2018a.
  • Mokhtari et al. [2018b] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Stochastic conditional gradient methods: From convex minimization to submodular maximization. arXiv preprint arXiv:1804.09554, 2018b.
  • Nguyen et al. [2017a] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2613–2621. JMLR. org, 2017a.
  • Nguyen et al. [2017b] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Stochastic recursive gradient algorithm for nonconvex optimization. arXiv preprint arXiv:1705.07261, 2017b.
  • Nguyen et al. [2019] Lam M Nguyen, Marten van Dijk, Dzung T Phan, Phuong Ha Nguyen, Tsui-Wei Weng, and Jayant R Kalagnanam. Optimal finite-sum smooth non-convex optimization with sarah. arXiv preprint arXiv:1901.07648, 2019.
  • Ping et al. [2016] Wei Ping, Qiang Liu, and Alexander T Ihler. Learning infinite rbms with frank-wolfe. In Advances in Neural Information Processing Systems, pages 3063–3071, 2016.
  • Ravi et al. [2018] Sathya N Ravi, Tuan Dinh, Vishnu Sai Rao Lokhande, and Vikas Singh. Constrained deep learning using conditional gradient and applications in computer vision. arXiv preprint arXiv:1803.06453, 2018.
  • Reddi et al. [2016] Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Stochastic frank-wolfe methods for nonconvex optimization. arXiv preprint arXiv:1607.08254, 2016.
  • Schramowski et al. [2018] Patrick Schramowski, Christian Bauckhage, and Kristian Kersting. Neural conditional gradients. arXiv preprint arXiv:1803.04300, 2018.
  • Seide et al. [2014] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  • Shen et al. [2019] Zebang Shen, Cong Fang, Peilin Zhao, Junzhou Huang, and Hui Qian. Complexities in projection-free stochastic non-convex minimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2868–2876, 2019.
  • Stich et al. [2018] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. In Advances in Neural Information Processing Systems, pages 4452–4463, 2018.
  • Strom [2015] Nikko Strom. Scalable distributed dnn training using commodity gpu cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • Wang et al. [2016] Yu-Xiang Wang, Veeranjaneyulu Sadhanala, Wei Dai, Willie Neiswanger, Suvrit Sra, and Eric Xing. Parallel and distributed block-coordinate frank-wolfe algorithms. In International Conference on Machine Learning, pages 1548–1557, 2016.
  • Wen et al. [2017] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, pages 1509–1519, 2017.
  • Zhang et al. [2017] Wenpeng Zhang, Peilin Zhao, Wenwu Zhu, Steven CH Hoi, and Tong Zhang. Projection-free distributed online learning in networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 4054–4062. JMLR. org, 2017.
  • Zheng et al. [2018] Wenjie Zheng, Aurélien Bellet, and Patrick Gallinari. A distributed frank–wolfe framework for learning low-rank matrices with the trace norm. Machine Learning, 107(8-10):1457–1475, 2018.

Appendix A Proof of Section 2

{proof}

For any given vector , the ratio lies in an interval of the form where . Hence, for that specific , the following inequalities

(11)

are satisfied. Moreover, based on the probability distribution of we know that

(12)

Therefore, based on the inequalities in (11) and (12) we can write

(13)

Hence, we can show that the variance of -Partition Encoding Scheme is upper bounded by