Nonconvex Variance Reduced Optimization with Arbitrary Sampling

Nonconvex Variance Reduced Optimization
with Arbitrary Sampling

Samuel Horváth     Peter Richtárik King Abdullah University of Science and Technology (KAUST), Kingdom of Saudi ArabiaSchool of Mathematics, University of Edinburgh, United Kingdom — King Abdullah University of Science and Technology (KAUST), Kingdom of Saudi Arabia — Moscow Institute of Physics and Technology (MIPT), Russia
Abstract

We provide the first importance sampling variants of variance reduced algorithms for empirical risk minimization with non-convex loss functions. In particular, we analyze non-convex versions of SVRG, SAGA and SARAH. Our methods have the capacity to speed up the training process by an order of magnitude compared to the state of the art on real datasets. Moreover, we also improve upon current mini-batch analysis of these methods by proposing importance sampling for minibatches in this setting. Surprisingly, our approach can in some regimes lead to superlinear speedup with respect to the minibatch size, which is not usually present in stochastic optimization. All the above results follow from a general analysis of the methods which works with arbitrary sampling, i.e., fully general randomized strategy for the selection of subsets of examples to be sampled in each iteration. Finally, we also perform a novel importance sampling analysis of SARAH in the convex setting.

1 Introduction

Empirical risk minimization (ERM) is a key problem in machine learning as it plays a key role in training supervised learning models, including classification and regression problems, such as support vector machine, logistic regression and deep learning. A generic ERM problem has the finite-sum form

(1)

where corresponds to the parameters/features defining a model, is the loss of the model associated with data point , and is the average (empirical) loss across the entire training dataset. In this paper we will focus on the case when the functions are –smooth but non-convex. We assume the problem has a solution .

One of the most popular algorithms for solving (1) is stochastic gradient descent (SGD) [19, 18]. In recent years, tremendous effort was exerted to improve its performance, leading to various enhancements which use acceleration [1], momentum [14], minibatching [35], distributed implementation [16, 15], importance sampling [36, 6, 27, 4], higher-order information [26, 9], and a number of other techniques.

1.1 Variance reduced methods

A particularly important recent advance has to do with the design of variance-reduced (VR) stochastic gradient methods, such as SAG [33], SDCA [34, 32], SVRG [11], S2GD [13], SAGA [7], MISO [17] and FINITO [8] and SARAH [22], which operate by modifying the classical stochastic gradient direction in each step of the training process in various clever ways so as to progressively reduce its variance as an estimator of the true gradient. We note that SAG and SARAH, the historically oldest and newest VR methods in the list, respectively, use a biased estimator of the gradient. In theory, all these methods enjoy linear convergence rates on smooth and strongly convex functions, which is in contrast with slow sublinear rate of SGD. VR methods are also easier to implement as they do not rely on a decreasing learning rate schedule. VR methods were recently extended to work with (non-strongly) convex losses [13], and more recently also to non-convex losses [29, 28, 2, 23], in all cases leading to best current rates for (1) in a given function class.

1.2 Importance sampling, minibatching and non-convex models

In the context of problem (1), importance sampling refers to the technique of assigning carefully designed non-uniform probabilities to the functions , and using these, as opposed to uniform probabilities, to sample the next data point (stochastic gradient) during the training process.

Despite the huge theoretical and practical success of VR methods, there are still considerable gaps in our understanding. For instance, an importance sampling variant of the popular SAGA method, with the “correct” convergence rate, was only designed very recently [10]; and the analysis applies to strongly convex only. A coordinate descent variant of SVRG with importance sampling, also in the strongly convex case, was analyzed in [12]. However, the method does not seem to admit a fast implementation. For dual methods based on coordinate descent, importance sampling is relatively well understood [20, 32, 24, 27, 3].

The territory is completely unmapped in the non-convex case, however. To the best of our knowledge, no importance sampling VR methods have been designed nor analyzed in the popular case when the functions are non-convex. An exception to this is dfSDCA [5]; however, this method applies to an explicitly regularized version of (1), and while the individual functions are allowed to be non-convex, the average is assumed to be convex. Given the dominance of stochastic gradient type methods in training large non-convex models such as deep neural networks, theoretical investigation of VR methods that can benefit from importance sampling is much needed.

The situation is worse still when one asks for importance sampling of minibatches. To the best of our knowledge, there are only a handful of papers on this topic [30, 6], none of which apply to the non-convex setting considered here, nor to the methods we will analyze, and the problem is open. This is despite the fact that minibatch methods are de-facto the norm for training deep nets due to the volume of data that feeds into the training, and the necessity to fully utilize the parallel processing power of GPUs and other hardware accelerators for the task. In practice, typically relatively small ( in comparison with ) minibatch sizes are used.

1.3 Contributions

Algorithm Uniform sampling Arbitrary sampling [NEW] [NEW]
SVRG [11]
SAGA [7]
SARAH [23]
Table 1: Stochastic gradient evaluation complexity for achieving for two variants of SVRG, SAGA and SARAH for minimizing the average of smooth nonconvex functions. Constants: are universal constant, ; ; = (average) minibatch size (hidden in ); can be for specific samplings smaller than and decreasing with increasing , which can lead to superlinear speedup in . For SARAH this guarantee holds for one outer loop with minibatch size, where we assume , in other words, minibatch size is not too big comparing to the required precision.

The main contributions of this paper are:

  • Arbitrary sampling. We peform a general analysis of three popular VR methods— SVRG [11], SAGA [7] and SARAH [22]— in the arbitrary sampling paradigm [30, 24, 25, 27, 4]. That is, we prove general complexity results which hold for an arbitrary random set valued mapping (aka arbitrary sampling) generating the minibatches of examples used by the algorithms in each iteration.

  • Rates. Our bounds improve the best current rates for these methods, for SVRG and SAGA even under the same ’s. Our importance sampling can be up faster by factor of comparing to the current state of the art (see Table 1 and Appendix C). Our methods can enjoy linear speedup or even for some specific samplings superlinear speedup in minibatch size. That is, the number of iterations needed to output a solution of a given accuracy drops by a factor equal or greater to the minibatch size used. This is of utmost relevance to the practice of training neural nets with minibatch stochastic methods as our results predict that this is to be expected. We design importance sampling and approximate importance sampling for minibatches which in our experiments vastly outperform the standard uniform minibatch strategies.

  • Best rates for SARAH under convexity. Lastly, we also perform an analysis of importance sampling variant of SARAH in the convex and strongly convex case (Appendix I). These are the currently fastest rates for SARAH.

2 Importance Sampling for Minibatches

As mentioned in the introduction, we assume throughout that are smooth, but not necessarily convex. In particular, we assume that is –smooth; that is,

where is the standard Euclidean norm. Let us define . Without loss of generality assume that .

2.1 Samplings

Let be a random set-valued mapping (“sampling”) with values in , where . A sampling is uniquely defined by assigning probabilities to all subsets of . With each sampling we associate a probability matrix defined by

The probability vector associated with is the vector composed of the diagonal entries of : , where . We say that is proper if for all . It is easy to show that

(2)

From now on, we will refer to as the minibatch size of sampling . It is known that is a symmetric positive semidefinite matrix [31].

Let us without loss of generality assume that and define constant to be number of ’s, which are not equal to one.

While our complexity results are general in the sense that they hold for any proper sampling, we shall now consider three special samplings; all with minibatch size :

  • Standard uniform minibatch sampling ().111In the literature, this is often refered to by the name –nice sampling [31, 25]. is chosen uniformly at random from all subsets of of cardinality . Clearly, with probability 1. The probability matrix is given by

    (3)
  • Independent sampling . Assume any proper ’s . For each we independently flip a coin, and with probability include element into . Hence, by construction, and . The probability matrix of is

  • Approximate independent sampling . Independent sampling has the disadvantage that coin tosses222 Note that just not , because others are included with probability one. need to be performed in order to generate the random set. However, we would like to sample at the cost coin tosses instead. We now design a sampling which has this property and which in a certain precise sense, as we shall see later, approximates the independent sampling. In particular, given an independent sampling with parameters for , let . Since , it follows that . On the other hand, if , then . We now sample a single set of cardinality using the standard uniform minibatch sampling (just for ). Subsequently, we apply an independent sampling to select elements of , with selection probabilities . The resulting set is . Since and for , the probability matrix of is given by

    Since , the probability matrix of the approximate independent sampling approximates that of the independent sampling. Note that includes both the standard uniform minibatch sampling and the independent sampling as special cases. Indeed, the former is obtained by choosing for all (whence and for all ), and the latter is obtained by choosing instead of .

2.2 Key lemma

The following lemma, which we use as upper bound for variance, plays a key role in our analysis.

Lemma 1.

Let be vectors in and let be their average. Let be a proper sampling (i.e., assume that for all ). Assume that there is such that

(4)

Then

(5)

where the expectation is taken over sampling . Whenever (4) holds, it must be the case that

(6)

Moreover, (4) is always satisfied for for and otherwise. Further, if with probability 1 for some , then (4) holds for . The standard uniform minibatch sampling admits , the independent sampling admits , and the approximate independent sampling admits the choice if , otherwise.

The following quantities, which comes from Lemma 1 and smoothness assumption, play a key role in our general complexity results:

(7)

We can see through theory that it is a good idea to design samplings for which the value is the smallest possible, which would lead to optimal sampling. The following result sheds light on how should be chosen, from samplings of a given minibatch size , to minimize .

Lemma 2.

Fix a minibatch size . Then the quantity , defined in (7), is minimized for the choice with the probabilities

(8)

where is the largest integer satisfying (for instance, satisfies this). Usually, if ’s are not too much different, than , for instance, if then . If we choose , then is minimized for (8) with

(9)

where for and for . Moreover, if we assume333Note, that this can be always satisfied, if we uplift the smallest ’s, because if function is -smooth, then it is also smooth with larger . , then , thus

From now let denote Independent Sampling and Approximate Inpedendent Sampling, respectively, with probabilities defined in (8).

Remark 1.

Lemma 2 guarantees that the sampling is optimal. Moreover, let then in theory, we have super linear speed up for up to for all three algorithms.

3 SVRG, SAGA and SARAH

In all of the results of this section we assume that is an arbitrary proper sampling. Let be the (average) minibatch size. We assume that satisfies (4) and that (which depends on ) is defined as in (7). All complexity results will depend on and .

We propose three methods, Algorithm 1, 2 and 3, which are generalizations of original SVRG [28], SAGA and SARAH to the arbitrary sampling setting, respectively. The original non-minibatch methods arise as special cases for the sampling with probability , and the original minibatch methods arise as a special case for the sampling (described in Section 2.1).

1:  ,
2:  for  to  do
3:     
4:     
5:     for  to  do
6:         Draw a random subset (minibatch)
7:         
8:         
9:     end for
10:     
11:  end for
12:  Output: Iterate chosen uniformly random from .
Algorithm 1 SVRG with arbitrary sampling

Our general result for SVRG [11] follows.

Theorem 3 (Complexity of Svrg with arbitrary sampling).

There exist universal constants , such that the output of Alg. 1 with mini-batch size , step size , and parameters , and (multiple of ) satisfies:

Thus in terms of stochastic gradient evaluations to obtain -accurate solution, one needs following number of iterations

In the next theorem we provide a generalization of the results in [29].

1:   for ,
2:  
3:  for  to  do
4:     Draw a random subset (minibatch)
5:     Pick random subset of such that
6:     
7:     
8:      for and for
9:     
10:  end for
11:  Output: Iterate chosen uniformly random from .
Algorithm 2 SAGA with arbitrary sampling
Theorem 4 (Complexity of Saga with arbitrary sampling).

There exist universal constants , such that the output of Alg. 2 with mini-batch size , step size , and parameter satisfies:

Thus in terms of stochastic gradient evaluations to obtain -accurate solution, one needs following number of iterations

We now introduce Algorithm 3: a general form of the SARAH algorithm [23].

1:  
2:  for  to  do
3:     
4:     
5:     for  to  do
6:         Draw a random subset (minibatch)
7:         
8:         
9:     end for
10:      chosen uniformly randomly from
11:  end for
12:  Output: Iterate
Algorithm 3 SARAH with arbitrary sampling
Theorem 5 (Complexity of Sarah with arbitrary sampling).

Consider one outer loop of Alg. 3 with

(10)

Then the output satisfies:

. Thus in terms of stochastic gradient evaluations to obtain -accurate solution, one needs following number of iterations

If all ’s are the same and we choose to be , thus uniform with mini-batch size , we can get back original result from [23]. Taking , we can restore gradient descent with the correct step size.

4 Additional Results

4.1 Gradient dominated functions

Definition 1.

Function is -gradient dominated if for all , where is optimal solution of (1).

  Input: , ,
  for  to  do
     
  end for
  Output:
Algorithm 4 GD-Algorithm

Gradient dominance is weaker version of strong convexity due to the fact that if function is -strongly convex then it is -gradient dominated, where .

Any of the non-convex methods in this paper can be used as a subroutine of Algorithm 4, where is the number of steps of the subroutine and is the set of optimal parameters for the subroutine. We set for SVRG and for SAGA. In the case of SARAH, is obtained by solving in and setting . Using Theorems 3, 4, 5 and the above special choice of , we get

Combined with Definition 1, this guarantees linear convergence with the same constants and we had before in our analysis.

4.2 Importance sampling for SARAH under convexity

In addition to the results presented in previous sections, we also establish importance sampling results for SARAH in convex and strongly convex cases (Appendix I) with similar improvements as for the non-convex algorithm. Ours are the best current rates for SARAH in these settings.

Further, we also provide specialized non-minibatch versions of non-convex SAGA, SARAH and SVRG, which are either special cases of their minibatch versions presented in the main part, or slight modifications, with slightly improved guarantees.

5 Experiments

In this section, we perform experiments with regression for binary classification, where our loss function has form , where is the sigmoid function, thus is smooth but non-convex. We use four LIBSVM datasets444The LIBSVM dataset collection is available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/: covtype, ijcnn1, splice, australian.

Parameters of each algorithm are chosen as suggested by theorems in section 3 and is set to be all zeros vector. The axis shows the norm of the gradient, and the axis shows how many times the algorithm evaluates the full gradient. Evaluation of the gradient of a single function from the empirical loss costs of the full gradient evaluation. For SARAH, we chose .

5.1 Importance and uniform sampling comparison

Here we provide comparison of the methods with uniform and with importance sampling.

Figure 1: Comparison of all methods with uniform and importance sampling

From Figure 1 one can see that method with importance sampling outperform uniform sampling and can be faster by orders of magnitude. For instance, for the first graph, we can see improvement up to orders of magnitude.

5.2 Linear or more than linear speedup

Theory suggests that linear or even more than linear speed up can be obtained using . Our experiment suggest that this is possible for all three algorithms.


Figure 2: ijcnn1 dataset, # iterations means number of stochastic gradient evaluation divided by minibatch size , when we assume parallelism on cores.

Figure 2 confirms that linear, even superlinear, speedup (with increasing batch size, we need the same or less full gradient evaluations) can be obtained also in practice, not just in theory. However, it is limited, and for this dataset it is up to minibatch size of . Upper graphs at Figure 2 visualize convergence under assumption of multi-core settings, where we assume number of cores is the same as the minibatch size.

5.3 Independent Sampling vs. Approximate Independent Sampling

In the theory, Independent Sampling is slightly better than Approximate Independent Sampling . However, it is more expensive in terms of computations. The goal of the next experiment is to show that in practice yields comparable or faster convergence. Hence, it is more reasonable to use this sampling for datasets, where number of data points is big (if we implement efficiently we can almost get rid of dependence on ). The intuition behind why could work better is that has smaller variance in batch size than .

Figure 3: vs. , splice dataset.

It can be seen from Figure 3 that can outperform , thus, however, is optimal in theory, one should use in practice.

References

  • [1] Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. STOC 2017: Symposium on Theory of Computing, 19-23, 2016.
  • [2] Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In The 33th International Conference on Machine Learning, pages 699–707, 2016.
  • [3] Zeyuan Allen-Zhu, Zheng Qu, Peter Richtárik, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In The 33rd International Conference on Machine Learning, pages 1110–1119, 2016.
  • [4] Antonin Chambolle, Matthias J. Ehrhardt, Peter Richtárik, and Carola-Bibiane Schöenlieb. Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. arXiv:1706.04957, 2017.
  • [5] Dominik Csiba and Peter Richtárik. Primal method for erm with flexible mini-batching schemes and non-convex losses. arXiv:1506.02227, 2015.
  • [6] Dominik Csiba and Peter Richtárik. Importance sampling for minibatches. arXiv:1602.02283, 2016.
  • [7] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems 27, 2014.
  • [8] Aaron Defazio, Tiberio Caetano, and Justin Domke. Finito: A faster, permutable incremental gradient method for big data problems. The 31st International Conference on Machine Learning, 2014.
  • [9] Robert Mansel Gower, Donald Goldfarb, and Peter Richtárik. Stochastic block BFGS: squeezing more curvature out of data. In The 33rd International Conference on Machine Learning, pages 1869–1878, 2016.
  • [10] Robert Mansel Gower, Peter Richtárik, and Francis Bach. Stochastic quasi-gradient methods: variance reduction via Jacobian sketching. arXiv:1805.02632, 2018.
  • [11] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pages 315–323, 2013.
  • [12] Jakub Konečný, Zheng Qu, and Peter Richtárik. S2CD: Semi-stochastic coordinate descent. Optimization Methods and Software, 32(5):993–1005, 2017.
  • [13] Jakub Konečný and Peter Richtárik. S2GD: Semi-stochastic gradient descent methods. Frontiers in Applied Mathematics and Statistics, pages 1–14, 2017.
  • [14] Nicolas Loizou and Peter Richtárik. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. arXiv:1712.09677, 2017.
  • [15] Chenxin Ma, Jakub Konečný, Martin Jaggi, Virginia Smith, Michael I Jordan, Peter Richtárik, and Martin Takáč. Distributed optimization with arbitrary local solvers. Optimization Methods and Software, 32(4):813–848, 2017.
  • [16] Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I. Jordan, Peter Richtárik, and Martin Takáč. Adding vs. averaging in distributed primal-dual optimization. In The 32nd International Conference on Machine Learning, pages 1973–1982, 2015.
  • [17] Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
  • [18] A Nemirovski, A Juditsky, G Lan, and A Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
  • [19] Arkadi Nemirovsky and David B. Yudin. Problem complexity and method efficiency in optimization. Wiley, New York, 1983.
  • [20] Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
  • [21] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
  • [22] Lam Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. The 34th International Conference on Machine Learning, 2017.
  • [23] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Stochastic recursive gradient algorithm for nonconvex optimization. arXiv:1705.07261, 2017.
  • [24] Zheng Qu and Peter Richtárik. Coordinate descent with arbitrary sampling I: algorithms and complexity. Optimization Methods and Software, 31(5):829–857, 2016.
  • [25] Zheng Qu and Peter Richtárik. Coordinate descent with arbitrary sampling II: expected separable overapproximation. Optimization Methods and Software, 31(5):858–884, 2016.
  • [26] Zheng Qu, Peter Richtárik, Martin Takáč, and Olivier Fercoq. SDNA: stochastic dual newton ascent for empirical risk minimization. In The 33rd International Conference on Machine Learning, pages 1823–1832, 2016.
  • [27] Zheng Qu, Peter Richtárik, and Tong Zhang. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In Advances in Neural Information Processing Systems 28, pages 865–873, 2015.
  • [28] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In The 33th International Conference on Machine Learning, pages 314–323, 2016.
  • [29] Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Fast incremental method for smooth nonconvex optimization. In Decision and Control (CDC), 2016 IEEE 55th Conference on, pages 1971–1977. IEEE, 2016.
  • [30] Peter Richtárik and Martin Takáč. On optimal probabilities in stochastic coordinate descent methods. Optimization Letters, 10(6):1233–1243, 2016.
  • [31] Peter Richtárik and Martin Takáč. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1-2):433–484, 2016.
  • [32] Peter Richtárik and Martin Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(2):1–38, 2014.
  • [33] Nicolas Le Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663–2671, 2012.
  • [34] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
  • [35] Martin Takáč, Avleen Bijral, Peter Richtárik, and Nathan Srebro. Mini-batch primal and dual methods for SVMs. In 30th International Conference on Machine Learning, 2013.
  • [36] Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In The 32rd International Conference on Machine Learning, pages 1–9, 2015.

Appendix

Appendix A Proof of Lemma 1

Proof.

Let if and otherwise. Likewise, let if and otherwise. Note that and . Next, let us compute the mean of :

(11)

Let , where , and let be the vector of all ones in . We now write the variance of in a form which will be convenient to establish a bound:

(12)

Since by assumption we have , we can further bound

To obtain (5), it remains to combine this with (12).

Inequality (6) follows by comparing the diagonal elements of the two matrices in (4). Let us now verify the formulas for .

  • Since is positive semidefinite [31], we can bound , where .

  • It was shown in [25][Theorem 4.1] that provided that with probability 1. Hence, , which means that for all .

  • Consider now the independent sampling. Clearly,

    where .

  • Consider the –nice sampling (standard uniform minibatch sampling). Direct computation shows that the probability matrix is given by

    as claimed in (3). Therefore,

  • Letting and the probability matrix of the approximate independent sampling satisfies

    where Therefore,