New Optimisation Methods for Machine Learning

New Optimisation Methods for Machine Learning

[.4cm]\thesisqualifier

[2.5cm]Aaron Defazio

[9cm]A thesis submitted for the degree of

Doctor of Philosophy

of The Australian National University

[2cm]November 2014

© Aaron Defazio 2014

Except where otherwise indicated, this thesis is my own originalwork.

Aaron Defazio

November 12th 2014

Acknowledgements

I would like to thank several NICTA researchers for conversationsand brainstorming sessions during the course of my PhD, particularlyScott Sanner and my supervisor Tiberio Caetano.I would like to thank Justin Domke for many discussions about theFinito algorithm, and his assistance with developing and checkingthe proof. Likewise, for the SAGA algorithm I would like to thankFrancis Bach and Simon Lacoste-Julien for discussion and assistancewith the proofs. The SAGA algorithm was discovered in collaborationwith them while visiting the INRIA lab, with some financial supportfrom INRIA.I would also like to thank my family for all their support duringthe course of my PhD. Particularly my mother for giving me a placeto stay for part of the duration of the PhD as well as food, loveand support. I do not thank her often enough.I also would like to thank NICTA for their scholarship during thecourse of the PhD. NICTA is funded by the Australian Government throughthe Department of Communications and the Australian Research Councilthrough the ICT Centre of Excellence Program.

Abstract

In this work we introduce several new optimisation methods for problemsin machine learning. Our algorithms broadly fall into two categories:optimisation of finite sums and of graph structured objectives. Thefinite sum problem is simply the minimisation of objective functionsthat are naturally expressed as a summation over a large number ofterms, where each term has a similar or identical weight. Such objectivesmost often appear in machine learning in the empirical risk minimisationframework in the non-online learning setting. The second category,that of graph structured objectives, consists of objectives that resultfrom applying maximum likelihood to Markov random field models. Unlikethe finite sum case, all the non-linearity is contained within a partitionfunction term, which does not readily decompose into a summation.For the finite sum problem, we introduce the Finito and SAGA algorithms,as well as variants of each. The Finito algorithm is best suited tostrongly convex problems where the number of terms is of the sameorder as the condition number of the problem. We prove the fast convergencerate of Finito for strongly convex problems and demonstrate its state-of-the-artempirical performance on 5 datasets.The SAGA algorithm we introduce is complementary to the Finito algorithm.It is more generally applicable, as it can be applied to problemswithout strong convexity, and to problems that have a non-differentiableregularisation term. In both cases we establish strong convergencerate proofs. It is also better suited to sparser problems than Finito.The SAGA method has a broader and simpler theory than any existingfast method for the problem class of finite sums, in particular itis the first such method that can provably be applied to non-stronglyconvex problems with non-differentiable regularisers without introductionof additional regularisation.For graph-structured problems, we take three complementary approaches.We look at learning the parameters for a fixed structure, learningthe structure independently, and learning both simultaneously. Specifically,for the combined approach, we introduce a new method for encouraginggraph structures with the “scale-free” property. For the structurelearning problem, we establish SHORTCUT, a expectedtime approximate structure learning method for Gaussian graphicalmodels. For problems where the structure is known but the parametersunknown, we introduce an approximate maximum likelihood learning algorithmthat is capable of learning a useful subclass of Gaussian graphicalmodels.Our thesis as a whole introduces a new suit of techniques for machinelearning practitioners that increases the size and type of problemsthat can be efficiently solved. Our work is backed by extensive theory,including proofs of convergence for each method discussed.

Contents

Chapter 1 Introduction and Overview

Numerical optimisation is in many ways the core problem in modernmachine learning. Virtually all learning problems can be tackled byformulating a real valued objective function expressing some notationof loss or suboptimality which can be optimised over. Indeed approachesthat don’t have well founded objective functions are rare, perhapscontrastive divergence (Hinton, 2002) and some samplingschemes being notable examples.Many methods that started as heuristics were able to be significantlyimproved once well-founded objectives were discovered and exploited.Often a convex variant can be developed. A prime example is beliefpropagation, the relation to the Bethe approximation (Yedidia et al., 2000),and the later development of tree weighted variants (Wainwright et al., 2003) whichallowed a convex formulation. The core of this thesis is the developmentof several new numerical optimisation schemes, primarily focusingon convex objectives, which either address limitations of existingapproaches, or improve on the performance of state-of-the-art algorithms.These methods increase the breadth and depth of machine learning problemsthat are tractable on modern computers.

1.1 Convex Machine Learning Problems

In this work we particularly focus on problems that have convex objectives.This is a major restriction, and one at the core of much of modernoptimisation theory, but one that nevertheless requires justification.The primary reasons for targeting convex problems is their ubiquitousnessin applications and their relative ease of solving them. Logisticregression, least-squares, support vector machines, conditional randomfields and tree-weighted belief propagation all involve convex models.All of these techniques have seen real world application, althoughtheir use has been overshadowed in recent years by non-convex modelssuch as neural networks.Convex optimisation is still of interest when addressing non-convexproblems though. Many algorithms that were developed for convex problems,motivated by their provably fast convergence have later been appliedto non-convex problems with good empirical results. Additionally,often the best approach to solving a non-convex problem is throughthe repeated solution of convex sub-problems, or by replacing theproblem entirely with a close convex surrogate.The class of convex numerical problems is sometimes considered synonymouswith that of computationally tractable problems. This is not strictlytrue in the usual compute science sense of tractability as some convexproblems on complicated but convex sets can still be NP-hard (de Klerk and Pasechnik, 2006).On the other hand, we can sometimes approximately solve non-convexproblems of massive scale using modern approaches (e.g. Dean et al., 2012).Instead, convex problems can be better thought of as the reliablysolvable problems. For convex problems we can almost always establishtheoretical results giving a practical bound on the amount of computationtime required to solve a given convex problem (Nesterov and Nemirovski, 1994).For non-convex problems we can rarely do better than finding a locallyoptimal solution.Together with the small or no tuning required by convex optimisationalgorithms, they can be used as building blocks within larger programs;details of the problem can be abstracted away from the users. Thisis rarely the case for non-convex problems, where the most commonlyused methods require substantial hand tuning.When tasked with solving a convex problem, we have at our disposalpowerful and flexible algorithms such as interior point methods andin particular Newton’s method. While Newton’s method is strikinglysuccessful on small problems, its approximately cubic running timeper iteration resulting from the need to do a linear solve means thatit scales extremely poorly to problems with large numbers of variables.It is also unable to directly handle non-differentiable problems commonin machine learning. Both of these shortcomings have been addressedto some degree (Nocedal, 1980; Liu and Nocedal, 1989; Andrew and Gao, 2007), by the use oflow-rank approximations and tricks for specific non-differentiablestructures, although problems remain.An additional complication is a divergence in between the numericaloptimisation and machine learning communities. Numerical convex optimisationresearchers in the 80s and 90s largely focused on solving problemswith large numbers of complex constraints, particularly QuadraticProgramming (QP) and Linear Programming (LP) problems. These advanceswere applicable to the kernel methods of the early 2000s, but at oddswith many of the more modern machine learning problems which are characterisedby large numbers of potentially non-differentiable terms. The coreexamples would be linear support vector machines, other max-marginmethods and neural networks with non-differentiable activation functions.The problem we address in Chapter 7 also fitsinto this class.In this thesis we will focus on smooth optimisation problems, allowingonly a controlled level of non-smooth structure in the form of certainnon-differentiable regularisation terms (detailed in Section 1.2).The notion of smoothness we use is that of Lipschitz smoothness. Afunction is Lipschitz smooth with constant if its gradientsare Lipschitz continuous. That is, for all :

Lipschitz smooth functions are differentiable, and if their Hessianmatrix exists it is bounded in spectral norm. The other assumptionwe will sometimes make is that of strong convexity. A function is strongly convex with constant if for all and :

Essentially rather than the usual convexity interpolation bound ,we have it strengthened by a quadratic term.

1.2 Problem Structure and Black Box Methods

The last few years have seen a resurgence in convex optimisation centredaround the technique of exploiting problem structure, an approachwe take as well. When no structure is assumed by the optimisationmethod about the problem other than the degree of convexity, verystrong results are known about the best possible convergence ratesobtainable. These results date back to the seminal work of Nemirovsky and Yudin (1983)and Nesterov (1998, earlier work in Russian). These results havecontributed to the widely held attitude that convex optimisation isa solved problem.But when the problem has some sort of additional structure these worst-casetheoretical results are no longer applicable. Indeed, a series ofrecent results suggest that practically all problems of interest havesuch structure, allowing advances in theoretical, not just practicalconvergence. For example, non-differentiable problems under reasonableLipschitz smoothness assumptions can be solved with an error reductionof times after iterations, for standard measuresof convergence rate, at best (Nesterov, 1998, Theorem 3.2.1). Inpractice, virtually all non-differentiable problems can be treatedby a smoothing transformation, giving a reduction in errorafter t iterations when an optimal algorithm is used (Nesterov, 2005).Many problems of interest have a structure where most terms in theobjective involve only a small number of variables. This is the casefor example in inference problems on graphical models. In such casesblock coordinate descent methods can give better theoretical and practicalresults (Richtarik and Takac, 2011).Another exploitable structure involves a sum of two terms ,where the first term is structurally nice, say smooth anddifferentiable, but potentially complex to evaluate, and where thesecond term is non-differentiable. As long as is simplein the sense that its proximal operator is easy to evaluate,then algorithms exist with the same theoretical convergence rate asif was not part of the objective at all () (Beck and Teboulle, 2009).The proximal operator is a key construction in this work, and indeedin modern optimisation theory. It is defined for a function andconstant as:

Some definitions of the proximal operator use the weighting instead of ; we use this form throughout this work.The proximal operator is itself an optimisation problem, and so ingeneral it is only useful when the function is simple. In manycases of interest the proximal operator has a closed form solution.The first four chapters of this work focus on quite possibly the simplestproblem structure, that of a finite summation. This occurs when thereis a large number of terms with similar structure added together oraveraged in the objective. Recent results have shown that for stronglyconvex problems better convergence rates are possible under such summationstructures than is possible for black box problems (Schmidt et al., 2013; Shalev-Shwartz and Zhang, 2013b).We provide three new algorithms for this problem structure, discussedin Chapters 3 and 4. We also discussproperties of problems in the finite sum class extensively in Chapter5.

1.3 Early & Late Stage Convergence

Figure 1.1: Schematic illustration of convergence rates

When dealing with problems with a finite sum structure, practitionershave traditionally had to make a key trade-off between stochasticmethods which access the objective one term at a time, and batch methodswhich work directly with the full objective. Stochastic methods suchas SGD (stochastic gradient descent, Robbins and Monro, 1951) exhibitrapid convergence during early stages of optimisation, yielding agood approximate solution quickly, but this convergence slows downover time; getting a high accuracy solution is nearly impossible withSGD. Fortunately, in machine learning it is often the case that alow accuracy solution gives just as a good a result as a high accuracysolution for minimising the test loss on held out data. A high accuracysolution can effectively over-fit to the training data. Running SGDfor a small number of epochs is common in practice.Batch methods on the other hand are slowly converging but steady;if run for long enough they yield a high accuracy solution. For stronglyconvex problems, the difference in convergence is between a error after iterations for SGD versus a error() for LBFGS111Quasi-newton methods are often cited as having local super-linearconvergence. This is only true if the dimensionality of the underlyingparameter space is comparable to the number of iterations used. Inmachine learning the parameter space is usually much larger in effectivedimension than the number of iterations., the most popular batch method (Nocedal, 1980). We have illustratedthe difference schematically in Figure 1.1.The SGD and LBFGS lines here are typical of simple prediction problems,where SGD gives acceptable solutions after 5-10 epochs (passes overthe data), where LBFGS eventually gives a better solution, taking30-100 iterations to do so. LBFGS is particularly well suited to usein a distributed computing setting, and it is sometimes the case LBFGSwill give better results ultimately on the test loss, particularlyfor poorly conditioned (high-curvature) problems.Figure 1.1 also illustrates the kind of convergencethat the recently developed class of incremental gradient methodspotentially offers. Incremental gradient methods have the same linear error after epochs as a batch method, but witha coefficient dramatically better. The difference being intheory thousands of times faster convergence, and in practice usually10-20 times better. With favorable problem structure incremental gradienthave the potential to offer the best of both worlds, having rapidinitial convergence without the later stage slow-down of SGD.Another traditional advantage of batch methods over stochastic methodsis their ease of use. Methods such as LBFGS require no hand tuningto be applied to virtually any smooth problem. Some tuning of thememory constant that holds the number of past gradients to rememberat each step can give faster convergence, but bad choices of thisconstant still result in convergence. SGD and other traditional stochasticmethods on the other hand require a step size parameter and a parameterannealing schedule to be set. SGD is sensitive to these choices, andwill diverge for poor choices.Incremental gradient methods offer a solution to the tuning problemas well. Most incremental gradient algorithms have only a single stepsize parameter that needs to be set. Fortunately the convergence rateis fairly robust to the value of this parameter. The SDCA algorithm(Shalev-Shwartz and Zhang 2013b) reduces this to 0 parameters, but at the expenseof being limited to problems with efficient to compute proximal operators.

1.4 Approximations

The exploitation of problem structure is not always directly possiblewith the objectives we encounter in machine learning. A case we focuson in this work is the learning of weight parameters in a Gaussiangraphical model structure. This is an undirected graph structure withweights associated with both edges and nodes. These weights are theentries of the precision matrix (inverse covariance matrix) of a Gaussiandistribution. Absent edges effectively have a weight of zero (Figure1.2). A formal definition is given inChapter 6. A key approach to such problemsis the use of approximations that introduce additional structure inthe objective which we can exploit.

Figure 1.2: Gaussian graphical model definedby the precision matrix , together with the non-sparse covariancematrix it induces with rounding to 1 significant figure. Correlationsare indicated by negative edge weights in a Gaussian model.

The regularised maximum likelihood objective for fitting a Gaussiangraphical model can require time to evaluate222Theoretically it takes time equivalent to the big-O cost of a fastmatrix multiplication such as Strassen’s algorithm (),but in practice simpler techniques are used.. This is prohibitively long on many problems of interest. Instead,approximations can be introduced that decompose the objective, allowingmore efficient techniques to be used. In Chapter 9we show how the Bethe approximation may be applied for learningthe edge weights on restricted classes of Gaussian graphical models.This approximation allows for the use of an efficient dual decompositionoptimisation method, and has direct practical applicability in thedomain of recommendation systems.Besides parameter learning, the other primary task involving graphsis directly learning the structure. Structure learning for Gaussiangraphical models is problem that has seen a lot of interest in machinelearning. The structure can be used in a machine learning pipelineas the precursor to parameter learning, or it can be used for itsown sake as indicator of correlation structure in a dataset. The useof approximations in structure learning is more widespread than inparameter learning, and we give an overview of approaches in Chapter6. We improve on an existing techniquein Chapter 8, where we show that an existingapproximation can be further approximated, giving a substantialpractical and theoretical speed-up by a factor of .

1.5 Non-differentiability in Machine Learning

As mentioned, machine learning problems tend to have substantial non-differentiablestructure compared to the constraint structures more commonly addressedin numerical optimisation. These two forms of structure are in a sensetwo sides of the same coin, as for convex problems the transformationto the dual problem can often convert from one to the other. The primaryexample being support vector machines, where non-differentiabilityin the primal hinge loss is converted to a constraint set when thedual is considered.Recent progress in optimisation has seen the use of proximal methodsas the tool of choice for handling both structures in machine learningproblems. When using a regularised loss objective of the form as mentioned above in Section 1.2,the non-differentiability can be in the regulariser or theloss term . We introduce methods addressing both cases in thiswork. The SAGA algorithm of Chapter 4 is a new primalmethod, the first primal incremental gradient method able to be usedon non-strongly convex problems with non-differentiable regularisersdirectly. It makes use of the proximal operator of the regulariser.It can also be used on problems with constraints, where the function is the indicator function of the constraint set, and proximaloperator is projection onto the constraint set.In this work we also introduce a new non-differentiable regulariserfor the above mentioned graph structure learning problem, which canalso be attacked using proximal methods. Its non-differentiable structureis atypically complex compared to other regularisers used in machinelearning, requiring a special optimisation procedure to be used justto evaluate the proximal operator.For non-differentiable losses, we introduce the Prox-Finto algorithm(Section 3.6). This incremental gradient algorithmuses the proximal operator of the single datapoint loss. It providesa bridge between the Finito algorithm (Section 3.1)and the SDCA algorithm (Shalev-Shwartz and Zhang, 2013b), having properties of both methods.

1.6 Publications Related to This Thesis

The majority of the content in this thesis has been published as conferencearticles. For the work on incremental gradient methods, the Finitomethod has been published as Defazio et al. (2014b), and the SAGAmethod as Defazio et al. (2014a). Chapters 3& 4 contain much more detailed theory than has beenpreviously published. Some of the discussion in Chapter 5appears in Defazio et al. (2014b) also. For the portion of thisthesis on Gaussian graphical models, Chapter 7largely follows the publication Defazio and Caetano (2012a). Chapter9 is based on the work in Defazio and Caetano (2012b),although heavily revised.

Chapter 2 Incremental Gradient Methods

In this chapter we give an introduction to the class of incrementalgradient (IG) methods. Incremental gradient methods are simply a classof methods that can take advantage of known summation structure inan optimisation objective by accessing the objective one term at atime. Objectives that are decomposable as a sum of a number of termscome up often in applied mathematics and scientific computing, butare particularly prevalent in machine learning applications. Researchin the last two decades on optimisation problems with a summationstructure has focused more on the stochastic approximation setting,where the summation is assumed to be over an infinite set of terms.The finite sum case that incremental gradient methods coverhas seen a resurgence in recent years after the discovery that thereexist fast incremental gradient methods whose convergence ratesare better than any possible black box method for finite sums withparticular (common) structures. We provide an extensive overview ofall known fast incremental gradient methods in the later parts ofthis chapter. Building on the described methods, in Chapters 3& 4 we introduce three novel fast incremental gradientmethods. Depending on the problem structure, each of these methodscan have state-of-the-art performance.

2.1 Problem Setup

We are interested in minimising functions of the form

where and each is convex and Lipschitzsmooth with constant . We will also consider the case where each is additionally strongly convex with constant . SeeAppendix A.1 for defintions of Lipschitz smoothnessand strong convexity. Incremental gradient methods are algorithmsthat at each step evaluate the gradient and function value of onlya single .We will measure convergence rates in terms of the number of evaluations, normally these are much cheaper computationally thanevaluations of the whole function gradient , such asperformed by the gradient descent algorithm. We use the notation to denote a minimiser of .For strongly convex problems this is the unique minimiser.This setup differs from the traditional black box smooth convex optimisationproblem only in that we are assuming that our function is decomposableinto a finite sum structure. This finite sum structure is widespreadin machine learning applications. For example, the standard frameworkof Empirical Risk Minimisation (ERM) takes this form, where for aloss function and data label tuples , we have:

where is the hypothesis function that we intend to optimiseover. The most common case of ERM is minimisation of the negativelog-likelihood, for instance the classical logistic regression problem(See standard texts such as Bishop 2006). Often ERM is an approximationto an underlying stochastic programming problem, where the summationis replaced with an expectation over a population of data tuples fromwhich we can sample from.

2.1.1 Exploiting problem structure

Given the very general nature of the finite sum structure, we cannot expect to get faster convergence than we would by accessing thewhole gradient without additional assumptions. For example, supposethe summation only has one term, or alternatively each isthe zero function except one of the .Notice that the Lipschitz smoothness and strong convexity assumptionswe made are on each rather than on . This is a key point.If the directions of maximum curvature of each term are aligned andof similar magnitude, then we can expect the term Lipschitz smoothnessto be similar to the smoothness of the whole function. However, itis easy to construct problems for which this is not the case, in factthe Lipschitz smoothness of may be times smaller than thatof each . In that case the incremental gradient methods willgive no improvement over black box optimisation methods.For machine learning problems, and particularly the empirical riskminimisation problem, this worst case behavior is not common. Thecurvature and hence the Lipschitz constants are defined largely bythe loss function, which is shared between the terms, rather thanthe data point. Common data preprocessing methods such as data whiteningcan improve this even further.The requirement that the magnitude of the Lipschitz constants be approximatelybalanced can be relaxed in some cases. It is possible to formulateIG methods where the convergence is stated in terms of the averageof the Lipschitz constants of the instead of the maximum.This is the case for the Prox-Finito algorithm described in Section3.6. All known methods that make use of the averageLipschitz constant require knowledge of the ratios of the Lipschitzconstants of the terms, which limits their practicality unfortunately.Regardless of the condition number of the problem, if we have a summationwith enough terms optimisation becomes easy. This in made precisein the definition that follows.

Definition 2.1.

The big data condition: Forsome known constant ,

This condition obviously requires strong convexity and Lipschitz smoothnessso that is well defined. It is a very strong assumption forsmall , as the condition number in typical machine learningproblems is at least in the thousands. For applications of this assumption, is typically between and . Several of the methodswe describe below have a fixed and very fast convergence rate independentof the condition number when this big-data condition holds.

2.1.2 Randomness and expected convergence rates

This thesis works extensively with optimisation methods that makerandom decisions during the course of the algorithm. Unlike the stochasticapproximation setting, we are dealing with deterministic, known optimisationproblems; the stochasticity is introduced by our optimisation methods,it is not inherent in the problem. We introduce randomness becauseit allows us to get convergence rates faster than that of any currentlyknown deterministic methods. The caveat is that these convergencerates are in expectation, so they don’t always hold precisely.This is not as bad as it first seems though. Determining that theexpectation of a general random variable converges is normally quitea weak result, as its value may vary around the expectation substantiallyin practice, potentially by far more than it converges by. The reasonwhy this is not an issue for the optimisation methods we consideris that all the random variables we bound are non-negative.A non-negative random variable with a very small expectation,say:

is with high probability close to its expectation. This is a fundamentalresult implied by Markov’s inequality.For example, suppose and we want to bound theprobability that is greater than , i.e. a factorof 100 worse than its expectation. Then Markov’s inequality tellsus that:

So there is only a 1% chance of being larger than 100 timesits expected value here. We will largely focus on methods with linearconvergence in the following chapters, so in order to increase theprobability of the value holding by a factor , only alogarithmic number of additional iterations in is required().We would also like to note that Markov’s inequality can be quite conservative.Our experiments in later chapters show little in the way of randomnoise attributable to the optimisation procedure, particularly whenthe amount of data is large.

2.1.3 Data access order

The source of randomness in all the methods considered in this chapteris the order of accessing the terms. By access wemean the evaluation of and at an of our choice. This is more formally known as an oracle evaluation(see Section 5.1), and typically constitutes the mostcomputationally expensive part of the main loop of each algorithmwe consider. The access order is defined on a per-epoch basis,where an epoch is evaluations. Only three different access ordersare considered in this work:

Cyclic

Each step with . Effectively we access in the order they appear, then loop to the beginning andthe end of every epoch.

Permuted

Each epoch with is sampled without replacementfrom the set of indices not accessed yet in that epoch. This is equivalentto permuting the at the beginning of each epoch, then usingthe cyclic order within the epoch.

Randomised

The value of is sampled uniformly at random withreplacement from .

The “permuted” terminology is our nomenclature, whereas the othertwo terms are standard.

2.2 Early Incremental Gradient Methods

The classical incremental gradient (IG) method is simply a step ofthe form:

where at step we use cyclic access, taking .This is similar to the more well known stochastic gradient descent,but with a cyclic order of access of the data instead of a randomorder. We have introduced here a superscript notation forthe variable at step . We use this notation throughout thiswork.It turns out to be much easier to analyse such methods under a randomaccess ordering. For the random order IG method (i.e. SGD) on smoothstrongly convex problems, the following rate holds for an appropriatelychosen step sizes:

The step size scheme required is of the form ,where is a constant that depends on the gradient norm bound as well as the degree of strong convexity . It may be requiredto be quite small in some cases. This is what’s known as a sublinearrate of convergence, as the dependence on is of the form ,which is slower than the linear rate for any asymptotically.Incremental gradient methods for strongly convex smooth problems wereof less practical utility in machine learning up until the developmentof fast variants (discussed below), as the sublinear rates for thepreviously known methods did not compare favourably to the (super-)linearrate of quasi-Newton methods. For non-strongly convex problems, orstrongly convex but non-smooth problems, the story is quite different.In those cases, the theoretical and practical rates are hard to beatwith full (sub-)gradient methods. The non-convex case is of particularinterest in machine learning. SGD has been the de facto standard optimisationmethod for neural networks for example since the 1980s (Rumelhart et al., 1986).Such incremental gradient methods have a long history, having beenapplied to specific problems as far back as the 1960s (Widrow and Hoff, 1960).An up-to-date survey can be found in Bertsekas (2012).

2.3 Stochastic Dual Coordinate Descent (SDCA)

The stochastic dual coordinate descent method (Shalev-Shwartz and Zhang, 2013b) is basedon the principle that for problems with explicit quadratic regularisers,the dual takes a particularly easy to work with form. Recall the finitesum structure defined earlier.Instead of assuming that each is strongly convex, we insteadneed to consider the regularised objective:

For any strongly convex , we may transform our function tothis form by replacing each with ,then including a separate regulariser. This changes the Lipschitzsmoothness constant for each to , and preserves convexity.We are now ready to consider the dual transformation. We apply thetechnique of dual decomposition, where we decouple the termsin our objective as follows:

This reformulation initially achieves nothing, but the key idea isthat we now have a constrained optimisation problem, and so we mayapply Lagrangian duality (Section A.3). The Lagrangianfunction is:

(2.1)

where are the introduced dual variables.The Lagrangian dual function is formed by taking the minimum of with respect to each , leaving , the set of free:

(2.2)

Now recall that the definition of the convex conjugate (Section A.2)says that:

Clearly we can plug this in for each to get:

We still need to simplify the remaining term, which is alsoin the form of a convex conjugate. We know that squared norms areself-conjugate, and scaling a function by a positive constant transforms its conjugate from to ,so we in fact have:

This is the objective directly maximised by SDCA. As the name implies,SDCA is randomised (block) coordinate ascent on this objective, whereonly one is changed each step.In coordinate descent we have the option of performing a gradientstep in a coordinate direction, or an exact minimisation. For theexact coordinate minimisation, the update is easy to derive:

(2.3)

The primal point corresponding to the dual variables at step is the minimiser of the conjugate problem ,which in closed form is simply .This can be used to further simplify Equation 2.3.The full method is Algorithm 2.1.

Initialise and as the zero vector, forall .Step :

  1. Pick an index uniformly at random.

  2. Update , leaving the other unchanged:

  3. Update

At completion, for smooth return . For non-smooth,return a tail average of the sequence.

Algorithm 2.1 SDCA (exact coordinate descent)

The SDCA method has a geometric convergence rate in the dual objective of the form:

This is easily extended to a statement about the duality gap and hence the suboptimality by using the relation:

2.3.1 Alternative steps

The full coordinate minimisation step discussed in the previous sectionis not always practical. If we are treating each element in the summation as a single datapoint loss, then even for the simple binary logistic loss there isnot a closed form solution for the exact coordinate step. We can usea black-box 1D optimisation method to find the coordinate minimiser,but this will generally require 20-30 exponential function evaluations,together with one vector dot product.For multiclass logistic loss, the subproblem solve is not fast enoughto yield a usable algorithm. In the case of non-differentiable losses,the situation is better. Most non-differentiable functions we usein machine learning, such as the hinge loss, yield closed form solutions.For performance reasons we often want to treat each as aminibatch loss, in which case we virtually never have a closed formsolution for the subproblem, even in the non-differentiable case.Shalev-Shwartz and Zhang (2013a) describe a number of other possible steps whichlead to the same theoretical convergence rate as the exact minimisationstep, but which are more usable in practice:

Interval Line search:

It turns out that it is sufficient toperform the minimisation in Equation 2.3 along theinterval between the current dual variable and thepoint . The update takes the form:

Constant step:

If the value of the Lipschitz smoothness constant is known, we can calculate a conservative value for the parameter instead of optimising over it with an interval line search. Thisgives an update of the form:

This method is much slower in practice than performing a line-search,just as a step size with gradient descent is much slowerthan performing a line search.

2.3.2 Reducing storage requirements

We have presented the SDCA algorithm in full generality above. Thisresults in dual variables of dimension , for which the total storage can be prohibitive. In practice, the dual variables oftenlie on a low-dimensional subspace. This is the case with linear classifiersand regressors, where a class problem has gradients on a dimensional subspace.A linear classifier takes the form ,for a fixed loss and datainstance matrix . In the simplest case is just the data point duplicated as rows. Then the dual variablesare dimensional, and the updates change to:

This is the form of SDCA presented by Shalev-Shwartz and Zhang (2013a), althoughwith the negation of our dual variables.

2.3.3 Accelerated SDCA

The SDCA method is also currently the only fast incremental gradientmethod to have a known accelerated variant. By acceleration, we referto the modification of an optimisation method to improve the convergencerate by an amount greater than any constant factor. This terminologyis common in optimisation although a precise definition is not normallygiven.The ASDCA method (Shalev-Shwartz and Zhang, 2013a) works by utilising theregular SDCA method as a sub-procedure. It has an outer loop, whichat each step invokes SDCA on a modified problem where is chosen as an over-relaxed step of the form:

for some known constant . The constant is likewisecomputed from the Lipschitz smoothness and strong convexity constants.These regularised sub-problems have a greater degree of strong convexity than , and so individuallyare much faster to solve. By a careful choice of the accuracy at whichthey are computed to, the total number of steps made between all thesubproblem solves is much smaller than would be required if regularSDCA is applied directly to to reach the same accuracy.In particular, they state that to reach an accuracy of in expectation for the function value, they need iterations,where:

The notation hides constant factors. This rate is notof the same precise form as the other convergence rates we will discussin this chapter. We can make some general statements though. When is in the range of the big-data condition, this rate is no betterthan regular SDCA’s rate, and probably worse in practice due to overheadshidden by the notation. When is much smaller than, then potentially it can be much faster than regularSDCA.Unfortunately, the ASDCA procedure has significant computational overheadsthat make it not necessarily the best choice in practice. Probablythe biggest issue however is a sensitivity to the Lipschitz smoothnessand strong convexity constants. It assumes these are known, and ifthe used values differ from the true values, it may be significantlyslower than regular SDCA. In contrast, regular SDCA requires no knowledgeof the Lipschitz smoothness constants (for the prox variant at least),just the strong convexity (regularisation) constant.

2.4 Stochastic Average Gradient (SAG)

The SAG algorithm (Schmidt et al., 2013) is the closest in form to the classicalSGD algorithm among the fast incremental gradient methods. Insteadof storing dual variables like SDCA above, we storea table of past gradients , which has the same storage costin general, . The SAG method is given in Algorithm 2.2.The key equation for SAG is the step:

Essentially we move in the direction of the average of the past gradients.Note that this average contains one past gradient for each term, andthey are equally weighted. This can be contrasted to the SGD methodwith momentum, which uses a geometrically decaying weighted sum ofall past gradient evaluations. SGD with momentum however is not alinearly convergent method. It is surprising that using equal weightslike this actually yields a much faster converging algorithm, eventhough some of the gradients in the summation can be extremely outof date.SAG is an evolution of the earlier incremental averaged gradient method(IAG, Blatt et al., 2007) which has the same update with a differentconstant factor, and with cyclic access used instead of randomised.IAG has a more limited convergence theory covering quadratic or boundedgradient problems, and a much slower rate of convergence.

Initialise as the zero vector, and for each .Step :

  1. Pick an index uniformly at random.

  2. Update using step length constant :

  3. Set . Leave for .

Algorithm 2.2 SAG

The convergence rate of SAG for strongly convex problems is of thesame order as SDCA, although the constants are not quite as good.In particular, we have an expected convergence rate in terms of functionvalue suboptimality of:

Where is a complex expression involving and a quadratic form of and each . This theoreticalconvergence rate is between 8 and 16 times worse than SDCA. In practiceSAG is often faster than SDCA though, suggesting that the SAG theoryis not tight. A nice feature of SAG is that unlike SDCA, it can bedirectly applied to non-strongly convex problems. Differentiabilityis still required though. The convergence rate is then in terms ofthe average iterate :

The SAG algorithm has great practical performance, but it is surprisinglydifficult to analyse theoretically. The above rates are likely conservativeby a factor of between and . Due to the difficulty of analysis,the proximal version for composite losses has not yet had its theoreticalconvergence established.

2.5 Stochastic Variance Reduced Gradient (SVRG)

The SVRG method (Johnson and Zhang, 2013) is a recently developed fast incrementalgradient method. It was developed to address the potentially highstorage costs of SDCA and SAG, by trading off storage against computation.The SVRG method is given in Algorithm 2.3. Unlike theother methods discussed, there is a tunable parameter , whichspecifies the number of iterations to complete before the currentgradient approximation is “recalibrated” by computing a full gradient at the last iterate before the recalibration,. Essentially, instead of maintaining atable of past gradients for each like SAG does, thealgorithm just stores the location at which those gradientsshould be evaluated, then re-evaluates them when needed by just computingLike the SAG algorithm, at each step we need to know the updated termgradient , the old term gradient and the average of the old gradients . Sincewe are not storing the old term gradient, just its average, we needto calculate two term gradients instead of the one term gradient calculatedby SAG at each step.

Initialise as the zero vector, and .Step :

  1. Pick uniformly at random.

  2. Update :

  3. Every iterations, set and recalculatethe full gradient at that point:

    Otherwise leave and .

At completion return .

Algorithm 2.3 SVRG

The S2GD method (Konečný and Richtárik, 2013) was concurrently developed with SVRG.It has the same update as SVRG, just differing in that the theoreticalchoice of discussed in the next paragraph. We use SVRGhenceforth to refer to both methods.The update in step 3 aboveis technically not supported by the theory. Instead, one of the followingtwo updates are used:

  1. is the average of the values from the last iterations. This is the variant suggested by Johnson and Zhang (2013).

  2. is a randomly sampled from the last iterations.This is used in the S2GD variant (Konečný and Richtárik, 2013).

These alternative updates are required theoretically as the convergencebetween recalibrations is expressed in terms of the average of functionvalues of the last points,

instead of in terms of directly. Variant 1 avoidsthis issue by using Jensen’s inequality to pull the summation inside:

Variant 2 uses a sampled , which in expectation will also havethe required value. In practice, there is a very high probabilitythat is less than the last- average, so justtaking works.The SVRG method has the following convergence rate if is a multipleof :

Note also that each step requires two term gradients, so the ratemust be halved when comparing against the other methods describedin this chapter. There is also the cost of the recalibration pass,which (depending on ) can further increase the run time to threetimes that of SAG per step. This convergence rate has quite a differentform from that of the other methods considered in this section, makingdirect comparison difficult. However, for most parameter values thistheoretical rate is worse than that of the other fast incrementalgradient methods. In Section 4.7 we givean analysis of SVRG that requires additional assumptions, but givesa rate that is directly comparable to the other fast incremental gradientmethods.

Chapter 3 New Dual Incremental Gradient Methods

In this chapter we introduce a novel fast incremental gradient methodfor strongly convex problems that we call Finito. Like SDCA,SVRG and SAG, Finito is a stochastic method that is able to achievelinear convergence rates for strongly convex problems. Although theFinito algorithm only uses primal quantities directly, the proof ofits convergence rate uses lower bounds extensively, so it can be considereda dual method, like SDCA. Similar to SDCA, its theory does not supportits use on non-strongly convex problems, although there are no practicalissues with its application.In Section 3.7 we prove the convergence rateof the Finito method under the big-data condition described in theprevious chapter. This theoretical rate is better than the SAG andSVRG rates but not quite as good as the SDCA rate. In Section 3.3we compare Finito empirically against SAG and SDCA, showing that itconverges faster. This difference is most pronounced when using apermuted access order, which unfortunately is not covered by currentconvergence theory.The relationship between Finito and SDCA allows a kind of midpointalgorithm to be constructed, which has favourable properties of bothmethods. We call this midpoint Prox-Finito. It is described in Section3.6.An earlier version of the work in this chapter has been publishedas Defazio et al. (2014b).

3.1 The Finito Algorithm

As discussed in Chapter 2, we areinterested in convex functions of the form

We assume that each is Lipschitz smooth with constant and is strongly convex with constant . We will focus on thebig data setting:

with , as described in Section 2.1.1.

3.1.1 Additional notation

We omit the superscript on summations throughout, and subscript with the implication that indexing starts at . When we useseparate arguments for each , we denote them .Let denote the average .Our step length constant, which depends on (thebig-data constant), is denoted . Note that is aninverse step length, in the sense that large results insmaller steps. We use angle bracket notation for dot products .

3.1.2 Method

We start with a table of known values, and a tableof known gradients , for each .We will update these two tables during the course of the algorithm.The Finito method is described in Algorithm 3.1.

Initialise for some and all ,and calculate and store each .The step for iteration , is:

  1. Update using the step:

    (3.1)
  2. Pick an index uniformly at random, or using without-replacementsampling as discussed in Section 3.2.

  3. Set in the table and leave the other variablesthe same ( for ).

  4. Calculate and store in the table.

Algorithm 3.1 Finito Algorithm

We have established the theoretical convergence rate of Finito underthe big-data condition:

Theorem 3.1.

When the big-data condition holds with , maybe used. In that setting, the expected convergence rate is:

See Section 3.7 for the proof. This can be comparedto the SAG method, which achieves a geometric ratewhen . The SDCA method has a rate, which is very slightly better rate ( v.s. )than Finito.Note that on a per epoch basis, the Finito rate is .Lemma B.7 discusses this approximation in more detail. To put that rate into context, 10 epochswill see the error bound reduced by more than 148x.One notable feature of our method is the fixed step size. In typicalmachine learning problems the strong convexity constant is given bythe strength constant of the quadratic regulariser used. Since thisis a known quantity, as long as the big-data condition holds may be used without any tuning or adjustment of Finito required. Thislack of tuning is a major feature of Finito.Our theory is not as complete as for SDCA and SAG. In cases wherethe big-data condition does not hold, we conjecture that the stepsize must be reduced proportionally to the violation of the big-datacondition. In practice, the most effective step size can be foundby testing a number of step sizes, as is usually done with other stochasticoptimisation methods. We do not have any convergence theory for whenthe big-data condition doesn’t hold, except for when very small stepsizes are used, such as by setting the inverse step size constant to (Section 5.3).

3.1.3 Storage costs

Another difference compared to the SAG method is that we store bothgradients and points . We donot actually need twice as much memory however, as they can be storedsummed together. In particular we can store the quantities ,and use the update rule .This trick does not work when step lengths are adjusted during optimisationunfortunately.When using this trick it would on the surface appear that storageof is a disadvantage when the gradients are sparse but the values are not. However, when is is strongly convex (which is one of our assumptions) this can notoccur.There is an additional possible method for avoiding the storage ofthe values. If we can easily evaluate the gradient ofthe convex conjugate of each , then we can use the relation(Section A.2):

This is possible because for strongly convex functions there is anisomorphism between the dual space that the gradients live in andthe primal space.

3.2 Permutation & the Importance of Randomness

One of the most interesting aspects Finito and the other fast incrementalgradient methods is the random choice of index at each iteration.We are not in a stochastic approximation setting, so there is no inherentrandomness in the problem. Yet it seems that randomisation is requiredfor Finito. It diverges in practice if a cyclic access order is used.It is hard to emphasise enough the importance of randomness here.The technique of pre-permuting the data, then doing in order passesafter that, is not enough. Even reducing the step size in SAG or Finitoby 1 or 2 orders of magnitude does not fix convergence. The permuted ordering described in Section 2.1.3is particularly well suited to use with Finito. Recall that for thepermuted ordering, each step within an epoch the data is sampled withoutreplacement from the points not accessed yet in that epoch. In practice,this approach does not give any speed-up with SAG, however it worksspectacularly well with Finito. We see speed-ups of up to a factorof two using this approach. This is one of the major differences inpractice between SAG and Finito. We should note that we have no theoryto support this case however.The SDCA method is also sometimes used with a permuted ordering (Shalev-Shwartz and Zhang, 2013b),our experiments in Section 3.3 show thatthis sometimes results in a speed-up over uniform random sampling,although it does not appear to be as reliable as with Finito.

3.3 Experiments

In this section we compare Finito, SAG, SDCA and LBFGS. The SVRG methodwas not published at the time these experiments where run. We onlyconsider problems where the regulariser is large enough so that thebig-data condition holds, as this is the case our theory supports.However, in practice our method can be used with smaller step sizesin the more general case, in much the same way as SAG.Since we do not know the Lipschitz smoothness constant for these problemsexactly, the SAG method was run for a variety of step sizes, withthe one that gave the fastest rate of convergence plotted. The beststep size for SAG is usually not what the theory suggests. Schmidt et al. (2013)suggest using instead of the theoretical rate .For Finito, we find that using is the fastest rate whenthe big-data condition holds for any . This is the stepsuggested by our theory when . Interestingly, reducing to 1 does not improve the convergence rate. Instead, we see no furtherimprovement in our experiments.For both SAG and Finito we used a different step size rule than suggestedby the theory for the first pass. For Finito, during the first pass,since we do not have derivatives for each yet, we simplysum over the terms seen so far

where we process data points in cyclic order for the first pass only.A similar trick is suggested by Schmidt et al. (2013) for SAG.For our test problems we choose log loss for 3 binary classificationdatasets, and quadratic loss for 2 regression tasks. For classification,we tested on the ijcnn1 and covtype datasets 111http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html,as well as MNIST222http://yann.lecun.com/exdb/mnist/classifying 0-4 against 5-9. For regression, we choose the two datasetsfrom the UCI repository: the million song year regression dataset,and the slice-localisation dataset. The training portion of the datasetsare of size , , , and respectively.

(a) MNIST
(b) ijcnn1
(c) covtype
Figure 3.4: Classification tasks
(a) million song
(b) slice
Figure 3.7: Regression tasks

Figures 3.4 & 3.7shows the results of our experiments. Firstly we can see that LBFGSis not competitive with any of the incremental gradient methods considered.Secondly, the non-permuted SAG, Finito and SDCA often converge atvery similar rates. The observed differences are usually down to thespeed of the very first pass, where SAG and Finito are using the abovementioned trick to speed their convergence. After the first pass,the slopes of the line are usually comparable. When considering themethods with permutation each pass, we see a clear advantage for Finito.Interestingly, it gives very flat lines, indicating very stable convergence.SAG with permutation is not shown as it is much slower and unstablethan randomised SAG.

3.4 The MISO Method

The MISO (Mairal, 2013) method is an incremental gradient method applicablewhen each of the terms can easily be majorised (that is upperbounded everywhere) by known functions. When quadratic functions arechosen for the majorisation step, the method has the same update asFinito but with an alternative and much smaller step size. MISO maintainsthe following upper bound on at each step:

This is just the sum of the Lipschitz smoothness upper bounds aroundeach , at different points . The method alternatesupdating randomly chosen values, using the update:

This update is sensible. We are using the minimum of the upper boundas our best guess at the solution, just like in gradient descent andNewton’s method. Taking the gradient of to zero gives an explicitformula for of:

(3.2)

which is identical to Finito but with an inverse step size of instead of . Since the condition number of practical problemsis large, generally in the thousands to millions, this step size isdramatically smaller.Mairal (2013) establishes the following convergence rate for thismethod in the strongly convex case:

(3.3)

Note that on a per epoch basis, this rate essentially matches thatof gradient descent, as .This approximation is described further in Lemma B.4.Gradient descent is not considered a competitive method for optimisation,and likewise, experiments suggest that this method is even slowerin practice than gradient descent with line searches. In Theorem 5.8(Chapter 5) we improve on this result, establishinga convergence rate with a factor of better geometric constant:

although this result is for the specific form of MISO given in Equation3.2 rather than the general upper bound minimisationscheme they describe.A recent technical report (Mairal, 2014) gives a proof of the MISOmethod’s convergence rate under the larger step sizes we considerwith Finito. Although this no longer fits into the upper bound minimisationframework of MISO, they refer to this as the MISO method. Theirreport was released while Finito was under peer review. They establishfor a step size (Compared to our rate usedin the Finito proof, see Equation 3.1) arate of

This constant is slightly worse than the constant we get. They make the same big-data assumption that we do.

3.5 A Primal Form of SDCA

At first glance, the SDCA method bears no obvious relation to Finito.However, there is a simple transformation that can be applied to SDCA,that makes the relation clear. This leads to a novel primal formulationof SDCA. This also allows us to construct an algorithm that sits atthe midpoint between SDCA and Finito.

Initialise for some initial andcompute the table of gradients foreach .At step :

  1. Pick an index uniformly at random.

  2. Compute ,where

    (3.4)
  3. Store the gradient in the table at location . For , the table entries areunchanged ().

At completion, return .

Algorithm 3.2 Primal SDCA

The primal version of SDCA is given as Algorithm 3.2.At each step, this algorithm evaluates the proximal operator of aterm at a particular point , yielding the new point .The gradient table is then updated with .The point is just a constant times the table’s average gradient,excluding location . Compare this to the other fast incrementalgradient methods that use a table average, such as SDCA and Finito,where in those cases the average includes the location .It’s not immediately obvious that the update is actually computing inAlgorithm 3.2. To see why it works, consider theoptimality condition of the proximal operator from Step 2,found by taking the definition of the proximal operator and settingits gradient to zero:

Relating this to the update in Step 3 showsthat the step does indeed compute the gradient.We claim that this algorithm is exactly equivalent to SDCA when exactcoordinate minimisation is used (the standard variant). In fact, thesimple relation holds.Note that in SDCA the quantities are not used explicitly,whereas in our notation here we have made them explicit.

Theorem 3.2.

Primal SDCA (Algorithm 3.2) is equivalent to SDCA(Algorithm 2.1) when each is initialisedto .In particular the iterate defined as is identical to the SDCA iterate ,due to the relation .

Proof.

We will prove this by induction, by showing that at each step. For the base case, we have assumed that is initialised to , so it holds byconstruction. Next suppose that at step , .Then we just need to show that a single step in Algorithm 3.2gives .The relation between the SDCA and the primal variant we describe stemsfrom the fact that the standard dual version performs a minimisationof the following form at each step:

(3.5)

Here are dual variables and the current primalpoint. As noted by Shalev-Shwartz and Zhang (2013a), this is actually an instanceof the proximal operator of the convex conjugate of . Recallthe definition of the proximal operator of a function with weight:

Clearly the exact coordinate step (Equation 3.5) isthe proximal operator of in the following form:

(3.6)
(3.7)
(3.8)

In the last step we have used our assumption that .Our primal formulation exploits this definition of using a relation between the proximal operator of a function and itsconvex conjugate known as the Moreau decomposition:

See Section A.2 for more details on the Moreaudecomposition. This decomposition allows us to compute the proximaloperator of conjugate via the primal proximal operator. As this isthe only use in the basic SDCA method of the conjugate function, applyingthis decomposition allows us to completely eliminate the “dual”aspect of the algorithm. The same trick can be used to interpret Dykstra’sset intersection as a primal algorithm instead of a dual block coordinatedescent algorithm (Combettes and Pesquet, 2011).We now apply the Moreau decomposition to our primal SDCA (Algorithm3.2) formulation’s update : ,giving:

(3.9)

Now consider Step 3 of Algorithm 3.2.Combining with Equation 3.9 and we have:

This gives that the update for in SDCA is identicalto the update in Algorithm 3.2. ∎

Remark 3.3.

The initialisation of SDCA typically takes each as thezero vector. This is a somewhat unnatural initialisation to use inthe primal variant, where it is more natural to take for all . The SDCA style initialisation can be done using theassignment if desired, but thatrequires working with the convex conjugate function.

3.6 Prox-Finito: a Novel Midpoint Algorithm

The primal form of SDCA has some resemblance to Finito; the primarydifference being that it assumes strong convexity is induced by aseparate strongly convex regulariser, rather than each beingstrongly convex. We now show how to modify SDCA so that it works withouta separate regulariser.Consider the dual problem maximised by SDCA, using our relation ,and . Usingthe Lagrangian from Equation 2.1: