SGDLibrary: A MATLAB library for stochastic gradient descent algorithms

# SGDLibrary: A MATLAB library for stochastic gradient descent algorithms

Hiroyuki Kasai Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo, Japan (kasai@is.uec.ac.jp).
July 24, 2019
###### Abstract

We consider the problem of finding the minimizer of a function of the form . This problem has been studied intensively in recent years in machine learning research field. One typical but promising approach for large-scale data is stochastic optimization algorithm. SGDLibrary is a flexible, extensible and efficient pure-Matlab library of a collection of stochastic optimization algorithms. The purpose of the library is to provide researchers and implementers a comprehensive evaluation environment of those algorithms on various machine learning problems.

## 1 Introduction

This work aims to facilitate research on stochastic optimization for large-scale data. We particularly address a regularized empirical loss minimization problem defined as

 min\boldmathw∈Rdf(\boldmathw):=1nn∑i=1fi(\boldmathw)=1nn∑i=1L(\boldmathw,xi,yi)+λR(\boldmathw), (1)

where represents the model parameter and denotes the number of samples. is the loss function and is the regularizer with the regularization parameter . Widely diverse machine learning models fall into this problem. Considering , , and , this results in -norm regularized linear regression problem (a.k.a. ridge regression) for training samples . In case of the binary classification problem with the desired class label and , -norm regularized logistic regression (LR) problem is obtained as , which encourages the sparsity of the solution of . Other problems are matrix completion, support vector machines (SVM), and sparse principal components analysis, to name but a few.

Full gradient decent (GD) with a step size is the most straightforward approach for (1) as where the update reduces to . However, this is expensive especially when is extremely large. In fact, one needs a sum of calculations of the inner product of -dimensional vectors, leading to cost overall per iteration. For this issue, a popular and effective alternative is stochastic gradient descent update as where is a random vector. A popular choice for this is to set as for -th sample uniformly at random, which is called stochastic gradient descent (SGD). Its update rule is . Actually, SGD assumes an unbiased estimator of the full gradient as . As the update rule clearly represents, the calculation cost is independent of , resulting in . Mini-batch SGD uses , where is the set of samples of size . Also, SGD needs a diminishing stepsize algorithm to guarantee the convergence, which causes a severe slow convergence rate. To accelerate this rate, we have two active research directions in machine learning; Variance reduction (VR) techniques [Johnson_NIPS_2013_s, Roux_NIPS_2012_s, Shalev_JMLR_2013_s, Defazio_NIPS_2014_s, Nguyen_ICML_2017] explicitly or implicitly exploit a full gradient estimation to reduce the variance of noisy stochastic gradient, leading to superior convergence properties. We can regard this approach as a hybrid algorithm of GD and SGD. Another promising direction is to modify deterministic second-order algorithms into stochastic settings, and solves the potential problem of first-order algorithms for ill-conditioned problems. A direct extension of quasi-Newton (QN) is known as online BFGS [Schraudolph_AISTATS_2007_s]. Its variants include regularized version (RES) [Mokhtari_IEEETranSigPro_2014], limited memory version (oLBFGS) [Schraudolph_AISTATS_2007_s, Mokhtari_JMLR_2015_s], stochastic QN (SQN) [Byrd_SIOPT_2016], incremental QN [Mokhtari_ICASSP_2017], and non-convex version [Wang_SIOPT_2017]. Lastly, hybrid algorithms of the stochastic QN algorithm with VR are proposed [Moritz_AISTATS_2016_s, Kolte_OPT_2015]. Others include [Duchi_JMLR_2011, Bordes_JMLR_2009, De_AISTATS_2017].

The performance of stochastic optimization algorithms is strongly influenced not only by the distribution of data but also by the stepsize algorithm. Therefore, we often encounter results that are completely different from those in papers in every experiment. Consequently, an evaluation framework to test and compare the algorithms at hand is crucially important for fair and comprehensive experiments. SGDLibrary is a flexible, extensible and efficient pure-Matlab library of a collection of stochastic optimization algorithms. The purpose of the library is to provide researchers and implementers a collection of state-of-the-arts stochastic optimization algorithms that solve a variety of large-scale optimization problems. SGDLibrary provides easy access to many solvers and problem examples including, for example, linear/non-linear regression problems and classification problems. This also provides some visualization tools to show classification results and convergence behaviors. To the best of my knowledge, no report of the literature describes a comprehensive experimental environment specialized for stochastic optimization algorithms. The code is available in https://github.com/hiroyuki-kasai/SGDLibrary.

## 2 Software architecture

The software architecture of SGDLibrary follows a typical module-based architecture, which separates problem descriptor and optimization solver. To use the library, the user selects one problem descriptor of interest and no less than one optimization solvers to be compared.

Problem descriptor: The problem descriptor, denoted as problem, specifies the problem of interest with respect to , noted as w in the library. The user does nothing other than calling a problem definition function, for instance, logistic_regression() for -norm regularized LR problem. Each problem definition includes the functions necessary for solvers; (i) (full) cost function , (ii) mini-batch stochastic derivative for the set of samples , which is denoted as indices. (iii) stochastic Hessian for indices, and (iv) stochastic Hessian-vector product for a vector v and indices. The build-in problems include -norm regularized multidimensional linear regression, -norm regularized linear SVM, -norm regularized LR, -norm regularized softmax classification (multinomial LR), -norm multidimensional linear regression, and -norm LR. The problem descriptor provides additional specific functions that are necessary for the problem. For example, the LR problem includes the prediction and the classification accuracy calculation function.

Optimization solver: The optimization solver implements the main routine of the stochastic optimization algorithm. Once the optimization solver function is called with one selected problem descriptor problem as the first argument, it solves the optimization problem by calling the corresponding functions via problem, such as the cost function and the stochastic gradient calculation function. Calling the solver function with the selected problem mutually binds the two of them. The supported optimization solvers in the library are listed up based on the categorized groups as; (i) SGD methods: Vanila SGD [Robbins_MathStat_1951],SGD with classical momentum, SGD with classical momentum with Nesterov’s accelerated gradient [Sutskever_ICML_2013], AdaGrad [Duchi_JMLR_2011], RMSProp, AdaDelta, Adam, AdaMax (ii) Variance reduction (VR) methods: SVRG [Johnson_NIPS_2013_s], SAG [Roux_NIPS_2012_s], SAGA [Defazio_NIPS_2014_s],SARAH[Nguyen_ICML_2017], (iii) Second-order methods: SQN [Bordes_JMLR_2009], oBFGS-Inf [Schraudolph_AISTATS_2007_s, Mokhtari_JMLR_2015_s], oBFGS-Lim (oLBFGS) [Schraudolph_AISTATS_2007_s, Mokhtari_JMLR_2015_s], Reg-oBFGS-Inf (RES) [Mokhtari_IEEETranSigPro_2014], Damp-oBFGS-Inf [Wang_SIOPT_2017], (iv) Second-order method with VR: SVRG-LBFGS [Kolte_OPT_2015], SS-SVRG [Kolte_OPT_2015], SVRG-SQN [Moritz_AISTATS_2016_s], (v) Else: BB-SGD [De_AISTATS_2017], SVRG-BB. The solver also receives optional parameters as the second argument, which forms a struct, designated as options in the library. It contains elements such as the maximum number of epochs, the batch size, and the stepsize algorithm with an initial stepsize. Finally, the solver returns to the caller the final solution w and rich statistic information, which are, for instance, the histories of the cost function value, optimality gap, and the number of gradient calculations.

Others: SGDLibrary accommodates a user-defined stepsize algorithm. This accommodation is achieved by setting as options.stepsizefun=@my_stepsize_alg, which is delivered to solvers. Additionally, when the regularizer in the minimization problem (1) is a non-smooth regularizer such as -norm regularizer , the solver calls the proximal operator as problem.prox(w,stepsize), which is the wrapper function defined in each problem. -norm regularized LR problem, for example, calls soft-threshold function as w = prox(w,stepsize)=soft_thresh(w,stepsize*lambda), where stepsize is the stepsize and lambda is the regularization parameter .

## 3 Tour of the SGDLibrary

We embark on a tour of SGDLibrary exemplifying -norm regularized LR problem. The LR model generates pairs of for a (unknown) model parameter , where is an input -dimensional feature vector and is the binary class label, as Then, the problem seeks the unknown parameter that fits the regularized LR model to the generated data . This problem is cast as a minimization problem as The code for this particular problem is in Listing 1.

First, we generate train/test datasets d using logistic_regression_data_generator(), where the input feature vector is with and . is its class label. The LR problem is defined properly by calling logistic_regression(), which internally contains the functions for cost value, the gradient and the Hessian. This is stored in problem. Then, we execute optimization solvers, i.e., SGD and SVRG, by calling solver functions, i.e., sgd() and svrg() with problem and options after setting some options into the options struct. They return the final solutions of {w_sgd,w_svrg} and the statistics information {info_sgd,info_svrg}. Finally, display_graph() visualizes the behavior of the cost function values in terms of the number of gradient evaluations. It is noteworthy that each algorithm requires a different number of evaluations of samples in each epoch. Therefore, it is common to use this value to evaluate the algorithms instead of the number of iterations. An illustrative result is presented in Figure 1(a). Figures 1(b) and 1(c) are also generated, respectively, by the same display_graph() and another function display_classification_result() specialized for the classification problems. Consequently, SGDLibrary provides other rich visualization tools.

## Appendix A Supported stochastic optimization solvers

Table A summarizes the supported stochastic optimization algorithms and configurations.

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters