Abstract
We propose a new stochastic optimization framework for empirical risk minimization problems such as those that arise in machine learning. The traditional approaches, such as (minibatch) stochastic gradient descent (SGD), utilize an unbiased gradient estimator of the empirical average loss. In contrast, we develop a computationally efficient method to construct a gradient estimator that is purposely biased toward those observations with higher current losses. On the theory side, we show that the proposed method minimizes a new ordered modification of the empirical average loss, and is guaranteed to converge at a sublinear rate to a global optimum for convex loss and to a critical point for weakly convex (nonconvex) loss. Furthermore, we prove a new generalization bound for the proposed algorithm. On the empirical side, the numerical experiments show that our proposed method consistently improves the test errors compared with the standard minibatch SGD in various models including SVM, logistic regression, and deep learning problems.
1 Introduction
Stochastic Gradient Descent (SGD), as the workhorse training algorithm for most machine learning applications including deep learning, has been extensively studied in recent years (e.g., see a recent review by Bottou et al. 2018). At every step, SGD draws one training sample uniformly at random from the training dataset, and then uses the (sub)gradient of the loss over the selected sample to update the model parameters. The most popular version of SGD in practice is perhaps the minibatch SGD (Bottou et al., 2018; Dean et al., 2012), which is widely implemented in the stateoftheart deep learning frameworks, such as TensorFlow (Abadi et al., 2016), PyTorch (Paszke et al., 2017) and CNTK (Seide and Agarwal, 2016). Instead of choosing one sample per iteration, minibatch SGD randomly selects a minibatch of the samples, and uses the (sub)gradient of the average loss over the selected samples to update the model parameters.
Both SGD and minibatch SGD utilize uniform sampling during the entire learning process, so that the stochastic gradient is always an unbiased gradient estimator of the empirical average loss over all samples. On the other hand, it appears to practitioners that not all samples are equally important, and indeed most of them could be ignored after a few epochs of training without affecting the final model
(Katharopoulos and Fleuret, 2018). For example, intuitively, the samples near the final decision boundary should be more important to build the model than those far away from the boundary for classification problems. In particular, as we will illustrate later in Figure 1, there are cases when those faraway samples may corrupt the model by using average loss. In order to further explore such structures, we propose an efficient sampling scheme on top of the minibatch SGD. We call the resulting algorithm ordered SGD, which is used to learn a different type of models with the goal to improve the testing performance.
The above motivation of ordered SGD is related to that of importance sampling SGD, which has been extensively studied recently in order to improve the convergence speed of SGD (Needell et al., 2014; Zhao and Zhang, 2015; Alain et al., 2015; Loshchilov and Hutter, 2015; Gopal, 2016; Katharopoulos and Fleuret, 2018). However, our goals, algorithms and theoretical results are fundamentally different from those in the previous studies on importance sampling SGD. Indeed, all aforementioned studies are aimed to accelerate the minimization process for the empirical average loss, whereas our proposed method turns out to minimize a new objective function by purposely constructing a biased gradient.
Our main contributions can be summarized as follows: i) we propose a computationally efficient and easily implementable algorithm, ordered SGD, with principled motivations (Section 3), ii) we show that ordered SGD minimizes an ordered empirical risk with sublinear rate for convex and weakly convex (nonconvex) loss functions (Section 4), iii) we prove a generalization bound for ordered SGD (Section 5), and iv) our numerical experiments show ordered SGD consistently improved minibatch SGD in test errors (Section 6).
2 Empirical Risk Minimization
Empirical risk minimization is one of the main tools to build a model in machine learning. Let be a training dataset of samples where is the input vector and is the target output vector for the th sample. The goal of empirical risk minimization is to find a prediction function , by minimizing
(1) 
where is the parameter vector of the prediction model, with the function is the loss of the th sample, and is a regularizer. For example, in logistic regression, is a linear function of the input vector , and is the logistic loss function with . For a neural network, represents the preactivation output of the last layer.
3 Algorithm
In this section, we introduce ordered SGD and provide an intuitive explanation of the advantage of ordered SGD by looking at 2dimension toy examples with linear classifiers and small artificial neural networks (ANNs). Let us first introduce a new notation as an extension to the standard notation : {definition} Given a set of real numbers , an index subset , and a positive integer number , we define such that is a set of indexes of the largest values of ; i.e., .
Algorithm 1 describes the pseudocode of our proposed algorithm, ordered SGD. The procedures of ordered SGD follow those of minibatch SGD except the following modification: after drawing a minibatch of size , ordered SGD updates the parameter vector based on the (sub)gradient of the average loss over the top samples in the minibatch in terms of individual loss values (lines 4 and 5 of Algorithm 1). This modification is used to purposely build and utilize a biased gradient estimator with more weights on the samples having larger losses. As it can be seen in Algorithm 1, ordered SGD is easily implementable, requiring to change only a single line or few lines on top of a minibatch SGD implementation.
Figure 1 illustrates the motivation of ordered SGD by looking at twodimensional toy problems of binary classification. To avoid an extra freedom due to the hyperparameter , we employed a single fixed procedure to set the hyperparameter in the experiments for Figure 1 and other experiments in Section 6, which is further explained in Section 6. The details of the experimental settings for Figure 1 are presented in Section 6 and in Appendix C.
It can be seen from Figure 1 that ordered SGD adapts better to imbalanced data distributions compared with minibatch SGD. It can better capture the information of the smaller subclusters that contribute less to the empirical average loss : e.g., the small subclusters in the middle of Figures 0(a) and 0(b), as well as the small inner ring structure in Figures 0(c) and 0(d) (the two inner rings contain only 40 data points while the two outer rings contain 960 data points). The smaller subclusters are informative for training a classifier when they are not outliers or byproducts of noise. A subcluster of data points would be less likely to be an outlier as the size of the subcluster increases. The value of in ordered SGD can control the size of subclusters that a classifier should be sensitive to. With smaller , the output model becomes more sensitive to smaller subclusters. In an extreme case with and , ordered SGD minimizes the maximal loss (ShalevShwartz and Wexler, 2016) that is highly sensitive to every smallest subcluster of each single data point.
4 Optimization Theory
In this section, we answer the following three questions: (1) what objective function does ordered SGD solve as an optimization method, (2) what is the convergence rate of ordered SGD for minimizing the new objective function, and (3) what is the asymptotic structure of the new objective function.
Similarly to the notation of order statistics, we first introduce the notation of ordered indexes: given a model parameter , let be the decreasing values of the individual losses , where (for all ). That is, as a perturbation of defines the order of sample indexes by loss values. Throughout this paper, whenever we encounter ties on the values, we employ a tiebreaking rule in order to ensure the uniqueness of such an order.
(2) 
where the parameter depends on the tuple , and is defined by
(3) 
Then, ordered SGD is a stochastic firstorder method for minimizing in the sense that used in ordered SGD is an unbiased estimator of a (sub)gradient of .
Although the order of individual losses change with different , is a welldefined function. For any given , the order of individual losses is fixed and has a unique value, which means is a function of .
All proofs in this paper are deferred to Appendix A. As we can see from Theorem 4, the objective function minimized by ordered SGD (i.e., ) depends on the hyperparameters of the algorithm through the values of . Therefore, it is of practical interest to obtain deeper understandings on how the hyperparameters affects the objective function through . The next proposition presents the asymptotic value of (when ), which shows that a rescaled converges to the cumulative distribution function of a Beta distribution: {proposition} Denote and . Then, it holds that
Moreover, it holds that is the cumulative distribution function of .
To better illustrate the structure of in the nonasymptotic regime, Figure 2 plots and for different values of where is a rescaled version of defined by (and the value of between and is defined by linear interpolation for better visualization). As we can see from Figure 2, monotonically decays. In each subfigure, with fixed , the cliff gets smoother and converges to as increases. Comparing Figures 1(a) and 1(b), we can see that as , and all increase proportionally, the cliff gets steeper. Comparing Figures 1(b) and 1(c), we can see that with fixed and , the cliff shifts to the right as increases.
As a direct extension of Theorem 4, we can now obtain the computational guarantees of ordered SGD for minimizing by taking advantage of the classic convergence results of SGD:
Let be a sequence generated by ordered SGD (Algorithm 1). Suppose that is Lipschitz continuous for , and is Lipschitz continuous. Suppose that there exists a finite and is finite. Then, the following two statements hold:

(Convex setting). If and are both convex, for any stepsize , it holds that

(Weakly convex setting) Suppose that is weakly convex (i.e., is convex) and is convex. Recall the definition of Moreau envelope: . Denote as a random variable taking value in according to the probability distribution . Then for any constant , it holds that
Theorem 4 shows that in particular, if we choose , the optimality gap and decay at the rate of (note that with ).
The Lipschitz continuity assumption in Theorem 4 is a standard assumption for the analysis of stochastic optimization algorithms. This assumption is generally satisfied with logistic loss, hinge loss and Huber loss without any constraints on , and with square loss when one can presume that stays in a compact space (which is typically the case being interested in practice). For the weakly convex setting, (appeared in Theorem 4 (2)) is a natural measure of the nearstationarity for a nondifferentiable weakly convex function (Davis and Drusvyatskiy, 2018). The weak convexity (also known as negative strong convexity or almost convexity) is a standard assumption for analyzing nonconvex optimization problem in optimization literature (Davis and Drusvyatskiy, 2018; AllenZhu, 2017). With a standard loss criterion such as logistic loss, the individual objective with a neural network using sigmoid or tanh activation functions is weakly convex (neural network with ReLU activation function is not weakly convex and falls out of our setting).
5 Generalization Bound
This section presents the generalization theory for ordered SGD. To make the dependence on a training dataset explicit, we define and by rewriting and , where defines the order of sample indexes by the loss value, as stated in Section 4. Denote where depends on . Given an arbitrary set , we define as the (standard) Rademacher complexity of the set :
where , and are independent uniform random variables taking values in (i.e., Rademacher variables). Given a tuple , define as the least upper bound on the difference of individual loss values: for all and all . For example, if is the 01 loss function. Theorem 5 presents a generalization bound for ordered SGD:
Let be a fixed subset of . Then, for any , with probability at least over an iid draw of examples , the following holds for all :
(4)  
where
The expected error in the lefthand side of Equation (4) is a standard objective for generalization, whereas the righthand side is an upper bound with the dependence on the algorithm parameters and . Let us first look at the asymptotic case when . Let be constrained such that as , which has been shown to be satisfied for various models and sets (Bartlett and Mendelson, 2002; Mohri et al., 2012; Bartlett et al., 2017; Kawaguchi et al., 2017). With being bounded, the third term in the righthand side of Equation (4) disappear as . Thus, it holds with high probability that , where is minimized by ordered SGD as shown in Theorem 4 and Theorem 4. From this viewpoint, ordered SGD minimizes the expected error for generalization when .
A special case of Theorem 5 recovers the standard generalization bound of the empirical average loss (e.g., Mohri et al., 2012), That is, if , ordered SGD becomes the standard minibatch SGD and Equation (4) becomes
(5) 
which is the standard generalization bound (e.g., Mohri et al., 2012). This is because if , then and hence .
For the purpose of a simple comparison of ordered SGD and (minibatch) SGD, consider the case where we fix a single subset . Let and be the parameter vectors obtained by ordered SGD and (minibatch) SGD respectively as the results of training. Then, when , with being bounded, the upper bound on the expected error for ordered SGD (the right handside of Equation 4) is (strictly) less than that for (minibatch) SGD (the right handside of Equation 5) if or if .
6 Experiments
Data Aug  Datasets  Model  minibatch SGD  OSGD  Improve 

No  Semeion  Logistic model  10.76 (0.35)  9.31 (0.42)  13.48 
No  MNIST  Logistic model  7.70 (0.06)  7.35 (0.04)  4.55 
No  Semeion  SVM  11.05 (0.72)  10.25 (0.51)  7.18 
No  MNIST  SVM  8.04 (0.05)  7.66 (0.07)  4.60 
No  Semeion  LeNet  8.06 (0.61)  6.09 (0.55)  24.48 
No  MNIST  LeNet  0.65 (0.04)  0.57 (0.06)  11.56 
No  KMNIST  LeNet  3.74 (0.08)  3.09 (0.14)  17.49 
No  FashionMNIST  LeNet  8.07 (0.16)  8.03 (0.26)  0.57 
No  CIFAR10  PreActResNet18  13.75 (0.22)  12.87 (0.32)  6.41 
No  CIFAR100  PreActResNet18  41.80 (0.40)  41.32 (0.43)  1.17 
No  SVHN  PreActResNet18  4.66 (0.10)  4.39 (0.11)  5.95 
Yes  Semeion  LeNet  7.47 (1.03)  5.06 (0.69)  32.28 
Yes  MNIST  LeNet  0.43 (0.03)  0.39 (0.03)  9.84 
Yes  KMNIST  LeNet  2.59 (0.09)  2.01 (0.13)  22.33 
Yes  FashionMNIST  LeNet  7.45 (0.07)  6.49 (0.19)  12.93 
Yes  CIFAR10  PreActResNet18  8.08 (0.17)  7.04 (0.12)  12.81 
Yes  CIFAR100  PreActResNet18  29.95 (0.31)  28.31 (0.41)  5.49 
Yes  SVHN  PreActResNet18  4.45 (0.07)  4.00 (0.08)  10.08 
In this section, we empirically evaluate ordered SGD with various datasets, models and settings. To avoid an extra freedom due to the hyperparameter , we introduce a single fixed setup of the adaptive values of as the default setting: at the beginning of training, once , once , once , and once , where represents training accuracy. The value of was automatically updated at the end of each epoch based on this simple rule. This rule was derived based on the intuition that in the early stage of training, all samples are informative to build a rough model, while the samples around the boundary (with larger losses) are more helpful to build the final classifier in later stage. In the figures and tables of this section, we refer to ordered SGD with this rule as ‘OSGD’, and ordered SGD with a fixed value as ’OSGD: ’.
Experiment with fixed hyperparameters. For this experiment, we fixed all hyperparameters a priori across all different datasets and models by using a standard hyperparameter setting of minibatch SGD, instead of aiming for stateoftheart test errors for each dataset with a possible issue of overfitting to test and validation datasets (Dwork et al., 2015; Rao et al., 2008). We fixed the minibatch size to be 64, the weight decay rate to be , the initial learning rate to be , and the momentum coefficient to be . See Appendix C for more details of the experimental settings. The code to reproduce all the results is publicly available at: [the link is hidden for anonymous submission].
Table 1 compares the testing performance of ordered SGD and minibatch SGD for different models and datasets. Table 1 consistently shows that ordered SGD improved minibatch SGD in test errors. The table reports the mean and the standard deviation of test errors (i.e., 100 the average of 01 losses on test dataset) over random experiments with different random seeds. The table also summarises the relative improvements of ordered SGD over minibatch SGD, which is defined as [ ((mean test error of minibatch SGD)  (mean test error of ordered SGD)) / (mean test error of minibatch SGD)]. Logistic model refers to linear multinomial logistic regression model, SVM refers to linear multiclass support vector machine, LeNet refers to a standard variant of LeNet (LeCun et al., 1998) with ReLU activations, and PreActResNet18 refers to preactivation ResNet with 18 layers (He et al., 2016).
Figure 3 shows the test error and the average training loss of minibatch SGD and ordered SGD versus the number of epoch. As shown in the figure, ordered SGD with the fixed value also outperformed minibatch SGD in general. In the figures, the reported training losses refer to the standard empirical average loss measured at the end of each epoch. When compared to minibatch SGD, ordered SGD had lower test errors while having higher training losses in Figures 2(a), 2(d) and 2(g), because ordered SGD optimizes over the ordered empirical loss instead. This is consistent with our motivation and theory of ordered SGD in Sections 3, 4 and 5. The qualitatively similar behaviors were also observed with all of the 18 various problems as shown in Appendix C.
Datasets  minibatch SGD  OSGD 

MNIST  14.44 (0.54)  14.77 (0.41) 
KMNIST  12.17 (0.33)  11.42 (0.29) 
CIFAR10  48.18 (0.58)  46.40 (0.97) 
CIFAR100  47.37 (0.84)  44.74 (0.91) 
SVHN  72.29 (1.23)  67.95 (1.54) 
Moreover, ordered SGD is a computationally efficient algorithm. Table 2 shows the wallclock time in illustrative four experiments, whereas Table 4 in Appendix C summarizes the wallclock time in all experiments. The wallclock time of ordered SGD measures the time spent by all computations of ordered SGD, including the extra computation of finding top samples in a minibatch (line 4 of Algorithm 1). The extra computation is generally negligible and can be completed in or by using a sorting/selection algorithm. The ordered SGD algorithm can be faster than minibatch SGD because ordered SGD only computes the (sub)gradient of the top samples (in line 5 of Algorithm 1). As shown in Tables 2 and 4, ordered SGD was faster than minibatch SGD for all larger models with PreActResNet18. This is because the computational reduction of the backpropagation in ordered SGD can dominate the small extra cost of finding top samples in larger problems.
Experiment with different values. Figure 4 shows the effect of different fixed values for CIFAR10 with PreActResNet18. Ordered SGD improved the test errors of minibatch SGD with different fixed values. We also report the same observation with different datasets and models in Appendix C.
Experiment with different learning rates and minibatch sizes. Figures 5 and 6 in Appendix C consistently show the improvement of ordered SGD over minibatch SGD with different different learning rates and minibatch sizes.
Experiment with the best learning rate, mixup, and random erasing. Table 3 summarises the experimental results with the data augmentation methods of random erasing (RE) (Zhong et al., 2017) and mixup (Zhang et al., 2017; Verma et al., 2019) by using CIFAR10 dataset. For this experiment, we purposefully adopted the setting that favors minibatch SGD. That is, for both minibatch SGD and ordered SGD, we used hyperparameters tuned for minibatch SGD. For RE and mixup data, we used the same tuned hyperparameter settings (including learning rates) and the codes as those in the previous studies that used minibatch SGD (Zhong et al., 2017; Verma et al., 2019) (with WRN2810 for RE and with PreActResNet18 for mixup). For standard data augmentation, we first searched the best learning rate of minibatch SGD based on the test error (purposefully overfitting to the test dataset for minibatch SGD) by using the grid search with learning rates of 1.0, 0.5, 0.1, 0.05. 0.01, 0.005, 0.001, 0.0005, 0.0001. Then, we used the best learning rate of minibatch SGD for ordered SGD (instead of using the best learning rate of ordered SGD for ordered SGD). As shown in Table 3, ordered SGD with hyperparameters tuned for minibatch SGD still outperformed finetuned minibatch SGD with the different data augmentation methods.
Data Aug  minibatch SGD  OSGD  Improve 

Standard  6.94  6.46  6.92 
RE  3.24  3.06  5.56 
Mixup  3.31  3.05  7.85 
7 Related work and extension
Although there is no direct predecessor of our work, the following fields are related to this paper.
Other minibatch stochastic methods. The proposed sampling strategy and our theoretical analyses are generic and can be extended to other (minibatch) stochastic methods, including Adam (Kingma and Ba, 2014), stochastic mirror descent (Beck and Teboulle, 2003; Nedic and Lee, 2014; Lu, 2017; Lu et al., 2018; Zhang and He, 2018), and proximal stochastic subgradient methods (Davis and Drusvyatskiy, 2018). Thus, our results open up the research direction for further studying the proposed stochastic optimization framework with different base algorithms such as Adam and AdaGrad. To illustrate it, we presented ordered Adam and reported the numerical results in Appendix C.
Importance Sampling SGD. Stochastic gradient descent with importance sampling has been an active research area for the past several years (Needell et al., 2014; Zhao and Zhang, 2015; Alain et al., 2015; Loshchilov and Hutter, 2015; Gopal, 2016; Katharopoulos and Fleuret, 2018). In the convex setting, (Zhao and Zhang, 2015; Needell et al., 2014) show that the optimal sampling distribution for minimizing is proportional to the persample gradient norm. However, maintaining the norm of gradient for individual samples can be computationally expensive when the dataset size or the parameter vector size is large in particular for many applications of deep learning. These importance sampling methods are inherently different from ordered SGD in that importance sampling is used to reduce the number of iterations for minimizing , whereas ordered SGD is designed to learn a different type of models by minimizing the new objective function .
Average Topk Loss. The average top loss is introduced by Fan et al. (2017) as an alternative to the empirical average loss . The ordered loss function differs from the average top loss as shown in Section 4. Furthermore, our proposed framework is fundamentally different from the average top loss. First, the algorithms are different – the stochastic method proposed in Fan et al. (2017) utilizes duality of the function and is unusable for deep neural networks (and other nonconvex problems), while our proposed method is a modification of minibatch SGD that is usable for deep neural networks (and other nonconvex problems) and scales well for large problems. Second, the optimization results are different, and in particular, the objective functions are different and we have convergence analysis for weakly convex (nonconvex) functions. Finally, the focus of generalization property is different – Fan et al. (2017) focuses on the calibration for binary classification problem, while we focus on the generalization bound that works for general classification and regression problems.
RandomthenGreedy Procedure. Ordered SGD randomly picks a subset of samples and then greedily utilizes a part of the subset, which is related to the randomthengreedy procedure proposed recently in the different topic – the greedy weak learner for gradient boosting (Lu and Mazumder, 2018).
8 Conclusion
We have presented an efficient stochastic firstorder method, ordered SGD, for learning an effective predictor in machine learning problems. We have shown that ordered SGD minimizes a new ordered empirical loss , based on which we have developed the optimization and generalization properties of ordered SGD. The numerical experiments confirmed the effectiveness of our proposed algorithm.
Appendix
Appendix A Proofs
In Appendix A, we provide complete proofs of the theoretical results.
a.1 Proof of Theorem 4
Proof.
We just need to show that is an unbiased estimator of a subgradient of at , namely .
At first, it holds that
where is a subgradient of at and . In the above equality chain, the third equality is simply the definition of expectation, and the last equality is because is a permutation of .
For any given index , define , then
(6) 
Notice that is randomly chosen from sample index set without replacement. There are in total different sets such that . Among them, there are different sets which contains the index , thus
(7) 
Given the condition , contains items in means contains items in , thus there are such possible set , whereby it holds that
(8) 
Substituting Equations (7) and (8) into Equation (6), we arrive at
Therefore,
where the last inequality is due to the aditivity of subgradient (for both convex and weakly convex function) ∎
a.2 Proof of Proposition 4
We just need to show that
(9) 
then we finish the proof by changing variable .
At first, the Stirling’s approximation yields that when and are both sufficiently large, it holds that
(10) 
Thus,
(11) 
where the first equality utilize Equation (10) and the fact that are negligible in the limit case (except the exponent terms).
On the other hand, it holds by rearranging the factorial numbers that
(12) 
By noticing , it holds that
In other word, is the cumulative of Beta() when .
a.3 Proof of Theorem 4
Proof.
Notice that is a subgradient of where . Suppose where is a subgradient of and is a subgradient of . Then
(13) 
Meanwhile, it follows Theorem 4 that is an unbiased estimator of a subgradient of . Together with Equation (13), we obtain the statement (1) by the analysis of convex stochastic subgradient descent in Boyd and Mutapcic (2008).
Furthermore, suppose is convex for any , then is also convex, whereby is weakly convex. We obtain the statement (2) by substituting into Theorem 2.1 in Davis and Drusvyatskiy (2018). ∎
a.4 Proof of Theorem 5
Before proving Theorem 5, we first show the following proposition, which gives an upper bound for : {proposition} For any , .
Proof.
The value of is equal to the probability of ordered SGD choosing the th sample in the ordered sequence , which is at most the probability of minibatch SGD choosing the th sample. The probability of minibatch SGD choosing the th sample is . ∎
We are now ready to prove Theorem 5 by finding an upper bound on based on McDiarmid’s inequality.
Proof of Theorem 5. Define . In this proof, our objective is to provide the upper bound on by using McDiarmid’s inequality. To apply McDiarmid’s inequality to , we first show that satisfies the remaining condition of McDiarmid’s inequality. Let and be two datasets differing by exactly one point of an arbitrary index ; i.e., for all and . Then, we provide an upper bound on as follows:
where the first line follows the property of the supremum, , the second line follows the definition of , and the last line follows Proposition A.4 ().
We now bound the last term . This requires a careful examination because for more than one index (although and differ only by exactly one point). This is because it is possible to have for many indexes where in and in . To analyze this effect, we now conduct case analysis. Define such that where ; i.e., .
Consider the case where . Let and . Then,
where the first line uses the fact that where is the index of samples differing in and . The second line follows the equality from to in this case. The third line follows the definition of the ordering of the indexes. The fourth line follows the cancellations of the terms from the third line.
Consider the case where