1 Introduction

Abstract

We propose a new stochastic optimization framework for empirical risk minimization problems such as those that arise in machine learning. The traditional approaches, such as (mini-batch) stochastic gradient descent (SGD), utilize an unbiased gradient estimator of the empirical average loss. In contrast, we develop a computationally efficient method to construct a gradient estimator that is purposely biased toward those observations with higher current losses. On the theory side, we show that the proposed method minimizes a new ordered modification of the empirical average loss, and is guaranteed to converge at a sublinear rate to a global optimum for convex loss and to a critical point for weakly convex (non-convex) loss. Furthermore, we prove a new generalization bound for the proposed algorithm. On the empirical side, the numerical experiments show that our proposed method consistently improves the test errors compared with the standard mini-batch SGD in various models including SVM, logistic regression, and deep learning problems.

1 Introduction

Stochastic Gradient Descent (SGD), as the workhorse training algorithm for most machine learning applications including deep learning, has been extensively studied in recent years (e.g., see a recent review by Bottou et al. 2018). At every step, SGD draws one training sample uniformly at random from the training dataset, and then uses the (sub-)gradient of the loss over the selected sample to update the model parameters. The most popular version of SGD in practice is perhaps the mini-batch SGD (Bottou et al., 2018; Dean et al., 2012), which is widely implemented in the state-of-the-art deep learning frameworks, such as TensorFlow (Abadi et al., 2016), PyTorch (Paszke et al., 2017) and CNTK (Seide and Agarwal, 2016). Instead of choosing one sample per iteration, mini-batch SGD randomly selects a mini-batch of the samples, and uses the (sub-)gradient of the average loss over the selected samples to update the model parameters.

Both SGD and mini-batch SGD utilize uniform sampling during the entire learning process, so that the stochastic gradient is always an unbiased gradient estimator of the empirical average loss over all samples. On the other hand, it appears to practitioners that not all samples are equally important, and indeed most of them could be ignored after a few epochs of training without affecting the final model (Katharopoulos and Fleuret, 2018). For example, intuitively, the samples near the final decision boundary should be more important to build the model than those far away from the boundary for classification problems. In particular, as we will illustrate later in Figure 1, there are cases when those far-away samples may corrupt the model by using average loss. In order to further explore such structures, we propose an efficient sampling scheme on top of the mini-batch SGD. We call the resulting algorithm ordered SGD, which is used to learn a different type of models with the goal to improve the testing performance. 1

The above motivation of ordered SGD is related to that of importance sampling SGD, which has been extensively studied recently in order to improve the convergence speed of SGD (Needell et al., 2014; Zhao and Zhang, 2015; Alain et al., 2015; Loshchilov and Hutter, 2015; Gopal, 2016; Katharopoulos and Fleuret, 2018). However, our goals, algorithms and theoretical results are fundamentally different from those in the previous studies on importance sampling SGD. Indeed, all aforementioned studies are aimed to accelerate the minimization process for the empirical average loss, whereas our proposed method turns out to minimize a new objective function by purposely constructing a biased gradient.

Our main contributions can be summarized as follows: i) we propose a computationally efficient and easily implementable algorithm, ordered SGD, with principled motivations (Section 3), ii) we show that ordered SGD minimizes an ordered empirical risk with sub-linear rate for convex and weakly convex (non-convex) loss functions (Section 4), iii) we prove a generalization bound for ordered SGD (Section 5), and iv) our numerical experiments show ordered SGD consistently improved mini-batch SGD in test errors (Section 6).

2 Empirical Risk Minimization

Empirical risk minimization is one of the main tools to build a model in machine learning. Let be a training dataset of samples where is the input vector and is the target output vector for the -th sample. The goal of empirical risk minimization is to find a prediction function , by minimizing

(1)

where is the parameter vector of the prediction model, with the function is the loss of the -th sample, and is a regularizer. For example, in logistic regression, is a linear function of the input vector , and is the logistic loss function with . For a neural network, represents the pre-activation output of the last layer.

3 Algorithm

In this section, we introduce ordered SGD and provide an intuitive explanation of the advantage of ordered SGD by looking at 2-dimension toy examples with linear classifiers and small artificial neural networks (ANNs). Let us first introduce a new notation as an extension to the standard notation : {definition} Given a set of real numbers , an index subset , and a positive integer number , we define such that is a set of indexes of the largest values of ; i.e., .

1:  Inputs: an initial vector and a learning rate sequence
2:  for  do
3:     Randomly choose a mini-batch of samples: such that .
4:     Find a set of top- samples in in term of loss values: .
5:     Compute a subgradient of the top- samples : where and is the set of sub-gradient2of function .
6:     Update parameters :
Algorithm 1 Ordered Stochastic Gradient Descent (ordered SGD)

Algorithm 1 describes the pseudocode of our proposed algorithm, ordered SGD. The procedures of ordered SGD follow those of mini-batch SGD except the following modification: after drawing a mini-batch of size , ordered SGD updates the parameter vector based on the (sub-)gradient of the average loss over the top- samples in the mini-batch in terms of individual loss values (lines 4 and 5 of Algorithm 1). This modification is used to purposely build and utilize a biased gradient estimator with more weights on the samples having larger losses. As it can be seen in Algorithm 1, ordered SGD is easily implementable, requiring to change only a single line or few lines on top of a mini-batch SGD implementation.

(a) with linear classifier
(b) with linear classifier
(c) with small ANN
(d) with tiny ANN
Figure 1: Decision boundaries of mini-batch SGD predictors (top row) and ordered SGD predictors (bottom row) with 2D synthetic datasets for binary classification. In these examples, ordered SGD predictors correctly classify more data points than mini-batch SGD predictors, because a ordered SGD predictor can focus more on a smaller yet informative subset of data points, instead of focusing on the average loss dominated by a larger subset of data points.

Figure 1 illustrates the motivation of ordered SGD by looking at two-dimensional toy problems of binary classification. To avoid an extra freedom due to the hyper-parameter , we employed a single fixed procedure to set the hyper-parameter in the experiments for Figure 1 and other experiments in Section 6, which is further explained in Section 6. The details of the experimental settings for Figure 1 are presented in Section 6 and in Appendix C.

It can be seen from Figure 1 that ordered SGD adapts better to imbalanced data distributions compared with mini-batch SGD. It can better capture the information of the smaller sub-clusters that contribute less to the empirical average loss : e.g., the small sub-clusters in the middle of Figures 0(a) and 0(b), as well as the small inner ring structure in Figures 0(c) and 0(d) (the two inner rings contain only 40 data points while the two outer rings contain 960 data points). The smaller sub-clusters are informative for training a classifier when they are not outliers or by-products of noise. A sub-cluster of data points would be less likely to be an outlier as the size of the sub-cluster increases. The value of in ordered SGD can control the size of sub-clusters that a classifier should be sensitive to. With smaller , the output model becomes more sensitive to smaller sub-clusters. In an extreme case with and , ordered SGD minimizes the maximal loss (Shalev-Shwartz and Wexler, 2016) that is highly sensitive to every smallest sub-cluster of each single data point.

4 Optimization Theory

In this section, we answer the following three questions: (1) what objective function does ordered SGD solve as an optimization method, (2) what is the convergence rate of ordered SGD for minimizing the new objective function, and (3) what is the asymptotic structure of the new objective function.

Similarly to the notation of order statistics, we first introduce the notation of ordered indexes: given a model parameter , let be the decreasing values of the individual losses , where (for all ). That is, as a perturbation of defines the order of sample indexes by loss values. Throughout this paper, whenever we encounter ties on the values, we employ a tie-breaking rule in order to ensure the uniqueness of such an order.3 Theorem 4 shows that ordered SGD is a stochastic first-order method for minimizing the new ordered empirical loss . {theorem} Consider the following objective function:

(2)

where the parameter depends on the tuple , and is defined by

(3)

Then, ordered SGD is a stochastic first-order method for minimizing in the sense that used in ordered SGD is an unbiased estimator of a (sub-)gradient of .

(a)
(b)
(c)
Figure 2: and for different where is a rescaled version of : .

Although the order of individual losses change with different , is a well-defined function. For any given , the order of individual losses is fixed and has a unique value, which means is a function of .

All proofs in this paper are deferred to Appendix A. As we can see from Theorem 4, the objective function minimized by ordered SGD (i.e., ) depends on the hyper-parameters of the algorithm through the values of . Therefore, it is of practical interest to obtain deeper understandings on how the hyper-parameters affects the objective function through . The next proposition presents the asymptotic value of (when ), which shows that a rescaled converges to the cumulative distribution function of a Beta distribution: {proposition} Denote and . Then, it holds that

Moreover, it holds that is the cumulative distribution function of .

To better illustrate the structure of in the non-asymptotic regime, Figure 2 plots and for different values of where is a rescaled version of defined by (and the value of between and is defined by linear interpolation for better visualization). As we can see from Figure 2, monotonically decays. In each subfigure, with fixed , the cliff gets smoother and converges to as increases. Comparing Figures 1(a) and 1(b), we can see that as , and all increase proportionally, the cliff gets steeper. Comparing Figures 1(b) and 1(c), we can see that with fixed and , the cliff shifts to the right as increases.

As a direct extension of Theorem 4, we can now obtain the computational guarantees of ordered SGD for minimizing by taking advantage of the classic convergence results of SGD:

{theorem}

Let be a sequence generated by ordered SGD (Algorithm 1). Suppose that is -Lipschitz continuous for , and is -Lipschitz continuous. Suppose that there exists a finite and is finite. Then, the following two statements hold:

  1. (Convex setting). If and are both convex, for any step-size , it holds that

  2. (Weakly convex setting) Suppose that is -weakly convex (i.e., is convex) and is convex. Recall the definition of Moreau envelope: . Denote as a random variable taking value in according to the probability distribution . Then for any constant , it holds that

Theorem 4 shows that in particular, if we choose , the optimality gap and decay at the rate of (note that with ).

The Lipschitz continuity assumption in Theorem 4 is a standard assumption for the analysis of stochastic optimization algorithms. This assumption is generally satisfied with logistic loss, hinge loss and Huber loss without any constraints on , and with square loss when one can presume that stays in a compact space (which is typically the case being interested in practice). For the weakly convex setting, (appeared in Theorem 4 (2)) is a natural measure of the near-stationarity for a non-differentiable weakly convex function (Davis and Drusvyatskiy, 2018). The weak convexity (also known as negative strong convexity or almost convexity) is a standard assumption for analyzing non-convex optimization problem in optimization literature (Davis and Drusvyatskiy, 2018; Allen-Zhu, 2017). With a standard loss criterion such as logistic loss, the individual objective with a neural network using sigmoid or tanh activation functions is weakly convex (neural network with ReLU activation function is not weakly convex and falls out of our setting).

5 Generalization Bound

This section presents the generalization theory for ordered SGD. To make the dependence on a training dataset explicit, we define and by rewriting and , where defines the order of sample indexes by the loss value, as stated in Section 4. Denote where depends on . Given an arbitrary set , we define as the (standard) Rademacher complexity of the set :

where , and are independent uniform random variables taking values in (i.e., Rademacher variables). Given a tuple , define as the least upper bound on the difference of individual loss values: for all and all . For example, if is the 0-1 loss function. Theorem 5 presents a generalization bound for ordered SGD:

{theorem}

Let be a fixed subset of . Then, for any , with probability at least over an iid draw of examples , the following holds for all :

(4)

where

The expected error in the left-hand side of Equation (4) is a standard objective for generalization, whereas the right-hand side is an upper bound with the dependence on the algorithm parameters and . Let us first look at the asymptotic case when . Let be constrained such that as , which has been shown to be satisfied for various models and sets (Bartlett and Mendelson, 2002; Mohri et al., 2012; Bartlett et al., 2017; Kawaguchi et al., 2017). With being bounded, the third term in the right-hand side of Equation (4) disappear as . Thus, it holds with high probability that , where is minimized by ordered SGD as shown in Theorem 4 and Theorem 4. From this viewpoint, ordered SGD minimizes the expected error for generalization when .

A special case of Theorem 5 recovers the standard generalization bound of the empirical average loss (e.g., Mohri et al., 2012), That is, if , ordered SGD becomes the standard mini-batch SGD and Equation (4) becomes

(5)

which is the standard generalization bound (e.g., Mohri et al., 2012). This is because if , then and hence .

For the purpose of a simple comparison of ordered SGD and (mini-batch) SGD, consider the case where we fix a single subset . Let and be the parameter vectors obtained by ordered SGD and (mini-batch) SGD respectively as the results of training. Then, when , with being bounded, the upper bound on the expected error for ordered SGD (the right hand-side of Equation 4) is (strictly) less than that for (mini-batch) SGD (the right hand-side of Equation 5) if or if .

For a given model , whether Theorem 5 provides a non-vacuous bound depends on the choice of . In Appendix B, we discuss this effect as well as a standard way to derive various data-dependent bounds from Theorem 5.

6 Experiments

Data Aug Datasets Model mini-batch SGD OSGD Improve
No Semeion Logistic model 10.76 (0.35) 9.31 (0.42) 13.48
No MNIST Logistic model 7.70 (0.06) 7.35 (0.04) 4.55
No Semeion SVM 11.05 (0.72) 10.25 (0.51) 7.18
No MNIST SVM 8.04 (0.05) 7.66 (0.07) 4.60
No Semeion LeNet 8.06 (0.61) 6.09 (0.55) 24.48
No MNIST LeNet 0.65 (0.04) 0.57 (0.06) 11.56
No KMNIST LeNet 3.74 (0.08) 3.09 (0.14) 17.49
No Fashion-MNIST LeNet 8.07 (0.16) 8.03 (0.26) 0.57
No CIFAR-10 PreActResNet18 13.75 (0.22) 12.87 (0.32) 6.41
No CIFAR-100 PreActResNet18 41.80 (0.40) 41.32 (0.43) 1.17
No SVHN PreActResNet18 4.66 (0.10) 4.39 (0.11) 5.95
Yes Semeion LeNet 7.47 (1.03) 5.06 (0.69) 32.28
Yes MNIST LeNet 0.43 (0.03) 0.39 (0.03) 9.84
Yes KMNIST LeNet 2.59 (0.09) 2.01 (0.13) 22.33
Yes Fashion-MNIST LeNet 7.45 (0.07) 6.49 (0.19) 12.93
Yes CIFAR-10 PreActResNet18 8.08 (0.17) 7.04 (0.12) 12.81
Yes CIFAR-100 PreActResNet18 29.95 (0.31) 28.31 (0.41) 5.49
Yes SVHN PreActResNet18 4.45 (0.07) 4.00 (0.08) 10.08
Table 1: Test errors (%) of mini-batch SGD and ordered SGD (OSGD). The last column labeled “Improve” shows relative improvements (%) from mini-batch SGD to ordered SGD. In the other columns, the numbers indicate the mean test errors (and standard deviations in parentheses) over ten random trials. The first column shows ‘No’ for no data augmentation, and ‘Yes’ for data augmentation.
(a) MNIST & Logistic
(b) MNIST & LeNet
(c) KMNIST

 

(d) CIFAR-10
(e) Semeion & LeNet
(f) KMNIST
(g) CIFAR-100
(h) SVHN
Figure 3: Test error and training loss (in log scales) versus the number of epoch. These are without data augmentation in subfigures (a)-(d), and with data augmentation in subfigures (e)-(h). The lines indicate the mean values over 10 random trials, and the shaded regions represent intervals of the sample standard deviations.

In this section, we empirically evaluate ordered SGD with various datasets, models and settings. To avoid an extra freedom due to the hyper-parameter , we introduce a single fixed setup of the adaptive values of as the default setting: at the beginning of training, once , once , once , and once , where represents training accuracy. The value of was automatically updated at the end of each epoch based on this simple rule. This rule was derived based on the intuition that in the early stage of training, all samples are informative to build a rough model, while the samples around the boundary (with larger losses) are more helpful to build the final classifier in later stage. In the figures and tables of this section, we refer to ordered SGD with this rule as ‘OSGD’, and ordered SGD with a fixed value as ’OSGD: ’.

Experiment with fixed hyper-parameters. For this experiment, we fixed all hyper-parameters a priori across all different datasets and models by using a standard hyper-parameter setting of mini-batch SGD, instead of aiming for state-of-the-art test errors for each dataset with a possible issue of over-fitting to test and validation datasets (Dwork et al., 2015; Rao et al., 2008). We fixed the mini-batch size to be 64, the weight decay rate to be , the initial learning rate to be , and the momentum coefficient to be . See Appendix C for more details of the experimental settings. The code to reproduce all the results is publicly available at: [the link is hidden for anonymous submission].

Table 1 compares the testing performance of ordered SGD and mini-batch SGD for different models and datasets. Table 1 consistently shows that ordered SGD improved mini-batch SGD in test errors. The table reports the mean and the standard deviation of test errors (i.e., 100 the average of 0-1 losses on test dataset) over random experiments with different random seeds. The table also summarises the relative improvements of ordered SGD over mini-batch SGD, which is defined as [ ((mean test error of mini-batch SGD) - (mean test error of ordered SGD)) / (mean test error of mini-batch SGD)]. Logistic model refers to linear multinomial logistic regression model, SVM refers to linear multiclass support vector machine, LeNet refers to a standard variant of LeNet (LeCun et al., 1998) with ReLU activations, and PreActResNet18 refers to pre-activation ResNet with 18 layers (He et al., 2016).

Figure 3 shows the test error and the average training loss of mini-batch SGD and ordered SGD versus the number of epoch. As shown in the figure, ordered SGD with the fixed value also outperformed mini-batch SGD in general. In the figures, the reported training losses refer to the standard empirical average loss measured at the end of each epoch. When compared to mini-batch SGD, ordered SGD had lower test errors while having higher training losses in Figures 2(a), 2(d) and 2(g), because ordered SGD optimizes over the ordered empirical loss instead. This is consistent with our motivation and theory of ordered SGD in Sections 3, 4 and 5. The qualitatively similar behaviors were also observed with all of the 18 various problems as shown in Appendix C.

Datasets mini-batch SGD OSGD
MNIST 14.44 (0.54) 14.77 (0.41)
KMNIST 12.17 (0.33) 11.42 (0.29)
CIFAR-10 48.18 (0.58) 46.40 (0.97)
CIFAR-100 47.37 (0.84) 44.74 (0.91)
SVHN 72.29 (1.23) 67.95 (1.54)
Table 2: Average wall-clock time (seconds) per epoch with data augmentation. PreActResNet18 was used for CIFAR-10, CIFAR-100, and SVHN, while LeNet was used for MNIST and KMNIST.
Figure 4: Effect of different values with CIFAR-10.

Moreover, ordered SGD is a computationally efficient algorithm. Table 2 shows the wall-clock time in illustrative four experiments, whereas Table 4 in Appendix C summarizes the wall-clock time in all experiments. The wall-clock time of ordered SGD measures the time spent by all computations of ordered SGD, including the extra computation of finding top- samples in a mini-batch (line 4 of Algorithm 1). The extra computation is generally negligible and can be completed in or by using a sorting/selection algorithm. The ordered SGD algorithm can be faster than mini-batch SGD because ordered SGD only computes the (sub-)gradient of the top- samples (in line 5 of Algorithm 1). As shown in Tables 2 and 4, ordered SGD was faster than mini-batch SGD for all larger models with PreActResNet18. This is because the computational reduction of the back-propagation in ordered SGD can dominate the small extra cost of finding top- samples in larger problems.

Experiment with different values. Figure 4 shows the effect of different fixed values for CIFAR-10 with PreActResNet18. Ordered SGD improved the test errors of mini-batch SGD with different fixed values. We also report the same observation with different datasets and models in Appendix C.

Experiment with different learning rates and mini-batch sizes. Figures 5 and 6 in Appendix C consistently show the improvement of ordered SGD over mini-batch SGD with different different learning rates and mini-batch sizes.

Experiment with the best learning rate, mixup, and random erasing. Table 3 summarises the experimental results with the data augmentation methods of random erasing (RE) (Zhong et al., 2017) and mixup (Zhang et al., 2017; Verma et al., 2019) by using CIFAR-10 dataset. For this experiment, we purposefully adopted the setting that favors mini-batch SGD. That is, for both mini-batch SGD and ordered SGD, we used hyper-parameters tuned for mini-batch SGD. For RE and mixup data, we used the same tuned hyper-parameter settings (including learning rates) and the codes as those in the previous studies that used mini-batch SGD (Zhong et al., 2017; Verma et al., 2019) (with WRN-28-10 for RE and with PreActResNet18 for mixup). For standard data augmentation, we first searched the best learning rate of mini-batch SGD based on the test error (purposefully overfitting to the test dataset for mini-batch SGD) by using the grid search with learning rates of 1.0, 0.5, 0.1, 0.05. 0.01, 0.005, 0.001, 0.0005, 0.0001. Then, we used the best learning rate of mini-batch SGD for ordered SGD (instead of using the best learning rate of ordered SGD for ordered SGD). As shown in Table 3, ordered SGD with hyper-parameters tuned for mini-batch SGD still outperformed fine-tuned mini-batch SGD with the different data augmentation methods.

Data Aug mini-batch SGD OSGD Improve
Standard 6.94 6.46 6.92
RE 3.24 3.06 5.56
Mixup 3.31 3.05 7.85
Table 3: Test errors (%) by using the best learning rate of mini-batch SGD with various data augmentation methods for CIFAR-10.

7 Related work and extension

Although there is no direct predecessor of our work, the following fields are related to this paper.

Other mini-batch stochastic methods. The proposed sampling strategy and our theoretical analyses are generic and can be extended to other (mini-batch) stochastic methods, including Adam (Kingma and Ba, 2014), stochastic mirror descent (Beck and Teboulle, 2003; Nedic and Lee, 2014; Lu, 2017; Lu et al., 2018; Zhang and He, 2018), and proximal stochastic subgradient methods (Davis and Drusvyatskiy, 2018). Thus, our results open up the research direction for further studying the proposed stochastic optimization framework with different base algorithms such as Adam and AdaGrad. To illustrate it, we presented ordered Adam and reported the numerical results in Appendix C.

Importance Sampling SGD. Stochastic gradient descent with importance sampling has been an active research area for the past several years (Needell et al., 2014; Zhao and Zhang, 2015; Alain et al., 2015; Loshchilov and Hutter, 2015; Gopal, 2016; Katharopoulos and Fleuret, 2018). In the convex setting, (Zhao and Zhang, 2015; Needell et al., 2014) show that the optimal sampling distribution for minimizing is proportional to the per-sample gradient norm. However, maintaining the norm of gradient for individual samples can be computationally expensive when the dataset size or the parameter vector size is large in particular for many applications of deep learning. These importance sampling methods are inherently different from ordered SGD in that importance sampling is used to reduce the number of iterations for minimizing , whereas ordered SGD is designed to learn a different type of models by minimizing the new objective function .

Average Top-k Loss. The average top- loss is introduced by Fan et al. (2017) as an alternative to the empirical average loss . The ordered loss function differs from the average top- loss as shown in Section 4. Furthermore, our proposed framework is fundamentally different from the average top- loss. First, the algorithms are different – the stochastic method proposed in Fan et al. (2017) utilizes duality of the function and is unusable for deep neural networks (and other non-convex problems), while our proposed method is a modification of mini-batch SGD that is usable for deep neural networks (and other non-convex problems) and scales well for large problems. Second, the optimization results are different, and in particular, the objective functions are different and we have convergence analysis for weakly convex (non-convex) functions. Finally, the focus of generalization property is different – Fan et al. (2017) focuses on the calibration for binary classification problem, while we focus on the generalization bound that works for general classification and regression problems.

Random-then-Greedy Procedure.  Ordered SGD randomly picks a subset of samples and then greedily utilizes a part of the subset, which is related to the random-then-greedy procedure proposed recently in the different topic – the greedy weak learner for gradient boosting (Lu and Mazumder, 2018).

8 Conclusion

We have presented an efficient stochastic first-order method, ordered SGD, for learning an effective predictor in machine learning problems. We have shown that ordered SGD minimizes a new ordered empirical loss , based on which we have developed the optimization and generalization properties of ordered SGD. The numerical experiments confirmed the effectiveness of our proposed algorithm.

Appendix

Appendix A Proofs

In Appendix A, we provide complete proofs of the theoretical results.

a.1 Proof of Theorem 4

Proof.

We just need to show that is an unbiased estimator of a sub-gradient of at , namely .

At first, it holds that

where is a sub-gradient of at and . In the above equality chain, the third equality is simply the definition of expectation, and the last equality is because is a permutation of .

For any given index , define , then

(6)

Notice that is randomly chosen from sample index set without replacement. There are in total different sets such that . Among them, there are different sets which contains the index , thus

(7)

Given the condition , contains items in means contains items in , thus there are such possible set , whereby it holds that

(8)

Substituting Equations (7) and (8) into Equation (6), we arrive at

Therefore,

where the last inequality is due to the aditivity of sub-gradient (for both convex and weakly convex function) ∎

a.2 Proof of Proposition 4

We just need to show that

(9)

then we finish the proof by changing variable .

At first, the Stirling’s approximation yields that when and are both sufficiently large, it holds that

(10)

Thus,

(11)

where the first equality utilize Equation (10) and the fact that are negligible in the limit case (except the exponent terms).

On the other hand, it holds by rearranging the factorial numbers that

(12)

Combining Equations (11) and (12) and summing , we arrive at Equation (9).

By noticing , it holds that

In other word, is the cumulative of Beta() when .

a.3 Proof of Theorem 4

Proof.

Notice that is a sub-gradient of where . Suppose where is a sub-gradient of and is a sub-gradient of . Then

(13)

Meanwhile, it follows Theorem 4 that is an unbiased estimator of a sub-gradient of . Together with Equation (13), we obtain the statement (1) by the analysis of convex stochastic sub-gradient descent in Boyd and Mutapcic (2008).

Furthermore, suppose is convex for any , then is also convex, whereby is -weakly convex. We obtain the statement (2) by substituting into Theorem 2.1 in Davis and Drusvyatskiy (2018). ∎

a.4 Proof of Theorem 5

Before proving Theorem 5, we first show the following proposition, which gives an upper bound for : {proposition} For any , .

Proof.

The value of is equal to the probability of ordered SGD choosing the -th sample in the ordered sequence , which is at most the probability of mini-batch SGD choosing the -th sample. The probability of mini-batch SGD choosing the -th sample is . ∎

We are now ready to prove Theorem 5 by finding an upper bound on based on McDiarmid’s inequality.

Proof of Theorem 5.  Define . In this proof, our objective is to provide the upper bound on by using McDiarmid’s inequality. To apply McDiarmid’s inequality to , we first show that satisfies the remaining condition of McDiarmid’s inequality. Let and be two datasets differing by exactly one point of an arbitrary index ; i.e., for all and . Then, we provide an upper bound on as follows:

where the first line follows the property of the supremum, , the second line follows the definition of , and the last line follows Proposition A.4 ().

We now bound the last term . This requires a careful examination because for more than one index (although and differ only by exactly one point). This is because it is possible to have for many indexes where in and in . To analyze this effect, we now conduct case analysis. Define such that where ; i.e., .

Consider the case where . Let and . Then,

where the first line uses the fact that where is the index of samples differing in and . The second line follows the equality from to in this case. The third line follows the definition of the ordering of the indexes. The fourth line follows the cancellations of the terms from the third line.

Consider the case where