Coresets for Data-efficient Training of Machine Learning Models PAGE 0

# Coresets for Data-efficient Training of Machine Learning Models Page 0

## Abstract

Incremental gradient (IG) methods, such as stochastic gradient descent and its variants are commonly used for large scale optimization in machine learning. Despite the sustained effort to make IG methods more data-efficient, it remains an open question how to select a training data subset that can theoretically and practically perform on par with the full dataset. Here we develop CRAIG, a method to select a weighted subset (or coreset) of training data that closely estimates the full gradient by maximizing a submodular function. We prove that applying IG to this subset is guaranteed to converge to the (near)optimal solution with the same convergence rate as that of IG for convex optimization. As a result, CRAIG achieves a speedup that is inversely proportional to the size of the subset. To our knowledge, this is the first rigorous method for data-efficient training of general machine learning models. Our extensive set of experiments show that CRAIG, while achieving practically the same solution, speeds up various IG methods by up to 6x for logistic regression and 3x for training deep neural networks.

\printAffiliationsAndNotice

## 1 Introduction

Mathematical optimization lies at the core of training large-scale machine learning systems, and is now widely used over massive data sets with great practical success, assuming sufficient data resources are available. Achieving this success, however, also requires large amounts of (often GPU) computing, as well as concomitant financial expenditures and energy usage (Strubell et al., 2019). Significantly decreasing these costs without decreasing the learnt system’s resulting accuracy is one of the grand challenges of machine learning and artificial intelligence today (Asi and Duchi, 2019).

Training machine learning models often reduces to optimizing a regularized empirical risk function. Given a convex loss , and a -strongly convex regularizer , one aims to find model parameter vector over the parameter space that minimizes the loss over the training data :

 w∗∈argminw∈W f(w),f(w):=∑i∈Vfi(w)+r(w), fi(w)=l(w,(xi,yi)), (1)

where is an index set of the training data, and functions are associated with training examples , where is the feature vector, and is the point ’s label.

However, the direction that remains largely unexplored is how to carefully select a small subset of the full training data , so that the model is trained only on the subset while still (approximately) converging to the globally optimal solution (i.e., the model parameters that would be obtained if training/optimizing on the full ). If such a subset can be quickly found, then this would directly lead to a speedup of (which can be very large if ) per epoch of IG.

There are four main challenges in finding such a subset . First is that a guiding principle for selecting is unclear. For example, selecting training points close to the decision boundary might allow the model to fine tune the decision boundary, while picking the most diverse set of data points would allow the model to get a better sense of the training data distribution. Second is that finding must be fast, as otherwise identifying the set may take longer than the actual optimization, and so no overall speed-up would be achieved. Third is that finding a subset is not enough. One also has to decide on a gradient stepsize for each data point in , as they affect the convergence. And last, while the method might work well empirically on some data sets, one also requires theoretical understanding and mathematical convergence guarantees.

Here we develop Coresets for Accelerating Incremental Gradient descent (CRAIG), for selecting a subset of training data points to speed up training of large machine learning models. Our key idea is to select a weighted subset of training data that best approximates the full gradient of . We prove that the subset that minimizes an upper-bound on the error of estimating the full gradient maximizes a submodular facility location function. Hence, can be efficiently found using a fast greedy algorithm.

We also provide theoretical analysis of CRAIG and prove its convergence. Most importantly, we show that any incremental gradient method (IG) on converges in the same number epochs as the same IG would on the full , which means that we obtain a speed-up inversely proportional to the size of . In particular, for a -strongly convex risk function and a subset selected by CRAIG that estimates the full gradient by an error of at most , we prove that IG on with diminishing stepsize at epoch (with and ), converges to an neighborhood of the optimal solution at rate . Here, where is the initial distance to the optimum, is an upper-bound on the norm of the gradients, , and is the largest weight for the elements in the subset obtained by CRAIG. Moreover, we prove that if in addition to the strong convexity, component functions have smooth gradients, IG with the same diminishing step size on subset converges to a neighborhood of the optimum solution at rate .

The above implies that IG on converges to the same solution and in the same number of epochs as IG on the full . But because every epoch only uses a subset of the data, it requires fewer gradient computations and thus leads to a speedup over traditional IG methods, while still (approximately) converging to the optimal solution. We also note that CRAIG is complementary to various incremental gradient (IG) methods (SGD, SAGA, SVRG, Adam), and such methods can be used on the subset found by CRAIG.

We also demonstrate the effectiveness of CRAIG via an extensive set of experiments using logistic regression (a convex optimization problem) as well as training deep neural networks (non-convex optimization problems). We show that CRAIG speeds up incremental gradient methods, including SGD, SAGA, SVRG, and Adam. In particular, CRAIG while achieving practically the same loss and accuracy as the underlying incremental gradient descent methods, speeds up gradient methods by up to 6x for convex and 3x for non-convex loss functions.

## 2 Related Work

Convergence of IG methods has been long studied under various conditions (Zhi-Quan and Paul, 1994; Mangasariany and Solodovy, 1994; Bertsekas, 1996; Solodov, 1998; Tseng, 1998), however IG’s convergence rate has been characterized only more recently (see (Bertsekas, 2015) for a survey). In particular, Nedić and Bertsekas (2001) provides a convergence rate for diminishing stepsizes per epoch under a strong convexity assumption, and Gürbüzbalaban et al. (2015) proves a convergence rate with diminishing stepsizes for under an additional smoothness assumption for the components. While these works provide convergence on the full dataset, our analysis provides the same convergence rates on subsets obtained by CRAIG.

Techniques for speeding up SGD, are mostly focused on variance reduction techniques (Roux et al., 2012; Shalev-Shwartz and Zhang, 2013; Johnson and Zhang, 2013; Hofmann et al., 2015; Allen-Zhu et al., 2016), and accelerated gradient methods when the regularization parameter is small (Frostig et al., 2015; Lin et al., 2015; Xiao and Zhang, 2014). Very recently, Hofmann et al. (2015); Allen-Zhu et al. (2016) exploited neighborhood structure to further reduce the variance of stochastic gradient descent and improve its running time. Our CRAIG method and analysis are complementary to variance reduction and accelerated methods. CRAIG can be applied to all these methods as well to speed them up.

Coresets are weighted subsets of the data, which guarantees that models fitting the coreset also provide a good fit for the original data. Coreset construction methods traditionally perform importance sampling with respect to sensitivity score, to provide high-probability solutions Har-Peled and Mazumdar (2004); Lucic et al. (2017); Cohen et al. (2017) for a particular problem, such as -means and -median clustering Har-Peled and Mazumdar (2004), naïve Bayes and nearest-neighbors Wei et al. (2015), mixture models Lucic et al. (2017), low rank approximation Cohen et al. (2017), spectral approximation Agarwal et al. (2004); Li et al. (2013), Nystrom methods Agarwal et al. (2004); Musco and Musco (2017), and Bayesian inference Campbell and Broderick (2018). Unlike existing coreset construction algorithms, our method is not problem specific and can be applied for training general machine learning models.

## 3 Coresets for Accelerating Incremental Gradient Descent (Craig)

We proceed as follows: First, we define an objective function for selecting an optimal set of size that best approximates the gradient of the full training dataset of size . Then, we show that can be turned into a submodular function and thus can be efficiently found using a fast greedy algorithm. Crucially, we also show that for convex loss functions the approximation error between the estimated and the true gradient can be efficiently minimized in a way that is independent of the actual optimization procedure. Thus, CRAIG can simply be used as a preprocessing step before the actual optimization starts.

Incremental gradient methods aim at estimating the full gradient over by iteratively making a step based on the gradient of every function . Our key idea in CRAIG is that if we can find a small subset such that the weighted sum of the gradients of its elements closely approximates the full gradient over , we can apply IG only to the set (with stepsizes equal to the weight of the elements in ), and we should still converge to the (approximately) optimal solution, but much faster.

Specifically, our goal in CRAIG is to find the smallest subset and corresponding per-element stepsizes that approximate the full gradient with an error at most for all the possible values of the optimization parameters .1

 S∗= argminS⊆V,γj≥0 ∀j|S|,s.t. maxw∈W∥∑i∈V∇fi(w)−∑j∈Sγj∇fj(w)∥≤ϵ. (2)

Given such an and associated weights , we are guaranteed that gradient updates on will be similar to the gradient updates on regardless of the value of .

Unfortunately, directly solving the above optimization problem is not feasible, due to two problems. Problem 1: Eq. (3) requires us to calculate the gradient of all the functions over the entire space , which is too expensive and would not lead to overall speedup. In other words, it would appear that solving for is as difficult as solving Eq. (1), as it involves calculating for various . And Problem 2: even if calculating the normed difference between the gradients in Eq. (3) would be fast, as we discuss later finding the optimal subset in NP-hard. In the following, we address the above two challenges and discuss how we can quickly find a near-optimal subset .

### 3.1 Upper-bound on the Estimation Error

We first address Problem 1, i.e., how to quickly estimate the error/discrepancy of the weighted sum of gradients of functions associate with data points , vs the full gradient, for every .

Let be a subset of data points. Furthermore, assume that there is a mapping that for every assigns every data point to one of the elements in , i.e., . Let be the set of data points that are assigned to , and be the number of such data points. Hence, form a partition of . Then, for any arbitrary (single) we can write

 ∑i∈V∇fi(w) =∑i∈V(∇fi(w)−∇fςw(i)(w)+∇fςw(i)(w)) (3) =∑i∈V( ∇fi(w)−∇fςw(i)(w))+∑j∈Sγj∇fj(w). (4)

Subtracting and then taking the norm of the both sides, we get an upper bound on the error of estimating the full gradient with the weighted sum of the gradients of the functions for . I.e.,

 \vspace−2mm∥∑i∈V∇fi(w)−∑j∈S γj∇fj(w)∥≤ ∑i∈V∥∇fi(w)−∇fςw(i)(w)∥, (5)

where the inequality follows from the triangle inequality. The upper-bound in Eq. (5) is minimized when assigns every to an element in with most gradient similarity at , or minimum Euclidean distance between the gradient vectors at . That is: . Hence,

 minS⊆V∥∑i∈V∇fi(w)− ∑j∈Sγj∇fj(w)∥≤ ∑i∈Vminj∈S∥∇fi(w)−∇fj(w)∥. (6)

The right hand side of Eq. (6) is minimized when is the set of medoids (exemplars) for all the components in the gradient space.

So far, we considered upper-bounding the gradient estimation error at a particular . To bound the estimation error for all , we consider a worst-case approximation of the estimation error over the entire parameter space . Formally, we define a distance metric between gradients of and as the maximum normed difference between and over all :

 dij≜maxw∈W∥∇fi(w)−∇fj(w)∥. (7)

Thus, by solving the following minimization problem, we obtain the smallest weighted subset that approximates the full gradient by an error of at most for all :

 \vspace−1mmS∗=argminS⊆V|S|,% s.t.L(S)≜∑i∈Vminj∈Sdij≤ϵ. (8)

Note that Eq. (8) requires that the gradient error is bounded over . However, we show (Appendix B.1) for several classes of convex problems, including linear regression, ridge regression, logistic regression, and regularized support vector machines (SVMs), the normed gradient difference between data points can be efficiently boundedly approximated by (Allen-Zhu et al., 2016; Hofmann et al., 2015):

 \vspace−2mm∀w,i, j∥∇fi(w)−∇fj(w)∥≤dij≤ maxw∈WO(∥w∥)⋅∥xi−xj∥=const. ∥xi−xj∥. (9)

Note when is bounded for all , i.e., , upper-bounds on the Euclidean distances between the gradients can be pre-computed. This is crucial, because it means that estimation error of the full gradient can be efficiently bounded independent of the actual optimization problem (i.e., point ). Thus, these upper-bounds can be computed only once as a pre-processing step before any training takes place, and then used to find the subset by solving the optimization problem (8). We address upper-bounding the normed difference between gradients for deep models in Section 3.3.

### 3.2 The Craig Algorithm

Optimization problem (8) produces a subset of elements with their associated weights or per-element stepsizes that closely approximates the full gradient. Here, we show how to efficiently approximately solve the above optimization problem to find a near-optimal subset .

The optimization problem (8) is NP-hard as it involves calculating the value of for all the subsets . We show, however, that we can transform it into a submodular set cover problem, that can be efficiently approximated.

Formally, is submodular if for any and . We denote the marginal utility of an element w.r.t. a subset as . Function is called monotone if for any and . The submodular cover problem is defined as finding the smallest set that achieves utility . Precisely,

 S∗=argminS⊆V|S|,s. t.F(S)≥ρ.\vspace−1mm (10)

Although finding is NP-hard since it captures such well-known NP-hard problems such as Minimum Vertex Cover, for many classes of submodular functions (Nemhauser et al., 1978; Wolsey, 1982), a simple greedy algorithm is known to be very effective. The greedy algorithm starts with the empty set , and at each iteration , it chooses an element that maximizes , i.e., Greedy gives us a logarithmic approximation, i.e. . The computational complexity of the greedy algorithm is . However, its running time can be reduced to using stochastic algorithms (Mirzasoleiman et al., 2015a) and further improved using lazy evaluation (Minoux, 1978), and distributed implementations (Mirzasoleiman et al., 2015b, 2016). loo Given a subset , the facility location function quantifies the coverage of the whole data set by the subset by summing the similarities between every and its closest element . Formally, facility location is defined as , where is the similarity between . The facility location function has been used in various summarization applications (Lin et al., 2009; Lin and Bilmes, 2012). By introducing an auxiliary element we can turn in Eq. (8) into a monotone submodular facility location function,

 F(S)=L({s0})−L(S∪{s0}),\vspace−1mm (11)

where is a constant. In words, measures the decrease in the estimation error associated with the set versus the estimation error associated with just the auxiliary element. For a suitable choice of , maximizing is equivalent to minimizing . Therefore, we apply the greedy algorithm to approximately solve the following problem to get the subset defined in Eq. (8):

 S∗=argminS⊆V|S|,s.t.F(S)≥L({s0})−ϵ. (12)

At every step, the greedy algorithm selects an element that reduces the upper bound on the estimation error the most. In fact, the size of the smallest subset that estimates the full gradient by an error of at most depends on the structural properties of the data. Intuitively, as long as the marginal gains of facility location are considerably large, we need more elements to improve our estimation of the full gradient. Having found , the weight of every element is the number of components that are closest to it in the gradient space, and are used as stepsize of element during IG. The pseudocode for CRAIG is outlined in Algorithm 1.

Notice that CRAIG creates subset incrementally one element at a time, which produces a natural order to the elements in . Adding the element with largest marginal gain improves our estimation from the full gradient by an amount bounded by the marginal gain. At every step , we have . Hence,

 ∥∑i∈V∇fi(w)−∑j∈Sγj∇fj(w)∥≤cnt−(1−e−i/|S|)L(S∗).\vspace−1mm (13)

Intuitively, the first elements of the ordering contribute the most to provide a close approximation of the full gradient and the rest of the elements further refine the approximation. Hence, the first incremental gradient updates gets us close to , and the rest of the updates further refine the solution.

### 3.3 Application of Craig to Deep Networks

As discussed, CRAIG selects a subset that closely approximates the full gradient, and hence can be also applied for speeding up training deep networks. The challenge here is that we cannot use inequality (3.1) to bound the normed difference between gradients for all and find the subset as a preprocessing step.

However, for deep neural networks, the variation of the gradient norms is mostly captured by the gradient of the loss w.r.t. the input to the last layer [Section 3.2 of Katharopoulos and Fleuret (2018). We show (Appendix B.1) that the normed gradient difference between data points can be efficiently bounded approximately by

 ∥∇fi(w)−∇ fj(w)∥≤ (14) C1∥Σ′L(z(L)i) ∇f(L)i(w)−Σ′L(z(L)j)∇f(L)j(w)∥+C2,

where is gradient of the loss w.r.t. the input to the last layer for data point , and are constants. The above upper-bound depends on parameter vector which changes during the training process. Thus, we need to use CRAIG to update the subset after a number of parameter updates.

The above upper-bound is often only slightly more expensive than calculating the loss. For example, in cases where we have cross entropy loss with soft-max as the last layer, the gradient of the loss w.r.t. the -th input to the soft-max is simply , where are logits (dimension for classes) and is the one-hot encoded label. In this case, CRAIG does not need any backward pass or extra storage. Note that, although CRAIG needs an additional complexity (or using stochastic greedy) to find the subset at the beginning of every epoch, this complexity does not involve any (exact) gradient calculations and is negligible compared to the cost of backpropagations performed during the epoch. Hence, as we show in the experiments CRAIG is practical and scalable.

## 4 Convergence Rate Analysis of Craig

The idea of CRAIG is to selects a subset that closely approximates the full gradient, and hence can be applied to speed up most IG variants as we show in our experiments. Here, we briefly introduce the original IG method, and then prove the convergence rate of IG applied to CRAIG subsets.

### 4.1 Incremental Gradient Methods (IG)

Incremental gradient (IG) methods are core algorithms for solving Problem (1) and are widely used and studied. IG aims at approximating the standard gradient method by sequentially stepping along the gradient of the component functions in a cyclic order. Starting from an initial point , it makes passes over all the components. At every epoch , it iteratively updates based on the gradient of for using stepsize . I.e.,

 \vspace−1mmwki=wki−1−αk∇fi(wki−1),i=1,2,⋯,n, (15)

with the convention that . Note that for a closed and convex subset of , the results can be projected onto , and the update rule becomes

 wki=PW(wki−1−αk∇fi(wki−1)),i=1,2,⋯,n, (16)

where denotes projection on the set .

IG with diminishing stepsizes converges at rate for strongly convex sum function (Nedić and Bertsekas, 2001). If in addition to the strong convexity of the sum function, every component function is smooth, IG with diminishing stepsizes converges at rate (Gürbüzbalaban et al., 2015).

The convergence rate analysis of IG is valid regardless of order of processing the elements. However, in practice, the convergence rate of IG is known to be quite sensitive to the order of processing the functions (Bertsekas and Scientific, 2015; Gurbuzbalaban et al., 2017). If problem-specific knowledge can be used to find a favorable order (defined as a permutation of ), IG can be updated to process the functions according to this order, i.e.,

 wki=wki−1−αk∇fσi(wki−1),i=1,2,⋯,n. (17)

In general a favorable order is not known in advance, and a common approach is sampling the function indices with replacement from the set and is called the Stochastic Gradient Descent (SGD) method.

### 4.2 Convergence Rate of IG on Craig Subsets

Next we analyze the convergence rate of IG applied to the weighted and ordered subset found by CRAIG. In particular, we show that (1) applying IG to converges to a close neighborhood of the optimal solution and that (2) this convergence happens at the same rate (same number of epochs) as IG on the full data. Formally, every step of IG on the subset becomes

 wki=wki−1−αkγsσi∇ fsσi(wki−1),i=1,2,⋯,r, si∈S,|S|=r. (18)

Here, is a permutation of , and the per-element stepsize for every function is the weight of the element and is fixed for all epochs.

### 4.3 Convergence for Strongly Convex f

We first provide the convergence analysis for the case where the function in Problem (1) is strongly convex, i.e. we have .

###### Theorem 1.

Assume that is strongly convex, and is a weighted subset of size such that . Then for the iterates generated by applying IG to with per-epoch stepsize with and , we have

• if , then ,

• if , then , for

• if , then

where is an upper-bound on the norm of the component function gradients, i.e. , is the largest per-element step size, and , where is the initial distance to the optimum .

All the proofs can be found in the Appendix. The above theorem shows that IG on converges at the same rate of IG on the entire data set . However, compared to IG on , the speedup of IG on comes at the price of getting an extra error term, .

### 4.4 Convergence for Smooth and Strongly Convex f

If in addition to strong convexity of the expected risk, each component function has a Lipschitz gradient, i.e. we have , then we get the following results about the iterates generated by applying IG to the weighted subset returned by CRAIG.

###### Theorem 2.

Assume that is strongly convex and let be convex and twice continuously differentiable component functions with Lipschitz gradients on . Given a subset such that . Then for the iterates generated by applying IG to with per-epoch stepsize with and , we have

• if , then ,

• if , then , for

• if , then ,

where is the sum of gradient Lipschitz constants of the component functions.

The above theorem shows that for , IG applied to converges to a neighborhood of the optimal solution, with a rate of which is the same convergence rate for IG on the entire data set . As shown in our experiments, in real data sets small weighted subsets constructed by CRAIG provide a close approximation to the full gradient. Hence, applying IG to the weighted subsets returned by CRAIG provides a solution of the same or higher quality compared to the solution obtained by applying IG to the whole data set, in a considerably shorter amount of time.

## 5 Experiments

In our experimental evaluation we wish to address the following questions: (1) How do loss and accuracy of IG applied to the subsets returned by CRAIG compare to loss and accuracy of IG applied to the entire data? (2) How small is the size of the subsets that we can select with CRAIG and still get a comparable performance to IG applied to the entire data? And (3) How well does CRAIG scale to large data sets, and extends to non-convex problems? In our experiments, we report the run-time as the wall-clock time for subset selection with CRAIG, plus minimizing the loss using IG or other optimizers with the specified learning rates. For the classification problems, we separately select subsets from each class while maintaining the class ratios in the whole data, and apply IG to the union of the subsets. We separately tune each method so that it performs at its best.

### 5.1 Convex Experiments

In our convex experiments, we apply CRAIG to SGD, as well as SVRG (Johnson and Zhang, 2013), and SAGA (Defazio et al., 2014). We apply L2-regularized logistic regression: to classify the following two datasets from LIBSVM: (1) covtype.binary including 581,012 data points of 54 dimensions, and (2) Ijcnn1 including 49,990 training and 91,701 test data points of 22 dimensions. As covtype does not come with labeled test data, we randomly split the training data into halves to make the training/test split (training and set sets are consistent for different methods).

For the convex experiments, we tuned the learning rate for each method (including the random baseline) by preferring smaller training loss from a large number of parameter combinations for two types of learning scheduling: exponential decay and -inverse with parameters and to adjust. Furthermore, following Johnson and Zhang (2013) we set to .

CRAIG effectively minimizes the loss. Figure 1(top) compares training loss residual of SGD, SVRG, and SAGA on the 10% CRAIG set (blue), 10% random set (green), and the full dataset (orange). CRAIG effectively minimizes the training data loss (blue line) and achieves the same minimum as the entire dataset training (orange line) but much faster. Also notice that training on the random 10% subset of the data does not effectively minimize the training loss.

CRAIG has a good generalization performance. Figure 1(bottom) shows the test error rate of models trained on CRAIG vs. random vs. the full data. Notice that training on CRAIG subsets achieves the same generalization performance (test error rate) as training on the full data.

CRAIG achieves significant speedup. Figure 1 also shows that CRAIG achieves a similar training loss (top) and test error rate (bottom) as training on the entire set, but much faster. In particular, we obtain a speedup of 2.75x, 4.5x, 2.5x from applying IG, SVRG and SAGA on the subsets of size 10% from covtype obtained by CRAIG. Furthermore, Figure 3 compares the speedup achieved by CRAIG to reach a similar loss residual as that of SGD for subsets of size of Ijcnn1. We get a 5.6x speedup by applying SGD to subsets of size 30% obtained by CRAIG.

### 5.2 Non-convex Experiments

Our non-convex experiments involve applying CRAIG to train the following two neural networks: (1) Our smaller network is a fully-connected hidden layer of 100 nodes and ten softmax output nodes; sigmoid activation and L2 regularization with and mini-batches of size 10 on MNIST dataset of handwritten digits containing 60,000 training and 10,000 test images. (2) Our large neural network is ResNet-32 for CIFAR10 with convolution, average pooling and dense layers with softmax outputs and L2 regularization with . CIFAR 10 includes 50,000 training and 10,000 test images from 10 classes, and we used mini-batches of size 128. Both MNIST and CIFAR10 data sets are normalized into [0, 1] by division with 255.

CRAIG achieves considerable speedup. Figure 4 shows training loss, and test accuracy for training a 2-layer neural net on MNIST. For this problem, we used a constant learning rate of . Here, we apply CRAIG to select a subset of 30%-40% of the data at the beginning of every epoch and train only on the selected subset with the corresponding per-element stepsizes. Interestingly, in addition to achieving a speedup of 2x to 3x for training neural networks, the subsets selected by CRAIG provide a better generalization performance compared to models trained on the entire dataset.

CRAIG is data-efficient for training neural networks. Figure 5 shows test accuracy vs. the fraction of data selected for training ResNet-32 on CIFAR10. At the beginning of every epoch a subset of size 50% is chosen at random or by CRAIG from the training data. The network is trained only on the selected subset for that epoch. We apply both SGD and Adaptive Moment Estimation (Adam), that is a popular method for training neural networks (Kingma and Ba, 2014) to subsets obtained by CRAIG. Adam computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients. For SGD, we used the standard learning rate schedule for training ResNet-32 on CIFAR10, i.e., we start with initial learning rate of 0.1, and exponentially decay the learning rate by a factor of 0.1 at epochs 100 and 150. For Adam, we used learning rate of 0.001. It can be seen that CRAIG can identify the data points that are effective for training the neural network, and achieves a superior test accuracy by training on a smaller fraction of the training data.

## 6 Conclusion

We developed a method, CRAIG, for selecting a subset (coreset) of data points with their corresponding per-element stepsizes to speed up iterative gradient (IG) methods. In particular, we showed that weighted subsets that minimize the upper-bound on the estimation error of the full gradient, maximize a submodular facility location function. Hence, the subset can be found using a fast greedy algorithm. We proved that IG on subsets found by CRAIG converges at the same rate as IG on the entire dataset , while providing a speedup. In our experiments, we showed that various IG methods, including SAGA, SVRG, and Adam runs up to 6x faster on convex and up to 3x on non-convex problems on subsets found by CRAIG while achieving practically the same training loss and test error.

## Appendix A Convergence Rate Analysis

We firs proof the following Lemma which is an extension of the [Chung (1954), Lemma 4].

###### Lemma 3.

Let be a sequence of real numbers. Assume there exist such that

 uk+1≤(1−ck)uk+ekp+dkp+1,∀k≥k0,

where are given real numbers. Then

 uk ≤(dk−1+e)(c−p+1)−1k−p+1+o(k−p+1) for c>p−1,p≥1 (19) uk =O(k−clogk) for c=p−1,p>1 (20) uk =O(k−c) for c1 (21)
###### Proof.

Let and . Then, using Taylor approximation we can write

 vk+1 =(k+1)p−1uk+1−d(k+1)(c−p+1)−ec−p+1 (23) ≤kp−1(1+1k)p−1((1−ck)uk+ekp+dkp+1)−d(k+1)(c−p+1)−ec−p+1 (24) =kp−1uk(1−c−p+1k+o(1k))+ek(1+p−1k+o(1k)) (25) +dk2(1+p−1k+o(1k))−d(k+1)(c−p+1)−ec−p+1 (26) =(vk+dk(c−p+1)+ec−p+1)(1−c−p+1k+o(1k)) (27) (28) −d(k+1)(c−p+1)−ec−p+1 (29) =vk(1−c−p+1k+o(1k))+d/(c−p+1)k(k+1)+e(p−1)k2+d(p−1)k3+o(1k2) (30)

Note that for , we have

 ∞∑k=0(1−c−p+1k+o(1k))=∞

and

 (d/(c−p+1)k(k+1)+e(p−1)k2+d(p−1)k3+o(1k2))(1−c−p+1k+o(1k))−1→0.

Therefore, , and we get Eq. 19. For , we have . Hence, converges into the region , with ratio .

Moreover, for we have

 vk+1 =uk+1(k+1)c≤[(1−ck)uk+ekp+dkp+1]kc(1+ck+c22k2+o(1k2)) (31) =(1−c22k2+o(1k2))vk+dkp−c+1(1+O(1k))+ekp−c(1+ck+O(1k2)) (32) ≤vk+e′kp−c (33)

for sufficiently large . Summing over , we obtain that is bounded for (since the series converges for ) and for (since ). ∎

In addition, based on [Chung (1954), Lemma 5] for , we can write

 uk+1≤(1−cks)uk+ekp+dkt,0

Then, we have

 uk≤ec1kp−s+o(1kp−s). (35)

### Proof of Theorem 1

We now provide the convergence rate for strongly convex functions building on the analysis of Nedić and Bertsekas (2001). For non-smooth functions, gradients can be replaced by sub-gradients.

Let . For every IG update on subset we have

 ∥wkj−w∗∥2 =∥wkj−1−αkγj∇fj(wkj−1)−w∗∥2 (36) =∥wkj−1−w∗∥2−2αkγj∇fj(wkj−1)(wkj−1−w∗)+α2k∥γj∇fj(wkj−1)∥2 (37) ≤∥wkj−1−w∗∥2−2αk(fj(wkj−1)−fi(w∗))+α2k∥γi∇fj(wkj−1)∥2. (38)

Adding the above inequalities over elements of we get

 ∥wk+1−w∗∥2≤∥wk−w∗∥2 −2αk∑j∈S(fi(wkj−1)−fj(w∗))+α2k∑j∈S∥γj∇fj(wkj−1)∥2 (39) =∥wk−w∗∥2 −2αk∑j∈S(fj(wk)−fi(w∗)) +2αk∑j∈S(fj(wkj−1)−fj(wk))+α2k∑j∈S∥γj∇fj(wkj−1)∥2 (40)

Using strong convexity we can write

 ∥wk+1−w∗∥2≤∥wk−w∗∥2 −2αk(∑j∈Sγj∇fj(w∗)⋅(wk−w∗)+μ2∥wk−w∗∥2) +2αk∑j∈S(fj(wkj−1)−fj(wk))+α2k∑j∈S∥γj∇fj(wkj−1)∥2 (41)

Using Cauchy–Schwarz inequality, we know

 |∑j∈Sγj∇fj(w∗)⋅(wk−w∗)|≤∥∑j∈Sγj∇fj(w∗)∥⋅∥wk−w∗∥. (42)

Hence,

 −∑j∈Sγj∇fj(w∗)⋅(wk−w∗)≤∥∑j∈Sγj∇fj(w∗)∥⋅∥wk−w∗∥. (43)

From reverse triangle inequality, and the facts that is chosen in a way that , and that we have . Therefore

 ∥∑j∈Sγj∇fj(w∗)∥⋅∥wk−w∗∥≤ϵ⋅∥wk−w∗∥ (44)

For a continuously differentiable function, the following condition is implied by strong convexity condition

 ∥wk−w∗∥≤1μ∥∑j∈Sγj∇fj(wk)∥. (45)

Assuming gradients have a bounded norm , and the fact that we can write

 ∥∑j∈Sγj∇fj(wk)∥≤n⋅C. (46)

Thus for initial distance , we have

 ∥wk−w∗∥≤min(n⋅C,d0)=R