On Coresets for Logistic Regression

# On Coresets for Logistic Regression

Alexander Munteanu
Department of Computer Science
TU Dortmund University
44227 Dortmund, Germany
alexander.munteanu@tu-dortmund.de
&Chris Schwiegelshohn
Department of Computer Science
Sapienza University of Rome
00185 Rome, Italy
schwiegelshohn@diag.uniroma1.it
&Christian Sohler
Department of Computer Science
TU Dortmund University
44227 Dortmund, Germany
christian.sohler@tu-dortmund.de
&David P. Woodruff
Department of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213, USA
dwoodruf@cs.cmu.edu
###### Abstract

Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure , which quantifies the hardness of compressing a data set for logistic regression. has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded -complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear -coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.

On Coresets for Logistic Regression

Alexander Munteanu Department of Computer Science TU Dortmund University 44227 Dortmund, Germany alexander.munteanu@tu-dortmund.de Chris Schwiegelshohn Department of Computer Science Sapienza University of Rome 00185 Rome, Italy schwiegelshohn@diag.uniroma1.it Christian Sohler Department of Computer Science TU Dortmund University 44227 Dortmund, Germany christian.sohler@tu-dortmund.de David P. Woodruff Department of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA dwoodruf@cs.cmu.edu

\@float

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

Scalability is one of the central challenges of modern data analysis and machine learning. Algorithms with polynomial running time might be regarded as efficient in a conventional sense, but nevertheless become intractable when facing massive data sets. As a result, performing data reduction techniques in a preprocessing step to speed up a subsequent optimization problem has received considerable attention. A natural approach is to sub-sample the data according to a certain probability distribution. This approach has been successfully applied to a variety of problems including clustering (Langberg & Schulman, 2010; Feldman & Langberg, 2011), mixture models (Feldman et al., 2011; Lucic et al., 2016), low rank approximation (Cohen et al., 2017), spectral approximation (Alaoui & Mahoney, 2015; Li et al., 2013), and Nyström methods (Alaoui & Mahoney, 2015; Musco & Musco, 2017).

The unifying feature of these works is that the probability distribution is based on the sensitivity score of each point. Informally, the sensitivity of a point corresponds to the importance of the point with respect to the objective function we wish to minimize. If the total sensitivity, i.e., the sum of all sensitivity scores , is bounded by a reasonably small value , there exists a collection of input points known as a coreset with very strong aggregation properties. Given any candidate solution (e.g., a set of centers for -means, or a hyperplane for linear regression), the objective function computed on the coreset evaluates to the objective function of the original data up to a small multiplicative error. See Sections 2 and 4 for formal definitions of sensitivity and coresets.

Our Contribution We investigate coresets for logistic regression within the sensitivity framework. Logistic regression is an instance of a generalized linear model where we are given data , and labels and the optimization task consists of minimizing the negative log-likelihood with respect to the parameter (McCullagh & Nelder, 1989)

 L(β|Z,Y) =∑ln(1+exp(−YiZiβ)).

Our first contribution is an impossibility result: logistic regression has no sublinear streaming algorithm. Due to a standard reduction between coresets and streaming algorithms, this also implies that logistic regression admits no coresets or bounded sensitivity scores in general.

Our second contribution is an investigation of available sensitivity sampling distributions for logistic regression. For points with large contribution, where , the objective function increases by a term almost linear in . This questions the use of sensitivity scores designed for problems with squared cost functions such as -regression, -means, and -based low-rank approximation, since we have to deal with an -related cost function. Instead, we propose sampling from a mixture distribution with one component proportional to the square root of the leverage score, which is a good upper bound on the -based sensitivity. The other component is uniform sampling to deal with the remaining domain, where the cost function consists of an exponential decay towards zero. Our experiments show that this distribution outperforms uniform and -means based sensitivity sampling by a wide margin on real data sets. The algorithm is space efficient, and can be implemented in a variety of models used to handle large data sets such as -pass streaming, and massively parallel frameworks such as Hadoop and MapReduce, and can be implemented in input sparsity time, , denoting the number of non-zero entries of the data matrix (Clarkson & Woodruff, 2013).

Our third contribution is an analysis of our sampling distribution for a parametrized class of instances we call -complex, placing our work in the framework of beyond worst-case analysis (Balcan et al., 2015; Roughgarden, 2017). The parameter roughly corresponds to the ratio between the log of correctly estimated odds and the log of incorrectly estimated odds. The condition of small is justified by the fact that for instances with large , logistic regression exhibits methodological problems like imbalance and separability, cf. (Mehta & Patel, 1995; Heinze & Schemper, 2002). We show that the total sensitivity of logistic regression can be bounded in terms of , and that our sampling scheme produces the first coreset of provably sublinear size, provided that is small enough.

Related Work There is more than a decade of extensive work on sampling based methods relying on the sensitivity framework for -regression (Drineas et al., 2006, 2008; Li et al., 2013; Cohen et al., 2015) and -regression (Clarkson, 2005; Sohler & Woodruff, 2011; Clarkson et al., 2016). These were generalized to -regression for all (Dasgupta et al., 2009; Woodruff & Zhang, 2013). More recent works study sampling methods for -estimators (Clarkson & Woodruff, 2015a, b) and extensions to generalized linear models (Huggins et al., 2016; Molina et al., 2018). The contemporary theory behind coresets has been applied to logistic regression, first by Reddi et al. (2015) using first order gradient methods, and subsequently via sensitivity sampling by Huggins et al. (2016). In the latter work, the authors recovered the result that bounded sensitivity scores for logistic regression imply coresets. Explicit sublinear bounds on the sensitivity scores, as well as an algorithm for computing them, were left as an open question. Instead, they proposed using sensitivity scores derived from any -means clustering for logistic regression. While high sensitivity scores of an input point for -means provably do not imply a high sensitivity score of the same point for logistic regression, the authors observed that they can outperform uniform random sampling on a number of instances with a clustering structure. Recently and independent of our work, Tolochinsky & Feldman (2018) gave a coreset construction for logistic regression in a more general framework. Our construction is without regularization and therefore can be also applied for any regularized version of logistic regression, but we have constraints regarding the -complexity of the input. While their result is for -regularization, which significantly changes the objective and does not carry over to the unconstrained version. However, they do not constrain the input but the domain of optimization is bounded. This indicates that both results differ in many important points and are of independent interest.

All proofs and additional plots from the experiments are in the appendices A and B, respectively.

## 2 Preliminaries and Problem Setting

In logistic regression we are given a data matrix , and labels . Logistic regression has a negative log-likelihood (McCullagh & Nelder, 1989)

 L(β|Z,Y)=∑ni=1ln(1+exp(−YiZiβ))

which from a learning and optimization perspective, is the objective function that we would like to minimize over . For brevity we fold for all the labels as well as the factor in the exponent into comprising row vectors . Let . For technical reasons we deal with a weighted version for weights , where each weight satisfies . Any positive scaling of the all ones vector corresponds to the unweighted case. We denote by a diagonal matrix carrying the entries of , i.e., , so that multiplying to a vector or matrix has the effect of scaling row by a factor of . The objective function becomes

 fw(Xβ) =∑ni=1wig(xiβ)=∑ni=1wiln(1+exp(xiβ)).

In this paper we assume we have a very large number of observations in a moderate number of dimensions, that is, . In order to speed up the computation and to lower memory and storage requirements we would like to significantly reduce the number of observations without losing much information in the original data. A suitable data compression reduces the size to a sublinear number of data points while the dependence on and the approximation parameters may be polynomials of low degree. To achieve this, we design a so-called coreset construction for the objective function. A coreset is a possibly (re)weighted and significantly smaller subset of the data that approximates the objective value for any possible query points. More formally, we define coresets for the weighted logistic regression function.

###### Definition 1 ((1±ε)-coreset for logistic regression).

Let be a set of points weighted by . Then a set , (re)weighted by , is a -coreset of for , if and

 ∀β∈Rd:|fw(Xβ)−fu(Cβ)|≤ε⋅fw(Xβ).

-Complex Data Sets We will see in Section 3 that in general, there is no sublinear one-pass streaming algorithm approximating the objective function up to any finite constant factor. More specifically there exists no sublinear summary or coreset construction that works for all data sets. For the sake of developing coreset constructions that work reasonably well, as well as conducting a formal analysis beyond worst-case instances, we introduce a measure that quantifies the complexity of compressing a given data set.

###### Definition 2.

Given a data set weighted by and a vector let denote the vector comprising only the negative entries of . Similarly let denote the vector of positive entries. We define the hardness of compressing weighted by to be

 μw(X)=supβ∈Rd∖{0}∥(DwXβ)+∥1∥(DwXβ)−∥1.

weighted by is called -complex if .

The size of our -coreset constructions for logistic regression for a given -complex data set will have low polynomial dependency on but only sublinear dependency on its original size parameter . So for -complex data sets having small we have the first -coreset of provably sublinear size. The above definition implies, for , the following inequalities. The reader should keep in mind that for all

 μ−1∥(DwXβ)−∥1≤∥(DwXβ)+∥1≤μ∥(DwXβ)−∥1.

We note that the value of can be approximated efficiently.

###### Theorem 3.

Let weighted by . Then a -approximation to the value of can be computed in time.

The parameter has an intuitive interpretation and might be of independent interest. The odds of a binary random variable are defined as The model assumption of logistic regression is that for every sample , the logarithm of the odds is a linear function of . For a candidate , multiplying all odds and taking the logarithm is then exactly . Our definition now relates the probability mass due to the incorrectly predicted odds and the probability mass due to the correctly predicted odds. We say that the ratio between these two is upper bounded by . For logistic regression, assuming they are within some order of magnitude is not uncommon. One extreme is the (degenerate) case where the data set is exactly separable. Choosing to parameterize a separating hyperplane for which is all positive, implies that . Another case is when we have a large ratio between the number of positively and negatively labeled points which is a lower bound to . Note that under either of these conditions, logistic regression exhibits methodological weaknesses due to the separation or imbalance between the given classes, cf. (Mehta & Patel, 1995; Heinze & Schemper, 2002).

## 3 Lower Bounds

In the following we will show that no efficient streaming algorithms or coresets for logistic regression can exist in general, even if we assume that the points lie in -dimensional Euclidean space. To this end we will reduce from the INDEX communication game. In its basic variant, there exist two players Alice and Bob. Alice is given a binary bit string and Bob is given an index . The goal is to determine the value of with constant probability while using as little communication as possible. Clearly, the difficulty of the problem is inherently one-way; otherwise Bob could simply send his index to Alice. If the entire communication consists of only a single message sent by Alice to Bob, the message must contain bits (Kremer et al., 1999).

###### Theorem 4.

Let be an instance of logistic regression in -dimensional Euclidean space. Any one-pass streaming algorithm that approximates the optimal solution of logistic regression up to any finite multiplicative approximation factor requires bits of space.

The same reduction also holds if Alice’s message can only consist of input points forming a coreset. Hence, the following corollary holds.

###### Corollary 5.

Let be an instance of logistic regression in -dimensional Euclidean space. Any coreset of for logistic regression consists of at least points.

We note that independently, Tolochinsky & Feldman (2018) gave a linear lower bound in a more general context based on a worst case instance to the sensitivity approach due to Huggins et al. (2016). Our lower bounds and theirs are incomparable; they show that if a coreset can only consist of input points it comprises the entire data set in the worst-case. We show that no coreset with can exist, irrespective of whether input points are used. While the distinction may seem minor, a number of coreset constructions in literature necessitate the use of non-input points, see (Agarwal et al., 2004) and (Feldman et al., 2013).

## 4 Sampling via Sensitivity Scores

Our sampling based coreset constructions are obtained with the following approach, called sensitivity sampling. Suppose we are given a data set together with weights as in Definition 1. Recall the function under study is . Associate with each point the function . Then we have the following definition.

###### Definition 6.

(Langberg & Schulman, 2010) Consider a family of functions mapping from to and weighted by . The sensitivity of for is

 ςi=supwigi(β)fw(β) (1)

where the is over all with . If this set is empty then . The total sensitivity is .

The sensitivity of a point measures its worst-case importance for approximating the objective function on the entire input data set. Performing importance sampling proportional to the sensitivities of the input points thus yields a good approximation. Computing the sensitivities is often intractable and involves solving the original optimization problem to near-optimality, which is the problem we want to solve in the first place, as pointed out in (Braverman et al., 2016). To get around this, it was shown that any upper bound on the sensitivities also has provable guarantees. However, the number of samples needed depends on the total sensitivity, that is, the sum of their estimates , so we need to carefully control this quantity. Another complexity measure that plays a crucial role in the sampling complexity is the VC dimension of the range space induced by the set of functions under study.

###### Definition 7.

A range space is a pair where is a set and is a family of subsets of . The VC dimension of is the size of the largest subset such that

###### Definition 8.

Let be a finite set of functions mapping from to . For every and , let , and , and be the range space induced by .

Recently a framework combining the sensitivity scores with a theory on the VC dimension of range spaces was developed in (Braverman et al., 2016). For technical reasons we use a slightly modified version.

###### Theorem 9.

Consider a family of functions mapping from to and a vector of weights . Let . Let . Let . Given one can compute in time a set of

 O(Sε2(ΔlogS+log(1δ)))

weighted functions such that with probability we have for all simultaneously

 ∣∣ ∣∣∑f∈Fwifi(β)−∑f∈Ruifi(β)∣∣ ∣∣≤ε∑f∈Fwifi(β).

where each element of is sampled i.i.d. with probability from , denotes the weight of a function that corresponds to , and where is an upper bound on the VC dimension of any range space induced by that can be obtained by defining to be the set of functions from where each function is scaled by a non-negative scalar.

Now we show that the VC dimension of the range space induced by the set of functions studied in logistic regression equals the VC dimension of the set of linear classifiers.

###### Lemma 10.

The range space induced by satisfies .

Theorem 9 also holds for the weighted case. To see this, note that we can scale the weights such that they become positive integers and treat the resulting expansion as a sum over an unweighted multiset of points. This does not affect the multiplicative error guarantee of Theorem 9. Lemma 10 applies to the set of functions associated with the inflated set of points, since the bound does not depend on the number of elements. It remains for us to derive tight and efficiently computable upper bounds on the sensitivities.

Base Algorithm We show that sampling proportional to the square root of the -leverage scores augmented by yields a coreset whose size is roughly linear in and the dependence on the input size is roughly . In what follows, let .

We make a case distinction covered by lemmas 11 and 12. The intuition in the first case is that for a sufficiently large positive entry , we have that . The lower bound holds even for all non-negative entries. Moreover, for -complex inputs we are able to relate the norm of all entries to the positive ones, which will yield the desired bound, arguing similarly to the techniques of Clarkson & Woodruff (2015b).

###### Lemma 11.

Let weighted by be -complex. If for index , the supreme in (1) satisfies then .

In the second case, the element under study is bounded by a constant. We consider two sub cases. If there are a lot of contributions, which are not too small, and thus cost at least a constant each, then we can lower bound the total cost by a constant times their total weight. If on the other hand there are many very small negative values, then this implies again that the cost is within a fraction of the total weight.

###### Lemma 12.

Let weighted by be -complex. If for index , the supreme in (1) satisfies then .

Combining both lemmas yields general upper bounds on the sensitivities that we can use as an importance sampling distribution. We also derive an upper bound on the total sensitivity that will be used to bound the sampling complexity.

###### Lemma 13.

Let weighted by be -complex. Let be an orthonormal basis for the columnspace of . For each , the sensitivity of for the weighted logistic regression function is bounded by . The total sensitivity is bounded by .

We combine the above results into the following theorem.

###### Theorem 14.

Let weighted by be -complex. Let . There exists a -coreset of for logistic regression of size . Such a coreset can be constructed in two passes over the data, in time, and with success probability for any absolute constant .

Recursive Algorithm Here we develop a recursive algorithm, inspired by the recursive sampling technique of Clarkson & Woodruff (2015a) for the Huber -estimator, though adapted here for logistic regression. This yields a better dependence on the input size. More specifically, we can diminish the leading factor to only for an absolute constant . One complication is that the parameter grows in the recursion, which we need to control, while another complication is having to deal with the separate and uniform parts of our sampling distribution.

We apply the Algorithm of Theorem 14 recursively. To do so, we need to ensure that after one stage of subsampling and reweighting, the resulting data set remains -complex for a value that is not too much larger than . To this end, we first bound the VC dimension of a range space induced by an related family of functions.

###### Lemma 15.

The range space induced by satisfies .

Applying Theorem 9 to implies that the subsample of Theorem 14 satisfies a so called -subspace embedding property for . Note that, by linearity of the -norm, we can fold the weights into .

###### Lemma 16.

Let be a sampling and reweighting matrix according to Theorem 14. I.e., is the resulting reweighted sample when Theorem 14 is applied to -complex input . Then with probability , for all simultaneously

 (1−ε)∥DwXβ∥1≤∥TDwXβ∥1≤(1+ε)∥DwXβ∥1.

Using this, we can show that the -complexity property is not violated too much after one stage of sampling.

###### Lemma 17.

Let be a sampling and reweighting matrix according to Theorem 14 with parameter . That is is the resulting reweighted sample when Theorem 14 succeeds on -complex input . Suppose that simultaneously Lemma 16 holds. Let

 μ′=μTw(X)=supβ∈Rd∥(TDwXβ)+∥1∥(TDwXβ)−∥1.

Then we have

Now we are ready to prove our Theorem regarding the recursive subsampling algorithm.

###### Theorem 18.

Let be -complex. Let , and . There exists a -coreset of for logistic regression of size . Such a coreset can be constructed in time in passes over the data for a small , assuming the machine has access to memory. The success probability is for any absolute constant .

## 5 Experiments

We ran a series of experiments to illustrate the performance of our coreset method. All experiments were run on a Linux machine using an Intel i7-6700, 4 core CPU at 3.4 GHz, and 32GB of RAM. We implemented our algorithms in Python. We compare our basic algorithm to simple uniform sampling and to sampling proportional to the sensitivity upper bounds given by Huggins et al. (2016).

Implementation Details The approach of Huggins et al. (2016) is based on a -means++ (Arthur & Vassilvitskii, 2007) clustering on a small uniform subset of the data and was performed using standard parameters taken from the publication. For this purpose we used parts of their original Python code. However, we removed the restriction of the domain of optimization to a region of small radius around the origin. This way, we enabled unconstrained regression in the domain .

The exact QR-decomposition is rather slow on large data matrices. We thus optimized the running time of our approach in the following way. We used a fast approximation algorithm based on the sketching techniques of Clarkson & Woodruff (2013), cf. (Woodruff, 2014). That leads to a provable constant approximation of the square root of the leverage scores with constant probability, cf. (Drineas et al., 2012), which means that the total sensitivity bounds given in our theory will grow by only a small constant factor. A detailed description of the algorithm is in the proof of Theorem 14.

The subsequent optimization was done for all approaches with the standard gradient based optimizer from the scipy.optimize package.

Data Sets We briefly introduce the data sets that we used. The  data consists of unigrams with features from web pages which have to be classified as spam or normal pages ( positive). The  data consists of cartographic observations of different forests with features. The task is to predict the type of trees at each location ( positive). The KDD Cup ’99  data comprises network connections with features and the task is to detect network intrusions ( positive).

Experimental Assessment For each data set we assessed the total running times for computing the sampling probabilities, sampling and optimizing on the sample. In order to assess the approximation accuracy we examined the relative error of the negative log-likelihood for the maximum likelihood estimators obtained from the full data set and the subsamples .

For each data set, we ran all three subsampling algorithms for a number of thirty regular subsampling steps in the range . For each step, we present the mean relative error as well as the trade-off between mean relative error and running time, taken over twenty independent repetitions, in Figure 1. Relative running times, standard deviations and absolute values are presented in Figure 2 respectively in Table 1 in Appendix B.

Evaluation The accuracy of the QR-sampling distribution outperforms uniform sampling and the distribution derived from -means on all instances. This is especially true for small sampling sizes. Here, the relative error especially for uniform sampling tends to deteriorate. While -means sampling occasionally improved over uniform sampling for small sample sizes, the behavior of both distributions was similar for larger sampling sizes. The standard deviations had a similarly low magnitude as the mean values, where the QR method usually showed the lowest values.

The trade-off between the running time and relative errors shows a common picture for Webb Spam and Covertype. QR is nearly always more accurate than the other algorithms for a similar time budget, except for regions where the relative error is large, say above 5-10% while for larger time budgets, QR is better by a factor between - and drops more quickly towards . The conclusion so far could be that for a quick guess, say a -approximation, the competitors are faster, but to provably obtain a reasonably small relative error below 5%, QR outperforms its competitors. However, for KDD Cup ’99, QR always has a lower error than its competitors. Their relative errors remain above 15% or much worse, while QR never exceeds 22% and drops quickly below 4%.

The relative running time for the QR-distribution was comparable to -means and only slightly higher than uniform sampling. However, it never exceeded a factor of two compared to its competitors and remained negligible compared to the full optimization task, see Figure 2 in Appendix B. The standard deviations were negligible except for the -means algorithm and the KDD Cup ’99 data set, where the uniform and -means based algorithms showed larger values. The QR method had much lower standard deviations. This indicates that the resulting coresets are more stable for the subsequent numerical optimization.

We note that the savings of all presented data reduction methods become even more significant when performing more time consuming data analysis tasks like MCMC sampling in a Bayesian setting, see e.g., (Huggins et al., 2016; Geppert et al., 2017).

## 6 Conclusions

We first showed that (sublinear) coresets for logistic regression do not exist in general. It is thus necessary to make further assumptions on the nature of the data. To this end we introduced a new complexity measure , which quantifies the amount of overlap of positive and negative classes and the balance in their cardinalities. We developed the first rigorous sublinear -coresets for logistic regression, given that the original data has small -complexity. The leading factor is . We have further developed a recursive coreset construction that reduces the dependence on the input size to only for absolute constant . This comes at the cost of an increased dependence on . However, it is beneficial for very large and well-behaved data. Our algorithms are space efficient, and can be implemented in a variety of models, used to tackle the challenges of large data sets, such as -pass streaming, and massively parallel frameworks like Hadoop and MapReduce, and can be implemented to run in input sparsity time , which is especially beneficial for sparsely encoded input data.

Our experimental evaluation shows that our implementation of the basic algorithm outperforms uniform sampling as well as state of the art methods in the area of coresets for logistic regression while being competitive to both regarding its running time.

## Acknowledgments

This work was partly supported by the German Science Foundation (DFG) Collaborative Research Center SFB 876 "Providing Information by Resource-Constrained Analysis", projects A2 and C4.

## References

• Agarwal et al. (2004) Agarwal, Pankaj K., Har-Peled, Sariel, and Varadarajan, Kasturi R. Approximating extent measures of points. Journal of the ACM, 51(4):606–635, 2004.
• Alaoui & Mahoney (2015) Alaoui, Ahmed El and Mahoney, Michael W. Fast randomized kernel ridge regression with statistical guarantees. In Advances in Neural Information Processing Systems 28 (NIPS), pp. 775–783, 2015.
• Arthur & Vassilvitskii (2007) Arthur, David and Vassilvitskii, Sergei. k-means++: the advantages of careful seeding. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1027–1035, 2007.
• Balcan et al. (2015) Balcan, Marina-Florina, Manthey, Bodo, Röglin, Heiko, and Roughgarden, Tim. Analysis of algorithms beyond the worst case (Dagstuhl seminar 14372). Dagstuhl Reports, 4(9):30–49, 2015.
• Blumer et al. (1989) Blumer, Anselm, Ehrenfeucht, Andrzej, Haussler, David, and Warmuth, Manfred K. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989.
• Braverman et al. (2016) Braverman, Vladimir, Feldman, Dan, and Lang, Harry. New frameworks for offline and streaming coreset constructions. arXiv preprint CoRR, abs/1612.00889, 2016.
• Chao (1982) Chao, M. T. A general purpose unequal probability sampling plan. Biometrika, 69(3):653–656, 1982.
• Clarkson (2005) Clarkson, Kenneth L. Subgradient and sampling algorithms for regression. In Proceedings of the annual ACM-SIAM symposium on Discrete algorithms (SODA), pp. 257–266, 2005.
• Clarkson & Woodruff (2013) Clarkson, Kenneth L. and Woodruff, David P. Low rank approximation and regression in input sparsity time. In Symposium on Theory of Computing (STOC), pp. 81–90, 2013.
• Clarkson & Woodruff (2015a) Clarkson, Kenneth L. and Woodruff, David P. Sketching for M-estimators: A unified approach to robust regression. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 921–939, 2015a.
• Clarkson & Woodruff (2015b) Clarkson, Kenneth L. and Woodruff, David P. Input sparsity and hardness for robust subspace approximation. In IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS), pp. 310–329, 2015b.
• Clarkson et al. (2016) Clarkson, Kenneth L., Drineas, Petros, Magdon-Ismail, Malik, Mahoney, Michael W., Meng, Xiangrui, and Woodruff, David P. The fast cauchy transform and faster robust linear regression. SIAM J. Comput., 45(3):763–810, 2016.
• Cohen et al. (2015) Cohen, Michael B., Lee, Yin Tat, Musco, Cameron, Musco, Christopher, Peng, Richard, and Sidford, Aaron. Uniform sampling for matrix approximation. In Proceedings of the Conference on Innovations in Theoretical Computer Science (ITCS), pp. 181–190, 2015.
• Cohen et al. (2017) Cohen, Michael B., Musco, Cameron, and Musco, Christopher. Input sparsity time low-rank approximation via ridge leverage score sampling. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1758–1777, 2017.
• Dasgupta et al. (2009) Dasgupta, Anirban, Drineas, Petros, Harb, Boulos, Kumar, Ravi, and Mahoney, Michael W. Sampling algorithms and coresets for regression. SIAM Journal on Computing, 38(5):2060–2078, 2009.
• Drineas et al. (2006) Drineas, Petros, Mahoney, Michael W., and Muthukrishnan, S. Sampling algorithms for regression and applications. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1127–1136, 2006.
• Drineas et al. (2008) Drineas, Petros, Mahoney, Michael W., and Muthukrishnan, S. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844–881, 2008.
• Drineas et al. (2012) Drineas, Petros, Magdon-Ismail, Malik, Mahoney, Michael W., and Woodruff, David P. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13:3475–3506, 2012.
• Feldman & Langberg (2011) Feldman, Dan and Langberg, Michael. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC), pp. 569–578, 2011.
• Feldman et al. (2011) Feldman, Dan, Faulkner, Matthew, and Krause, Andreas. Scalable training of mixture models via coresets. In Advances in Neural Information Processing Systems 24 (NIPS), pp. 2142–2150, 2011.
• Feldman et al. (2013) Feldman, Dan, Schmidt, Melanie, and Sohler, Christian. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1434–1453, 2013.
• Geppert et al. (2017) Geppert, Leo N., Ickstadt, Katja, Munteanu, Alexander, Quedenfeld, Jens, and Sohler, Christian. Random projections for Bayesian regression. Statistics and Computing, 27(1):79–101, 2017.
• Golub & van Loan (2013) Golub, Gene H. and van Loan, Charles F. Matrix computations (4. ed.). J. Hopkins Univ. Press, 2013.
• Heinze & Schemper (2002) Heinze, Georg and Schemper, Michael. A solution to the problem of separation in logistic regression. Statistics in Medicine, 21(16):2409–2419, 2002.
• Huggins et al. (2016) Huggins, Jonathan H., Campbell, Trevor, and Broderick, Tamara. Coresets for scalable Bayesian logistic regression. In Advances in Neural Information Processing Systems 29 (NIPS), pp. 4080–4088, 2016.
• Kearns & Vazirani (1994) Kearns, Michael J. and Vazirani, Umesh V. An Introduction to Computational Learning Theory. MIT Press, 1994.
• Kremer et al. (1999) Kremer, Ilan, Nisan, Noam, and Ron, Dana. On randomized one-round communication complexity. Computational Complexity, 8(1):21–49, 1999.
• Langberg & Schulman (2010) Langberg, Michael and Schulman, Leonard J. Universal -approximators for integrals. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 598–607, 2010.
• Li et al. (2013) Li, Mu, Miller, Gary L., and Peng, Richard. Iterative row sampling. In 54th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 127–136, 2013.
• Lucic et al. (2016) Lucic, Mario, Bachem, Olivier, and Krause, Andreas. Strong coresets for hard and soft Bregman clustering with applications to exponential family mixtures. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1–9, 2016.
• McCullagh & Nelder (1989) McCullagh, P. and Nelder, J. A. Generalized Linear Models. Chapman & Hall, London, 1989.
• Mehta & Patel (1995) Mehta, Cyrus R. and Patel, Nitin R. Exact logistic regression: Theory and examples. Statistics in Medicine, 14(19):2143–2160, 1995.
• Molina et al. (2018) Molina, Alejandro, Munteanu, Alexander, and Kersting, Kristian. Core dependency networks. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI), 2018.
• Musco & Musco (2017) Musco, Cameron and Musco, Christopher. Recursive sampling for the Nyström method. In Advances in Neural Information Processing Systems 30 (NIPS), pp. 3836–3848, 2017.
• Reddi et al. (2015) Reddi, Sashank J., Póczos, Barnabás, and Smola, Alexander J. Communication efficient coresets for empirical loss minimization. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence (UAI), pp. 752–761, 2015.
• Roughgarden (2017) Roughgarden, Tim. Beyond worst-case analysis, 2017. Invited talk held at the Highlights of Algorithms conference (HALG), 2017.
• Sohler & Woodruff (2011) Sohler, Christian and Woodruff, David P. Subspace embeddings for the -norm with applications. In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC), pp. 755–764, 2011.
• Tolochinsky & Feldman (2018) Tolochinsky, Elad and Feldman, Dan. Coresets for monotonic functions with applications to deep learning. CoRR, abs/1802.07382, 2018.
• Vapnik (1995) Vapnik, Vladimir N. The Nature of Statistical Learning Theory. Springer, New York, USA, 1995.
• Woodruff (2014) Woodruff, David P. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2):1–157, 2014.
• Woodruff & Zhang (2013) Woodruff, David P. and Zhang, Qin. Subspace embeddings and -regression using exponential random variables. In The 26th Conference on Learning Theory (COLT), pp. 546–567, 2013.

## Appendix A Proofs

###### Proof of Theorem 3..

Let . In time, find an -well-conditioned basis (Clarkson et al., 2016) of , such that

 ∀β∈Rd:∥β∥1≤∥Uβ∥1≤poly(d)∥β∥1.

Then and are the same since and span the same columnspace. By linearity it suffices to optimize over unit- vectors . If we minimize over unit- vectors , and is the minimum value, then is at most , and at least by the well-conditioned basis property, so we just need to find , which can be done with the following linear program:

 min ∑ni=1bi s.t. ∀i∈[n]:(Uβ)i=ai−bi ∀i∈[n]:βi=ci−di ∑ni=1ci+di≥1 ∀i∈[n]:ai,bi,ci,di≥0

Note that ensures , but to minimize the objective function, one will always have . Further, if both and are positive for some , they can both be reduced, reducing the objective function. So exactly corresponds to the minimum over of . ∎

###### Proof of Theorem 4..

Assume we had a streaming algorithm using space. We construct the following protocol for INDEX: Consider an instance of INDEX, i.e., Alice has a string and Bob has an index . We transform the instance into an instance for logistic regression. For each , Alice adds a point . Note that all of these points have unit Euclidean norm and hence any single point may be linearly separated from the others. All of Alice’s points have label . Alice summarizes the point set by running the streaming algorithm and sends a message containing the working memory of the streaming algorithm to Bob. Bob now adds the point for small enough with label . From the contents of Alice’s message and , Bob now obtains a solution to the logistic regression instance. Clearly, if Alice added and hence then the optimal solution will have cost at least , since there will be at least one misclassification. If, on the other hand, Alice did not add and hence , then the two point sets are linearly separable and the cost tends to . Distinguishing between these two cases, i.e. approximating the cost of logistic regression beyond a factor solves the INDEX problem.

To conclude the theorem, let us consider the space required to encode the points added by Alice. For the reduction to work, it is only important that any point added by Alice can be linearly separated from the others. This can be achieved by using bits per point, i.e., the space of Alice’s point set is at most . The space bound now follows from the lower bound of bits due to Kremer et al. (1999) for the INDEX problem. ∎

###### Proof of Corollary 5..

If we had a coreset construction with points, we have a protocol for INDEX: Alice computes a coreset for her point set defined in the proof of Theorem 4 and sends it to Bob. Bob computes an optimal solution on the union of the coreset and his point. This solves INDEX using communication, which contradicts the lower bound of Kremer et al. (1999). So Alice’s coreset cannot exist. ∎

###### Proof of Lemma 10..

(cf. Huggins et al. (2016)) For all , we have

 |{G∩R∣R∈ranges(Flog)}|=|{rangeG(β,r)∣β∈Rd,r∈R≥0}|

Note that is invertible and monotone. Also note that maps surjectively into . For all we thus have

 rangeG(β,r) ={gi∈G∣gi(β)≥r}={gi∈G∣g(xiβ)≥r}={gi∈G∣xiβ≥g−1(r)}.

Now note that corresponds to the set of points that is shattered by the affine hyperplane classifier . We can conclude that

 ∣∣{rangeG(β,r)∣β∈Rd,r∈R≥0}∣∣=∣∣{{gi∈G∣xiβ−s≥0}∣β∈Rd,s∈R}∣∣

which means that the VC dimension of is since the VC dimension of the set of hyperplane classifiers is (Kearns & Vazirani, 1994; Vapnik, 1995). ∎

###### Proof of Lemma 11..

Let , where is an orthonormal basis for the columnspace of . It follows from and monotonicity of that

 wig(xiβ) =wig(∥Ui∥2∥DwXβ∥2wi)≤wi2wi∥Ui∥2∥DwXβ∥2≤2∥Ui∥2∥DwXβ∥1 ≤2∥Ui∥2(1+μ)∥(DwXβ)+∥1=2∥Ui∥2(1+μ)∑j:wjxjβ≥0wj|xjβ| ≤2∥Ui∥2(1+μ)∑j:xjβ≥0wjg(xjβ)≤2∥Ui∥2(1+μ)fw(Xβ).\qed
###### Proof of Lemma 12..

Let and . Note that and . Also,

Thus if then

 fw(Xβ) =∑wjg(xjβ)≥∑j∈[n]wj20≥W20wi⋅wig(xiβ).

If on the other hand then . Thus

 fw (Xβ)≥∥(DwXβ)+∥1≥∥(DwXβ)−∥1/μ≥(2⋅∑j∈[n]wj2)/μ≥Wμwi⋅wig(xiβ).\qed
###### Proof of Lemma 13..

From Lemma 11 and Lemma 12 we have for each

 ςi=supβg(xiβ)fw(Xβ) ≤2(1+μ)∥Ui∥2+(20+μ)wiW≤(20+2μ)(∥Ui∥2+wiW)

From this, the second claim follows via the Cauchy-Schwarz inequality and using the fact that the Frobenius norm satisfies due to orthonormality of . We have

 S=∑ni=1ςi ≤(20+2μ)∑ni=1(∥Ui∥2+wiW)≤22μ(√n∥U∥F+1)≤44μ√nd.\qed
###### Proof of Theorem 14..

The algorithm computes the QR-decomposition of . Note that is an orthonormal basis for the columnspace of . It uses the upper bounds on the sensitivities from Lemma 13. Namely, it samples the input points proportional to the sampling probabilities

 pi=si∑nj=1sj=∥Qi∥2+wi/W∑nj=1(∥Qj∥2+wj/W).

From Lemma 13 we know that . Lemma 10 bounds the VC dimension of the range space for logistic regression . Putting all these pieces into Theorem 9 for error parameter and failure probability , we have that a random sample of size

 k ∈O(Sε2(Δ(RFlog)logS+log(1δ))) ⊆O(μ√ndε2(dlog(μ√nd)+log(nc))) ⊆O(μ√nε2(d3/2log(μnd)))

is a coreset with probability as claimed.

It remains to prove the claims regarding streaming and running time. We can compute the QR-decomposition of in time , see (Golub & van Loan, 2013). Once is available, we can inspect it row-by-row computing and give it as input together with to independent copies of a weighted reservoir sampler (Chao, 1982), which takes time to collect all sampled non-zero entries. This gives a total running time of since the computations are dominated by the QR-decomposition.

We argue how to implement the first step in one streaming pass over the data in time . Using the sketching techniques of Clarkson & Woodruff (2013), cf. (Woodruff, 2014), we can obtain a provably constant approximation of the square root of the leverage scores with constant probability (Drineas et al., 2012). This means that the total sensitivity bound grows only by a small constant factor and does not affect the asymptotic analysis presented above. The idea is to first sketch the data matrix to a significantly smaller matrix , where . This takes only time, where the and factors are only needed to amplify the success probability from constant to (Woodruff, 2014). Performing the QR-decomposition takes time.

Now, to compute a fast approximation to the row norms, we use a Johnson-Lindenstrauss transform, i.e., a matrix , whose entries are i.i.d. . We compute the approximation to the row norms used in our sampling probabilities in a second pass over the data, as , for . As we do so, we can feed these augmented with the corresponding weight directly to the reservoir sampler. The latter is a streaming algorithm itself and updates its sample in constant time. The matrix product takes at most time, and the streaming pass can be done in

This sums up to two passes over the data and a running time of . ∎

###### Proof of Lemma 15..

Fix an arbitrary . Let . We attempt to bound the quantity

 |{G∩R∣ R∈ranges(Fℓ1)}| =|{rangeG(β,r)∣β∈Rd,r∈R≥0}| =|⋃(β,r)∈Ω{{hi∈G∣hi(β)≥r}}| =|⋃(β,r)∈Ω{{hi∈G∣wixiβ≥r∨−wixiβ≥r}}| ≤∣∣ ∣∣⋃(β,r)∈Ω{{hi∈G∣wixiβ≥r}}∣∣ ∣∣⋅∣∣ ∣∣⋃(β,r)∈Ω{{hi∈G∣−wixiβ≥r}}∣∣ ∣∣ =∣∣ ∣∣⋃(β,r)∈Ω{{hi∈G∣wixiβ≥r}}∣∣ ∣∣2. (2)

The inequality holds, since each non-empty set in the collection on the LHS satisfies either of the conditions of the sets in the collections on the RHS, or both, and is thus the union of two of those sets, one from each collection. It can thus comprise at most all unions obtained from combining any two of these sets. The last equality holds since for each fixed we also union over as we reach over all . The two sets are thus equal.

Now note that each set equals the set of weighted points that is shattered by the affine hyperplane classifier . Note that the VC dimension of the set of hyperplane classifiers is (Kearns & Vazirani, 1994; Vapnik, 1995). To conclude the claimed bound on it is sufficient to show that the above term (2) is bounded strictly below for . By a bound given in (Blumer et al., 1989; Kearns & Vazirani, 1994) we have for this particular choice

 (???) ≤∣∣{{hi∈G∣wixiβ−r≥0}∣β∈Rd,r∈R}∣∣2≤(e|G|d+1)2(d+1) <22(d+1)log(30)≤22(d+1)5=2|G|

which implies that . ∎

###### Proof of Lemma 16..

Consider any . Let where is an orthonormal basis for the columnspace of . As in Lemma 11 we have for each index

 |wixiβ| =|UiRβ|≤∥Ui∥2∥Rβ∥2=∥Ui∥2∥DwXβ∥2≤∥Ui∥2∥DwXβ∥1 (3)

The sensitivity for the norm function of is thus

 supβ∈Rd∖{0}wi|xiβ|∥DwXβ∥1≤∥Ui∥2.

Note that our upper bounds on the sensitivities satisfy . Thus also holds. Also, by Lemma 15, we have a bound of on the VC dimension of the class of functions . Now, rescaling the error probability parameter that we put into Theorem 9 by a factor of , and union bound over the two sets of functions , and , the sample in Theorem 14 satisfies at the same time the claims of Theorem 14 and this lemma. ∎

###### Proof of Lemma 17..

For brevity of presentation let . First note that by Lemma 16 we have for all

 (1−ε′)∥X′β∥1≤∥TX′β∥1≤(1+ε′)∥X′β∥1.

Note that since the weights are non-negative, sampling and reweighting does not change the sign of the entries. This implies for and that

From this and