Finding HeavilyWeighted Features in Data Streams
Abstract
We introduce a new sublinear space data structure—the WeightMedian Sketch—that captures the most heavily weighted features in linear classifiers trained over data streams. This enables memorylimited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual information. In contrast with related sketches that capture the most commonly occurring features (or items) in a data stream, the WeightMedian Sketch captures the features that are most discriminative of one stream (or class) compared to another. The WeightMedian sketch adopts the core data structure used in the CountSketch, but, instead of sketching counts, it captures sketched gradient updates to the model parameters. We provide a theoretical analysis of this approach that establishes recovery guarantees in the online learning setting, and demonstrate substantial empirical improvements in accuracymemory tradeoffs over alternatives, including countbased sketches and feature hashing.
1 Introduction
With the rapid growth of streaming data volumes, memoryefficient sketches are an increasinglyimportant workhorse in analytics tasks including finding frequent items (Charikar et al., 2002; Cormode and Muthukrishnan, 2005b; Metwally et al., 2005; Larsen et al., 2016), estimating quantiles (Greenwald and Khanna, 2001; Luo et al., 2016), and approximating the number of distinct items (Flajolet, 1985). Sketching algorithms compute approximations of these analyses in exchange for significant reductions in memory utilization; therefore, they are wellsuited to settings where highlyaccurate estimation is not essential, or where practitioners wish to trade off between memory usage and approximation accuracy (Boykin et al., 2014; Yu et al., 2013). These properties make sketches a natural fit for stream processing applications where it is infeasible to store the entire stream in memory.
Simultaneously, a wide range of streaming analytics workloads can be formulated as machine learning problems over streaming data. For example, in streaming data explanation (Bailis et al., 2017; Meliou et al., 2014), analyses seek to differentiate between subpopulations in the data—for example, between an inlier class and an outlier class, as determined by some metric of interest. In network monitoring, analyses seek to identify significant relative differences between streams of network traffic (Cormode and Muthukrishnan, 2005a) such as in detecting destinations that are substantially more popular on one link versus another. In natural language processing on text streams, several applications require the identification of stronglyassociated groups of tokens (Durme and Lall, 2009)—this pertains more broadly to the problem of identifying groups of events which tend to co occur. In the language of machine learning, these tasks can all be framed as instances of streaming classification between two or more classes of interest, followed by the interpretation of the learned model in order to identify the features that are the most discriminative between the classes.
In streaming classification applications such as the above, we can identify several key desiderata:
Low Memory. An emerging application domain for ML systems is in running training and inference on resourceconstrained mobile and Internet of Things (IoT) devices (Kumar et al., 2017; Gupta et al., 2017). A widelydeployed example is the Arduino Uno Rev3 board, which operates with 2KB of onboard SRAM and 32KB of flash memory; predictive models operating in these settings must therefore contend with strict limits on model size. The use of ondevice models is not limited to simply performing prediction with a pretrained model; it is often desirable to support model updates in these memorylimited environments as well. For instance, an ondevice model can be updated “in the field” on the basis of input data streams like local sensor readings. Even in the traditional setting of stream processing on commodity servers, memoryconstrained techniques can be useful for processing a large number of distinct streams on a single node.
Model Interpretability. In the stream processing applications described earlier, the goal is not solely to achieve high classification accuracy, but also to identify important features in the learned model, where each feature corresponds to items or attributes in the data stream. Learned model weights can themselves be a useful output of the learning process; a model trained to discriminate between two classes can yield actionable information regarding the attributes most characteristic of each subpopulation. While “model interpretability” is a broadly construed term in the literature (Lipton, 2016), in this work we consider the specific task of identifying the largestmagnitude weights in the model; intuitively, this corresponds to identifying which features are most influential towards making model predictions. Beyond the above applications, the identification of influential features in a learned model is relevant to issues in feature selection (Zhang et al., 2016), fairness (CorbettDavies et al., 2017), model trustworthiness (Ribeiro et al., 2016), and legallymandated rights to explanation in scenarios involving algorithmic decisionmaking (Goodman and Flaxman, 2016).
Fast Updates. The high throughput requirements of modern stream analytics necessitate the use of models that support fast inference and update times. For example, network traffic monitoring requires methods that support updates at line rates.
Existing sketchbased methods may seem to be attractive buildingblocks for developing streaming classification systems that satisfy these requirements. However, as we show, simple adaptations of existing sketch algorithms can fail to capture influential features in the model. For example, a heavyhitters algorithm (Cormode and Hadjieleftheriou, 2008) will identify frequentlyoccurring features, but these features are not necessarily the most discriminative features between classes—a simple instance of this failure is when the most frequent features are statistically independent of the class variable. On the other hand, online learning algorithms with sparsityinducing regularization impose only soft constraints on memory usage, and it is difficult to a priori select a regularization parameter to satisfy hard memory constraints without strong assumptions on the data. Thus, existing methods are not fully satisfactory for our target applications.
In response, we develop new smallspace algorithms for finding heavilyweighted features in data streams. In this paper, we formalize the problem by introducing the HeavyWeights problem as a generalization of the wellstudied HeavyHitters problem. We introduce a new sketch algorithm for the HeavyWeights problem—the WeightMedian Sketch—that solves this problem for streaming linear classification using space logarithmic in the feature dimension. This sketch can be applied in many of the previouslymentioned stream analytics workloads, including: (i) online feature selection, (ii) streaming data explanation, (iii) detecting large relative differences in data streams (i.e., relative deltoids (Cormode and Muthukrishnan, 2005a)), and (iv) streaming estimation of highlycorrelated pairs of items via pointwise mutual information (Durme and Lall, 2009).
The key intuition behind the WeightMedian Sketch is that by sketching the gradient updates to a linear classifier, we can incrementally maintain a compressed version of the true, highdimensional model that supports efficient approximate recovery of the model weights. In particular, the WeightMedian Sketch maintains a linear projection of the weight vector of the linear classifier, similar to how a CountSketch (Charikar et al., 2002) maintains a sketch of a vector of frequency counts. Unlike heavyhitters sketches that simply increment or decrement counters, the WeightMedian Sketch updates its state using online gradient descent (Hazan et al., 2016). Since these updates themselves depend on the current weight estimates, a careful analysis is needed to ensure that the estimated weights do not diverge from the true model parameters over the course of multiple online updates. In this paper, we provide theoretical guarantees on the approximation error of these weight estimates. Additionally, in extensive experiments over real data, we show that the WeightMedian Sketch outperforms alternative approaches, including those based on tracking the most frequent features, truncationbased techniques, and feature hashing. For example, on the standard Reuters RCV1 binary classification benchmark, the WeightMedian Sketch recovers the most heavilyweighted features in the model with 4 better approximation error than a baseline using the Space Saving heavyhitters algorithm and 10 better than a naïve weight truncation baseline, while using the same amount of memory.
To summarize, we make the following contributions in this work:

We introduce the WeightMedian Sketch, a new sketch for identifying heavilyweighted features in linear classifiers over data streams.

We provide a theoretical analysis of the WeightMedian Sketch that provides guarantees on the accuracy of the sketch estimates. In particular, we show that for feature dimension and with success probability , we can learn a compressed model of dimension that supports approximate recovery of the optimal weight vector , where the absolute error of each weight is bounded above by . For a given input vector , this structure can be updated in time .

We present optimizations to the basic WeightMedian Sketch—in particular, using an active set of features—to improve both weight recovery and classification accuracy.

We demonstrate that the WeightMedian Sketch empirically outperforms several alternatives in terms of memoryaccuracy tradeoffs across a range of realworld datasets.
2 Preliminaries
In this section, we present our problem description as well as several theoretical concepts that we draw upon in our design and analysis of the WeightMedian Sketch.
Conventions and Notation. We denote all vector quantities by bolded lowercase letters, . The notation or (where appropriate for clarity) denotes the th element of . The notation denotes the set . We write norms as , where the norm of is defined as .
2.1 The HeavyWeights Problem
The wellstudied approximate HeavyHitters problem can be defined as follows:
Definition 1.
(Cormode and Hadjieleftheriou, 2008) Given a stream of items, let denote the count of item over the stream. Given any , the approximate HeavyHitters problem is to return a set of items such that for all , , and there is no such that .
We formalize the HeavyWeights problem as the analog of the HeavyHitters problem for general optimization problems where a unique solution exists. As is typical in the HeavyHitters formulation, we allow for an arbitrary approximation factor .
Definition 2.
(()Approximate HeavyWeights Problem) For any function , for which a unique minimizer exists, define . Given any , the approximate HeavyWeights problem is to return a set such that for all , , and there is no such that .
The function represents an objective or cost to be minimized. Note that the familiar HeavyHitters problem can be posed as an optimization problem with the following objective: , subject to , where is the vector of true counts with and . This objective is minimized by . Therefore, the approximate HeavyHitters problem is a special case of the approximate HeavyWeights problem.
We can also define the related Weight Estimation problem:
Definition 3.
(()Approximate Weight Estimation Problem) For any function , for which a unique minimizer exists, define . The approximate Weight Estimation problem is to return a value , for any , such that .
If is known, an algorithm that solves the Weight Estimation problem also solves the HeavyWeights problem, since we can simply enumerate all weights and output those for which . In practice, it is often more convenient to work in terms of retrieving estimates of the top weights of in absolute value; this formulation is equivalent to the approximate HeavyWeights problem for some implicit value of . Note that if we have an algorithm that solves the approximate Weight Estimation problem and are given an upper bound , we can retrieve a set—possibly of size —that is guaranteed to contain the top weights using the following procedure: enumerate the weight estimates and return those whose range intersects with the top set.
2.2 HeavyWeights in Linear Classifiers
The HeavyWeights and Weight Estimation problems defined above are extremely broad as they concern general optimization problems. In this paper, we specialize to the class of objectives corresponding to binary classification with regularized linear models. As we show in our subsequent exploration of specific applications (Sec. 6), this model family is sufficiently expressive to capture a variety of applications in stream processing, including streaming feature selection, data explanation, relative deltoid detection, and pointwise mutual information estimation.
We consider a binary^{1}^{1}1In Sec. 8, we describe an extension to the multiclass setting using a standard reduction to binary classification. linear classifier parameterized^{2}^{2}2In the following, we assume linear classifiers with no bias term for simplicity, but it is straightforward to extend our methods to the nonzero bias case. by vector . Given an input feature vector , the classifier outputs a prediction as:
Given that examples have been observed in the stream, define the loss as the following:
where is the set of examples observed in the stream so far, is a convex, differentiable function of its argument and controls the strength of regularization. The choice of defines the linear classification model to be used. For example, the logistic loss defines logistic regression, and smoothed versions of the hinge loss define close relatives of linear support vector machines. Our goal is to obtain good estimates of , the optimal weight vector in hindsight over the observed examples.
We consider the online learning setting where the model weights are updated over a series of rounds . In each round, we update the model weights via the following process:

Receive an input .

Incur loss .

Update state to .
The online learning literature describes numerous update rules for selecting the next state given the current state . For concreteness, we specify our algorithms to perform updates with online gradient descent (OGD) (Hazan et al., 2016; Chp. 3), which uses the following update rule:
where is the learning rate at step . OGD enjoys the advantages of simplicity and minimal space requirements (needing to maintain only a representation of the weight vector and a global scalar learning rate), but may be suboptimal given the particular structure of the problem. Generalizations to other online learning algorithms are straightforward.
Clearly, if we were to explicitly maintain the weight vector at each step, both the HeavyWeights problem and the Weight Estimation problem would be trivial to solve. The challenge is the following: is it possible to maintain, for each time step, a compact summary , where , that supports efficient recovery of the highestmagnitude weights in the model? In the following sections, we will show that in a restricted online setting where the order in which the examples are observed is nonadversarial, there exists an algorithm using space that solves the approximate Weight Estimation problem.
2.3 Relevant Background
At its core, our proposed method combines techniques from (1) online learning, (2) sketching, and (3) normpreserving embeddings via random projections. In the previous section, we gave a brief overview of online learning. Here, we present the relevant background for areas (2) and (3).
CountSketch. The CountSketch (Charikar et al., 2002) is a linear projection of a vector that supports efficient approximate recovery of the entries of . For a given size and depth , the CountSketch algorithm maintains a collection of hash tables, each with width (see Fig. 2). Each is assigned a random bucket in table along with a random sign . Increments to each coordinate are multiplied by added to the corresponding buckets . The estimator for the th coordinate is the median of the values in the assigned buckets multiplied by the corresponding sign flips. Charikar et al. (2002) showed the following recovery guarantee:
Lemma 1.
(Charikar et al., 2002) Let be the CountSketch estimate of the vector . For any vector , with probability , a CountSketch matrix with width and depth satisfies
This result implies that the CountSketch solves the approximate HeavyHitters problem.
JohnsonLindenstrauss (JL) property. A random projection matrix is said to have the JohnsonLindenstrauss (JL) property (Johnson and Lindenstrauss, 1984) if it preserves the norm of a vector with high probability:
Definition 4.
A random matrix has the JL property with error and failure probability if for any given , we have with probability :
The JL property holds for dense matrices with independent Gaussian or Bernoulli entries (Achlioptas, 2003), and recent work has shown that it applies to certain sparse matrices as well (Kane and Nelson, 2014). Intuitively, JL matrices preserve the geometry of a set of points, and we leverage this key fact to ensure that we can still recover the original solution after projecting to low dimension.
3 Finding HeavilyWeighted Features
In this section, we present the WeightMedian Sketch for identifying heavilyweighted features in linear classifiers. First, for intuition and comparison, we describe a number of simple baseline methods based on maintaining a sparse weight vector at each iteration. The first two are based on truncating the weight vector to its largestmagnitude components; the next two are extensions of streaming heavyhitters algorithms. We subsequently describe our new sketchbased approaches for finding heavilyweighted features.
3.1 Baseline Methods
Simple Truncation. Given a budget of weights, a natural baseline method is to simply truncate after each update to the entries with highest absolute value, setting all other entries to zero. The simple truncation baseline is similar to the truncated Perceptron algorithm proposed by Hoi et al. (Hoi et al., 2012). We give a pseudocode description in Appendix B.
Probabilistic Truncation. A problem with the simple truncation method is that it may end up “stuck” with a bad set of weights: a “good” index that would have been included in the top set by the unconstrained classifier may fail to be included in the feature set under Alg. 3 if its gradient updates are insufficiently large relative to the smallest weight in the set; this results in the weight being repeatedly zeroedout in each iteration. To remedy this problem, we can instead adopt a randomized approach where indices are accepted into the sparse set with probability proportional to the magnitude of their weights. Therefore, even if some feature has small but nonzero weight after an update, there is still a positive probability that it is accepted into the feature set. This “probabilistic truncation” algorithm is inspired by weighted reservoir sampling (Efraimidis and Spirakis, 2006). We give the pseudocode in Appendix B.
CountMin Frequent Features. The CountMin sketch (Cormode and Muthukrishnan, 2005b) is a commonlyused method for finding frequent items in data streams. This baseline uses a CountMin sketch to identify the most frequentlyoccurring features; the weights for these frequent features are maintained, while the remaining weights are set to 0.
Space Saving Frequent Features. This method is identical to the previous approach except for the use of the Space Saving algorithm (Metwally et al., 2005) in place of the CountMin sketch for frequent item estimation. The Space Saving algorithm has previously been found to emprically outperform CountMin in insertiononly settings such as ours (Cormode and Hadjieleftheriou, 2008).
3.2 WeightMedian Sketch
We propose a new algorithm—the WeightMedian Sketch (WMSketch)—for the problem of identifying heavilyweighted features. The main data structure in the WMSketch is identical to that used in the CountSketch. The sketch is parameterized by size , depth , and width . We initialize the sketch with a size array set to zero. For a given depth , we view this array as being arranged in rows, each of width (assume that is a multiple of ). We denote this array as , and equivalently view it as a vector in .
The highlevel idea is that each row of the sketch is a compressed version of the model weight vector , where each index is mapped to some assigned bucket . Since , there will be many collisions between these weights; therefore, we maintain rows—each with different assignments of features to buckets—in order to disambiguate weights.
Hashing Features to Buckets. In order to avoid explicitly storing these mappings, which would require space linear in , we implement the featuretobucket maps using hash functions. For each row , we maintain a pair of hash functions, and . We now explain how these hash functions are used to update and query the sketch.
Updates. Suppose an update to weight of the th feature is given. We apply this update using the following procedure: for each row , add the value to bucket in the row. Let the matrix denote the map implicitly represented by the hash functions and . We can then write the update to as .
Instead of being provided the updates , we must compute them as a function of the input example and the sketch state . Given the loss function , we define the update to as the online gradient descent update for the sketched example where we first make a prediction , and then compute the gradient of the loss at this value:
where is the learning rate.
To build intuition, it is helpful to compare this update to the CountSketch update rule (Charikar et al., 2002). In the heavyhitters setting, the input is a onehot encoding for the item seen in that time step. The update to the CountSketch state is the following:
where is defined identically as above. Therefore, our update rule is simply the CountSketch update scaled by the constant . However, an important detail to note is that the CountSketch update is independent of the sketch state , whereas the WMSketch update does depend on . This cyclical dependency between the state and state updates is the main challenge in our analysis of the WMSketch.
Queries. To obtain an estimate of the th weight, we return the median of the values for . This is identical to the query procedure for the CountSketch.
We summarize the update and query procedures for the WMSketch in Algorithm 1. In the next section, we show how the sketch size and depth parameters can be chosen to satisfy an approximation guarantee with failure probability over the randomness in the sketch matrix.
Efficient Weight Decay. It is important that regularization can be applied efficiently on the sketch state of size . A naïve implementation that scales each entry in by in each iteration incurs an update cost of , which masks the computational gains that can be realized when is sparse. Here, we use a standard trick (ShalevShwartz et al., 2011): we maintain a global scale parameter that scales the sketch values . Initially, and we update to implement weight decay over the entire feature vector. Our weight estimates are therefore scaled by : . This optimization reduces the cost of each sketch update from to .
Hash Functions and Independence. Our analysis of the WMSketch requires hash functions that are wise independent. In comparison, the CountSketch requires 2independent hashing. While hash functions satisfying this level of independence can be constructed using polynomial hashing (Carter and Wegman, 1977), hashing each input value would require time , which can be costly when the dimension is large. Instead of satisfying the full independence requirement, our implementation simply uses fast, 3wise independent tabulation hashing (P?traşcu and Thorup, 2012). In our experiments (Sec. 5), we did not observe any significant degradation in performance from this choice of hash function, which is consistent with empirical results in other hashing applications that the degree of independence does not significantly impact performance on realworld data (Mitzenmacher and Vadhan, 2008).
3.3 ActiveSet WeightMedian Sketch
We now describe a simple, heuristic extension to the WMSketch that significantly improves the recovery accuracy of the sketch in practice. We refer to this variant as the ActiveSet WeightMedian Sketch (AWMSketch).
To efficiently track the top elements across sketch updates, we can use a minheap ordered by the absolute value of the estimated weights. This technique is also used alongside heavyhitters sketches to identify the most frequent items in the stream (Charikar et al., 2002). In the basic WMSketch, the heap merely functions as a mechanism to passively maintain the heaviest weights. This baseline scheme can be improved by noting that weights that are already stored in the heap need not be tracked in the sketch; instead, the sketch can be updated lazily only when the weight is evicted from the heap. This heuristic has previously been used in the context of improving count estimates derived from a CountMin Sketch (Roy et al., 2016). The intuition here is the following: since we are already maintaining a heap of heavy items, we can utilize this structure to reduce error in the sketch as a result of collisions with heavy items.
The heap can be thought of as an “active set” of highmagnitude weights, while the sketch estimates the contribution of the tail of the weight vector. Since the weights in the heap are represented exactly, this active set heuristic should intuitively yield better estimates of the heavilyweighted features in the model.
As a general note, similar coarsetofine approximation schemes have been proposed in other online learning settings. A similar scheme for memoryconstrained sparse linear regression was analyzed by Steinhardt and Duchi (2015). Their algorithm similarly uses a CountSketch for approximating weights, but in a different setting (sparse linear regression) and with a different update policy for the active set.
4 Analysis
We derive bounds on the recovery error achieved by the WMSketch for given settings of the size and depth . The main challenge in our analysis is that the updates to the sketch depend on gradient estimates which in turn depend on the state of the sketch. This reflexive dependence makes it difficult to straightforwardly transplant the standard analysis for the CountSketch to our setting. Instead, we turn to ideas drawn from normpreserving random projections and online convex optimization. In this section, we begin with an analysis of recovery error in the batch setting, where we are given access to a fixed dataset of size consisting of the first examples observed in the stream and are allowed multiple passes over the data. Subsequently, we use this result to show guarantees in a restricted online case where we are only allowed a single pass through the data, but with the assumption that the order of the data is not chosen adversarially.
4.1 Batch Setting
To begin, we briefly outline the main ideas in our analysis. With high probability, we can sample a random projection to dimension that satisfies the JL norm preservation property (Definition 4). We use this property to show that for any fixed dataset of size , optimizing a projected version of the objective yields a solution that is close to the projection of the minimizer of the original, highdimensional objective. We then use the observation that our JL map can be identified as a CountSketch projection; therefore, we can make use of existing error bounds for CountSketch estimates to bound the error of our recovered weight estimates.
Let denote the sketching matrix implicitly defined by the hashing construction described in Algorithm 1, scaled by : . This is the hashingbased sparse JL projection proposed by Kane and Nelson (Kane and Nelson, 2014). We consider the following pair of objectives defined over the batch of observed examples —the first defines the problem in the original space and the second defines the corresponding problem where the learner observes sketched examples :
with parameters and .
Suppose we optimized these objectives to obtain solutions and . How then does relate to given our choice of sketching matrix and regularization parameter ? Intuitively, if we stored all the data observed up to time and optimized over this dataset, we should hope that the optimal solution is close to , the sketch of , in order to have any chance of recovering the largest weights of . We show that in this batch setting, is indeed small; we then use this property to show elementwise error guarantees for the CountSketch recovery process. We first define a useful regularity condition on functions:
Definition 5.
A function is strongly smooth w.r.t. a norm if is everywhere differentiable and if for all we have:
We now state our main result for recovery error in the batch setting:
Theorem 1.
Let the loss function be strongly smooth (w.r.t. and . For fixed constants , let:
Let be the minimizer of the original objective function and be the estimate of returned by performing CountSketch recovery on the minimizer of the projected objective function . Then with probability over the choice of ,
We defer the full proof to Appendix A.1 but briefly discuss some salient properties of this recovery result. First, the feature dimension enters only polylogarithmically into the sketch size and the sparsity (i.e., depth) parameter : this establishes that memoryefficient learning and recovery is possible when in the high regime that we are interested in. Moreover, our result is independent of the number of examples ; this is relevant as our applications involve learning over streams and datasets containing possibly millions of examples. For standard loss functions such as the logistic loss and the smoothed hinge loss, we have smoothness parameter . Finally, and scale inversely with the strength of regularization: this is intuitive because additional regularization will shrink both and towards zero. We observe this inverse relationship between recovery error and regularization in practice (see Fig. 5). Also, note that the recovery error depends on the maximum norm of the data points , and the bound is most optimistic when is small. Across all of the applications we consider in Section 5 and Section 6, the data points are sparse with a small norm, and hence the bound is meaningful across a number of interesting settings.
The perparameter recovery error in Theorem 1 is bounded above by a multiple of the norm of the optimal weights for the uncompressed problem. This supports the intuition that sparse solutions with small norm should be more easily recovered. In practice, we can augment the objective with an additional (resp. ) term to induce sparsity; this corresponds to elastic netstyle composite / regularization on the parameters of the model (Zou and Hastie, 2005).
4.2 Online Setting
We now provide guarantees for WMSketch in the online setting. We make two small modifications to WMSketch for the convenience of analysis. First, we assume that the iterate is projected onto a ball of radius at every step. Second, we also assume that we perform the final CountSketch recovery on the average of the weight vectors, instead of on the current iterate . While using this averaged sketch is useful for the analysis, maintaining a duplicate data structure in practice for the purpose of accumulating the average would double the space cost of our method. Therefore, in our implementation of the WMSketch, we simply maintain the current iterate . As we show in the next section this approach achieves good performance on realworld datasets, in particular when combined with the active set heuristic.
Our guarantee holds in expectation over uniformly random permutations of . In other words, we achieve low recovery error on average over all orderings in which the data points could have been presented. We believe this condition is necessary to avoid worstcase adversarial orderings of the data points—since the WMSketch update at any time step depends on the state of the sketch itself, adversarial orderings can potentially lead to high error accumulation.
Theorem 2.
Let the loss function be strongly smooth (w.r.t. , and have its derivative bounded by . Assume , and . Let be a bound on the norm of the gradient at any time step , in our case . For fixed constants , let:
where . Let be the minimizer of the original objective function and be the estimate returned by the WMSketch algorithm with averaging and projection on the ball with radius . Then with probability over the choice of ,
where the expectation is taken with respect to uniformly sampling a permutation in which the samples are received.
Again, we defer the full proof to Appendix A.2.
Remark. Let us compare the guarantees for finding heavyhitters in data streams with our guarantees for finding heavilyweighted features in data streams. The CountSketch uses space to obtain frequency estimates with error , where is the true frequency vector (Lemma 6), while the CountMin Sketch uses space for error bounded by (Cormode and Muthukrishnan, 2005b). In comparison, we use space to achieve error (Theorem 2). Thus, we obtain a guarantee of a similar flavor to bounds for heavyhitters for this more general framework, but with somewhat worse polynomial dependencies.
5 Evaluation
In this section, we evaluate the WeightMedian Sketch on three standard binary classification datasets. Our goal here is to compare the WMSketch and AWMSketch against alternative limitedmemory methods in terms of (1) recovery error in the estimated top weights, (2) classification error rate, and (3) runtime performance. In the next section, we explore specific applications of the WMSketch in stream processing tasks.
5.1 Datasets and Experimental Setup
We evaluated our proposed sketches on several standard benchmark datasets as well as in the context of specific streaming applications. Table 1 lists summary statistics for these datasets.
Classification Datasets. We evaluate the recovery error on regularized online logistic regression trained on three standard binary classification datasets: Reuters RCV1 (Lewis et al., 2004), malicious URL identification (Ma et al., 2009), and the Algebra dataset from the KDD Cup 2010 largescale data mining competition (Stamper et al., 2010; Yu et al., 2010). We use the standard training split for each dataset except for the RCV1 dataset, where we use the larger “test” split as is common in experimental evaluations using this dataset (Golovin et al., 2013).
Dataset  # Examples  # Features  Space (MB) 

Reuters RCV1  0.4  
Malicious URLs  25.8  
KDD Cup Algebra  161.8  
Senate/House Spend.  4.2  
Packet Trace  1.0  
Newswire  375.2 
For each dataset, we make a single pass through the set of examples. Across all our experiments, we use an initial learning rate and . We used the following set of space constraints: 2KB, 4KB, 8KB, 16KB and 32KB. For each setting of the space budget and for each method, we evaluate a range of configurations compatible with that space constraint; for example, for evaluating the WMSketch, this corresponds to varying the space allocated to the heap and the sketch, as well as trading off between the sketch depth and the width . For each setting, we run 10 independent trials with distinct random seeds; our plots show medians and the range between the worst and best run.
Memory Cost Model. In our experiments, we control for memory usage and configure each method to satisfy the given space constraints using the following cost model: we charge 4B of memory utilization for each feature identifier, feature weight, and auxiliary weight (e.g., random keys in Algorithm 4 or counts in the Space Saving baseline) used. For example, a simple truncation instance (Algorithm 3) with 128 entries uses 128 identifiers and 128 weights, corresponding to a memory cost of 1024B.
5.2 Recovery Error Comparison
We measure the accuracy to which our methods are able to recover the top weights in the model using the following relative error metric:
where is the sparse vector representing the top weights returned by a given method, is the weight vector obtained by the uncompressed model, and is the sparse vector representing the true top weights in . The relative error metric is therefore bounded below by and quantifies the relative suboptimality of the estimated top weights. The best configurations for the WM and AWMSketch on RCV1 are listed in Table 2; the optimal configurations for the remaining datasets are similar.
We compare our methods across datasets (Fig. 3) and across memory constraints on a single dataset (Fig. 4). For clarity, we omit the CountMin Frequent Features baseline since we found that the Space Saving baseline achieved consistently better performance. We found that the AWMSketch consistently achieved lower recovery error than alternative methods on our benchmark datasets. The Space Saving baseline is competitive on RCV1 but underperforms the simple Probabilistic Truncation baseline on URL: this demonstrates that tracking frequent features can be effective if frequentlyoccurring features are also highly discriminative, but this property does not hold across all datasets. Standard feature hashing achieves poor recovery error since colliding features cannot be disambiguated.
In Fig. 5, we compare recovery error on RCV1 across different settings of . Higher regularization results in less recovery error since both the true weights and the sketched weights are closer to 0; however, settings that are too high can result in increased classification error.
WMSketch  AWMSketch  

Budget (KB)  width  depth  width  depth  
2  128  128  2  128  256  1 
4  256  256  2  256  512  1 
8  128  128  14  512  1024  1 
16  128  128  30  1024  2048  1 
32  128  256  31  2048  4096  1 
5.3 Classification Error Rate
We evaluated the classification performance of our models by measuring their online error rate (Blum et al., 1999): for each observed pair , we record whether the prediction (made without observing ) is correct before updating the model. The error rate is defined as the cumulative number of mistakes made divided by the number of iterations. Our results here are summarized in Fig. 6. For each dataset, we used the value of that achieved the lowest error rate across all our memorylimited methods. For each method and for each memory budget, we chose the configuration that achieved the lowest error rate. For the WMSketch, this corresponded to a width of or with depth scaling proportionally with the memory budget; for the AWMSketch, the configuration that uniformly performed best allocated half the space to the active set and the remainder to a depth1 sketch (i.e., a single hash table without any replication).
We found that across all tested memory constraints, the AWMSketch consistently achieved lower error rate than heavyhitterbased methods. Surprisingly, the AWMSketch outperformed feature hashing by a small but consistent margin: 0.5–3.7% on RCV1, 0.1–0.4% on URL, and 0.2–0.5% on KDDA, with larger gains seen at smaller memory budgets. This suggests that the AWMSketch benefits from the precise representation of the largest, mostinfluential weights in the model, and that these gains are sufficient to offset the increased collision rate due to the smaller hash table. The Space Saving baseline exhibited inconsistent performance across the three datasets, demonstrating that tracking the most frequent features is an unreliable heuristic: features that occur frequently are not necessarily the most predictive. We note that higher values of the regularization parameter correspond to greater penalization of rarelyoccurring features; therefore, we would expect the Space Saving baseline to better approximate the performance of the unconstrained classifier as increases.
5.4 Runtime Performance
We evaluated runtime performance relative to a memory unconstrained logistic regression model using the same configurations as those chosen to minimize recovery error (Table 2). In all our timing experiments, we ran our implementations of the baseline methods, the WMSketch, and the AWMSketch on Intel Xeon E52690 v4 processor with 35MB cache using a single core. The memoryunconstrained logistic regression weights were stored using an array of 32bit floating point values of size equal to the dimensionality of the feature space, with the highestweighted features tracked using a minheap of size ; reads and writes to the weight vector therefore required single array accesses. The remaining methods tracked heavy weights alongside 32bit feature identifiers using a minheap sized according to the corresponding configuration.
In our experiments, the fastest method was feature hashing, with about a 2x overhead over the baseline. This overhead was due to the additional hashing step needed for each read and write to a feature index. The AWMSketch incurred an additional 2x overhead over feature hashing due to more frequent heap maintenance operations.
6 Applications
We now show that a variety of tasks in stream processing can be framed as memoryconstrained classification. The unifying theme between these applications is that classification is a useful abstraction whenever the use case calls for discriminating between streams or between subpopulations of a stream. These distinct classes can be identified by partitioning a single stream into quantiles (Sec. 6.1), comparing separate streams (Sec. 6.2), or even by generating synthetic examples to be distinguished from real samples (Sec. 6.3).
6.1 Streaming Explanation
In data analysis workflows, it is often necessary to identify characteristic attributes that are particularly indicative of a given subset of data (Meliou et al., 2014). For example, in order to diagnose the cause of anomalous readings in a sensor network, it is helpful to identify common features of the outlier points such as geographical location or time of day. This use case has motivated the development of methods for finding common properties of outliers found in aggregation queries (Wu and Madden, 2013) and in data streams (Bailis et al., 2017).
This task can be framed as a classification problem: assign positive labels to the outliers and negative labels to the inliers, then train a classifier to discriminate between the two classes. The identification of characteristic attributes is then reduced to the problem of identifying heavilyweighted features in the trained model. In order to identify indicative conjunctions of attributes, we can simply augment the feature space to include arbitrary combinations of singleton features.
The relative risk or risk ratio is a statistical measure of the relative occurrence of the positive label when the feature is active versus when it is inactive. In the context of stream processing, the relative risk has been used to quantify the degree to which a particular attribute or attribute combination is indicative of a data point being an outlier relative to the overall population (Bailis et al., 2017). Here, we are interested in comparing our classifierbased approach to identifying highrisk features against the approach used in MacroBase (Bailis et al., 2017), an existing system for explaining outliers over streams, that identifies candidate attributes using a variant of the Space Saving heavyhitters algorithm.
Experimental Setup. We used a publiclyavailable dataset of itemized disbursements by candidates in U.S. House and Senate races from 2010–2016.^{3}^{3}3FEC candidate disbursements data: http://classic.fec.gov/data/CandidateDisbursement.do The outlier points were set to be the set of disbursements in the top20% by dollar amount. For each row of the data, we generated a sequence of 1sparse feature vectors^{4}^{4}4We can also generate a single feature vector per row (with sparsity greater than 1), but the learned weights would then correlate more weakly with the relative risk. This is due to the effect of correlations between features. corresponding to the observed attributes. We set a space budget of 32KB for the AWMSketch.
Results. Our results are summarized in Figs. 8 and 9. The former empirically demonstrates that the heuristic of filtering features on the basis of frequency can be suboptimal for a fixed memory budget. This is due to features that are frequent in both the inlier and outlier classes: it is wasteful to maintain counts for these items since they have low relative risk. In Fig. 8, the top row shows the distribution of relative risks among the most frequent items within the positive class (left) and across both classes (right). In contrast, our classifierbased approaches use the allocated space more efficiently by identifying features at the extremes of the relative risk scale.
In Fig. 9, we show that the learned classifier weights are strongly correlated with the relative risk values estimated from true counts. Indeed, logistic regression weights can be interpreted in terms of log odds ratios, a related quantity to relative risk. These results show that the AWMSketch is a superior filter compared to HeavyHitters approaches for identifying highrisk features.
6.2 Network Monitoring
IP network monitoring is one of the primary application domains for sketches and other smallspace summary methods (Venkataraman et al., 2005; Bandi et al., 2007; Yu et al., 2013). Here, we focus on the problem of finding packetlevel features (for instance, source/destination IP addresses and prefixes, port numbers, network protocols, and header or payload characteristics) that differ significantly in relative frequency between a pair of network links.
This problem of identifying significant relative differences—also known as relative deltoids—was studied by Cormode and Muthukrishnan (Cormode and Muthukrishnan, 2005a). Concretely, the problem is to estimate—for each item —ratios (where denote occurrence counts in each stream) and to identify those items for which this ratio, or its reciprocal, is large. Here, we are interested in identifying differences between traffic streams that are observed concurrently; in contrast, the empirical evaluation in (Cormode and Muthukrishnan, 2005a) focused on comparisons between different time periods.
Experimental Setup. We used a subset of an anonymized, publiclyavailable passive traffic trace dataset recorded at a peering link for a large ISP (cai, ). The positive class was the stream of outbound source IP addresses and the negative class was the stream of inbound destination IP addresses. We compared against several baseline methods, including ratio estimation using a pair of CountMin sketches (as in (Cormode and Muthukrishnan, 2005a)). For each method we retrieved the top2048 features (i.e., IP addresses in this case) and computed the recall against the set of features above the given ratio threshold, where the reference ratios were computed using exact counts.
Results. We found that the AWMSketch performed comparably to the memoryunconstrained logistic regression baseline on this benchmark. We significantly outperformed the paired CountMin baseline by a factor of over 4 in recall while using the same memory budget, as well as a paired CM baseline that was allocated 8x the memory budget. These results indicate that linear classifiers can be used effectively to identify relative deltoids over pairs of data streams.
6.3 Streaming Pointwise Mutual Information
Pointwise mutual information (PMI), a measure of the statistical correlation between a pair of events, is defined as:
Intuitively, positive values of the PMI indicate events that are positively correlated, negative values indicate events that are negatively correlated, and a PMI of 0 indicates uncorrelated events.
In natural language processing, PMI is a frequentlyused measure of word association (Turney and Pantel, 2010). Traditionally, the PMI is estimated using empirical counts of unigrams and bigrams obtained from a text corpus. The key problem with this approach is that the number of bigrams in standard natural language corpora can grow very large; for example, we found 47M unique cooccurring pairs of tokens in a small subset of a standard newswire corpus. This combinatorial growth in the feature dimension is further amplified when considering higherorder generalizations of PMI.
More generally, streaming PMI estimation can be used to detect pairs of events whose occurrences are strongly correlated. For example, we can consider a streaming log monitoring use case where correlated events are potentially indicative of cascading failures or trigger events resulting in exceptional behavior in the system. Therefore, we expect that the techniques developed here should be useful beyond standard NLP applications.
Sparse Online PMI Estimation. Streaming PMI estimation using approximate counting has previously been studied (Durme and Lall, 2009); however, this approach has the drawback that memory usage still scales linearly with the number of observed bigrams. Here, we explore streaming PMI estimation from a different perspective: we pose a binary classification problem over the space of bigrams with the property that the model weights asymptotically converge to an estimate of the PMI.^{5}^{5}5This classification formulation is used in the popular word2vec skipgram method for learning word embeddings (Mikolov et al., 2013); the connection to PMI approximation was first observed by Levy and Goldberg (2014). To our knowledge, we are the first to apply this formulation in the context of sparse PMI estimation.
The classification problem is set up as follows: in each iteration , with probability sample a bigram from the bigram distribution and set ; with probability sample from the unigram product distribution and set . The input is the 1sparse vector where the index corresponding to is set to . We train a logistic regression model to discriminate between the true and synthetic samples. If , the model asymptotically converges to the distribution for all pairs , where is the logistic function. It follows that , which is exactly the PMI of . If , we obtain an estimate that is biased, but with reduced variance in the estimates for rare bigrams.
Pair  PMI  Est.  Pair  PMI 

prime minister  6.339  7.609  , the  0.044 
los angeles  7.197  7.047  the ,  0.082 
http /  6.734  7.001  the of  0.611 
human rights  6.079  6.721  the .  0.057 
Experimental Setup. We train on a subset of a standard newswire corpus (Chelba et al., 2013); the subset contains 77.7M tokens, 605K unique unigrams and 47M unique bigrams over a sliding window of size 6. In our implementation, we approximate sampling from the unigram distribution by sampling from a reservoir sample of tokens (May et al., 2017; Kaji and Kobayashi, 2017). We estimated weights using the AWMSketch with heap size 1024 and depth 1; the reservoir size was fixed at 4000. We make a single pass through the dataset and generate 5 negative samples for every true sample. Strings were first hashed to 32bit values using MurmurHash3;^{6}^{6}6https://github.com/aappleby/smhasher/wiki/MurmurHash3 these identifiers were hashed again to obtain sketch bucket indices.
Results. For width settings up to , our implementation’s total memory usage was at most MB. In this regime, memory usage was dominated by the storage of strings in the heap and the unigram reservoir. For comparison, the standard approach to PMI estimation requires 188MB of space to store exact 32bit counts for all bigrams, excluding the space required for storing strings or the token indices corresponding to each count. In Table 3, we show sample pairs retrieved by our method; the PMI values estimated from exact counts are wellestimated by the classifier weights. In Fig. 11, we show that at small widths, the high collision rate results in the retrieval of noisy, lowPMI pairs; as the width increases, we retrieve higherPMI pairs which typically occur with lower frequency. Further, regularization helps discard lowfrequency pairs but can result in the model missing out on highPMI but lessfrequent pairs.
7 Related Work
HeavyHitters in Data Streams. The HeavyHitters problem has been extensively studied in the streaming algorithms literature. Given a sequence of items, the goal is to return the set of all items whose frequency exceeds a specified fraction of the total number of items. Algorithms for finding frequent items can be broadly categorized into counterbased approaches (Manku and Motwani, 2002; Demaine et al., 2002; Karp et al., 2003; Metwally et al., 2005), quantile algorithms (Greenwald and Khanna, 2001; Shrivastava et al., 2004), and sketchbased methods (Charikar et al., 2002; Cormode and Muthukrishnan, 2005b).
Mirylenka et al. (Mirylenka et al., 2015) develop streaming algorithms for finding conditional heavyhitters, where the goal is to identify items that are frequent in the context of a separate “parent” item. While these methods can be used in related applications to ours, their counterbased algorithms differ significantly from our classificationbased approach.
Characterizing Changes in Data Streams. The problem of detecting and characterizing significant absolute and relative differences between data streams has been studied by several authors. Cormode and Muthukrishnan (2005a) proposed a CountMinbased algorithm for identifying items whose occurrence rates differ significantly; Schweller et al. (2004) propose reversible hashes in this context to avoid storing key information. These approaches focus primarily on identifying differences between time periods, whereas we focus on the case where streams are observed concurrently. In order to explain anomalous traffic flows, Brauckhoff et al. (2012) proposed techniques using histogrambased detectors and association rules. Using histograms to identify changes in feature occurrence is appropriate for detecting large absolute differences; in contrast, we focus on detecting relative differences, a problem which has previously been found to be challenging (Cormode and Muthukrishnan, 2005a).
ResourceEfficient ML. Gupta et al. (2017) and Kumar et al. (2017) explore the use of tree and nearestneighborbased classifiers on highly resourceconstrained devices. A large body of work studies resourceefficient inference at test time (Xu et al., 2012, 2013; Hsieh et al., 2014). Unlike budget kernel methods (Crammer et al., 2004) that maintain a small set of support vectors, we aim to directly restrict the number of parameters stored. Our work also differs from model compression or distillation, which aims to imitate a large, expensive model using a smaller one with lower memory and computation costs (Bucilu? et al., 2006; Ba and Caruana, 2014; Hinton et al., 2015).
SparsityInducing Regularization. regularization is a standard technique for encouraging parameter sparsity in online learning (Langford et al., 2009; Duchi and Singer, 2009; Xiao, 2010; McMahan, 2011). In practice, it is difficult to a priori select an regularization strength in order to satisfy a given sparsity budget; it can therefore be problematic to apply standard online learning methods in settings with hard memory constraints. In this paper, we propose a different approach: we first fix a memory budget and then use the allocated space to approximate a classifier, with the property that our approximation will be better for parameter vectors with small norm (see Theorem 2). We note that the WMSketch and AWMSketch are compatible with the use of an regularizer in addition to the requisite term.
Feature Hashing. Feature hashing (Shi et al., 2009; Weinberger et al., 2009) is a technique where the classifier is trained on features that have been hashed to a fixedwidth table. This approach lowers memory usage by reducing the dimension of the feature space and by obviating the need for storing feature identifiers, but at the cost of model interpretability.
Compressed Learning. Calderbank et al. (Calderbank et al., ) introduced compressed learning, where a classifier is trained on data obtained via compressive sensing (Candes and Tao, 2006; Donoho, 2006). The authors focus on classification performance in the compressed domain and do not consider the problem of recovering weights in the original space.
8 Discussion
Active Set vs. Multiple Hashing. In the basic WMSketch, multiple hashing is needed in order to disambiguate features that collide in a heavy bucket; we should expect that features with truly high weight should correspond to large values in the majority of buckets that they hash to. The active set approach uses a different mechanism for disambiguation. Suppose that all the features that hash to a heavy bucket are added to the active set; we should expect that the weights for those features that were erroneously added will eventually decay (due to regularization) to the point that they are evicted from the active set. Simultaneously, the truly highweight features are retained in the active set. The AWMSketch can therefore be interpreted as a variant of feature hashing where the highestweighted features are not hashed.
The Cost of Interpretability. We initially motivated the development of the WMSketch with the dual goals of memoryefficient learning and model interpretability. A natural question to ask is: what is the cost of interpretability? What do we sacrifice in classification accuracy when we allocate memory to storing feature identifiers relative to feature hashing, which maintains only feature weights? A surprising finding in our evaluation on standard binary classification datasets was that the AWMSketch achieved uniformly better classification accuracy compared to feature hashing. We hypothesize that the observed gains are due to reduced collisions with heavilyweighted features. This result suggests that in some cases, we can essentially gain model interpretability for free.
PerFeature Learning Rates. In previous work on online learning applications, practitioners have found that the perfeature learning rates can significantly improve classification performance (McMahan et al., 2013). An open question is whether variable learning rate across features is worth the associated memory cost in the streaming setting.
Multiclass Classification. The WMSketch be extended to the multiclass setting using the following simple extension. Given output classes, for each row in the sketch, hash each feature using independent hash functions, thus maintaining weights for each feature. In order to predict the output distribution, evaluate the inner product of the feature vector with each set of weights and apply a softmax function to the result. For large , for instance in language modeling applications, this procedure can be computationally expensive since update time scales linearly with . In this regime, we can apply noise contrastive estimation (Gutmann and Hyvärinen, 2010)—a standard reduction to binary classification—to learn the model parameters.
Further Extensions. The WMSketch is likely amenable to further extensions that can improve update time and further reduce memory usage. Our method is equivalent to online gradient descent using random projections of input features, and gradient updates can be performed asynchronously with relaxed cache coherence requirements between cores (Recht et al., 2011). Additionally, our methods are orthogonal to reducedprecision techniques like randomized rounding (Raghavan and Tompson, 1987; Golovin et al., 2013) and approximate counting (Flajolet, 1985); these methods can be used in combination to realize further memory savings.
9 Conclusions
In this paper, we introduced the WeightMedian Sketch for the problem of identifying heavilyweighted features in linear classifiers over streaming data. We showed theoretical guarantees for our method, drawing on techniques from online learning and normpreserving random projections. In our empirical evaluation, we showed that the Active Set extension to the basic WeightMedian Sketch method is highly effective, achieving superior weight recovery and competitive classification error compared to baseline methods across several standard binary classification benchmarks. Finally, we explored promising applications of our methods by framing existing stream processing tasks as classification problems. We believe this machine learning perspective on sketchbased stream processing may prove to be a fruitful direction for future research in advanced streaming analytics.
References
 (1) The caida ucsd anonymized passive oc48 internet traces dataset. http://www.caida.org/data/passive/passive_oc48_dataset.xml.
 Achlioptas (2003) Dimitris Achlioptas. Databasefriendly random projections: Johnsonlindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671–687, 2003.
 Ba and Caruana (2014) Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014.
 Bailis et al. (2017) Peter Bailis, Edward Gan, Samuel Madden, Deepak Narayanan, Kexin Rong, and Sahaana Suri. Macrobase: Prioritizing attention in fast data. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 541–556. ACM, 2017.
 Bandi et al. (2007) Nagender Bandi, Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Fast data stream algorithms using associative memories. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 247–256. ACM, 2007.
 Blum et al. (1999) Avrim Blum, Adam Kalai, and John Langford. Beating the holdout: Bounds for kfold and progressive crossvalidation. In Proceedings of the twelfth annual conference on Computational learning theory, pages 203–208. ACM, 1999.
 Boykin et al. (2014) Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin. Summingbird: A framework for integrating batch and online mapreduce computations. Proceedings of the VLDB Endowment, 7(13):1441–1451, 2014.
 Brauckhoff et al. (2012) Daniela Brauckhoff, Xenofontas Dimitropoulos, Arno Wagner, and Kavè Salamatian. Anomaly extraction in backbone networks using association rules. IEEE/ACM Transactions on Networking (TON), 20(6):1788–1799, 2012.
 Bucilu? et al. (2006) Cristian Bucilu?, Rich Caruana, and Alexandru NiculescuMizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541. ACM, 2006.
 (10) Robert Calderbank, Sina Jafarpour, and Robert Schapire. Compressed learning: Universal sparse dimensionality reduction and learning in the measurement domain.
 Candes and Tao (2006) Emmanuel J Candes and Terence Tao. Nearoptimal signal recovery from random projections: Universal encoding strategies? IEEE transactions on information theory, 52(12):5406–5425, 2006.
 Carter and Wegman (1977) J Lawrence Carter and Mark N Wegman. Universal classes of hash functions. In Proceedings of the ninth annual ACM symposium on Theory of computing, pages 106–112. ACM, 1977.
 Charikar et al. (2002) Moses Charikar, Kevin Chen, and Martin FarachColton. Finding frequent items in data streams. Automata, languages and programming, pages 784–784, 2002.
 Chelba et al. (2013) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
 CorbettDavies et al. (2017) Sam CorbettDavies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. arXiv preprint arXiv:1701.08230, 2017.
 Cormode and Hadjieleftheriou (2008) Graham Cormode and Marios Hadjieleftheriou. Finding frequent items in data streams. Proceedings of the VLDB Endowment, 1(2):1530–1541, 2008.
 Cormode and Muthukrishnan (2005a) Graham Cormode and S Muthukrishnan. What’s new: Finding significant differences in network data streams. IEEE/ACM Transactions on Networking (TON), 13(6):1219–1232, 2005a.
 Cormode and Muthukrishnan (2005b) Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the countmin sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005b.
 Crammer et al. (2004) Koby Crammer, Jaz Kandola, and Yoram Singer. Online classification on a budget. In Advances in neural information processing systems, pages 225–232, 2004.
 Demaine et al. (2002) Erik D Demaine, Alejandro LópezOrtiz, and J Ian Munro. Frequency estimation of internet packet streams with limited space. In European Symposium on Algorithms, pages 348–360. Springer, 2002.
 Donoho (2006) David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306, 2006.
 Duchi and Singer (2009) John Duchi and Yoram Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10(Dec):2899–2934, 2009.
 Durme and Lall (2009) Benjamin V Durme and Ashwin Lall. Streaming pointwise mutual information. In Advances in Neural Information Processing Systems, pages 1892–1900, 2009.
 Efraimidis and Spirakis (2006) Pavlos S Efraimidis and Paul G Spirakis. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181–185, 2006.
 Flajolet (1985) Philippe Flajolet. Approximate counting: a detailed analysis. BIT Numerical Mathematics, 25(1):113–134, 1985.
 Golovin et al. (2013) Daniel Golovin, D Sculley, Brendan McMahan, and Michael Young. Largescale learning with less ram via randomization. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 325–333, 2013.
 Goodman and Flaxman (2016) Bryce Goodman and Seth Flaxman. Eu regulations on algorithmic decisionmaking and a ?right to explanation? In ICML workshop on human interpretability in machine learning (WHI 2016), New York, NY. http://arxiv. org/abs/1606.08813 v1, 2016.
 Greenwald and Khanna (2001) Michael Greenwald and Sanjeev Khanna. Spaceefficient online computation of quantile summaries. In ACM SIGMOD Record, volume 30, pages 58–66. ACM, 2001.
 Gupta et al. (2017) Chirag Gupta, Arun Sai Suggala, Ankit Goyal, Harsha Vardhan Simhadri, Bhargavi Paranjape, Ashish Kumar, Saurabh Goyal, Raghavendra Udupa, Manik Varma, and Prateek Jain. Protonn: Compressed and accurate knn for resourcescarce devices. In International Conference on Machine Learning, pages 1331–1340, 2017.
 Gutmann and Hyvärinen (2010) Michael Gutmann and Aapo Hyvärinen. Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 297–304, 2010.
 Hazan et al. (2016) Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(34):157–325, 2016.
 Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 Hoi et al. (2012) Steven CH Hoi, Jialei Wang, Peilin Zhao, and Rong Jin. Online feature selection for mining big data. In Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: Algorithms, systems, programming models and applications, pages 93–100. ACM, 2012.
 Hsieh et al. (2014) ChoJui Hsieh, Si Si, and Inderjit S Dhillon. Fast prediction for largescale kernel machines. In Advances in Neural Information Processing Systems, pages 3689–3697, 2014.
 Johnson and Lindenstrauss (1984) William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189206):1, 1984.
 Kaji and Kobayashi (2017) Nobuhiro Kaji and Hayato Kobayashi. Incremental skipgram model with negative sampling. arXiv preprint arXiv:1704.03956, 2017.
 Kane and Nelson (2014) Daniel M Kane and Jelani Nelson. Sparser JohnsonLindenstrauss transforms. Journal of the ACM (JACM), 61(1):4, 2014.
 Karp et al. (2003) Richard M Karp, Scott Shenker, and Christos H Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems (TODS), 28(1):51–55, 2003.
 Kumar et al. (2017) Ashish Kumar, Saurabh Goyal, and Manik Varma. Resourceefficient machine learning in 2 kb ram for the internet of things. In International Conference on Machine Learning, pages 1935–1944, 2017.
 Langford et al. (2009) John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10(Mar):777–801, 2009.
 Larsen et al. (2016) Kasper Green Larsen, Jelani Nelson, Huy L Nguyên, and Mikkel Thorup. Heavy hitters via clusterpreserving clustering. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 61–70. IEEE, 2016.
 Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pages 2177–2185, 2014.
 Lewis et al. (2004) David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361–397, 2004.
 Lipton (2016) Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
 Luo et al. (2016) Ge Luo, Lu Wang, Ke Yi, and Graham Cormode. Quantiles over data streams: experimental comparisons, new analyses, and further improvements. The VLDB Journal, 25(4):449–472, 2016.
 Ma et al. (2009) Justin Ma, Lawrence K Saul, Stefan Savage, and Geoffrey M Voelker. Identifying suspicious urls: an application of largescale online learning. In Proceedings of the 26th annual international conference on machine learning, pages 681–688. ACM, 2009.
 Manku and Motwani (2002) Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In Proceedings of the 28th international conference on Very Large Data Bases, pages 346–357. VLDB Endowment, 2002.
 May et al. (2017) Chandler May, Kevin Duh, Benjamin Van Durme, and Ashwin Lall. Streaming word embeddings with the spacesaving algorithm. arXiv preprint arXiv:1704.07463, 2017.
 McMahan (2011) Brendan McMahan. Followtheregularizedleader and mirror descent: Equivalence theorems and l1 regularization. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 525–533, 2011.
 McMahan et al. (2013) H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1222–1230. ACM, 2013.
 Meliou et al. (2014) Alexandra Meliou, Sudeepa Roy, and Dan Suciu. Causality and explanations in databases. In VLDB, 2014.
 Metwally et al. (2005) Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Efficient computation of frequent and topk elements in data streams. In International Conference on Database Theory, pages 398–412. Springer, 2005.
 Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 Mirylenka et al. (2015) Katsiaryna Mirylenka, Graham Cormode, Themis Palpanas, and Divesh Srivastava. Conditional heavy hitters: detecting interesting correlations in data streams. The VLDB Journal, 24(3):395–414, 2015.
 Mitzenmacher and Vadhan (2008) Michael Mitzenmacher and Salil Vadhan. Why simple hash functions work: exploiting the entropy in a data stream. In Proceedings of the nineteenth annual ACMSIAM symposium on Discrete algorithms, pages 746–755. Society for Industrial and Applied Mathematics, 2008.
 P?traşcu and Thorup (2012) Mihai P?traşcu and Mikkel Thorup. The power of simple tabulation hashing. Journal of the ACM (JACM), 59(3):14, 2012.
 Raghavan and Tompson (1987) Prabhakar Raghavan and Clark D Tompson. Randomized rounding: a technique for provably good algorithms and algorithmic proofs. Combinatorica, 7(4):365–374, 1987.
 Recht et al. (2011) Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
 Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
 Roy et al. (2016) Pratanu Roy, Arijit Khan, and Gustavo Alonso. Augmented sketch: Faster and more accurate stream processing. In Proceedings of the 2016 International Conference on Management of Data, pages 1449–1463. ACM, 2016.
 Schweller et al. (2004) Robert Schweller, Ashish Gupta, Elliot Parsons, and Yan Chen. Reversible sketches for efficient and accurate change detection over network data streams. In Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages 207–212. ACM, 2004.
 ShalevShwartz et al. (2011) Shai ShalevShwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated subgradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
 Shamir (2016) Ohad Shamir. Withoutreplacement sampling for stochastic gradient methods. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 46–54. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6245withoutreplacementsamplingforstochasticgradientmethods.pdf.
 Shi et al. (2009) Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alexander L Strehl, Alex J Smola, and SVN Vishwanathan. Hash kernels. In International Conference on Artificial Intelligence and Statistics, pages 496–503, 2009.
 Shrivastava et al. (2004) Nisheeth Shrivastava, Chiranjeeb Buragohain, Divyakant Agrawal, and Subhash Suri. Medians and beyond: new aggregation techniques for sensor networks. In Proceedings of the 2nd international conference on Embedded networked sensor systems, pages 239–249. ACM, 2004.
 Stamper et al. (2010) J. Stamper, A. NiculescuMizil, S. Ritter, G.J. Gordon, and K.R. Koedinger. Algebra i 20082009. challenge data set from kdd cup 2010 educational data mining challenge., 2010. Find it at http://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp.
 Steinhardt and Duchi (2015) Jacob Steinhardt and John Duchi. Minimax rates for memorybounded sparse linear regression. In Conference on Learning Theory, pages 1564–1587, 2015.
 Turney and Pantel (2010) Peter D Turney and Patrick Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 37:141–188, 2010.
 Venkataraman et al. (2005) Shoba Venkataraman, Dawn Song, Phillip B Gibbons, and Avrim Blum. New streaming algorithms for fast detection of superspreaders. Department of Electrical and Computing Engineering, page 6, 2005.
 Weinberger et al. (2009) Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1113–1120. ACM, 2009.
 Wu and Madden (2013) Eugene Wu and Samuel Madden. Scorpion: Explaining away outliers in aggregate queries. Proceedings of the VLDB Endowment, 6(8):553–564, 2013.
 Xiao (2010) Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543–2596, 2010.
 Xu et al. (2012) Zhixiang Xu, Kilian Weinberger, and Olivier Chapelle. The greedy miser: Learning under testtime budgets. arXiv preprint arXiv:1206.6451, 2012.
 Xu et al. (2013) Zhixiang Xu, Matt Kusner, Kilian Weinberger, and Minmin Chen. Costsensitive tree of classifiers. In International Conference on Machine Learning, pages 133–141, 2013.
 Yang et al. (2015) Tianbao Yang, Lijun Zhang, Rong Jin, and Shenghuo Zhu. Theory of dualsparse regularized randomized reduction. In International Conference on Machine Learning, pages 305–314, 2015.
 Yu et al. (2010) HsiangFu Yu, HungYi Lo, HsunPing Hsieh, JingKai Lou, Todd G McKenzie, JungWei Chou, PoHan Chung, ChiaHua Ho, ChunFu Chang, YinHsuan Wei, et al. Feature engineering and classifier ensemble for kdd cup 2010. In KDD Cup, 2010.
 Yu et al. (2013) Minlan Yu, Lavanya Jose, and Rui Miao. Software defined traffic measurement with opensketch. In NSDI, volume 13, pages 29–42, 2013.
 Zhang et al. (2016) Ce Zhang, Arun Kumar, and Christopher Ré. Materialization optimizations for feature selection workloads. ACM Transactions on Database Systems (TODS), 41(1):2, 2016.
 Zhang et al. (2014) Lijun Zhang, Mehrdad Mahdavi, Rong Jin, Tianbao Yang, and Shenghuo Zhu. Random projections for classification: A recovery approach. IEEE Transactions on Information Theory, 60(11):7300–7316, 2014.
 Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pages 928–936, 2003.
 Zou and Hastie (2005) Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
Appendix A Proofs
a.1 Proof of Theorem 1
We will use the duals of and to show that is close to . Some other works such as Zhang et al. (2014) and Yang et al. (2015) have also attempted to analyze random projections via the dual, and this has the advantage that the dual variables are often easier to compare, as they at least have the same dimensionality. Note that is essentially the countsketch projection of , hence showing that is close to will allow us to show that doing countsketch recovery using is comparable to doing countsketch recovery from the projection of itself, and hence give us the desired error bounds. We first derive the dual forms of the objective function , the dual of can be derived analogously. Let . Then we can write the primal as:
subject to 
Define , i.e. the th data point times its label. Let be the matrix of data points such that the th column is . Let be the kernel matrix corresponding to the original data points. Taking the Lagrangian, and minimizing with respect to the primal variables and gives us the following dual objective function in terms of the dual variable :
where is the Fenchel conjugate of . Also, if is the minimizer of , then the minimizer of is given by . Similarly, let be the kernel matrix corresponding to the projected data points. We can write down the dual of the projected primal objective function in terms of the dual variable as follows:
As before, if is the minimizer of , then the minimizer of is given by . We will first express the distance between and in terms of the distance between the dual variables. We can write:
(1) 
Hence our goal will be to upper bound . Define . We will show that can be upper bounded in terms of as follows.
Lemma 2.
Proof.
We first claim that,
(2) 
Note that just because is the minimizer of , hence Eq. 2 is essentially giving an improvement over this simple bound. In order to prove Eq. 2, define . As is a convex function (because is convex in ), from the definition of convexity,
(3) 
It is easy to verify that,
(4) 
Adding Eq. 3 and Eq. 4, we get,
which verifies Eq. 2. We can derive a similar bound for ,
(5) 
As is minimized by and is convex,