# Categorical Feature Compression via Submodular Optimization

###### Abstract

In the era of big data, learning from categorical features with very large vocabularies (e.g., 28 million for the Criteo click prediction dataset) has become a practical challenge for machine learning researchers and practitioners. We design a highly-scalable vocabulary compression algorithm that seeks to maximize the mutual information between the compressed categorical feature and the target binary labels and we furthermore show that its solution is guaranteed to be within a factor of the global optimal solution. To achieve this, we introduce a novel re-parametrization of the mutual information objective, which we prove is submodular, and design a data structure to query the submodular function in amortized time (where is the input vocabulary size). Our complete algorithm is shown to operate in time. Additionally, we design a distributed implementation in which the query data structure is decomposed across machines such that each machine only requires space, while still preserving the approximation guarantee and using only logarithmic rounds of computation. We also provide analysis of simple alternative heuristic compression methods to demonstrate they cannot achieve any approximation guarantee. Using the large-scale Criteo learning task, we demonstrate better performance in retaining mutual information and also verify competitive learning performance compared to other baseline methods.

^{†}

^{†}thanks: Authors are listed in alphabetical order. bateni@google.com

Google Research Lin Chen lin.chen@yale.edu

Google Research and Yale University Hossein Esfandiari esfandiari@google.com

Google Research Thomas Fu thomasfu@google.com

Google Research Vahab S. Mirrokni mirrokni@google.com

Google Research Afshin Rostamizadeh rostami@google.com

Google Research

## 1 Introduction

In modern large scale machine learning tasks, the presence of categorical features with extremely large vocabularies is a standard occurrence. For example, in tasks such as product recommendation and click-through rate prediction, categorical variables corresponding to inventory id, webpage identifier, or other such high cardinality values, can easily contain anywhere from hundreds of thousands to tens of millions of unique values. The size of machine learning models generally grows at least linearly with the vocabulary size and, thus, the memory required to serve the model, the training and inference cost, as well as the risk of overfitting become an issue with very large vocabularies. In the particular case of neural networks model, one generally uses an embedding layer to consume categorical inputs. The number of parameters in the embedding layer is , where is the size of the vocabulary and is the number of units in the first hidden layer.

To give a concrete example, the Criteo click prediction benchmark has about 28 million categorical feature values (CriteoLabs, 2014), thus resulting in an embedding layer more than 1 billion parameters for a modestly sized first hidden layer. Note, this number dwarfs the number of parameters found in the remainder of the neural network. Again, to give a concrete example, even assuming a very deep fully connected network of depth with hidden layers of size , we have parameters in the hidden network – still an order of magnitude smaller than the embedding layer alone. This motivates the problem of compressing the vocabulary into a smaller size while still retaining as much information as possible.

In this work, we model the compression task by considering the problem of maximizing the mutual information between the compressed version of the categorical features and the target label. We first observe a connection between this problem and the quantization problem for discrete memoryless channels, and note a polynomial-time algorithm for the problem (Kurkoski and Yagi, 2014; Iwata and Ozawa, 2014). The resulting algorithm, however, is based on solving a quadratic-time dynamic program, and is not scalable. Our main goal in this paper is to develop a scalable and distributed algorithm with a guaranteed approximation factor. We achieve this goal by developing a novel connection to submodular optimization. Although in some settings, entropy-based set functions are known to be submodular, this is not the case for the mutual information objective we consider (mutual information with respect to the target labels). Our main insight is in proving the submodularity of a particular transformation of the mutual information objective, which still allows us to provide an approximation guarantee on the quality of the solution with respect to the original objective. We also provide a data structure that allows us to query this newly defined submodular function in amortized logarithmic time. This logarithmic-time implementation of the submodular oracle empowers us to incorporate the fastest known algorithm for submodular maximization (Mirzasoleiman et al., 2015), which leads us to a sequential quasi-linear-time -approximation algorithm for binary vocabulary compression. Next, we provide a distributed implementation for binary vocabulary compression. Previous distributed algorithms for submodular maximization assume a direct access the query oracle on every machine (e.g., see (Barbosa et al., 2015; Mirrokni and Zadimoghaddam, 2015; Mirzasoleiman et al., 2013)). However, the query oracle itself requires space, which may be restrictive in the large scale setting. In this work, we provide a truly distributed implementation of the submodular maximization algorithm of (Badanidiyuru and Vondrák, 2014) (or similarly (Kumar et al., 2015)) for our application by distributing the query oracle. In this distributed implementation we manage to decompose the query oracle across machines such that each machine only requires space to store the partial query oracle. As a result, we successfully provide a distributed -approximation algorithm for vocabulary compression in logarithmic rounds of computation. Our structural results for submodularity of this new set function is the main technical contribution of this paper, and can also be of independent interest in other settings that seek to maximize mutual information.

We also study a number of heuristic and baseline algorithms for the problem of maximizing mutual information, and show that they do not achieve a guaranteed approximation for the problem. Furthermore, we study the empirical performance of our algorithms on two fronts: First, we show the effectiveness of our greedy scalable approximation algorithm for maximizing mutual information. Our study confirms that this algorithm not only achieves a theoretical guarantee, but also it beats the heuristic algorithms for maximizing mutual information. Finally, we examine the performance of this algorithm on the vocabulary compression problem itself, and confirm the effectiveness of the algorithm in producing a high-quality solution for vocabulary compression large scale learning tasks.

Organization. In the remainder of this section we review related previous works and introduce the problem formally along with appropriate notation. Then in Section 2, we introduce the novel compression algorithm and corresponding theoretical guarantees as well as analysis of some basic heuristic baselines. In Section 3, we present our empirical evaluation of optimizing the mutual information objective as well as an end-to-end learning task.

### 1.1 Related Work

Feature Clustering: The use of vocabulary compression has been studied previously, especially in text classification applications where it is commonly known as feature (or word) clustering. In particular, Baker and McCallum (1998) and Slonim and Tishby (2001) both propose agglomerative clustering algorithms, which start with singleton clusters that are iteratively merged using a Jenson-Shannon divergence based function to measure similarity between clusters, until the desired number of clusters is found. Both algorithms are greedy in nature and do not provide any guarantee with respect to a global objective. In (Dhillon et al., 2003), the authors introduce an algorithm that empirically performs better than the aforementioned methods and that also seeks to optimize the same global mutual information objective that is analyzed in this work. Their proposed iterative algorithm is guaranteed to improve the objective at each iteration and arrive at a local minimum, however, no guarantee with respect to the global optimum is provided. Furthermore, each iteration of the algorithm requires time (where is the size of the compressed vocabulary) and the number of iterations is only guaranteed to be finite (but potentially exponential). Later in this work, we compare the empirical performance of this algorithm with our proposed method.

Compression in Discrete Memoryless Channels: An area from information theory that is closely related to our vocabulary compression problem, and which our algorithm draws inspiration from, is compression in a discrete memoryless channels (DMC) (Cicalese et al., 2018; Zhang and Kurkoski, 2016; Iwata and Ozawa, 2014). In this problem, we assume there is a DMC which (in machine learning terminology) receive a class label and produces a corresponding categorical feature value drawn according to a fixed underlying distribution. The goal is to design a quantizer that maps the space of categorical features in lower cardinatility set, while preserving as much of the mutual information between the class label and newly constructed vocabulary. In Figure 1, we present a diagram that illustrates the DMC quantization problem and vocabulary compression problem as well as the translation of terminologies of these two problems. The results of Kurkoski and Yagi (2014) are of particular interest, as they show a cubic-time dynamic programming based algorithm is able to provide an optimal solution in the case of binary labels. Following this work, Iwata and Ozawa (2014) improve the computational complexity of this approach to quadratic time using the SMAWK algorithm (Aggarwal et al., 1987). Such algorithms are useful in the smaller scale regime, however, the use of a cubic- or even quadratic-time algorithm will be infeasible for our massive vocabulary size use cases. Finally, Mumey and Gedeon (2003) shows that in the general case of greater than two class labels, finding the optimal compression is NP-complete. In this work, we will be focusing on the binary label setting.

Feature Selection: A related method for dealing with very large vocabularies is to do feature selection, in which we simply select a subset of the vocabulary values and remove the rest (see (Guyon and Elisseeff, 2003) and the many references therein). One can view this approach as a special case of vocabulary compression, where we are restricted to only singleton “clusters”. Restricting the problem by selecting a subset of the vocabulary may have some benefits, such as potentially simplifing the optimization problem and the use of a simple filter to transform data at inference time. However, the obvious downside to this restriction is the loss of information and potentially poorer learning performance (see introduction of (Jiang et al., 2011)). In this work we focus on the more general vocabulary compression setting.

Other Feature Extraction Approaches: Clustering features in order to compress a vocabulary is only one approach to lower dimensional feature extraction. There are of course many classical approaches to feature extraction (see Chapter 15 of (Mohri et al., 2018)), such as learning linear projections (e.g., Principle Component Analysis, Linear Discriminant Analysis) or non-linear transformations (e.g., Locally Linear Embeddings, ISOMAP, Laplacian Eigenmaps). However, these classical methods generally incur more than quasilinear computational cost, for both learning and the application the transformation, and are not feasible for our setting.

### 1.2 Notation

In the vocabulary compression problem we are given a correlated pair of random variables (a categorical feature) and (a label), where and . We aim to define a random variable as a function of that maximizes the mutual information with the label , i.e., , where for general random variables and taking values in and , respectively,

(1) |

Note that is a function of and hence we have . If we let , maximizes the mutual information . We are interested in the nontrivial case of . Intuitively, we are compressing the vocabulary of feature from size to a smaller size , while preserving the maximum amount of information about .

## 2 Algorithm and Analysis

In this section, we first show how to transform the original objective into a set function and then prove that this set function is in fact submodular. Next, we describe the components of a quasi-linear and parallelizable algorithm to optimize the objective. Finally, we consider a few simple intuitive baselines and show that they may create features that fail to capture any mutual information with the label.

### 2.1 Objective Transformation

Without loss of generality assume for is sorted in increasing order. Once the feature values are sorted in this order, Lemma 3 of Kurkoski and Yagi (2014) crucially shows that in the optimum solution each value of corresponds to a consecutive subsequence of — this is a significant insight that we take from the quantization for DMC literature. Thus, we will cluster consecutive feature values into clusters, with each cluster corresponding to a value in the compressed vocabulary of . Formally, define a function as follows: Let , and assume . For simplicity, and without any loss in quality, we set and . Let be a random variable constructed from that has value , if and only if . We define . Notice that we have

where is a function of with vocabulary size . The non-negativity of mutual information implies that the function is always non-negative (Cover and Thomas, 2006, p. 28). The monotonicity is equivalent to for any , where and are the random variables constructed from and , respectively. Since represents a subdivision of , is a function of . By the data-processing inequality, we have (Cover and Thomas, 2006, p. 34). In the following section, we show that the function is in fact submodular.

### 2.2 Submodularity of

For a set and a number we define . Let be the item right before when we sort . Note that, the items that are mapped to by are either mapped to or by . We first observe the following useful technical lemma (the proof of all lemmas can be found in the supplement).

###### Lemma 1

Define the quantities , , and , then the following equality holds

(2) |

where the following convex function over :

(3) |

Next, we provide several inequalities that precisely analyze expressions of the same form as with various values of and .

###### Lemma 2

Pick , and . Let and let be an arbitrary convex function. We have

Replacing and in Lemma 2 with and and multiplying both sides by implies the following corollary.

###### Corollary 1

Pick , and . Let be an arbitrary convex function. We have

Similarly, we have the following corollary (simply by looking at instead of ).

###### Corollary 2

Pick , and . Let be an arbitrary convex function. We have

We require one final lemma before proceeding to the main theorem.

###### Lemma 3

Pick , and such that . Let be an arbitrary convex function. We have

###### Theorem 3 (Submodularity)

For any pair of sets and any we have

Proof

Let and be the items right before and right after when we sort . Also, let and be the random variables corresponding to and respectively. Similarly let and be items right before and right after when we sort , and let and be the random variables corresponding to and respectively.

Let us set , , and . Similarly let us set , , and . Note that since , we have and hence we have and (see Figure 2). Therefore, we have following set of inequalities

(4) | ||||

(5) |

Since in the definition of the elements are ordered by , we have the following set of inequalities

(6) | ||||

(7) | ||||

(8) | ||||

(9) |

Finally, we have

where and follow from equality 2,
follows from Corollary 1 and inequalities
(9) and (7), follows from
Corollary 2 and inequalities (8) and
(6), follows from Lemma 3 and
inequality (4), and follows from
Lemma 3 and inequality (5).
This completes the proof.

### 2.3 Submodular Optimization Algorithms

Given that we have shown is submodular, we now show two approaches to optimization: a single machine algorithm that runs in time as well as an algorithm which allows the input to be processed in a distributed fashion, at the cost of an additional logarithmic factor in running time.

Single Machine Algorithm: We will make use of a approximation algorithm for submodular maximization that makes only queries to (Mirzasoleiman et al., 2015). First, fix an arbitrary small constant (this appears as a loss in the approximation factor as well as in the running time). The algorithm starts with an empty solution set and then proceeds in iterations where, in each iteration, we sample a set of elements uniformly at random from the elements that have not been added to the solution so far and then add the sampled element with maximum marginal increase to the solution.

In general, we may expect that computing requires at least time, which might be as large as . However, we note that the algorithm of (Mirzasoleiman et al., 2015) (similar to most other algorithms for submodular maximization) only queries for incrementally growing subsets of the final solution . In that case, we can compute each incremental value of in logarithmic time using a data structure that costs time to construct (see Algorithm 1). By using this query oracle, we do not require communicating the whole set for every query. Moreover, we use a red-black tree to maintain , and hence we can search for neighbors ( and ) in logarithmic time. Thus, combining the submodular maximization algorithm that requires only a linear number of queries with the logarithmic time query oracle implies the following theorem.

###### Theorem 4

For any arbitrary small , there exists a -approximation algorithm for vocabulary compression that runs in time.

Distributed Algorithm: Again, fix an arbitrary small number (for simplicity assume and are integers). In this distributed implementation we use machines, requires space per machine, and uses a logarithmic number of rounds of computation.

To define our distributed algorithm we start with the (non-distributed) submodular maximization algorithm of (Badanidiyuru and Vondrák, 2014), which provides a approximate solution using queries to the submodular function oracle. The algorithm works by defining a decreasing sequence of thresholds , where is the maximum marginal increase of a single element, and . The algorithm proceeds in rounds, where in round the algorithm iterates over all elements and inserts an element into the solution set if . The algorithm stops once it has selected elements or if it finishes rounds, whichever comes first. As usual, this algorithm only queries for incrementally growing subsets of the final solution , and hence we can use Algorithm 1 to solve vocabulary compression in time.

Now, we show how to distribute this computation across multiple machines. First, for all , we select the -th element and add it to the solution set . This decomposes the elements into continuous subsets of elements, each of size , and each of which we assign to one machine. Note that only depends on the previous item and next item of in and, due to the way that we created the initial solution set and decomposed the input elements, the previous item and next item of are always both assigned to the same machine as . Hence each machine can compute locally. However, we assigned the first to the solution set blindly and their marginal gain may be very small. Intuitively, we are potentially throwing away some of our selection budget for the ease of computation. Next we show that by forcing these elements into the solution we do not lose more than on the approximation factor.

First of all, notice that if we force a subset of the element to be included in the solution, the objective function is still submodular over the remaining elements. That is, the marginal impact of an element (i.e., ) shrinks as grows. Next we show that if we force elements into the solution, it does not decrease the value of the optimum solution by more than a factor. This means that if we provide a -approximation to the new optimum solution, it is a approximate solution to the original optimum.

Let be a solution of size that maximizes . Decompose into subsets of size . Note that by submodularity the value of is more than the sum of the marginal impacts of each subset (given the remainder of the subsets). Therefore, by the pigeonhole principle, the marginal impact of one of these subsets of is at most . If we remove this subset from and add the forced elements, we find a solution of size (at most) that contains all of the forced elements and has value at least as desired. Hence, by forcing these initial elements to be in the solution we lose only an fraction on the approximation factor.

Now, to implement the algorithm of (Badanidiyuru and Vondrák, 2014), in iteration , each machine independently finds and inserts all of its elements with marginal increase more than . If the number of selected elements exceeds , we remove the last few elements to have exactly elements in the solution. This implies the following theorem.

###### Theorem 5

For any arbitrary small , there exists a -approximation -round distributed algorithm for vocabulary compression with space per machine and total work.

### 2.4 Heuristic Algorithms

In this subsection we review a couple of heuristics that can serve as simple alternatives to the algorithm we suggest and show that they can, in fact, fail entirely for some inputs. We also provide an empirical comparison to these, as well as the algorithm of Dhillon et al. (2003), in Section 3.

#### Bucketing Algorithm:

This algorithm splits the range of probabilities into equal size intervals . Then it uses these intervals (or buckets) to form the compressed vocabulary. Specifically, each interval represents all elements such that . Note that there exists a set that such that correspond to the mutual information of the outcome of the bucketing algorithm and the labels. First we show that it is possible to give an upper bound on the mutual information loss, i.e., .

###### Theorem 6

Let be the random variable provided by the bucketing algorithm. The total mutual information loss of the bucketing algorithm is bounded as follows.

where and is defined in equation (3).

Let be the index of the interval corresponding to . Then, by convexity of , we have

and

Therefore we have

This together with Equations (2.4) and (11) show that
and completes the theorem.

The above theorem states that the information loss of the bucketing
algorithm is no more than how much changes within one
interval of size . Note that this is an absolute loss and is not
comparable to the approximation guarantee that we provide submodular
maximization. The main problem with the bucketing algorithm is that it
is to some extent oblivious to the input data and, thus, will fail badly for
certain inputs as shown in the following proposition.

###### Proposition 7

There is an input to the bucketing algorithm such that and , where is the output of the bucketing algorithm.

Proof Fix a number . In this example for half of the items we have
and for the other half we
have . We also set the
probability of all values of to be the same, and hence
. The mutual information of
with the label is non-zero since . However, the bucketing algorithm merges all
of the elements in the range ,
thereby merging all values together giving us and
completing the proof.

Note, we can further strengthen the previous example by giving a tiny
mass to all buckets, so that all values do not collapse into a single
bucket. However, still in this case, the bucketing method can only
hope to capture a tiny fraction of mutual information since the vast
majority of mass falls into a single bucket.

#### Frequency Based Filtering:

This is very simple compression method (more precisely, a feature selection method) that is popular in practice. Given a vocabulary budget, we compute a frequency threshold which we use to remove any vocabulary value that appears in fewer than instances of our dataset and which results in a vocabulary of the desired size. Even though the frequency based algorithm is not entirely oblivious to the input, it is oblivious to the label and hence oblivious to conditional distributions. Similar to the bucketing algorithm with a simple example in the following theorem we show that the frequency based algorithm fails to provide any approximation gaurantee.

###### Proposition 8

There is an input to the frequency based algorithm such that and , where is the outcome of the frequency based algorithm.

Proof Assume we have values for , namely . For all define , and for all we have . Note that the first values are the most frequent values, however, we are going to define them such that they are independent of the label.

For let ; for let
; and for
let . Note that we have
. Therefore the mutual information of
the most frequent values with the label is zero, which
implies for a certain vocabulary budget, and thereby
frequency threshold, . Observe that even if we merge
the last
values and use it as a new value (as opposed to ignoring
them), the label corresponding the the merged value is
with probability half, and hence has no mutual information
with the label. However, we have
, which completes the proof.

## 3 Empirical Evaluation

In this section we report our empirical evaluation of the
optimization the submodular function described in the previous
section. All the experiments are performed using the Criteo click
prediction dataset (CriteoLabs, 2014), which consists of 37 million instances for training and
4.4 million held-out points.^{1}^{1}1
Note, we use the labeled training file from this challenge and
chronologically partitioned it into train/hold-out sets.
In addition to 13 numerical features, this dataset contains 26
categorical features with a combined total vocabulary of more than 28
million values. These features have varying vocabulary sizes, from a
handful up to millions of values. Five features, in particular, have
more than a million distinct feature values each.

In order to execute a mutual information based algorithm, we require estimates of the conditional probabilities and marginal probabilities . Here, we simply use the maximum likelihood estimate based on the empirical count, i.e. given a sample of feature value and label pairs , we have

We note that such estimates may sometimes be poor, especially when certain feature values appear very rarely. Evaluating more robust estimates is outside the scope of the current study, but an interesting direction for future work.

### 3.1 Mutual information evaluation

We first evaluate the ability of our algorithm to maximize the mutual information retained by the compressed vocabulary and compare it to other baseline methods.

In particular, we compare our algorithm to the iterative divisive clustering algorithm introduced by Dhillon et al. (2003), as well as the frequency filtering and bucketing heuristics introduced in the previous section. The divisive clustering algorithm resembles a version of the -means algorithm where is set to be the vocabulary size and distances between points and clusters are defined in terms of the KL divergence between the conditional distribution of the label variable given a feature value and the conditional distribution of the label variable given a cluster center. Notice that due to the large size of the dataset, we cannot run the dynamic programming algorithm introduced by Kurkoski and Yagi (2014) which would find the theoretically optimal solution. For ease of reference, we call our algorithm Submodular, and the other algorithms Divisive, Bucketing and Frequency.

Note that our algorithm, as well as previous algorithms, seek to maximize the mutual information between a single categorical variable and the label, while in the Criteo dataset we have several categorical variables that we wish to apply a global vocabulary size budget to. In the case of the Frequency heuristic, this issue is addressed by sorting the counts of feature values across all categorical variables and applying a global threshold. In the case of Submodular, we run the almost linear-time algorithm for each categorical variable to obtain a sorted list of feature values and their marginal contributions to the global objective. Then we sort these marginal values and pick the top-score feature values to obtain the desired target vocabulary size. Thus, both Submodular and Frequency are able to naturally employ global strategies in order to allocate the total vocabulary budget across different categorical features.

For the Divisive and Bucketing algorithms, a natural global allocation policy is not available, as one needs to define an allocation of the vocabulary budget to each categorical feature a priori. In this study, we evaluate two natural allocation mechanisms. The uniform allocation assigns a uniform budget across all categorical features, whereas the MI allocation assigns a budget that is proportional to the mutual information of the particular categorical feature.

The original vocabulary of over 28 million values is compressed by a factor of up to 2000. Using the methods mentioned above, we obtain vocabularies of size 10K, 20K, 40K, 80K, 120K and 160K. Then we compute the loss in average mutual information, which is defined as follows: let denote the mutual information of uncompressed categorical feature with the label and denote mutual information of the corresponding compressed feature, then the average mutual information loss is equal to .

For the heuristic Frequency algorithm, the measured loss ranges from 0.520 (for budget of 160K) to 0.654 (for budget of 10K), while for Bucketing the loss ranges from to . As expected, the mutual information based methods perform significantly better, in particular, the loss for Submodular ranges from to and consistently outperforms the Divisive algorithm (regardless of allocation strategy). Figure 3 provides a closer look at the mutual information based methods. Thus, we find that not only is our method fast, scalable and exhibits a theoretical lower bound on the performance, but that in practice it maintains almost all the mutual information between data points and the labels.

### 3.2 Learning evaluation

Our focus thus far has been in optimizing the mutual information objective. In this section we also evaluate the compressed vocabularies in an end-to-end task to demonstrate its application in a learning scenario.

Given a compressed vocabulary we train a neural network model on the
training split and measure the log-loss on the hold out set (futher
details in supplement Section LABEL:sec:emp_supplement).^{2}^{2}2In
order to alleviate the potential issue of poor conditional/marginal
distribution estimates we initially start with only features values
that appear in at least 100 instances. In Figure 4 we
see that the mutual information based methods perform comparably to
each other and significantly outperform popular heuristic method
Frequency. We observe that our scalable quasi-linear compression algorithm
with provable approximation guarantees also performs competitively in
end-to-end learning.

## 4 Conclusion

In this work we have shown the first scalable quasi-linear compression algorithm for maximizing mutual information with the label that also exhibits and factor approximation guarantee. The algorithm, as well as our insights into constructing a submodular objective function, might be of interest in other applications as well (for example, quantization in DMC). One future line of work is extending this work to the multiclass (non-binary) setting.

## References

- Abadi et al. [2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
- Aggarwal et al. [1987] Alok Aggarwal, Maria M Klawe, Shlomo Moran, Peter Shor, and Robert Wilber. Geometric applications of a matrix-searching algorithm. Algorithmica, 2(1-4):195–208, 1987.
- Badanidiyuru and Vondrák [2014] Ashwinkumar Badanidiyuru and Jan Vondrák. Fast algorithms for maximizing submodular functions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1497–1514. SIAM, 2014.
- Baker and McCallum [1998] L Douglas Baker and Andrew Kachites McCallum. Distributional clustering of words for text classification. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 96–103. ACM, 1998.
- Barbosa et al. [2015] Rafael Barbosa, Alina Ene, Huy Nguyen, and Justin Ward. The power of randomization: Distributed submodular maximization on massive datasets. In International Conference on Machine Learning, pages 1236–1244, 2015.
- Cicalese et al. [2018] Ferdinando Cicalese, Luisa Gargano, and Ugo Vaccaro. Bounds on the entropy of a function of a random variable and their applications. IEEE Transactions on Information Theory, 64(4):2220–2230, 2018.
- Cover and Thomas [2006] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, New York, NY, USA, 2006.
- CriteoLabs [2014] CriteoLabs. Display Advertising Challenge, 2014. URL https://www.kaggle.com/c/criteo-display-ad-challenge.
- Dhillon et al. [2003] Inderjit S Dhillon, Subramanyam Mallela, and Rahul Kumar. A divisive information-theoretic feature clustering algorithm for text classification. Journal of machine learning research, 3(Mar):1265–1287, 2003.
- Guyon and Elisseeff [2003] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182, 2003.
- Iwata and Ozawa [2014] Ken-ichi Iwata and Shin-ya Ozawa. Quantizer design for outputs of binary-input discrete memoryless channels using smawk algorithm. In Information Theory (ISIT), 2014 IEEE International Symposium on, pages 191–195. IEEE, 2014.
- Jiang et al. [2011] Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee. A fuzzy self-constructing feature clustering algorithm for text classification. IEEE transactions on knowledge and data engineering, 23(3):335–349, 2011.
- Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kumar et al. [2015] Ravi Kumar, Benjamin Moseley, Sergei Vassilvitskii, and Andrea Vattani. Fast greedy algorithms in mapreduce and streaming. ACM Transactions on Parallel Computing (TOPC), 2(3):14, 2015.
- Kurkoski and Yagi [2014] Brian M Kurkoski and Hideki Yagi. Quantization of binary-input discrete memoryless channels. IEEE Transactions on Information Theory, 60(8):4544–4552, 2014.
- Mirrokni and Zadimoghaddam [2015] Vahab Mirrokni and Morteza Zadimoghaddam. Randomized composable core-sets for distributed submodular maximization. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 153–162. ACM, 2015.
- Mirzasoleiman et al. [2013] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems, pages 2049–2057, 2013.
- Mirzasoleiman et al. [2015] Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrák, and Andreas Krause. Lazier than lazy greedy. In AAAI, pages 1812–1818, 2015.
- Mohri et al. [2018] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.
- Mumey and Gedeon [2003] Brendan Mumey and Tomáš Gedeon. Optimal mutual information quantization is np-complete. In Proc. Neural Inf. Coding (NIC) Workshop, 2003.
- Slonim and Tishby [2001] Noam Slonim and Naftali Tishby. The power of word clusters for text classification. In 23rd European Colloquium on Information Retrieval Research, volume 1, page 200, 2001.
- Zhang and Kurkoski [2016] Jiuyang Alan Zhang and Brian M Kurkoski. Low-complexity quantization of discrete memoryless channels. In Information Theory and Its Applications (ISITA), 2016 International Symposium on, pages 448–452. IEEE, 2016.

## Appendix A Supplement

### a.1 Proof of technical lemmas

Proof of Lemma 1 Let and be the random variables corresponding to and respectively. Note that we have

where we have

which is a convex function over . Next, we have

Notice that implies that or . Hence we have and

Now, if we set , , and , and combine the previous two inline equalities, we have