# A Model for Learned Bloom Filters,

and Optimizing by Sandwiching

###### Abstract

Recent work has suggested enhancing Bloom filters by using a pre-filter, based on applying machine learning to determine a function that models the data set the Bloom filter is meant to represent. Here we model such learned Bloom filters, with the following outcomes: (1) we clarify what guarantees can and cannot be associated with such a structure; (2) we show how to estimate what size the learning function must obtain in order to obtain improved performance; (3) we provide a simple method, sandwiching, for optimizing learned Bloom filters; and (4) we propose a design and analysis approach for a learned Bloomier filter, based on our modeling approach.

A Model for Learned Bloom Filters,

and Optimizing by Sandwiching

Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University michaelm@eecs.harvard.edu

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

An interesting recent paper, “The Case for Learned Index Structures” TCFLIS (), argues that standard index structures and related structures, such as Bloom filters, could be improved by using machine learning to develop what the authors dub learned index structures. However, this paper did not provide a suitable mathematical model for judging the performance of such structures. Here we aim to provide a more formal model for their variation of a Bloom filter, which they call a learned Bloom filter.

To describe our results, we first somewhat informally describe the learned Bloom filter. Like a standard Bloom filter, it provides a compressed representation of a set of keys that allows membership queries. (We may sometimes also refer to the keys as elements.) Given a key , a learned Bloom filter always returns yes if is in , so there will be no false negatives, and generally returns no if is not in , but may provide false positives. What makes a learned Bloom filter interesting is that it uses a function that can be obtained by “learning” the set to help determine the appropriate answer; the function acts as a pre-filter that provides a probabilistic estimate that a query key is in . This learned function can be used to make an initial decision as to whether the key is in , and a smaller backup Bloom filter is used to prevent any false negatives.

Our more formal model provides interesting insights into learned Bloom filters, and how they might be effective. In particular, here we: (1) clarify what guarantees can and cannot be associated with such a structure; (2) show how to estimate what size the learning function must obtain in order to obtain improved performance; (3) provide a simple method for optimizing learned Bloom filters; and (4) demonstrate our approach may be useful for other similar structures.

We briefly summarize the outcomes above. First, we explain how the types of guarantees offered by learned Bloom filters differ significantly from those of standard Bloom filters. We thereby clarify what application-level assumptions are required for a learned Bloom filter to be effective. Second, we provide formulae for modeling the false positive rate for a learned Bloom filter, allowing for an estimate of how small the learned function needs to be in order to be effective. We then find, perhaps surprisingly, that a better structure uses a Bloom filter before as well as after the learned function. Because we optimize for two layers of Bloom filters surrounding the learned function, we refer to this as a sandwiched learned Bloom filter. We show mathematically and intuitively why sandwiching improves performance. We also discuss an approach to designing learned Bloomier filters, where a Bloomier filter returns a value associated with a set element (instead of just returning whether the element is in the set), and show it can be modeled similarly.

While the contents of this paper may be seen as relatively simple, we feel it is important to provide solid foundations in order for a wide community to understand the potential and pitfalls of data structures using machine learning components. We therefore remark that the simplicity is purposeful, and suggest it is desirable in this context. Finally, we note that this work incorporates and extends analysis that appeared in two prior working notes mypaper (); mypaper2 ().

## 2 Review: Bloom Filters

We start by reviewing standard Bloom filters and variants, following the framework provided by the reference BroderMitzenmacher ().

### 2.1 Definition of the Data Structure

A Bloom filter for representing a set of elements corresponds to an array of bits, and uses independent hash functions with range . Here we follow the typical assumption that these hash functions are perfect; that is, each hash function maps each item in the universe independently and uniformly to a number in . Initially all array bits are 0. For each element , the array bits are set to 1 for ; it does not matter if some bit is set to 1 multiple times. To check if an item is in , we check whether all are set to 1. If not, then clearly is not a member of . If all are set to 1, we conclude that is in , although this may be a false positive. A Bloom filter does not produce false negatives.

The primary standard theoretical guarantee associated with a Bloom filter is the following. Let be an element of the universe such that , where is chosen independently of the hash functions used to create the filter. Let be the fraction of bits set to 1 after the elements are hashed. Then

For a bit in the Bloom filter to be 0, it has to not be the outcome of the hash values for the items. It follows that

and that via standard techniques using concentration bounds (see, e.g., MU ())

in the typical regime where and are constant. That is, is, with high probability, very close to its easily calculable expectation, and thus we know (up to very small random deviations) what the probability is that an element will be a false positive. Because of this tight concentration around the expectation, it is usual to talk about the false positive probability of a Bloom filter; in particular, it is generally referred to as though it is a constant depending on the filter parameters, even though it is a random variable, because it is tightly concentrated around its expectation.

Moreover, given a set of distinct query elements with chosen a priori before the Bloom filter is instantiated, the fraction of false positives over these queries will similarly be concentrated near . Hence we may talk about the false positive rate of a Bloom filter over queries, which (when the query elements are distinct) is essentially the same as the false positive probability. (When the query elements are not distinct, the false positive rate may vary significantly, depending on on the distribution of the number of appearances of elements and which ones yield false positives; we focus on the distinct item setting here.) In particular, the false positive rate is a priori the same for any possible query set . Hence one approach to finding the false positive rate of a Bloom filter empirically is simply to test a random set of query elements (that does not intersect ) and find the fraction of false positives. Indeed, it does not matter what set is chosen, as long as it is chosen independently of the hash functions.

We emphasize that, as we discuss further below, the term false positive rate often has a different meaning in the context of learning theory applications. Care must therefore be taken in understanding how the term is being used.

### 2.2 Additional Bloom Filter Benefits and Limitations

For completeness, we relate some of the other benefits and limitations of Bloom filters. More details can be found in BroderMitzenmacher ().

We have assumed in the above analysis that the hash functions are fully random. As fully random hash functions are not practically implementable, there are often questions relating to how well the idealization above matches the real world for specific hash functions. In practice, however, the model of fully random hash functions appears reasonable in many cases; see Simple () for further discussion on this point.

If an adversary has access to the hash functions used, or to the final Bloom filter, it can find elements that lead to false positives. One must therefore find other structures for adversarial situations. A theoretical framework for such settings is developed in naor2015bloom (). Variations of Bloom filters, which adapt to false positives and prevent them in the future, are described in adapt2 (); adapt1 (); while not meant for adversarial situations, they prevent repeated false positives with the same element.

One of the key advantages of a standard Bloom filter is that it is easy to insert an element (possibly slightly changing the false positive probability), although one cannot delete an element without using a more complex structure, such as a counting Bloom filter. However, there are more recent alternatives to the standard Bloom filter, such as the cuckoo filter cuckoof (), which can achieve the same or better space performance as a standard Bloom filter while allowing insertions and deletions. If the Bloom filter does not need to insert or delete elements, a well-known alternative is to develop a perfect hash function for the data set, and store a fingerprint of each element in each corresponding hash location (see, e.g., BroderMitzenmacher () for further explanation); this approach reduces the space required by approximately 30%.

## 3 Learned Bloom Filters

### 3.1 Definition of the Data Structure

We now consider the learned Bloom filter construction as described in TCFLIS (). We are given a set of positive keys that correspond to set to be held in the Bloom filter – that is, corresponds to the set in the previous section. We are also given a set of negative keys for training. We then train a neural network with ; that is, they suggest using a neural network on this binary classification task to produce a probability, based on minimizing the log loss function

where is the learned model from the neural network. Then can be interpreted as a “probability” estimate that is a key from the set. Their suggested approach is to choose a threshold so that if then the algorithm returns that is in the set, and no otherwise. Since such a process may provide false negatives for some keys in that have , a secondary structure – such as a smaller standard Bloom filter holding the keys from that have – can be used to check keys with to ensure there are no false negatives, matching this feature of the standard Bloom filter.

In essence, TCFLIS () suggests using a pre-filter ahead of the Bloom filter, where the pre-filter comes from a neural network and estimates the probability a key is in the set, allowing the use of a smaller Bloom filter than if one just used a Bloom filter alone. Performance improves if the size to represent the learned function and the size of the smaller backup filter for false negatives is smaller than the size of a corresponding Bloom filter with the same false positive rate. Of course the pre-filter here need not come from a neural network; any approach that would estimate the probability an input key is in the set could be used.

This motivates the following formal definition:

###### Definition 1

A learned Bloom filter on a set of positive keys and negative keys is a function and threshold , where is the universe of possible query keys, and an associated standard Bloom filter , referred to as a backup filter. The backup filter holds the set of keys . For a query , the learned Bloom filter returns that if , or if and the backup filter returns that . The learned Bloom filter returns otherwise.

### 3.2 Defining the False Positive Probability

The question remains how to determine or derive the false positive probability for a learned Bloom filter, and how to choose an appropriate threshold . The approach in TCFLIS () is to find the false positive rate over a test set. This approach is, as we have discussed, suitable for a standard Bloom filter, where the false positive rate is guaranteed to be close to its expected value for any test set, with high probability. However, this methodology requires additional assumptions in the learned Bloom filter setting.

As an example, suppose the universe of elements is the range , and the set of keys to store in our Bloom filter consists of a random subset of 500 elements from the range , and of 500 other random elements from outside this range. Our learning algorithm might determine that a suitable function yields that is large (say ) for elements in the range and close to zero elsewhere, and then a suitable threshold might be . The resulting false positive rate depends substantially on what elements are queried. If consists of elements primarily in the range , the false positive rate will be quite high, while if is chosen uniformly at random over the whole range, the false positive rate will be quite low. Unlike a standard Bloom filter, the false positive rate is highly dependent on the query set, and is not well-defined independently of the queries.

Indeed, it seems plausible that in many situations, the query set might indeed be similar to the set of keys , so that for might often be above naturally chosen thresholds. For example, in security settings, one might expect that queries for objects under consideration (URLs, network flow features) would be similar to the set of keys stored in the filter. Unlike in the setting of a standard Bloom filter, the false positive probability for a query can depend on , even before the function is instantiated.

It is worth noting, however, that the problem we point out here can possibly be a positive feature in other settings; it might be that the false positive rate is remarkably low if the query set is suitable. Again, one can consider the range example above where queries are uniform over the entire space; the query set is very unlikely to hit the range where the learned function yields an above threshold value in that setting for a key outside of . The data-dependent nature of the learned Bloom filter may allow it to circumvent lower bounds for standard Bloom filter structures.

While the false positive probability for learned Bloom filters does not have the same properties as for a standard Bloom filter, we can define the false positive rate of a learned Bloom filter with respect to a given query distribution.

###### Definition 2

A false positive rate on a query distribution over for a learned Bloom filter is given by

where is the false positive rate of the backup filter .

While technically is itself a random variable, the false positive rate is well concentrated around its expectations, which depends only on the size of the filter and the number of false negatives from that must be stored in the filter, which depends on . Hence where the meaning is clear we may consider the false positive rate for a learned Bloom filter with function and threshold to be

where the expectation is meant to over instantiations of the Bloom filter with given size .

Given sufficient data, we can determine an empirical false positive rate on a test set, and use that to predict future behavior. Under the assumption that the test set has the same distribution as future queries, standard Chernoff bounds provide that the empirical false positive rate will be close to the false positive rate on future queries, as both will be concentrated around the expectation. In many learning theory settings, this empirical false positive rate appears to be referred to as simply the false positive rate; we emphasize that false positive rate, as we have explained above, typically means something different in the Bloom filter literature.

###### Definition 3

The empirical false positive rate on a set , where , for a learned Bloom filter is the number of false positives from divided by .

###### Theorem 4

Consider a learned Bloom filter , a test set , and a query set , where and are both determined from samples according to a distribution . Let be the empirical false positive rate on , and be the empirical false positive rate on . Then

Proof: Let , and be false positive rate for the backup filter. We first show that for and that

This follows from a direct Chernoff bound (e.g., MU ()[Exercise 4.13]), since each sample chosen according to is a false positive with probability . A similar bound holds for and .

We can therefore conclude

giving the desired result.

Theorem 4 also informs us that it is reasonable to find a suitable parameter , given , by trying a suitable finite discrete set of values for , and choosing the best size-accuracy tradeoff for the application. By a union bound, all choices of will perform close to their expectation with high probability.

While Theorem 4 requires the test set and query set to come from the same distribution , the negative examples do not have to come from that distribution. Of course, if negative examples are drawn from , it may yield a better learning outcome .

If the test set and query set distribution do not match, because for example the types of queries change after the original gathering of test data , Theorem 4 offers limited guidance. Suppose is derived from samples from distribution and from another distribution . If the two distributions are close (say in distance), or, more specifically, if the changes do not significantly change the probability that a query has , then the empirical false positive rate on the test set may still be relatively accurate. However, in practice it may be hard to provide such guarantees on the nature of future queries. This explains our previous statement that learned Bloom filters appear most useful when the query stream can be modeled as coming from a fixed distribution, which can be sampled during the construction.

We can return to our previous example to understand these effects. Recall our set consists of 500 random elements from the range and 500 other random elements from the range . Our learned Bloom filter has for all in and otherwise. Our backup filter will therefore store 500 elements. If our test set is uniform over (excluding elements stored in the Bloom filter), our false positive rate from elements with too large an value would be approximately ; one could choose a backup filter with roughly the same false positive probability for a total empirical false positive probability of . If, however, our queries are uniform over a restricted range , then the false positive probability would jump to for the learned Bloom filter, because the learned function would yield more false positives over the smaller query range.

### 3.3 Additional Learned Bloom Filter Benefits and Limitations

Learned Bloom filters can easily handle insertions into by adding the key, if is does not already yield a (false) positive, to the backup filter. Such changes have a larger effect on the false positive probability than for a standard Bloom filter, since the backup filter is smaller. Keys cannot be deleted naturally from a learned Bloom filter. A deleted key would simply become a false positive, which (if needed) could possibly be handled by an additional structure.

As noted in TCFLIS (), it may be possible to re-learn a new function if the data set changes substantially via insertions and deletion of keys from . Of course, besides the time needed to re-learn a new function , this requires storing the original set somewhere, which may not be necessary for alternative schemes. Similarly, if the false positive probability proves higher than desired, one can re-learn a new function ; again, doing so will require access to , and maintaining a (larger) set of negative examples.

## 4 Size of the Learned Function

We now consider how to model the performance of the learned Bloom
filter with the goal of understanding how small the representation
of the function needs needs to be in order for the learned
Bloom filter to be more effective than a standard Bloom filter. ^{1}^{1}1We
thank Alex Beutel for pointing out that our analysis in mypaper2 () could
be used in this manner.

Our model is as follows. The function associated with Definition 1 we treat as an oracle for the keys , where , that works as follows. For keys not in there is an associated false positive probability , and there are false negatives for keys in . (The value is like a false negative probability, but given this fraction is determined and known according to the oracle outcomes.) We note the oracle representing the function is meant to be general, so it could potentially represent other sorts of filter structures as well. As we have described in Section 3.2, in the context of a learned Bloom filter the false positive rate is necessarily tied to the query stream, and is therefore generally an empirically determined quantity, but we take the value here as a given. Here we show how to optimize over a single oracle, although in practice we may possibly choose from oracles with different values and , in which case we can optimize for each pair of values and choose the best suited to the application.

We assume a total budget of bits for the backup filter, and bits for the learned function. If , the backup Bloom filter only needs to hold keys, and hence we take the number of bits per stored key to be . To model the false positive rate of a Bloom filter that uses bits per stored key, we assume the false positive rate falls as . This is the case for a standard Bloom filter (where when using the optimal number of hash functions, as described in the survey BroderMitzenmacher ()), as well as for a static Bloom filter built using a perfect hash function (where , again described in BroderMitzenmacher ()). The analysis can be modified to handle other functions for false positives in terms of in a straightforward manner. (For example, for a cuckoo filter cuckoof (), a good approximation for the false positive rate is for suitable constants and .)

The false positive rate of a learned Bloom filter is This is because, for , causes a false positive from the learned function with probability , or with remaining probability it yields a false positive on the backup Bloom filter with probability .

A comparable Bloom filter using the same number of total bits, namely bits, would have a false positive probability of . Thus we find an improvement using a learned Bloom filter whenever

which simplifies to

where we have expressed the requirement in terms of a bound on , the number of bits per key the function is allowed.

This expression is somewhat unwieldy, but it provides some insight into what sort of compression is required for the learned function , and how a practitioner can determine what is needed. First, one can determine possible thresholds and the corresponding rate of false positive and false negatives from the learned function. For example, the paper TCFLIS () considers situations where , and ; let us consider and for clarity. If we have a target goal of one byte per item, a standard Bloom filter achieves a false positive probability of approximately . If our learned function uses 3 bits per item (or less), then the learned Bloom filter can use bits for the backup Bloom filter, and achieve a false positive rate of approximately . The learned Bloom filter will therefore provide over a reduction in false positives with the same or less space. More generally, in practice one could determine or estimate different and values for different thresholds and different learned functions of various sizes, and use these equations to determine if better performance can be expected without in depth experiments.

Indeed, an interesting question raised by this analysis is how learned functions scale in terms of typical data sets. In extreme situations, such as when the set being considered is a range of consecutive integers, it can be represented by just two integers, which does not grow with . If, in practice, as data sets grow larger the amount of information needed for a learned function to approximate key sets grows sublinearly with , learned Bloom filters may prove very effective.

## 5 Sandwiched Learned Bloom Filters

### 5.1 The Sandwich Structure

Given the formalization of the learned Bloom filter, it seems natural to ask whether this structure can be improved. Here we show that a better structure is to use a Bloom filter before using the function , in order to remove most queries for keys not in . We emphasize that this initial Bloom filter does not declare that an input is in , but passes forward all matching keys to the learned function , and it returns when the Bloom filter shows the key is not in . Then, as before, we use the function to attempt to remove false positives from the initial Bloom filter, and then use the backup filter to allow back in keys from that were false negatives for . Because we have two layers of Bloom filters surrounding the learned function , we refer to this as a sandwiched learned Bloom filter. The sandwiched learned Bloom filter is represented pictorially in Figure 1.

In hindsight, our result that sandwiching improves performance makes sense. The purpose of the backup Bloom filter is to remove the false negatives arising from the learned function. If we can arrange to remove more false positives up front, then the backup Bloom filter can be quite porous, allowing most everything that reaches it through, and therefore can be quite small. Indeed, surprisingly, our analysis shows that the backup filter should not grow beyond a fixed size.

### 5.2 Analyzing Sandwiched Learned Bloom Filters

We model the sandwiched learned Bloom filter as follows. As before, the learned function in the middle of the sandwich we treat as an oracle for the keys , where . Also as before, for keys not in there is an associated false positive probability , and there are false negatives for keys in .

We here assume a total budget of bits to be divided between an initial Bloom filter of bits and a backup Bloom filter of bits. As before, we model the false positive rate of a Bloom filter that uses bits per stored key as for simplicity. The backup Bloom filter only needs to hold keys, and hence we take the number of bits per stored key to be . If we find the best value of is , then no initial Bloom filter is needed, but otherwise, an initial Bloom filter is helpful.

The false positive rate of a sandwiched learned Bloom filter is then To see this, note that for , first has to pass through the initial Bloom filter, which occurs with probability . Then either causes a false positive from the learned function with probability , or with remaining probability it yields a false positive on the backup Bloom filter, with probability .

As and are all constants for the purpose of this analysis, we may optimize for in the equivalent expression

The derivative with respect to is

This equals 0 when

(1) |

This further yields that the false positive rate is minimized when , where

(2) |

This result may be somewhat surprising, as here we see that the optimal value is a constant, independent of . That is, the number of bits used for the backup filter is not a constant fraction of the total budgeted number of bits , but a fixed number of bits; if the number of budgeted bits increases, one should simply increase the size of the initial Bloom filter as long as the backup filter is appropriately sized.

In hindsight, returning to the expression for the false positive rate provides useful intuition. If we think of sequentially distributing the bits among the two Bloom filters, the expression shows that bits assigned to the initial filter (the bits) reduce false positives arising from the learned function (the term) as well as false positives arising subsequent to the learned function (the term), while the backup filter only reduces false positives arising subsequent to the learned function. Initially we would provide bits to the backup filter to reduce the rate of false positives subsequent to the learned function. Indeed, bits in the backup filter drive down this term rapidly, because the backup filter holds fewer keys from the original set, leading to the (instead of just a ) in the exponent in the expression . Once the false positives coming through the backup Bloom filter reaches an appropriate level, which, by plugging in the determined optimal value for , we find is , then the tradeoff changes. At that point the gains from reducing the false positives by increasing the bits for the backup Bloom filter become smaller than the gains obtained by increasing the bits for the initial Bloom filter.

Again, we can look at situations discussed in TCFLIS () for some insight. Suppose we have a learned function where and . We consider (which corresponds to a standard Bloom filter). We do not consider the size of in the calculation below. Then the optimal value for is

Depending on our Bloom filter budget parameter , we obtain different levels of performance improvement by using the initial Bloom filter. At bits per key, the false positive rate drops from approximately to , over a factor of 2. At bits per key, the false positive rate drops from approximately to , almost an order of magnitude.

We may also consider the implications for the oracle size. Again, if we let represent the size of the oracle in bits, then a corresponding Bloom filter would have a false positive probability of . Hence we have an improvement whenever

For sufficiently large that , we can calculate the false positive probability of the optimized sandwiched Bloom filter. Let be the optimal value for from equation 2 and be the corresponding value for . First using the relationship from equation 1, we have a gain whenever

Using and equation 2 gives

Again, this expression is somewhat unwieldy, but one useful difference from the analysis of the original learned Bloom filter is that we see the improvement does not depend on the exact value of (as long is large enough so that , and we use the optimal value for ). For , , and , we find a gain whenever falls below approximately

A possible further advantage of the sandwich approach is that it makes learned Bloom filters more robust. As discussed previously, if the queries given to a learned Bloom filter do not come from the same distribution as the queries from the test set used to estimate the learned Bloom filter’s false positive probability, the actual false positive probability may be substantially larger than expected. The use of an initial Bloom filter mitigates this problem, as this issue then only affects the smaller number of keys that pass the initial Bloom filter.

We note that a potential disadvantage of the sandwich approach may be that it is more computationally complex than a learned Bloom filter without sandwiching, requiring possibly more hashing and memory accesses for the initial Bloom filter. The overall efficiency would be implementation dependent, but this remains a possible issue for further research.

## 6 Learned Bloomier Filters

In the supplemental material, we consider learned Bloomier filters. Bloomier filters are a variation of the Bloom filter idea where each key in the set has an associated value. The Bloomier filter returns the value for every key of , and is supposed to return a null value for keys not in , but in this context there can be false positives where the return for a key outside of is a non-null value with some probability. We derive related formulae for the performance of learned Bloomier filters.

## 7 Conclusion

We have focused on providing a more formal analysis of the proposed learned Bloom filter. As part of this, we have attempted to clarify a particular issue in the Bloom filter setting, namely the dependence of what is referred to as the false positive rate in TCFLIS () on the query set, and how it might affect the applications this approach is suited for. We have also found that our modeling laeds to a natural and interesting optimization, based on sandwiching, and allows for generalizations to related structures, such as Bloomier filters. Our discussion is meant to encourage users to take care to realize all of the implications of the learned Bloom filter approach before adopting it. However, for sets that can be accurately predicted by small learned functions, the learned Bloom filter may provide a novel means of obtaining significant performance improvements over standard Bloom filter variants.

## Acknowledgments

The author thanks Suresh Venkatasubramanian for suggesting a closer look at TCFLIS (), and thanks the authors of TCFLIS () for helpful discussions involving their work. This work was supported in part by NSF grants CCF-1563710, CCF-1535795, CCF-1320231, and CNS-1228598. Part of this work was done while visiting Microsoft Research New England.

## References

- (1) M. Bender, M. Farach-Colton, M. Goswami, R. Johnson, S. McCauley, and S. Singh. Bloom Filters, Adaptivity, and the Dictionary Problem. https://arxiv.org/abs/1711.01616, 2017.
- (2) A. Broder and M. Mitzenmacher. Network Applications of Bloom Filters: A Survey. Internet Mathematics, 1(4):485-509, 2004.
- (3) D. Charles and K. Chellapilla. Bloomier Filters: A Second Look. In Proceedings of the European Symposium on Algorithms, pp. 259-270, 2008.
- (4) B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. The Bloomier Filter: an Efficient Data Structure for Static Support Lookup Tables. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 30-39, 2004.
- (5) K. Chung, M. Mitzenmacher, and S. Vadhan. Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream. Theory of Computing, 9(30):897-945, 2013.
- (6) B. Fan, D. Andersen, M. Kaminsky, and M. Mitzenmacher. Cuckoo Filter: Practically Better than Bloom. In Proceedings of the 10th ACM International Conference on Emerging Networking Experiments and Technologies, pp. 75-88, 2014.
- (7) T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The Case for Learned Index Structures. https://arxiv.org/abs/1712.01208, 2017.
- (8) M. Mitzenmacher. A Model for Learned Bloom Filters and Related Structures. https://arxiv.org/abs/1802.00884, 2018.
- (9) M. Mitzenmacher. Optimizing Learned Bloom Filters by Sandwiching. https://arxiv.org/abs/1803.01474, 2018.
- (10) M. Mitzenmacher, S. Pontarelli, and P. Reviriego. Adaptive Cuckoo Filters. In Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 36-47, 2018.
- (11) M. Mitzenmacher and E. Upfal. Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Pres, 2017.
- (12) M. Naor and E. Yogev. Bloom Filters in Adversarial Environments. In Proceedings of the Annual Cryptography Conference, pp. 565-584, 2015.

## Appendix A Supplemental: Learned Bloomier Filters Derivation

Bloomier filters are a variation of the Bloom filter idea where each key in the set has an associated value, which for convenience we will assume is a -bit value. The Bloomier filter returns the value for every key of , as is supposed to return a null value for keys not in , but in this context there can be false positives where the return for a key outside of is a non-null value with some probability. For our purposes, this description will suffice (although we present a few more details below in presenting our model), but more information on Bloomier filters and their constructions can be found in charles2008bloomier (); chazelle2004bloomier ().

Here we imagine that we can derive a learned function that will return a value given an input, with the goal being that the function will return the appropriate -bit value for a key in and the null value otherwise. In this setting we refer to a false positive as a key outside of that obtains a non-null value, and a false negative as a key in that obtains an incorrect value, where that value may either be the null value or the wrong value for that key.

Notice in this setting that, because keys in may obtain an incorrect value that is not merely null, the system to correct for false negatives must be slightly more complicated. We propose an approach in Figure 2. The input is first passed to the learned oracle, which provides a predicted value. To handle false negatives, we provide a two-stage scheme. First, we use a Bloom filter to hold any keys that lead to false negatives. If the Bloom filter returns a key is a positive, which we refer to as a hit on the Bloom filter to avoid ambiguity, it is assumed that that key is a false negative from the oracle and a backup Bloomier filter is used to determine its value. If the Bloom filter returns a key is a negative, it is assumed the learned oracle provided the correct value (whether null or non-null) for that key, and that value is returned. We can see that for every key in the correct value is returned, so the question is what is the false positive rate for this chain.

Our model is as follows. We again treat the function as an oracle for the keys in , where , and the size of the oracle is . For keys not in there is an associated false positive probability , and there are false negatives for keys in . The Bloom filter will consist of bits and have its own false positive probability of . (Of course, instead of a standard Bloom filter we could use a learned Bloom filter in its place, but that is harder to model.)

To model a Bloomier filter, we use the following approach: a Bloomier filter for keys uses space , where is the number of bits in the return value, is constant that is determined by the Bloomier filter design, and is a parameter chosen, with the result that the false positive probability for a key outside of is . This setup, for example, matches the construction of chazelle2004bloomier (). One can think of it as having cells of bits. The simple construction of chazelle2004bloomier () hashes a query key to (typically) three cells, and exclusive-ors their contents together; if the result is a valid -bit value (say with leading zeroes), this value is returned, and otherwise a null value is returned. The table is initially filled so that the right values are returned for the keys, and other keys obtain a value uniformly distributed over bits, leading to the false positive probability of . This requires cells for some to provide enough “room” to set up suitable cell values initially.

If we just used a standard Bloomier filter for the keys , then we would use bits for a false positive probability of .

For our learned Bloom filter construction, we start with the learned function of size . The function yields false negatives; these false negatives can be stored in the Bloom filter using bits and the corredsponding values recovered by the backup Bloomier filter. The keys not hit by this Bloom filter, which we refer to as the keys passed by this Bloom filter, may now include false positives for our learned Bloomier filter. A key not in will yield a false positive here with probability ; that is, the key must have been a false positive for the learned oracle, but must have not been a hit on the Bloom filter. Also, note that some keys from that obtained the correct value from may hit the Bloom filter, and therefore will also have to have their values provided by the backup Bloomier filter. Of these keys from , a fraction are expected to be false positives in the Bloom filter; as we have throughout the paper, we will use the expectation, keeping in mind the true result will be concentrated around this expectation. Hence, in total, we need the backup Bloomier filter for keys. Suppose we use bits for the backup Bloomier filter. Then our total space is

and our overall false positive probability is

where the first term is from false positives from the oracle than passed Bloom filter, and the second term is from queries that hit the Bloom filter and give false positives in the backup Bloomier filter.

Again, these expressions are somewhat unwieldy because of the number of parameters. At a high level, however, these expressions reinforce helpful intuition. The cost per element in a Bloomier filter is rather high, because the value must be stored. Therefore if the false negatives as given by can be driven down to reasonable value with a small enough learned function, there should be space available to pay the bits of the learned function as well as the additional Bloom filter.