Indexing Cost Sensitive Prediction
Abstract
Predictive models are often used for realtime decision making. However, typical machine learning techniques ignore feature evaluation cost, and focus solely on the accuracy of the machine learning models obtained utilizing all the features available. We develop algorithms and indexes to support costsensitive prediction, i.e., making decisions using machine learning models taking feature evaluation cost into account. Given an item and a online computation cost (i.e., time) budget, we present two approaches to return an appropriately chosen machine learning model that will run within the specified time on the given item. The first approach returns the optimal machine learning model, i.e., one with the highest accuracy, that runs within the specified time, but requires significant upfront precomputation time. The second approach returns a possibly suboptimal machine learning model, but requires little upfront precomputation time. We study these two algorithms in detail and characterize the scenarios (using real and synthetic data) in which each performs well. Unlike prior work that focuses on a narrow domain or a specific algorithm, our techniques are very general: they apply to any costsensitive prediction scenario on any machine learning algorithm.
Indexing Cost Sensitive Prediction
\@float
copyrightbox[b]
\end@floatPredictive models are ubiquitous in realworld applications: adnetworks predict which ad the user will most likely click on based on the user’s web history, Netflix uses a user’s viewing and voting history to pick movies to recommend, and content moderation services decide if an uploaded image is appropriate for young children. In these applications, the predictive model needs to process the input data and make a prediction within a bounded amount of time, or risk losing user engagement or revenue [?, ?, ?].
Unfortunately, traditional featurebased classifiers take a onemodelfits all approach when placed in production, behaving the same way regardless of input size or time budget. From the classifier’s perspective, the features used to represent an input item have already been computed, and this computation process is external to the core task of classification. This approach isolates and simplifies the core task of machine learning, allowing theorists to focus on tasks like quick learning convergence and accuracy. But it leaves out many important aspects of production systems can be just as important as accuracy, such as tunable prediction speed.
In reality, the cost of computing features can easily dominate prediction time. For example, a content moderation application may use a supportvector machine to detect inappropariate images. At runtime, SVMs need only perform a single dotproduct between a feature vector and precomputed weight vector. But computing the feature vector may require several scans of an image which may take longer than an alotted time budget.
If feature computation is the dominating cost factor, one might intuitively accomodate time constraints by computing and using only a subset of features available. But selecting which subset to use at runtime—and guaranteeing that a model is available that was trained on that subet—is made challenging by a number of factors. Features vary in predictive power, e.g. skin tone colors might more accurately predict inappropriate images than image size. They also vary in prediction cost, e.g. looking up the image size is much faster than computing a color histogram over an entire image. This cost also varies with respect to input size—the size feature may be , stored in metadata, while the histogram may be . Finally for any features, distinct subsets are possible, each with their aggregate predictive power and cost, and each potentially requiring its own custom training run. As the number of potential features grows large, training a model for every possible subset is clearly prohibitively costly. All of these reasons highlight why deploying realtime prediction, while extremely important, is a particularly challenging problem.
Existing strategies of approaching this problem (see Section Indexing Cost Sensitive Prediction) tend to be either tightly coupled to a particular prediction task or to a particular mathematical model. While these approaches work for a particular problem, they are narrow in their applicability: if the domain (features) change or the machine learning model is swapped for new one (e.g., SVM for AdaBoost), the approach will no longer work.
In this paper, we develop a framework for costsensitive realtime classification as a wrapper over “offtheshelf” featurebased classifiers. That is, given an item that needs to be classified or categorized in real time and a cost (i.e., time) budget for feature evaluation, our goal is to identify features to compute in real time that are within the budget, identify the appropriate machine learning model that has been learned in advance, and apply the model on the extracted features.
We take a systems approach by decoupling the problem of costsensitive prediction from the problem of machine learning in general. We present an algorithm for cost sensitive prediction that operates on any featurebased machine learning algorithm as a black box. The few assumptions it makes reasonably transfer between different feature sets and algorithms (and are justified herein). This decoupled approach is attractive for the same reason that machine learning literature did not originally address such problems: it segments reasoning about the core tasks of learning and prediction from system concerns about operationalizing and scaling. Additionally, encapsulating the details of classification as we do ensures advances in machine learning algorithms and feature engineering can be integrated without change to the costsensitivity apparatus.
Thus, our focus in this paper is on systems issues underlying this wrapperbased approach, i.e., on intelligent indexing and pruning techniques to enable rapid online decision making and not on the machine learning algorithms themselves. Our contribution is two approaches to this problem as well as new techniques to mitigate the challenges of each approach. These two approaches represents two ends of a continuum of approaches to tackle the problem of modelagnostic cost sensitivity:

Our PolyDom approach yields optimal solutions but requires significant offline precomputation,

Our Greedy approach yields relatively good solutions but does not require significant offline precomputation.
First, consider the Greedy approach: Greedy, and its two subvariants GreedyAcc and GreedyCost (described in Section Indexing Cost Sensitive Prediction), are all simple but effective techniques adapted from prior work by Xu et al. [?], wherein the technique only applied to a subclass of SVMs [?]. Here, we generalize the techniques to apply to any machine learning classiciation algorithm as a black box. Greedy is a “quick and dirty” technique that requires little precomputation, storage and retrieval, and works well in many settings.
Then, consider our PolyDom approach, which is necessary whenever accuracy is paramount, a typical scenario in critical applications like credit card fraud detection, system performance monitoring, and adserving systems. In this approach, we conceptually store, for each input size, a skyline of predictive models along the axes of total realtime computation cost vs. accuracy. Then, given an input of that size, we can simply pick the predictive model along the skyline within the total realtime computation cost budget, and then get the best possible accuracy.
However, there are many difficulties in implementing this skylinebased approach:

Computing the skyline in a naive fashion requires us to compute for all , subsets of features (where is the set of all features) the best machine learning algorithm for that set, and the total realtime computation time (or cost). If the number of features is large, say in the 100s or the 1000s, computing the skyline is impossible to do, even with an unlimited amount of time offline. How do we intelligently reduce the amount of precomputation required to find the skyline?

The skyline, once computed, will require a lot of storage. How should this skyline be stored, and what index structures should we use to allow efficient retrieval of individual models on the skyline?

Computing and storing the skyline for each input size is simply infeasible: an image, for instance, can vary between 0 to 70 Billion Pixels (the size of the largest photo on earth [?]), we simply cannot store or precompute this much information. What can we do in such a case?
To deal with the challenges above, PolyDom use a dualpronged solution, with two precomputation steps:

Feature Set Pruning: We develop a number of pruning techniques that enable us to minimize the number of feature sets for which we need to learn machine learning algorithms. Our lattice pruning techniques are provably correct under some very reasonable assumptions, i.e., they do not discard any feature sets if those feature sets could be potentially optimal under certain input conditions. We find that our pruning techniques often allow us to prune up to 90% of the feature sets.

Polydom Index: Once we gather the collection of feature sets, we develop an index structure that allows us to represent the models learned using the feature sets in such a way that enables us to perform efficient retrieval of the optimal machine learning model given constraints on cost and input size. This index structure relies on reasoning about polynomials that represent cost characteristics of feature sets as a function of input size.
Overall, our approach offers a systems perspective to an increasingly important topic in the deployment of machine learning systems. The higherlevel goal is to isolate and develop the mechanisms for storage and delivery of costsensitive prediction without having to break the encapsulation barrier that should surround the fundamental machinery of machine learning. Our techniques could be deployed alongside the existing algorithms in a variety of realtime prediction scenarios, including:

An ad system needs to balance between peruser ad customization and latency on the small scale, and allocate computational resources between lowvalue and highvalue ad viewers on the aggregate scale.

Cloudbased financial software needs to run predictive models on portfolios of dramatically different input size and value. A maximum latency on results may be required, but the best possible model for each time and input size pairing is financially advantageous.

An autopilot system in an airplane has a limited time to respond to an error. Or more broadly, system performance monitors in a variety of industrial systems have fixed time to decide whether to alert a human operator of a failure.

A mobile sensor has limited resources to decide if an error needs to be flagged and sent to the central controller.
In the rest of the paper, we will first formally present our problem variants in Section Indexing Cost Sensitive Prediction, then describe our twopronged PolyDom solution in Section Indexing Cost Sensitive Prediction and our “quick and dirty” Greedy solution in Section Indexing Cost Sensitive Prediction, and finally present our experiments on both synthetic and realworld datasets in Section Indexing Cost Sensitive Prediction.
We begin by describing some notation that will apply to the rest of the paper, and then we will present the formal statement of the problems that we study.
Our goal is to classify an item (e.g., image, video, text) during realtime. We assume that the size of , denoted , would be represented using a single number or dimension , e.g., number of words in the text. Our techniques also apply to the scenario when the size can be represented using a vector of dimensions: for example, (length, breadth), for an image; however, for ease of exposition, we focus on the single dimension scenario. The entire set of features we can evaluate on is ; each individual feature is denoted , while a nonempty set of features is denoted .
We assume that we have some training data, denoted , wherein every single feature is evaluated for each item. Since training is done offline, it is not unreasonable to expect that we have the ability to compute all the features on each item. In addition, we have some testing data, denoted , where once again every single feature is evaluated for each item. We use this test data to estimate the accuracy of the machine learning models we discover offline.
Cost Function: We assume that evaluating a feature on an item depends only on the feature that is being computed, and the size of the item . We denote the cost of computing on as: . We can estimate during preprocessing time by running the subroutine corresponding to feature evaluation on varying input sizes. Our resulting expression for could either be a constant (if it takes a fixed amount of time to evaluate the feature, no matter the size), or could be a function of , e.g., , if evaluating a feature depends on .
Then, the cost of computing a set of features on can be computed as follows:
(1) 
We assume that each feature is computed independently, in sequential order. Although there may be cases where multiple features can be computed together (e.g., multiple features can share scans over an image simultaneously), we expect that users provide the features as “blackbox” subroutines and do not want to place additional burden by asking users to provide subroutines for combinations of features as well. That said, our techniques will equally well apply to the scenario when our cost model is more general than Equation Indexing Cost Sensitive Prediction, or if users have provided subroutines for generating multiple feature values simultaneously (e.g., extracting a word frequency vector from a text document).
Accuracy Function: We model the machine learning algorithm (e.g., SVM, decision tree, naivebayes) as a black box function supplied by the user. This algorithm takes as input the entire training data , as well as a set of features , and outputs the best model learned using the set of features . We denote the accuracy of inferred on the testing data as , possibly using kfold crossvalidation. We assume that the training data is representative of the items classified online (as is typical), so that the accuracy of the model is still online.
Note that we are implicitly assuming that the accuracy of the classification model only depends on the set of features inferred during test time, and not on the size of the item. This assumption is typically true in practice: whether or not an image needs to be flagged for moderation is independent of the size of the image.
Characterizing a Feature Set: Since we will be dealing often with sets of features at a time, we now describe what we mean by characterizing a feature set . Overall, given , as discussed above, we have a black box that returns

a machine learning model learned on some or all the features in , represented as .

, i.e., an estimate of the accuracy of the model on test data .
In addition, we can estimate , i.e., the cost of extracting the features to apply the model at test time as a function of the size of the item . Note that unlike the last two quantities, this quantity will be expressed in symbolic form. For example, could be an expression like .
Characterizing a feature set thus involves learning all three quantities above for : . For the rest of the paper, we will operate on feature sets, implicitly assuming that a feature set is characterized by the best machine learning model for that feature set, an accuracy value for that model, and a cost function.
Problem Statements: The most general version of the problem is when (i.e., the cost or time constraint) and the size of an item are not provided to us in advance:
Problem 1 (ProbGeneral)
Given at preprocessing time, compute classification models and indexes such that the following task can be completed at realtime:

Given , and a cost constraint at real time, identify a set such that and return .
That is, our goal is to build classification models and indexes such that given a new item at real time, we select a feature set and the corresponding machine learning model that both obeys the cost constraint, and is within of the best accuracy among all feature sets that obey the cost constraint. The reason we care about is that, in contrast to a hard cost constraint, a slightly lower accuracy is often acceptable as long as the amount of computation required for computing, storing, and retrieving the appropriate models is manageable. We will consider , i.e., the absolute best accuracy, as a special case; however, for most of the paper, we will consider the more general variants.
There are two special cases of the general problem that we will consider. The first special case considers the scenario when the input size is provided upfront, e.g., when Yelp fixes the size of profile images uploaded to the website that need to be moderated.
Problem 2 (ProbFixed)
Given and a fixed at preprocessing time, compute classification models and indexes such that the following task can be completed at realtime:

Given , and a cost constraint at real time, identify the set such that and return .
We also consider the version where is provided in advance but the item size is not, e.g., when an aircraft needs to respond to any signals within a fixed time.
Problem 3 (ProbFixed)
Given at preprocessing time, compute classification models and indexes such that the following task can be completed at realtime:

Given , at real time, identify the set such that and return .
Reusing Skyline Computation is Incorrect: We now argue that it is not sufficient to simply compute the skyline of classification models for a fixed item size and employ that skyline for all . For the following discussion, we focus on ; the general case is similar. Given a fixed , we define the skyline as the set of all feature sets that are undominated in terms of cost and accuracy (or equivalently, error, which is accuracy). A feature set is undominated if there is no feature set , where and , and no feature set where and ^{1}^{1}1Notice that the operator is placed in different clauses in the two statements.. A naive strategy is to enumerate each feature set , and characterize each by its crossvalidation accuracy and average extraction cost over the training dataset. Once the feature sets are charcterized, iterate through them by increasing cost, and keep the feature sets whose accuracy is greater than any of the feature sets preceeding it. The resulting set is the skyline. However, it is every expensive to enumerate and characterize all feature sets (especially when the number of features is large), and one of our key contributions will be to avoid this exhaustive enumeration. But for the purposes of discussion, let us assume that we have the skyline computed. Note that the skyline feature sets are precisely the ones we need to consider as possible solutions during realtime classification for for various values of .
Then, one approach to solving Problem 1 could be to simply reuse the skyline for other . However, this approach is incorrect, as depicted by Figure Indexing Cost Sensitive Prediction. Figure Indexing Cost Sensitive Prediction depicts the cost and error of each feature set and the error vs. cost skyline curve for , while Figure Indexing Cost Sensitive Prediction depicts what happens when changes from to a larger value (the same holds when the value changed to a smaller value): as can be seen in the figure, different feature sets move by different amounts, based on the cost function . This is because different polynomials behave differently as is varied. For instance, is less than when is small, however it is significantly larger for big values of . As a result, a feature set which was on the skyline may now no longer be on the skyline, and another one that was dominated could suddenly become part of the skyline.
Learning Algorithm Properties: We now describe a property about machine learning algorithms that we will leverage in subsequent sections. This property holds because even in the worst case, adding additional features simply gives us no new useful information that can help us in classification.
Axiom 2.1 (InformationNeverHurts)
If , then
While this property is known by the machine learning community to be anecdotally true [?], we experimentally validate this in our experiments. In fact, even if this property is violated in a few cases, our PolyDom algorithm can be made more robust by taking that into account, as we will see in Section Indexing Cost Sensitive Prediction
Our solution follows three steps:

Feature Set Pruning: First, we will start by constructing what we call as a candidate set, that is, the set of all feature sets (and corresponding machine learning models) that will be solutions to Problem 1. As a side effect, we will find a solution to Problem 2. The candidate set will be a carefully constructed superset of the skyline feature sets, so that we do not discard any feature sets that could be useful for any . We will describe this in Section Indexing Cost Sensitive Prediction.

Polydom Index Construction: In Section Indexing Cost Sensitive Prediction, we describe a new datastructure for Problem 1, called the polydominance index, which compactly represents the candidate set and allows it to be indexed into given a specific item size and budget during query time. In particular, we would like to organize the candidate set so that it can be efficiently probed even for large candidate sets size.

Online Retrieval: Lastly, we describe how the polydominance index is accessed during query time in Section Indexing Cost Sensitive Prediction.
We will construct the candidate set using a bidirectional search on the lattice of all subsets of features, depicted in Figure Indexing Cost Sensitive Prediction.
When a sequence of features is listed, this sequence corresponds to the feature set containing those features. In the figure, feature sets are listed along with their accuracies (listed below the feature set)^{2}^{2}2We have chosen accuracy values that satisfy Axiom 2.1 in the previous section.. An edge connect two features sets that differ in one feature. For now, ignore the symbols , we will describe their meaning subsequently. The feature set corresponding to is depicted at the top of the lattice, while the feature set corresponding to the empty set is depicted at the bottom. The feature sets in between have to features. In the following we use feature sets and nodes in the lattice interchangably.
Bidirectional Search: At one extreme, we have the empty set , and at the other extreme, we have . We begin by learning and characterizing the best machine learning model for , and for : i.e., we learn the best machine learning model, represented as , and learn the accuracy (listed below the node) and for the model^{3}^{3}3Note that the cost of a model (i.e. featureset) is simply the sum of the individual features and can reuse previously computed costs. We call this step expanding a feature set, and a feature set thus operated on is called an expanded feature set.
At each round, we expand the feature sets in the next layer, in both directions. We stop once we have expanded all nodes. In our lattice in Figure Indexing Cost Sensitive Prediction, we expand the bottom and top layer each consisting of 1 node, following which, we expand the secondtobottom layer consisting of 4 nodes, and the secondtotop layer again consisting of 4 nodes, and then we finally expand the middle layer consisting of 6 nodes.
However, notice that the total number of nodes in the lattice is , and even for relatively small , we simply cannot afford to expand all the nodes in the lattice. Therefore, we develop pruning conditions to avoid expanding all the nodes in the lattice. Note that all the pruning conditions we develop are guaranteed to return an accurate solution. That is, we do not make approximations at any point that take away the optimality of the solution.
Dominated Feature Sets: We now define what we mean for a feature set to dominate another.
Definition 3.1 (dominance)
A feature set dominates a feature set if and
As an example from Figure Indexing Cost Sensitive Prediction, consider node and on the righthand extreme of the lattice: the accuracies of both these feature sets is the same, while the cost of is definitely higher (since an additional feature is evaluated). Here, we will always prefer to use over , and as a result, is dominated by .
Overall, a feature set that is dominated is simply not under consideration for any , because it is not going to be the solution to Problems 1, 2, or 3, given that is a better solution. We formalize this as a theorem:
Given the property above, we need to find domination rules that allow us to identify and discard dominated feature sets. In particular, in our lattice, this corresponds to not expanding feature sets.
Pruning Properties: Our first property dictates that we should not expand a feature set that is strictly “sandwiched between” two other feature sets. It can be shown that any such feature set is dominated, and therefore, using Theorem 3.2, can never be a solution to any of the problems listed in the previous section.
Property 3.3 (SandwichProperty)
If , and , then no such that , needs to be expanded.
Intuitively, if there is a feature set that dominates an , while , then all other feature sets betweeen and are also dominated. Consider Figure Indexing Cost Sensitive Prediction, with , let , and , since , feature sets corresponding to and need not be expanded: in the figure, both these feature sets have precisely the same accuracy as , but have a higher cost.
In Figure Indexing Cost Sensitive Prediction once again for , let us consider how many expansions the previous property saves us while doing bidirectional search: We first expand and , and then we expand all nodes in the secondtotop layer and the secondtobottom layer. Then, from the next layer, and will not be expanded (using the argument from the previous paragraph), while the rest are expanded. Thus, we save two expansions. The expanded nodes in the lattice are denoted using s.
Now, on changing slightly, the number of evaluations goes down rapidly. The nodes expanded in this case are denoted using a . Let us consider . Once again, nodes in the top two and bottom two layers are expanded. However, only in in the middle layer needs to be expanded. This is because:

and are sandwiched between and

and are sandwiched between and

and are sandwiched between and
The previous property is hard apply directly (e.g., before expanding every feature set we need to verify if there exists a pair of feature sets that sandwich it). Next, we describe a property that specifies when it is safe to stop expanding all nonexpanded ancestors of a specific node.
Property 3.4 (CoveringProperty)
If such that , and , then no feature set sandwiched between and needs to be expanded.
This property states if any set of feature sets 1) contain , 2) in aggregate covers all the features in and 3) are dominated by , then all feature sets between and do not need to be expanded. The inverse property for pruning descendents also holds.
We use this property to extend the bidirectional search with an additional pruning step. Let the top frontier be the set of feature sets expanded from the top for which no child feature set has been expanded, and let be similarly defined from the bottom. By directly applying the CoveringProperty, we can prune the parents of if , dominates . We can similarly use the inverse of the property to prune feature sets in the top frontier.
Properties of Expanded Nodes: We have the following theorem, that is a straightforward consequence of Property 3.3:
Theorem 3.5
The set of expanded nodes form a superset of the skyline nodes for any .
In figure Indexing Cost Sensitive Prediction, the set of expanded nodes (denoted by for and for ) are the ones relevant for any .
Candidate Nodes: Given the expanded set of nodes, two properties to allow us to prune away some of the expanded but dominated nodes to give the candidate nodes. Both these properties are straightforward consequences of the definition of dominance.
Property 3.6 (Subset PruningProperty)
If , and , then does not need to be retained as a candidate
For instance, even though and are both expanded, does not need to be retained as a candidate node when is present (for any ); also, does not need to be retained as a candidate node when is present for .
The next property is a generalization of the previous, when we have a way of evaluating polynomial dominance.
Property 3.7 (PolyDom PruningProperty)
If , and , then does not need to be retained as a candidate
The next theorem states that we have not made any incorrect decisions until this point, i.e., the set of candidate nodes includes all the nodes that are solutions to Problems 1, 2, 3 for all .
Theorem 3.8
The set of candidate nodes form a superset of the skyline nodes for any .
Algorithm: The pseudocode for the algorithm can be found in the appendix split into: Algorithm 2 (wherein the lattice is traversed and the nodes are expanded) and Algorithm 1 (wherein the dominated expanded nodes are removed to give the candidate nodes).
In brief, Algorithm 2 maintains two collections: and , which is the frontier (i.e., the boundary) of already expanded nodes from the top and bottom of the lattice respectively. The two collections and contain the next set of nodes to be expanded. When a node is expanded, its children in the lattice are added to if the node is expanded “from the top”, while its parents in the lattice are added to if the node is expanded “from the bottom”.
Note that there may be smart data structures we could use to check if a node is sandwiched or not, or when enumerating the candidate set. Unfortunately, the main cost is dominated by the cost for expanding a node (which involves training a machine learning model given a set of features and estimating its accuracy), thus these minor improvements do not improve the complexity much.
Discussion: When the number of features in are in the thousands, the lattice would be massive. In such cases, even the number of expanded nodes can be in the millions. Expanding each of these nodes can take a significant time, since we would need to run our machine learning algorithm on each node (i.e., feature set). In such a scenario, we have two alternatives: (a) we apply a feature selection algorithm [?] that allows us to bring the number of features under consideration to a smaller, more manageable number, or; (b) we apply our pruning algorithm in a progressive modality. In this modality, a user provides a precomputation pruning cost budget, and the pruning algorithm picks the “best ” to meet the precomputation cost budget (i.e., the smallest possible for which we can apply the lattice pruning algorithm within the precomputation cost budget.) The approach is the following: we start with a large (say 2), and run the lattice pruning algorithm. Once complete, we can reduce by a small amount, and rerun the lattice pruning algorithm, and so on, until we run out of the precomputation cost budget. We can make use of the following property:
Property 3.9
The nodes expanded for is a superset of the nodes expanded for .
Thus, no work that we do for larger s are wasted for smaller s: as a result, directly using the that is the best for the precomputation cost budget would be equivalent to the above procedure, since the above procedure expands no more nodes than necessary.
AntiMonotonicity: Note that there may be practical scenarios where the assumption of monotonicity, i.e., Axiom 2.1, does not hold, but instead, a relaxed version of monotonicity holds, that is,
Axiom 3.10 (InformationNeverHurtsRelaxed)
If , then
Here, if is a subset of , then cannot be larger than . Intuitively, the violations of monotonicity, if any are small—smaller than (we call this the error in the monotonicity.) Note that when , we have Axiom 2.1 once again.
In such a scenario, only the lattice construction procedure is modified by ensuring that we do not prematurely prune away nodes (say, using the sandwich property) that can still be optimal. We use the following modified sandwich property:
Property 3.11 (SandwichPropertyRelaxed)
If , and , then no such that , needs to be expanded.
With the above property, we have a more stringent condition, i.e., that and not simply has to be greater than . As a result, fewer pairs of nodes qualify, and as a result, fewer nodes are pruned without expansion.
Subsequently, when deciding whether to remove some of the expanded nodes to give candidate nodes, we have the true accuracies of the expanded nodes, we no longer need to worry about the violations of monotonicity.
We begin by collecting the set of candidate nodes from the previous step. We denote the set of candidate nodes as . We now describe how to construct the polydom index.
Alternate Visualization: Consider an alternate way of visualizing the set of candidate nodes, depicted in Figure Indexing Cost Sensitive Prediction(left). Here, we depict the cost for each of the candidate nodes, as a function of . Also labeled with each cost curve is the accuracy. Recall that unlike cost, the accuracy stays constant independent of the input size . We call each of the curves corresponding to the candidate nodes as candidate curves. We depict in our figure four candidate curves, corresponding to feature sets . In the figure, we depict five ‘intersection points’, where these candidate curves cross each other. We denote, in ascending order, the intersection points, as . In Figure Indexing Cost Sensitive Prediction(left), . It is easy to see that the following holds:
Lemma 3.12
For , has accuracy 0.8, has accuracy 0.65, has accuracy 0.76, and has accuracy 0.7. The skyline of these four candidate sets for is ; is dominated by and both of which have lower cost and higher accuracy.
The lemma above describes the obvious fact that the relationships between candidate curves (and therefore nodes) do not change between the intersection points, and therefore, we only need to record what changes at each intersection point. Unfortunately, with candidate curves, there can be as many as intersection points.
Thus, we have a naive approach to compute the index that allows us to retrieve the optimal candidate curve for each value of :

for each range, we compute the skyline of candidate nodes, and maintain it ordered on cost

when values are provided at query time, we perform a binary search to identify the appropriate range for , and do a binary search to identify the candidate node that that respects the condition on cost.
Our goal, next, is to identify ways to prune the number of intersection points so that we do not need to index and maintain the skyline of candidate nodes for many intersection points.
Traversing Intersection Points: Our approach is the following: We start with , and order the curves at that point in terms of cost. We maintain the set of curves in an ordered fashion throughout. Note that the first intersection point after the origin between these curves (i.e., the one that has smallest ) has to be an intersection of two curves that are next to each other in the ordering at . (To see this, if two other curves intersected that were not adjacent, then at least one of them would have had to intersect with a curve that is adjacent.) So, we compute the intersection points for all pairs of adjacent curves and maintain them in a priority queue (there are at most intersection points).
We pop out the smallest intersection point from this priority queue. If the intersection point satisfies certain conditions (described below), then we store the skyline for that intersection point. We call such an intersection point an interesting intersection point. If the point is not interesting, we do not need to store the skyline for that point. Either way, when we have finished processing this intersection point, we do the following: we first remove the intersection point from the priority queue. We then add two more intersection points to the priority queue, corresponding to the intersection points with the new neighbors of the two curves that intersected with each other. Subsequently, we may exploit the property that the next intersection point has to be one from the priority queue of intersection points of adjacent curves. We once again pop the next intersection point from the priority queue and the process continues.
The pseudocode for our procedure is listed in Algorithm 3 in the appendix. The array records the candidate curves sorted on cost, while the priority queue contains the intersection points of all currently adjacent curves. As long as is not empty, we keep popping intersection points from it, update to ensure that the ordering is updated, and add the point to the list of skyline recomputation points if the point is an interesting intersection point. Lastly, we add the two new intersections points of the curves that intersected at the current point.
Pruning Intersection Points: Given a candidate intersection point, we need to determine if we need to store the skyline for that intersection point. We now describe two mechanisms we use to prune away “uninteresting” intersection points. First, we have the following theorem, which uses Figure Indexing Cost Sensitive Prediction:
Theorem 3.13
We assume that for no two candidate nodes, the accuracy is same. The only intersection points (depicted in Figure Indexing Cost Sensitive Prediction, where we need to recompute the skyline are the following:

Scenario 1: Curve 1 and 2 are both on the skyline, and . In this case, the skyline definitely changes, and therefore the point is an interesting intersection point.

Scenario 2: Curve 1 is not on the skyline while Curve 2 is. Here, we have two cases: if then the skyline definitely changes, and if , then the skyline changes iff there is no curve below Curve 2, whose accuracy is greater than .
As an example of how we can use the above theorem, consider Figure Indexing Cost Sensitive Prediction, specifically, intersection point and . Before , the skyline was , with being dominated by . At , and intersect. Now, based on Theorem 3.13, since the lower curve (based on cost) before the intersection has lower accuracy, the curve corresponding to now starts to dominate the curve corresponding to , and as a result, the skyline changes. Thus, this intersection point is indeed interesting.
Between and , the skyline was , since and are both dominated by (lower cost and higher accuracy). Now, at intersection point , curves and intersect. Note that neither of these curves are on the skyline. Then, based on Theorem 3.13, we do not need to recompute the skyline for .
Recall that we didn’t use approximation at all. Since we already used to prune candidate nodes in the first step, we do not use it again to prune potential curves or intersection points, since that may lead to incorrect results. In our experience, the lattice pruning step is more timeconsuming (since we need to train a machine learning model for each expanded node), so it is more beneficial to use in that step. We leave determining how to best integrate into the learning algorithms as future work. Finally, the user can easily integrate domain knowledge, such as the distribution of item sizes, into the algorithm to further avoid computing and indexing intemediate intersection points.
Determining the Skyline: As we are determining the set of interesting intersection points, it is also easy to determine and maintain the skyline for each of these points. We simply walk up the list of candidate nodes at that point, sorted by cost, and keep all points that have not been dominated by previous points. (We keep track of the highest accuracy seen so far.)
Index Construction: Thus, our polydom indexing structure is effectively is a twodimensional sorted array, where we store the skyline for different values of . We have, first, a sorted array corresponding to the sizes of the input. Attached to each of these locations is an array containing the skyline of candidate curves.
For the intersection curve depicted in Figure Indexing Cost Sensitive Prediction, the index that we construct is depicted in Figure Indexing Cost Sensitive Prediction.
Given an instance of Problem 1 at realtime classification time, finding the optimal model is simple, and involves two steps. We describe these steps as it relates to Figure Indexing Cost Sensitive Prediction.

We perform a binary search on the ranges, i.e., horizontally on the bottom most array, to identify the range within which the input item size lies.

We then perform a binary search on the candidate nodes, i.e., vertically for the node identified in the previous step, to find the candidate node for which the cost is the largest cost is less than the target cost . Note that we can perform binary search because this array is sorted in terms of cost. We then return the model corresponding to the given candidate node.
Thus, the complexity of searching for the optimal machine learning model is: : where is the number of interesting intersection points, while is the number of candidate nodes on the skyline.
The second solution we propose, called Greedy, is a simple adaptation of the technique from Xu et al. [?]. Note that Xu et al.’s technique does not apply to generic machine learning algorithms, and only works with a specific class of SVMs; hence we had to adapt it to apply to all machine learning algorithms as a black box. Further, this algorithm (depicted in Algorithm 4 in the appendix) only works with a single size; multiple sizes are handed as described subsequently. For now, we assume that the following procedure is performed with the median size of items.
Expansion: Offline, the algorithm works as follows: for a range of values of , the algorithm does the following. For each , the algorithm considers adding one feature at a time to the current set of features that improves the most the function
gain = (increase in accuracy) — (increase in cost)
This is done by considering adding one feature at a time, expanding the new set of features, and estimating its accuracy and cost. (The latter is a number rather than a polynomial — recall that the procedure works on a single size of item.) Once the best feature is added, the corresponding machine learning model for that set of features is recorded. This procedure is repeated until all the features are added, and then repeated for different s. The intuition here is that the dictates the priority order of addition of features: a large means a higher preference for cost, and a smaller means a higher preference for accuracy. Overall, these sequences (one corresponding to every ) correspond to a number of depthfirst explorations of the lattice, as opposed to PolyDom, which explored the lattice in a breadthfirst manner.
Indexing: We maintain two indexes, one which keeps track of the sequence of feature sets expanded for each , and one which keeps the skyline of accuracy vs. cost for the entire set of feature sets. The latter is sorted by cost. Notice that since we focus on a single size, for the latter index, we do not need to worry about cost functions, we simply use the cost values for that size.
Retrieval: Online, when an item is provided, the algorithm performs a binary search on the skyline, picks the desired feature set that would fit within the cost budget. Then, we look up the corresponding to that model, and then add features starting from the first feature, computing one feature at a time, until the cost budget is exhausted for the given item. Note that we may end up at a point where we have evaluated more or less features than the feature set we started off with, because the item need not be of the same size as the item size used for offline indexing. Even if the size is different, since we have the entire sequence of expanded feature sets recorded for each , we can, when the size is larger, add a subset of features and still get to a feature set (and therefore a machine learning model) that is good, or when the size is smaller, add a superset of features (and therefore a model) and get to an even better model.
Comparison: This algorithm has some advantages compared to PolyDom:

The number of models expanded is simply , unlike PolyDom, whose number of expanded models could grow exponentially in in the worst case.

Given the number of models stored is small (proportional to ), the lookup can be simple and yet effective.

The algorithm is anytime; for whatever reason if a feature evaluation cost is not as predicted, it can still terminate early with a good model, or terminate later with an even better model.
Greedy also has some disadvantages compared to PolyDom:

It does not provide any guarantees of optimality.

Often, Greedy returns models that are worse than PolyDom. Thus, in cases where accuracy is crucial, we need to use PolyDom.

The values we iterate over, i.e., , requires handtuning, and may not be easy to set. Our results are very sensitive to this cost function.

Since Greedy uses a fixed size, for items that are of a very different size, it may not perform so well.
We will study the advantages and disadvantages in our experiments.
Special Cases: We now describe two special cases of Greedy that merit attention: we will consider these algorithms in our experiments as well.

GreedyAcc: This algorithm is simply Greedy where ; that is, this algorithm adds one at a time, the feature with the smallest cost at the median size.

GreedyCost: This algorithm is simply Greedy where ; that is, this algorithm adds one at a time, the feature that adds the most accuracy with no regard to cost.
Note that these two algorithms get rid of one of the disadvantages of Greedy, i.e., specifying a suitable .
Online prediction depends on two separate phases—an offline phase to precompute machine learning models and data structures, and an online phase to make the most accurate prediction within a time budget. To this end, the goals of our evaluation are threefold: First, we study how effectively the PolyDom and Greedybased algorithms can prune the feature set lattice and thus reduce the number of models that need to be trained and indexed during offline precomputation. Second, we study how these algorithms affect the latency and accuracy of the models that are retrieved online. Lastly, we verify the extent to which our antimonotonicity assumption holds in realworld datasets.
To this end, we first run extensive simulations to understand the regimes when each algorithm performs well (Section Indexing Cost Sensitive Prediction). We then evaluate how our algorithms perform on a realworld image classification task (Section \thefigure), and empirically study antimonotonicity (Section Indexing Cost Sensitive Prediction) and finally evaluate our algorithms on the realwold classification task.
Metrics: We study multiple metrics in our experiments:

Offline Feature Set Expansion (Metric 1): Here, we measure the number of feature sets “expanded” by our algorithms, which represents the amount of training required by our algorithm.

Offline Index (Metric 2): Here, we measure the total size of the index necessary to look up the appropriate machine learning model given a new item and budget constraints.

Online Index lookup time (Metric 3): Here, we measure the amount of time taken to consult the index.

Online Accuracy (Metric 4): Here, we measure the accuracy of the algorithm on classifying items from the test set.
In the offline case, the reason why we study these two metrics (1 and 2) separately is because in practice we may be bottlenecked in some cases by the machine learning algorithm (i.e., the first metric is more important), and in some cases by how many machine learning models we can store on a parameter server (i.e., the second metric is more important). The reason behind studying the two metrics in the online case is similar.
Our Algorithms: We consider the following algorithms that we have either developed or adapted from prior work against each other:

PolyDom: The optimal algorithm, which requires more storage and precomputation.

Greedy: The algorithm adapted from Xu et al [?], which requires less storage and precomputation than PolyDom, but may expand a suboptimal set of feature sets that result in lower accuracy when the item sizes change.

GreedyAcc: This algorithm involves a single sequence of Greedy (as opposed to multiple sequences) prioritizing for the features contributing most to accuracy.

GreedyCost: This algorithm involves a single sequence of Greedy prioritizing for the features that have least cost.
Comparison Points: We developed variations of our algorithms to serve as baseline comparisons for the lattice exploration and indexing portions of an online prediction task:

Lattice Exploration: NaiveExpandAll expands the complete feature set lattice.

Indexing: While PolyDom only indexes points where the dominance relationship changes, PolyDomIndexAll indexes every intersection point between all pairs of candidate feature sets. Alternatively, NaiveLookup does not create an index and instead scans all candidate feature sets online.
Our synthetic experiments explore how feature extraction costs, individual feature accuracies, interactions between feature accuracies, and item size variance affect our algorithms along each of our four metrics.
Synthetic Prediction: Our first setup uses a synthetic dataset whose item sizes vary between and . To explore the impact of nonconstant feature extraction costs, we use a training set whose sizes are all , and vary the item sizes in the test dataset. For the feature sets, we vary four key parameters:

Number of features : We vary the number of features in our experiments from 1 to 15. The default value in our experiments is .

Feature extraction cost : We randomly assign the cost function to ensure a high degree of intersection points. Each function is a polynomial of the form where the coefficients are picked as follows: , , . (The reason why typically is that is multiplied by , while is multiplied by .) We expect that typical cost functions are bounded by degree and found that this is consistent with the cost functions from the realworld task. Note that PolyDom is insensitive to the exact cost functions, only the intersection points.

Single feature accuracy : Each feature’s accuracy is sampled to be either helpful with probability or not helpful with probability . If a feature is helpful, then its accuracy is sampled uniformly from within and within if it is not.

Feature interactions : We control how the accuracy of a feature set depends on the accuracy of its individual features using a parameterized combiner function:
where are the top most accurate features in . Thus when , ’s accuracy is equal to its most accurate single feature. When , ’s accuracy increases as more features are added to the set. We will explore and as two extremes of this combiner function. We denote the combiner function for a specific value as . Note that for any , the accuracy values are indeed monotone. We will explore the ramifications of nonmonotone functions in the realworld experiments.
We use the following parameters to specify a specific synthetic configuration: the number of features , the parameter , and , the amount the features interact with each other. In each run, we use these parameters to generate the specific features, cost functions and accuracies that are used for all of the algorithms. Unless specified, the default value of is . For Greedy, the . Although we vary over a large range, in practice the majority give rise to identical sequences because the dominating set of feature sets is fixed for a given item size (as assumed by Greedy).
Feature Set Expansions (Metric 1): We begin by comparing the number of models that PolyDom, Greedy and NaiveExpandAll train as a function of the combiner function and the number of features. This is simply the total number of unique feature For PolyDom and NaiveExpandAll, the number of feature sets is simply the number of nodes in the lattice that are expanded, while for Greedy this is simply the number of unique feature sets expanded.
In Figure Indexing Cost Sensitive Prediction and Figure Indexing Cost Sensitive Prediction, we depict the number of feature sets expanded (in log scale) as a function of the number of features in the dataset along the axis, for and respectively. is set to . The plots for other values are similar.
For both combiner functions, the total number of possible feature sets (depicted as NaiveExpandAll) scales very rapidly, as expected. On the other hand, the number of feature sets expanded by PolyDom for grows at a much slower rate for both graphs, because PolyDom’s pruning rules allow it to “sandwich” a lot of potential feature sets and avoid expanding them. Consider first combiner function (i.e., Figure Indexing Cost Sensitive Prediction) For 10 features, NaiveExpandAll expands around 1000 feature sets, while PolyDom for expands about 50, and expands about 30. We find that the ability to effectively prune the lattice of feature sets depends quite a bit on the combiner function. While PolyDom with continues to perform similarly. PolyDom expands as much as NaiveExpandAll; this is not surprising given that all intermediate feature sets have accuracy values strictly greater than their immediate children. In comparison, Greedy expands about as many features as PolyDom with but with a slower growth rate as can be seen from both and ; this is not surprising because in the worst case Greedy expands .
Indexing Size and Retrieval (Metric 2 and 3): Here, we measure the indexing and retrieval time of the PolyDom algorithms, which use a more complex indexing scheme than the Greedy algorithms.
Metric 2 and 3 Summary: On the synthetic dataset, especially for larger numbers of features, the size of the PolyDom index is significantly smaller than the size of the PolyDomIndexAll index, and almost as small as the NaiveLookup index. However, while NaiveLookup has a smaller index size, NaiveLookup’s retrieval time is much larger than PolyDom (for multiple values of ), making it an unsuitable candidate.
In Figure Indexing Cost Sensitive Prediction and Figure Indexing Cost Sensitive Prediction, we plot, for the two combiner functions the total size of the polydom index as the number of features is increased (for .) Consider the case when the number of features is for : here, PolyDom and NaiveLookup’s index size are both less than 200, PolyDomIndexAll’s index size is at the 1000 mark, and rapidly increases to 6000 for 12 features, making it an unsuitable candidate for large numbers of features. The reason why PolyDom’s index size is smaller than PolyDomIndexAll is because PolyDom only indexes those points where the dominance relationship changes, while PolyDomIndexAll indexes all intersection points between candidate feature sets. NaiveLookup, on the other hand, for both and only needs to record the set of candidate sets, and therefore grows slowly as well.
On the other hand, for retrieval time, depicted in Figures Indexing Cost Sensitive Prediction and Indexing Cost Sensitive Prediction, we find that the PolyDom’s clever indexing scheme does much better than NaiveLookup, since we have organized the feature sets in such a way that it is quick to retrieve the appropriate feature set given a cost budget On the other hand, NaiveLookup does significantly worse than PolyDom, since it linearly scans all candidate feature sets to pick the best one — especially as the number of features increases.
RealTime Accuracy (Metric 4): We now test the accuracy of the eventual model recommended by our algorithm.
Metric 4 Summary: On synthetic datasets, for and , over a range of budgets, PolyDom (with both or ), returns models with greater accuracies than Greedy and GreedyAcc, which returns models with greater accuracies than GreedyCost. Often the accuracy difference (for certain budgets) between PolyDom and Greedy, or between Greedy and GreedyCost can be as high as 20%.
In figure Indexing Cost Sensitive Prediction and Indexing Cost Sensitive Prediction, we plot the accuracy as a function of budget for PolyDom with and , and for Greedy, GreedyAcc and GreedyCost. For space constraints, we fix the item size to and use features. and is almost always better than Greedy. For instance, consider budget 1000 for PolyDom with has an accuracy of about 80%, while Greedy, GreedyAcc and GreedyCost all have an accuracy of about 50%; as another example, consider budget 1000 for , where PolyDom with has an accuracy of more than 90%, while Greedy, GreedyAcc and GreedyCost all have accuracies of about 50%. In this particular case, this may be because Greedy, GreedyAcc, and GreedyCost all explore small portions of the lattice and may get stuck in local optima. That said, apart from “glitches” in the midtier budget range, all algorithms achieve optimality for the large budgets, and are no better than random for the low budgets.
Further, as can be seen in the figure Greedy does better than GreedyCost, and similar to GreedyAcc. We have in fact also seen other instances where Greedy does better than GreedyAcc, and similar to GreedyCost. Often, the performance of Greedy is similar to one of GreedyAcc or GreedyCost.
Real Dataset: This subsection describes our experiments using a real image classification dataset [?]. The experiment is a multiclassification task to identify each image as one out of 15 possible scenes. There are 4485 labeled pixel images in the original dataset. To test the how our algorithms perform on varying image sizes, we rescale them to , and pixel sizes. Thus in total, our dataset contains 17900 images. We use 8000 images as training and the rest as test images.
The task uses 13 image classification features (e.g., SIFT and GIST features) that vary significantly in cost and accuracy. In Figure Indexing Cost Sensitive Prediction, we plot the cost functions of eight representative features as a function of , the item size, that we have learned using least squares curve fitting to the median cost at each training image size. As can be seen in the figure, there are some features whose cost functions are relatively flat (e.g., gist), while there are others that are increasing linearly (e.g., geo_map88) and superlinearly (e.g., texton).
However, note that due to variance in feature evaluation time, we may have cases where the real evaluation cost does not exactly match the predicted or expected cost. In Figure Indexing Cost Sensitive Prediction, we depict the 10% and 90% percentile of the cost given the item size for a single feature. As can be seen in the figure, there is significant variance — especially on larger image sizes.
To compensate for this variation, we compute the cost functions using the worstcase extraction costs rather than the median. In this way, we ensure that the predicted models in the experiment are always within budget. Note that Greedy does not need to do this since it can seamlessly scale up/down the number of features evaluated as it traverses the sequence corresponding to a given . We did not consider this dynamic approach for the PolyDom algorithm.
The “black box” machine learning algorithm we use is a Linear classifier using stochastic gradient descent learning with hinge loss and L1 penalty. We first train the model over the training images for all possible combinations of features and cache the resulting models and crossvalidation (i.e., estimated) accuracies. The rest of the experiments can look up the cached models rather than retrain the models for each execution.
For this dataset, our default values for are , respectively As we will see in the following, the impact of is small, even though our experiments described appendix Indexing Cost Sensitive Prediction find that .
Feature Set Expansions (Metric 1):
Metric 1 Summary: On the real dataset, the number of feature sets expanded by PolyDom’s offline lattice pruning phase for is smaller than NaiveExpandAll, with the order of magnitude increasing as increases, and as decreases. Greedy expands a similar number of feature sets as PolyDom with .
In Figure Indexing Cost Sensitive Prediction, we depict the number of feature sets expanded (in log scale) as a function of the tolerance to nonmonotonicity along the axis, for PolyDom with values of , and for Greedy.
As can be seen in the figure, while the the total number of possible feature sets is close to (which is what NaiveExpandAll would expand), the number of feature sets by PolyDom is always less than for or greater, and is even smaller for larger s (the more relaxed variant induces fewer feature set expansions). Greedy (depicted as a flat black line) expands a similar number of feature sets as PolyDom with . On the other hand, for , more than 1/4th of the feature sets are expanded.
The number of feature sets expanded also increases as increases (assuming violations of monotonicity are more frequent leads to more feature set expansions).
Indexing Size and Retrieval (Metric 2 and 3):
Metric 2 and 3 Summary: On the real dataset the size of the index for PolyDom is two orders of magnitude smaller than PolyDomIndexAll, while NaiveLookup is one order of magnitude smaller than that. The index size increases as increases and decreases as increases. However, the retrieval time for PolyDom is minuscule compared to the retrieval time for NaiveLookup.
In Figure Indexing Cost Sensitive Prediction, we plot the total index size as the tolerance to nonmonotonicity is increased (for .) As can be seen in the figure, the index size for PolyDom grows slowly as compared to PolyDomIndexAll, while NaiveLookup grows even slower. Then, in Figure Indexing Cost Sensitive Prediction, we display the total index size that decreases rapidly as is increased.
On the other hand, if we look at retrieval time, depicted in Figures Indexing Cost Sensitive Prediction and Indexing Cost Sensitive Prediction (on varying and on varying respectively), we find that NaiveLookup is much worse than PolyDom— PolyDom’s indexes lead to nearzero retrieval times, while NaiveLookup’s retrieval time is significant, at least an order of magnitude larger. Overall, we find that as the number of candidate sets under consideration grows (i.e., as decreases, or increases), we find that PolyDom does much better relative to PolyDomIndexAll in terms of space considerations, and does much better relative to NaiveLookup in terms of time considerations. This is not surprising: the clever indexing scheme used by PolyDom pays richer dividends when the number of candidate sets is large.
RealTime Accuracy (Metric 4):
Metric 4 Summary: On the real dataset, we find that while PolyDom still performs well compared to other algorithms for small , some Greedybased algorithms are competitive, and in a few cases, somewhat surprisingly, better than PolyDom. All other things being fixed, the accuracy increases as the item size decreases, the budget increases, the decreases, and increases.
In Figure Indexing Cost Sensitive Prediction, we plot the average estimated accuracy of the model retrieved given a budget for various values of and for PolyDom, Greedy, GreedyAcc, and GreedyCost across a range of image sizes. As can be seen in the figure, PolyDom dominates Greedy and GreedyAcc apart from the case when . Even for , PolyDom dominates GreedyAcc when : here, we see that a higher leads to better performance, which we did not see in other cases.
Perhaps the most surprising aspect of this dataset is that GreedyCost dominates all the others overall. While the reader may be surprised that a Greedybased algorithm can outperform PolyDom with , recall that the Greedy algorithms are anytime algorithms that can adapt to high variance in the feature evaluation cost, as opposed to PolyDom, which provisions for the worstcase and does not adapt to the variance. In future work, we plan to explore anytime variants of PolyDom, or hybrid variants of PolyDom with Greedybased algorithms.
In Figure Indexing Cost Sensitive Prediction, we plot the estimated accuracy of the model retrieved as a function of the budget for various image sizes (across a number of images of the same size). As can be seen in the figure, the estimated accuracy is higher for the same size of image, as budget increases. Also, the estimated accuracy is higher for the same budget, as image size decreases (as the image size decreases, the same budget allows us to evaluate more features and use a more powerful model.)
Despite its importance in applications, costsensitive realtime classification is not a particularly wellstudied problem: typically, a feature selection algorithm [?] is used to identify a set of inexpensive features that are used with an inexpensive machine learning model (applied to all items, large or small), and there are no dynamic decisions enabling us to use a more expensive set of features if the input parameters allow it. This approach ends up giving us a classifier that is suboptimal given the problem parameters. This approach has been used for realtime classification in a variety of scenarios, including: sensornetwork processing [?, ?], object and people tracking [?, ?, ?], understanding gestures [?, ?], face recognition, speech understanding [?, ?], sound understanding [?, ?] scientific studies [?, ?], and medical analysis [?, ?]. All of these applications could benefit from the algorithms and indexing structures outlined in this paper.
Our techniques are designed using a wrapperbased approach [?] that is agnostic to the specific machine learning model, budget metric, and features to extract. For this reason, our approach can be applied in conjunction with a variety of machine learning classification or regression techniques, including SVMs [?], decision trees [?], linear or ridge regression [?], among others [?]. In addition, the budget can be defined in terms of systems resources, monetary metrics, time or a combination thereof.
There has been some work on adapting traditional algorithms to incorporate some notion of joint optimization with resource constraints, often motivated by a transition of these algorithms out of the research space and into industry. A few examples of this approach have been developed by the information retrieval and document ranking communities. In these papers the setup is typically described as an additive cascade of classifiers intermingled with pruning decisions. Wang et al notes that if these classifiers are independent the constrained selection problem is essentially the knapsack problem. In practice, members of an ensemble do not independently contribute to ensemble performance, posing a potentially exponential selection problem. For the ranking domain, Wang et al apply an algorithm that attempts to identify and remove redundant features [?], assemble a cascade of ranking and pruning functions [?], and develop a set of metrics to describe the efficientyeffectiveness tradeoff for these functions [?]. Other work focuses specifically on inputsensitive pruning aggresiveness [?] and early cascade termination strategies [?]. These approaches are similar in spirit to ours but tightly coupled to the IR domain. For example, redundant feature removal relies on knowledge of shared information between features (e.g., unigrams and bigrams), and the structure of the cascade (cycles of pruning and ranking) is particular to this particular problem. Further, these approaches are tuned to the ranking application, and do not directly apply to classification.
Xu’s classifier cascade work [?, ?, ?, ?] considers the problem of postprocessing classifiers for cost sensitivity. Their approach results in similar benefits to our own (e.g., expensive features may be chosen first if the gains outweigh a combination of cheap features), but it is tailored to binary classification environments with low positive classification rate and does not dynamically factor in runtime input size. Others apply markov decision processes to navigate the exponential space of feature combinations [?], terminate feature computation once a test point surpasses a certain similarity to training points [?], or greedily order feature computation [?], but none of these formalize the notion of budget or input size into the runtime model, making it difficult to know whether highcost highreward features can be justified up front or if they should be forgone for an ensemble of lowercost features. That said, our Greedy algorithm (along with its variants, GreedyAcc and GreedyCost) are adapted from these prior papers [?, ?].
Our PolyDom algorithms are also related to prior work on the broad literature on frequent itemset mining [?], specifically [?, ?], that has a notion of a lattice of sets of items (or market baskets) that is explored incrementally. Further, portions of the lattice that are dominated are simply not explored. Our PolyDom algorithm is also related to skyline computation [?], since we are implicitly maintaining a skyline at all “interesting” item sizes where the skyline changes drastically.
Anytime algorithms are a concept from planning literature that describe algorithms which always produce some answer and continuously refine that answer give more time [?]. Our Greedyfamily of algorithms are certainly anytime algorithms.
In this paper, we designed machinelearning modelagnostic costsensitive prediction schemes. We developed two core approaches (coupled with indexing techniques), titled PolyDom and Greedy, representing two extremes in terms of how this costsensitive prediction wrapper can be architected. We found that PolyDom’s optimization schemes allow us to maintain optimality guarantees while ensuring significant performance gains on various parameters relative to PolyDomIndexAll, NaiveLookup, and NaiveExpandAll, and many times Greedy, GreedyAcc, and GreedyCost as well. We found that Greedy, along with the GreedyAcc and GreedyCost variants enable “quick and dirty” solutions that are often close to optimal in many settings.
In our work, we’ve taken a purely blackbox approach towards how features are extracted, the machine learning algorithms, and the structure of the datasets. In future work, we plan to investigate how knowledge about the model, or correlations between features can help us avoid expanding even more nodes in the lattice.
Furthermore, in this paper, we simply used the size of the image as a signal to indicate how long a feature would take to get evaluated: we found that this often leads to estimates with high variance (see Figure Indexing Cost Sensitive Prediction), due to which we had to provision for the worstcase instead of the average case. We plan to investigate the use of other “cheap” indicators of an item (like the size) that allow us to infer how much a feature evaluation would cost. Additionally, in this paper, our focus was on classifying a single point. If our goal was to evaluate an entire dataset within a time budget, to find the item with the highest likelihood of being in a special class, we would need very different techniques.
References
 [1] Largest image. http://70gigapixel.cloudapp.net.
 [2] R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD Record, volume 22, pages 207–216. ACM, 1993.
 [3] J. N. Bailenson, E. D. Pontikakis, I. B. Mauss, J. J. Gross, M. E. Jabon, C. A. Hutcherson, C. Nass, and O. John. Realtime classification of evoked emotions using facial feature tracking and physiological responses. International journal of humancomputer studies, 66(5):303–317, 2008.
 [4] J. Brutlag. Speed matters for google web search. http://googleresearch.blogspot.com/2009/06/speedmatters.html#!/2009/06/speedmatters.html. Published: 2009.
 [5] D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In Data Engineering, 2001. Proceedings. 17th International Conference on, pages 443–452. IEEE, 2001.
 [6] B. B. Cambazoglu, H. Zaragoza, O. Chapelle, J. Chen, C. Liao, Z. Zheng, and J. Degenhardt. Early exit optimizations for additive machine learned ranking systems. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM ’10, pages 411–420, New York, NY, USA, 2010. ACM.
 [7] M. Chen, Z. E. Xu, K. Q. Weinberger, O. Chapelle, and D. Kedem. Classifier cascade for minimizing feature evaluation cost. In AISTATS, pages 218–226, 2012.
 [8] D. Comaniciu, V. Ramesh, and P. Meer. Realtime tracking of nonrigid objects using mean shift. In Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, volume 2, pages 142–149. IEEE, 2000.
 [9] B. Crawford, K. Miller, P. Shenoy, and R. Rao. Realtime classification of electromyographic signals for robotic control. In AAAI, volume 5, pages 523–528, 2005.
 [10] T. Dean and M. Boddy. Timedependent planning. AAAI ’88, pages 49–54, 1988.
 [11] P. Domingos. A few useful things to know about machine learning. Communications of the ACM, 55(10):78–87, 2012.
 [12] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern classification. John Wiley & Sons, 2012.
 [13] J. Hamilton. The cost of latency. http://perspectives.mvdirona.com/2009/10/31/TheCostOfLatency.aspx.
 [14] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, and R. Tibshirani. The elements of statistical learning, volume 2. Springer, 2009.
 [15] M. A. Hearst, S. Dumais, E. Osman, J. Platt, and B. Scholkopf. Support vector machines. Intelligent Systems and their Applications, IEEE, 13(4):18–28, 1998.
 [16] Q. A. Holmes, D. R. Nuesch, and R. A. Shuchman. Textural analysis and realtime classification of seaice types using digital sar data. Geoscience and Remote Sensing, IEEE Transactions on, (2):113–120, 1984.
 [17] D. M. Karantonis, M. R. Narayanan, M. Mathie, N. H. Lovell, and B. G. Celler. Implementation of a realtime human movement classifier using a triaxial accelerometer for ambulatory monitoring. Information Technology in Biomedicine, IEEE Transactions on, 10(1):156–167, 2006.
 [18] S. Karayev, M. J. Fritz, and T. Darrell. Dynamic feature selection for classification on a budget. In International Conference on Machine Learning (ICML): Workshop on Prediction with Sequential Models, 2013.
 [19] R. Kohavi and G. H. John. Wrappers for feature subset selection. Artif. Intell., 97(12):273–324, Dec. 1997.
 [20] S. M. LaConte, S. J. Peltier, and X. P. Hu. Realtime fmri using brainstate classification. Human brain mapping, 28(10):1033–1044, 2007.
 [21] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pages 2169–2178. IEEE, 2006.
 [22] A. J. Lipton, H. Fujiyoshi, and R. S. Patil. Moving target classification and tracking from realtime video. In Applications of Computer Vision, 1998. WACV’98. Proceedings., Fourth IEEE Workshop on, pages 8–14. IEEE, 1998.
 [23] A. Mahabal, S. Djorgovski, R. Williams, A. Drake, C. Donalek, M. Graham, B. Moghaddam, M. Turmon, J. Jewell, A. Khosla, et al. Towards realtime classification of astronomical transients. arXiv preprint arXiv:0810.4527, 2008.
 [24] F. Nan, J. Wang, K. Trapeznikov, and V. Saligrama. Fast marginbased costsensitive classification. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2952–2956. IEEE, 2014.
 [25] P. Naula, A. Airola, T. Salakoski, and T. Pahikkala. Multilabel learning under feature extraction budgets. Pattern Recognition Letters, 40:56–65, 2014.
 [26] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Efficient mining of association rules using closed itemset lattices. Information systems, 24(1):25–46, 1999.
 [27] J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
 [28] R. Rad and M. Jamzad. Real time classification and tracking of multiple vehicles in highways. Pattern Recognition Letters, 26(10):1597–1607, 2005.
 [29] M. Raptis, D. Kirovski, and H. Hoppe. Realtime classification of dance gestures from skeleton animation. In Proceedings of the 2011 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pages 147–156. ACM, 2011.
 [30] V. C. Raykar, B. Krishnapuram, and S. Yu. Designing efficient cascaded classifiers: tradeoff between accuracy and cost. In KDD, pages 853–860, 2010.
 [31] J. Rodriguez, A. Goni, and A. Illarramendi. Realtime classification of ecgs on a pda. Information Technology in Biomedicine, IEEE Transactions on, 9(1):23–34, 2005.
 [32] Y. Saeys, I. n. Inza, and P. Larrañaga. A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507–2517, Sept. 2007.
 [33] J. Saunders. Realtime discrimination of broadcast speech/music. In Acoustics, Speech, and Signal Processing, 1996. ICASSP96 Vol 2. Conference Proceedings., 1996 IEEE International Conference on, volume 2, pages 993–996. IEEE, 1996.
 [34] E. Schurman and J. Brutlag. The user and business impact of server delays, additional bytes, and http chunking in web search. http://velocityconf.com/velocity2009/public/schedule/detail/8523, 2009.
 [35] J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, M. Cook, and R. Moore. Realtime human pose recognition in parts from single depth images. Communications of the ACM, 56(1):116–124, 2013.
 [36] C. Stauffer and W. E. L. Grimson. Learning patterns of activity using realtime tracking. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):747–757, 2000.
 [37] K.L. Tan, P.K. Eng, B. C. Ooi, et al. Efficient progressive skyline computation. In VLDB, volume 1, pages 301–310, 2001.
 [38] N. Tonellotto, C. Macdonald, and I. Ounis. Efficient and effective retrieval using selective pruning. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM ’13, pages 63–72, New York, NY, USA, 2013. ACM.
 [39] G. Tzanetakis and P. Cook. Musical genre classification of audio signals. Speech and Audio Processing, IEEE transactions on, 10(5):293–302, 2002.
 [40] L. Wang, J. Lin, and D. Metzler. Learning to efficiently rank. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’10, pages 138–145, New York, NY, USA, 2010. ACM.
 [41] L. Wang, J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pages 105–114, New York, NY, USA, 2011. ACM.
 [42] L. Wang, D. Metzler, and J. Lin. Ranking under temporal constraints. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM ’10, pages 79–88, New York, NY, USA, 2010. ACM.
 [43] Z. E. Xu, M. J. Kusner, G. Huang, and K. Q. Weinberger. Anytime representation learning. In ICML (3), pages 1076–1084, 2013.
 [44] Z. E. Xu, M. J. Kusner, K. Q. Weinberger, and M. Chen. Costsensitive tree of classifiers. In ICML (1), pages 133–141, 2013.
 [45] Z. E. Xu, K. Q. Weinberger, and O. Chapelle. The greedy miser: Learning under testtime budgets. In ICML, 2012.
APPENDIX
The following a proof of the polydom index construction algorithm described in Section Indexing Cost Sensitive Prediction.
Proof The proof follows a case analysis:

Scenario 1: Curve 1 and 2 both on skyline and

Scenario 2: Curve 1 is not on the skyline while Curve 2 is

Scenario 3: Curve 1 and 2 are both not on skyline

Scenario 4: Curve 1 is on the skyline and Curve 2 is not, and
The argument is that for Scenario 3 and 4, the skyline will not change: in scenario 3, Curve 1 and 2 will still not be on the skyline, while in Scenario 4, since Curve 1 gets even better, it will still be on the skyline, while Curve 2 will be dominated by Curve 1 and therefore will not be on the skyline.
On the other hand, for Scenario 1, Curve 1 will start dominating Curve 2, and so Curve 2 now is removed from the skyline. For Scenario 2, which is a little tricky, Curve 1, which is not on the skyline because of high cost, may move into the skyline if there is no other curve that dominates it (i.e., has lower cost and higher accuracy).
The other scenarios cannot happen:

Other half of Scenario 1: Curve 1 and 2 are both on skyline, and cannot happen, because then Curve 2 would dominate Curve 1

Other half of Scenario 4: Curve 1 is on the skyline while Curve 2 is not on the skyline and cannot happen since Curve 1 is dominated by Curve 2
In addition to the above performance metrics, we evaluated the sources of three types of variation that deviate from our assumptions.
Antimonotonicity: First, we focus on the monotonicity assumption that we made on the real dataset, which we used to define the values used in the experiments. Although the monotonicity principle is anecdotally known to be true among machine learning practitioners, we were unable to find a reference for the practical evaluation of monotonicity on a real dataset. We view this as an additional contribution of our paper.
Figure Indexing Cost Sensitive Prediction plots the cumulative distribution of the violations of monotonicity between pairs of ancestordescendant feature sets in the lattice. Most of the violations are small: close to 95% of the violations are below an error of , while all violations are within . Thus, we believe the monotonicity principle is largely true, even on a real dataset. Note that we found fimor devations between the quality of the retrieved models as we decreased to , so the assumptions that we made in the paper do not hurt us in a significant way.
Next, we would like to evaluate where in fact these violations of monotonicity exist in the lattice. Figure Indexing Cost Sensitive Prediction evaluates the distribution of violations as a function of distance^{4}^{4}4the difference in number of features between the ancestor and descendant featuresets between feature set pairs. The black line is the median (with grey error bars) of all violations as a function of distance. As can be seen in the figure, the highest median as well as the largest variance is at distance and both quickly decrease to close to at . This is a good sign: violations of monotonicity, if any, are local rather than global, with almost no violations that are between pairs of feature sets that are far away from each other. These results suggest that the Skyline algorithm is not likely to falsely prune a feature set early on due to a violation in monotonicity. Furthermore, a modest value can compensate for the majority of violations.
Estimated Versus Actual: Next, we compare the estimated cost and accuracies of the realworld model with the true values for large and small image sizes.
Figure Indexing Cost Sensitive Prediction plots the estimated versusactual cost. We find that the cost function tends to over estimate the actual cost because the cost functions are trained on the worst, rather than average case. We chose this because if we did provision for the mean cost, the polydom index may return models whose true costs exceed the time budget. The costs for Greedy are similar, however because it ignores item size during the offline phase, it severely underestimates the cost of the small images, in contrast to PolyDom.
Figure Indexing Cost Sensitive Prediction plots the estimated and true accuracy of the models retrieved. We find the the estimated accuracy is indeed linearly correlated with the true accuracy. However the model consistently overestimates the accuracy because the small images are downsampled, so the features are correspondingly less accurate. Overall, this suggests that optimizing for estimated accuracy is a reliable proxy for the quality of predictions at test time.