Cost-Sensitive Feature-Value Acquisition Using Feature Relevance
In many real-world machine learning problems, feature values are not readily available. To make predictions, some of the missing features have to be acquired, which can incur a cost in money, computational time, or human time, depending on the problem domain. This leads us to the problem of choosing which features to use at the prediction time. The chosen features should increase the prediction accuracy for a low cost, but determining which features will do that is challenging. The choice should take into account the previously acquired feature values as well as the feature costs. This paper proposes a novel approach to address this problem. The proposed approach chooses the most useful features adaptively based on how relevant they are for the prediction task as well as what the corresponding feature costs are. Our approach uses a generic neural network architecture, which is suitable for a wide range of problems. We evaluate our approach on three cost-sensitive datasets, including Yahoo! Learning to Rank Competition dataset as well as two health datasets. We show that our approach achieves high accuracy with a lower cost than the current state-of-the-art approaches.
Traditionally, research on machine learning algorithms has focused on achieving accurate predictions on fully available feature sets. However, in the real world, data is often incomplete and acquiring additional feature values will incur a cost. For example, when making a medical diagnosis, a doctor examines the patient and determines what further information is needed to make a diagnosis. The next question to ask has to be chosen out of a vast number of possibilities, but the number of potentially useful follow-up questions can be narrowed down with the answers received for each question. Once the doctor has acquired enough information on the patient, they make a diagnosis and choose the appropriate treatment. Acquiring information could mean e.g. performing medical tests (blood tests, imaging, etc) or asking the patient for more subjective information on their symptoms. In this case, there are monetary costs for each test as well as for the doctor’s time, and these costs can be very different. Asking the patient for information is fast and cheap, whereas performing medical tests can have a high cost.
Money is not the only type of cost that needs to be taken into account. Medical tests can have negative side-effects or they can cause patient discomfort. In the field of mobile health, acquiring data can have an impact on the mobile device’s battery life. Individuals might also be uncomfortable disclosing specific information, in which case there is a privacy cost for acquiring data. There could even be a cost based on how much human time is required to acquire a feature value. The approaches proposed in this paper can be used with any type of cost, as long as the costs are quantifiable on a linear scale.
There is a wide range of existing research on minimizing the costs associated with machine learning. We categorize these approaches into four different categories. The first category is feature selection, where the number of features is minimized in the training phase so that the cost becomes lower without affecting the prediction accuracy too much (Chandrashekar and Sahin, 2014). This approach, however, does not take into account that some inputs are easier to make predictions on than others, so they should have a lower cost as well. The second category is cascade algorithms, where at each step a decision is made to either acquire the next feature or stop and make a prediction (Xu et al., 2012; Chen et al., 2012; Trapeznikov et al., 2013; Wang, 2014). In this approach, the order of features is decided in the training phase, so they have only a limited ability to adapt to newly acquired information. The third category is trees of classifiers, in which the new information affects which feature is acquired next (Xu et al., 2013; Kusner et al., 2014; Wang et al., 2015). However, the number of classifiers can become considerable if there is a large number of features. The fourth category is fully adaptive algorithms, which can choose any feature based on what is most valuable in the current situation (Chen et al., 2015; Early et al., 2016; Kachuee et al., 2018, 2019). There are only a few algorithms in this category, and they tend to have a high computational cost.
The algorithms proposed in this paper belong to the last category, where any unknown feature can be chosen for acquisition. Furthermore, the feature acquisition decisions are based on the knowledge of the existing features. We have chosen to use neural networks as our predictive model, as they have been shown to perform well across many domains. We use generic neural network architectures to demonstrate that our proposed algorithms do not depend on domain-specific structures. To choose which features to acquire next, we derive our approach from the recent research on relevance propagation (Bach et al., 2015; Montavon et al., 2017). The focus of that line of research has been to explain why a neural network made a specific prediction, but we show that a similar approach can give us valuable information on the importance of missing feature values as well. To the best of our knowledge, relevance propagation has not been used in active feature acquisition previously. The benefit of our approach is the ease of training a model and the relatively low computational cost in both the training and testing phases.
In this paper, we propose two algorithms for active feature acquisition. The first algorithm achieves high accuracies at a low computational cost. The second algorithm improves the results of the first one at the expense of a slightly higher computational cost. Our main contributions can be summarized as follows:
We demonstrate how relevance propagation can be utilized for the active feature acquisition problem.
We develop an efficient algorithm that uses feature relevance information for feature acquisition. We show through experiments that feature relevances can give us valuable information on the usefulness of missing features with a very low computational cost.
Based on the first algorithm, we develop a second algorithm that provides better results at a slightly increased computational cost.
We compare our results with other state-of-the-art algorithms and show that our approach achieves a high accuracy with a lower cost on three realistic datasets.
2 Relevance Propagation
The approach proposed in this paper is related to the recent research on layer-wise relevance propagation (LRP) (Bach et al., 2015). The goal of LRP is to show how large an impact each of the input features had on the prediction, which is a non-trivial task on neural networks due to their non-linearities. To simplify this problem, relevances are propagated back to the input layer one layer at a time following a few specific rules. First, the relevance of the output layer is set to the predicted output value:
where is the output of our neural network with an input vector .
Second, each layer should also have the same total relevance:
where is the output relevance of neuron belonging to layer and is the number of neurons in layer . This constraint guarantees that all of the relevance on the higher layer will be distributed to the neurons on the lower layer without adding or removing any relevance between layers. However, the distribution of the relevance within the layers can differ. Doing this at every layer means that the total relevance on the input layer is equal to the relevance on the output layer. Furthermore, the output relevance of a neuron should be equal to the sum of relevances of lower-level neurons that are directly connected to it:
where is the amount of relevance attributed from a higher-level neuron to neuron . These rules are fulfilled when a neuron distributes its output relevance entirely to its inputs according to specific rules.
There are multiple approaches for distributing the relevances to the lower layer. The approach proposed by Bach et al. (2015) is to distribute the relevance of one neuron to the input neurons in the same proportion as the input values received from each neuron:
in which , is an input value for the neuron, and in the corresponding weight. As this rule might not be numerically stable with small activation values, they present a few alternative approaches to mitigate the problem, but the overall idea remains the same.
Montavon et al. (2017) proposed another way to propagate relevances while still following the other rules defined by Bach et al. (2015). Their approach, called Deep Taylor Decomposition, approximates each neuron using a first-order Taylor series approximation to determine how much relevance should be propagated to each neuron input. Using this approach, they derive three rules for propagating the relevance based on the neuron’s input domain. When the neuron has unconstrained input, the propagation rule becomes:
If the input is constrained to have only positive values, as is the case when the previous layer contains rectified linear units (ReLU), the propagation rule is:
where , and is the positive part of the weight . The positive part of a vector is defined as a vector that has all the negative values replaced by zeroes. Finally, if the neuron input is known to have specific upper () and lower () bounds, as is often the case in the first layer, the rule becomes:
Full derivations of these rules can be found in (Montavon et al., 2017).
3 Our Approach
3.1 Problem Definition
Let us consider a case where a feature vector has all feature values available, but only part of these values are known to us at any time. We know the value of , where is an indicator vector containing values 1 or 0 depending on whether the corresponding feature value is known to us at time . The number of known features at time is therefore . To acquire a new feature, we have to pay feature acquisition cost , which can be different for each feature .
The feature acquisition algorithm will then take the following steps. First, the algorithm should determine the most valuable feature using the knowledge of the acquired feature values, the current prediction, and feature costs. This feature value is then requested from an oracle that is able to give any value for a cost, and the value is added to the feature vector of known values . The cumulative cost at each time step is therefore . This process is repeated until a stopping condition is reached. Possible stopping conditions can include e.g. the model certainty reaching a predefined level, the total cost becoming too high, or all of the feature values having been acquired. The appropriate stopping condition should be decided based on the problem domain. For example, in the medical domain it might be more appropriate to stop only when the model certainty is high enough, even if it leads to a higher cost. In less critical domains, minimizing the total cost might be more important.
Our task is then to define a value function c), which uses the currently known features, the current prediction, and the feature costs to determine the value of acquiring each unknown feature. This function should give the highest value for features that increase the prediction accuracy the most while having a low cost.
In this paper, we propose two methods for cost-sensitive feature acquisition where relevance propagation is used to determine feature informativeness. On a high level, the goal of these algorithms is to define which unknown feature is the most important for our current prediction at any given time. This importance should reflect our current knowledge of the available feature values. Our first method propagates the predicted value directly similar to how the earlier relevance propagation papers have done. Our second method propagates relevance through each output node separately to determine which feature is the most important over any potential class.
3.2 Direct Propagation
We start by training a neural network model with fully available features. Having fully available features is not a strict requirement but we choose to use them to make the results comparable with the existing research in this field. The model takes an input vector and produces a prediction . Our algorithm receives an input vector , which has no known values initially. To keep track of which feature-values have been acquired already, we introduce an indicator vector , which has value 1 if the corresponding feature-value has been acquired and value 0 otherwise. The goal is then to request these feature values one-by-one until a prediction can be made. First, we fill in the missing values of the given input vector :
, where is the Hadamard product. The true expected value of x is not known, so we estimate it using the training data. This gives us a feature vector where the missing values have been replaced with the expectation for those values. We then propagate forward through our model to acquire a prediction . This prediction is then used as the starting point for relevance propagation. Using the Deep Taylor Decomposition propagation rules described in Section 2, is propagated backward to the input nodes. This gives us relevances for each input node, which is shown in Figure 1.
Next, we will introduce an adjusted relevance score to take into account the constraints that our problem has. First, we need to notice that the relevances can have either positive or negative values, depending on whether the corresponding feature increased or decreased the predicted value. High impact in either direction can be important, so we will use the absolute value. Next, we need to take into account the feature costs. Two features could provide the same amount of information but have a vastly different cost, so we will normalize the earlier value using the corresponding feature cost. Finally, we are only interested in the unknown features, which leads us to the adjusted relevance where individual values are defined as:
This defines all known feature values to have zero relevance, whereas unknown feature values have a relevance value depending on the magnitude of their original relevance as well as the associated feature acquisition cost. The relevance of known features is set to zero to avoid acquiring the same features multiple times. Finally, we acquire the feature value , where . The direct propagation approach is shown in Algorithm 1. This approach performs one forward and one backward propagation for each acquired feature, so the computational complexity is .
|Input: Feature vector , model , cost vector c repeat until Stopping condition is reached return||Input: Feature vector , model , cost vector c repeat for … do if then end if end for until Stopping condition is reached return|
3.3 Multiple Propagations
Our second algorithm modifies the direct propagation algorithm to take into account relevant features for each output node . The intuition behind this approach is that some output nodes might receive a low predicted value due to some unknown input feature, which can cause that input feature to have a low relevance in the direct propagation algorithm. By propagating relevance through each output node separately, we can avoid such a situation at the cost of increased computational complexity.
The problem setting is the same as in our first algorithm. However, when the relevance was propagated backward in the previous method, this time we set for one , while . This is then propagated backward using the same rules as previously. This process is repeated for all values of , so in total the backward propagation is performed times, once for each output class. We then choose the globally maximal adjusted relevance , and acquire the value of feature .
The multiple propagations method is shown in Algorithm 2. This method is computationally more demanding than the first method, as each feature acquisition requires one forward pass and backward passes, thereby making the computational complexity . However, the algorithm can be modified to perform the backward passes in parallel, as they do not depend on each other. For clarity, only the serial method is shown here.
We evaluate our approach on three cost-sensitive datasets: diabetes prediction, heart disease prediction, and Yahoo! Learning to Rank Competition dataset (LTRC). Summarization of these datasets can be found in Table 1.
|Dataset||Examples||Features||Classes||Hidden layers||Layer size|
4.1 Diabetes Prediction
As medical diagnosis is an area where active feature acquisition can be highly beneficial, the first two experiments use medical datasets. We have derived two datasets from The National Health and Nutrition Examination Survey (NHANES), which is a long-term program that has collected health and nutrition data from a nationally representative group of 5000 people between years 1999–2016 Centers for Disease Control and Prevention (2018). The full dataset contains questionnaire answers as well as results from physical and laboratory examinations. As this dataset contains data on a wide range of medical problems, we will first focus on predicting whether an individual has diabetes, prediabetes, or no diabetes.
To predict the level of diabetes, we look at the blood glucose levels and define an individual as healthy if their fasting plasma glucose level is below 100 mg/dL, prediabetic if the level is between 100 and 125 mg/dL, and diabetic if the level is over 125 mg/dL. These ranges have been defined by Centers for Disease Control and Prevention (2017). We filter out features that are directly related to our target variable, i.e. any variable measuring blood glucose, as they would make the prediction trivial. We also leave out variables that have missing values for over 25% of the subjects. We define feature costs based on a rough estimate on how much money and effort is needed to acquire that feature. The feature costs are listed in Table 2. The dataset is provided in the supplemental materials with the code to reproduce our results. The data was split into a training set (70%) and a test set (30%) randomly. Both sets were balanced by oversampling to have an equal number of examples from each class. All features were normalized to range .
We compare our approach to four other algorithms: The Greedy Miser Xu et al. (2012), AdaptApprox Nan and Saligrama (2017), FACT Kachuee et al. (2018) and Opportunistic Learning (OL) Kachuee et al. (2019). We selected these algorithms based on their good performance and the availability of reference implementations. The Greedy Miser is an algorithm for learning cost-sensitive classification and regression trees. AdaptApprox learns a gating function to decide whether to use a cheap or an expensive classifier. FACT uses neural network’s sensitivity to determine which feature is most likely to change the prediction. Opportunistic Learning uses reinforcement learning to learn what features to ask. The results of the first experiment can be found in Figure 1(a).
In this experiment, both of our proposed algorithms (Relevance-DP, Relevance-MP) provide nearly identical results. Initially, FACT provides similar results to our algorithms, but converges to a lower accuracy. Opportunistic Learning (OL) improves its accuracy very slowly after the initial features. AdaptApprox reaches a good accuracy but slightly slower than our algorithms. The Greedy Miser has to acquire a large number of features before finally achieving similar accuracy to the other algorithms. In this experiment, our algorithms provide a fast convergence and a high final accuracy, thereby combining the best aspects of the other algorithms.
4.2 Heart Disease Prediction
Our next dataset is also derived from the NHANES dataset Centers for Disease Control and Prevention (2018). This time the goal is to predict whether an individual has a heart disease, such as congestive heart failure, or has had a heart attack. We use the same costs and data preprocessing steps as in the previous dataset, shown in Table 2. The heart disease dataset can be found in the supplemental materials with the code that can be used to reproduce our results.
The results are shown in Figure 1(b). This time there is a significant difference between our proposed algorithms. Using multiple propagations (Relevance-MP) provides faster convergence than using direct propagation (Relevance-DP). Again, FACT provides similar results to our Relevance-MP algorithm initially but converges to a lower accuracy. Opportunistic Learning (OL) starts slightly slower and converges to a similar accuracy with FACT. AdaptApprox starts with more expensive features, therefore staying at a low accuracy for longer. The Greedy Miser suffers from the same problem, but also reaches a good accuracy later. This experiment shows the benefit of using multiple propagations instead of the more simple algorithm.
4.3 Learning to Rank Competition
The goal of Yahoo! Learning to Rank Competition (LTRC) dataset is to predict how relevant a specific document is given a query Chapelle (2011). The relevance is defined by an expert using a five-step scale. The feature vectors consist of features describing the query and the document. These features include e.g. how many times a document has been clicked on the result list, how recent the page is, how well the document’s text matches the query, and so on. Each feature has an associated feature extraction cost between 1–200, which has been defined by Yahoo!. The split into training and evaluation sets has been defined in the original dataset and was used as-is for our experiments.
Chapelle (2011) suggests using Normalized Discounted Cumulative Gain (NDCG) to measure the quality of the predictions. This metric was introduced by Järvelin and Kekäläinen (2002), and it compares the relevance of predicted result order to the optimal order. It has also been used by previous feature acquisition papers Xu et al. (2012, 2013), so we will use it to measure the performance of our algorithms as well.
As the ranking problem is different from the traditional classification problem, we compare our results to the algorithms that have been designed for this problem and have demonstrated the best performance: Cost-Sensitive Tree of Classifiers (CSTC) Xu et al. (2013), Cronus Chen et al. (2012) and Early Exit Cambazoglu et al. (2010). CSTC builds a tree of classifiers, where the chosen path determines which features will be used for prediction. Cronus builds a cascade of classifiers so that the easy inputs will be handled by the earlier classifiers in the cascade with a low cost, and the more complicated inputs will go through more classifiers leading to a higher cost. Early Exit scores documents gradually, dropping out the ones with too low scores before fully evaluating them.
The results are shown in Figure 3. As can be seen, feature acquisition with relevance propagation reaches a higher NDCG@5 score than the other approaches. In addition, using multiple propagations (Relevance-MP) converges to a high score significantly faster than using the direct propagation algorithm (Relevance-DP).
In this paper, we have shown a novel approach for the cost-sensitive active feature acquisition problem. Our approach uses missing feature relevance as the core idea for choosing which features to acquire. We have demonstrated two different propagation algorithms using this approach. First of them provided good results with one forward and one backward propagation per acquired feature, while the second algorithm improved the results further at the expense of a slightly increased computational cost. We evaluated the proposed algorithms on three realistic datasets: Yahoo! Learning to Rank Competition dataset and two health datasets derived from National Health and Nutrition Examination Survey (NHANES) dataset. Our results show that our first algorithm (direct propagation) performs well in most cases, while our second algorithm (multiple propagations) is more robust and out-performs the existing algorithms.
- On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 10 (7). External Links: Cited by: §1, §2, §2, §2.
- Early exit optimizations for additive machine learned ranking systems. In Proceedings of the third ACM international conference on Web search and data mining, pp. 411–420. Cited by: §4.3.
- Diabetes home. Centers for Disease Control and Prevention. Note: Accessed July 23, 2018https://www.cdc.gov/diabetes/basics/getting-tested.html External Links: Cited by: §4.1.
- Questionnaires, datasets, and related documentation. Centers for Disease Control and Prevention. Note: Accessed July 23, 2018https://wwwn.cdc.gov/nchs/nhanes/Default.aspx External Links: Cited by: §4.1, §4.2.
- A survey on feature selection methods. Computers and Electrical Engineering. External Links: Cited by: §1.
- Yahoo! Learning to Rank Challenge Overview. JMLR: Workshop and Conference Proceedings 14, pp. 1–24. Cited by: §4.3, §4.3.
- Classifier Cascade for Minimizing Feature Evaluation Cost. Proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS). External Links: Cited by: §1, §4.3.
- Computer adaptive testing using the same-decision probability. In CEUR Workshop Proceedings, External Links: Cited by: §1.
- Test time feature ordering with FOCUS. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing - UbiComp ’16, pp. 992–1003. External Links: Cited by: §1.
- Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20 (4), pp. 422–446. External Links: Cited by: §4.3.
- Dynamic feature acquisition using denoising autoencoders. IEEE transactions on neural networks and learning systems. Cited by: §1, §4.1.
- Opportunistic learning: budgeted cost-sensitive learning from data streams. In International Conference on Learning Representations, Cited by: §1, §4.1.
- Feature-cost sensitive learning with submodular trees of classifiers. Proceedings of the National Conference on Artificial Intelligence. External Links: Cited by: §1.
- Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recognition. External Links: Cited by: §1, §2, §2.
- Adaptive classification for prediction under a budget. In Advances in Neural Information Processing Systems, pp. 4727–4737. Cited by: §4.1.
- Multi-stage classifier design. Machine Learning 92 (2-3), pp. 479–502. External Links: Cited by: §1.
- Efficient Learning by Directed Acyclic Graph For Resource Constrained Prediction. Advances in Neural Information Processing Systems. External Links: Cited by: §1.
- An LP for Sequential Learning Under Budgets. Aistats. External Links: Cited by: §1.
- The Greedy Miser: Learning under Test-time Budgets. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 1175–1182. External Links: Cited by: §1, §4.1, §4.3.
- Cost-sensitive tree of classifiers. In International Conference on Machine Learning, pp. 133–141. Cited by: §1, §4.3, §4.3.