Minimax Classifier for Uncertain Costs
Abstract
Many studies on the costsensitive learning assumed that a unique cost matrix is known for a problem. However, this assumption may not hold for many realworld problems. For example, a classifier might need to be applied in several circumstances, each of which associates with a different cost matrix. Or, different human experts have different opinions about the costs for a given problem. Motivated by these facts, this study aims to seek the minimax classifier over multiple cost matrices. In summary, we theoretically proved that, no matter how many cost matrices are involved, the minimax problem can be tackled by solving a number of standard costsensitive problems and subproblems that involve only two cost matrices. As a result, a general framework for achieving minimax classifier over multiple cost matrices is suggested and justified by preliminary empirical studies.
I Introduction
In many real world classification problems, different types of misclassifications commonly result in different costs. For example, in fraud detection problem, predicting a normal client as fraud will cut the profit, while predicting a fraud client as normal would usually lead to great loss [1]. In these scenarios, it would be more desirable to minimize the total cost rather than the classification error. This kind of problem is referred to as cost sensitivelearning problem [2], and has attracted many interests in recent years due to its wide applications in the real world [3, 4, 5].
So far, the majority of previous research on costsensitive learning assumes that the costs for different types of misclassifications, typically represented as a cost matrix, are uniquely specified before the classifier is applied to new data. Specifically, if the cost matrix is known before the training procedure, it can be integrated into the the learning algorithm to obtain a classifier with minimum total cost. This can be done by modifying the training data according to the cost matrix [6, 7], or by extending learning algorithms directly [8, 9]. In addition to specialized methods, some alternative approaches, which are motivated by other learning problems, could also be employed to address costsensitive learning problems. This category of methods, including calibration methods [10], threshold moving [5] and its variants [11], typically post process the output of a classifier to optimize its performance with respect to a objective (e.g., minimize the total cost or classification error). In this sense, it is not necessary to know the cost matrix in prior to the training phase, as long as it becomes available before testing ^{1}^{1}1Sometimes, postprocessing can also be considered as a part of the training procedure. From this point of view, the cost matrix still needs to be specified before training phase is finished..
All the abovementioned approaches assume that a unique cost matrix is known for a given costsensitive problem. Unfortunately, in the real world, it could be very difficult for a practitioner to specify the cost matrix uniquely, for the reason that one may do not have much sense about the exact values of misclassification costs, or that the costs may vary under different circumstances and thus is uncertain in nature. In one word, the cost matrix for a realworld problem may be uncertain throughout both training and testing.
As a matter of fact, the difficulty of specifying a cost matrix has been acknowledged by many researchers. In the context of ROC analysis [12], it is claimed that a classifier can be built without any cost information, while still performs well in the scenarios where the cost matrix changes. Nevertheless, an underlying assumption behind this statement is that threshold moving (or any other similar methods) is employed to finetune the output of the classifier. Hence, as discussed above, the specified cost matrix is still required in the postprocessing phase. Zadrozny and Elkan [13] considered the scenario where examplebased misclassification costs are static but unknown. More recently, Liu and Zhou [14] investigates the problem of learning with cost intervals. Specifically, the misclassification cost is assumed as taking a value within a predefined interval, and an approach is developed to train a SVM that performs well for every possible value of cost.
Rather than striving to achieve satisfactory performance over all possible cost matrices, the aim of this work is to minimize the largest total cost over a finite set of possible cost matrices, i.e., to find the minimax classifier. Under mild assumptions, we prove that the minimax classifier over multiple cost matrices can be achieved by solving a set of standard costsensitive learning problems and a set of subproblems involves only two cost matrices. This finding immediately suggests a general framework for seeking minimax classifier over arbitrary number of cost matrices. Moreover, since an interval can be transformed into a finite set of values via discretization, the framework is also applicable to the scenarios where only the largest and smallest costs for misclassification are available.
The rest of this paper is organized as follows. Preliminary backgrounds and related works are introduced with more details in Section 2. Section 3 presents the theoretical analysis of the minimax problem. Experimental studies are in following section, and we conclude the paper in Section 5.
Ii Preliminaries and Related Works
In this section, we introduce the basic notations and backgrounds at first, and then review two works that are closely related to this study. One is the work from Liu and Zhou [14], which also deals with the uncertain cost problem, but with different formulation and learning target, and the other focus on finding the minimax classifier for uncertain class prior [15].
Iia Preliminaries
Given a dataset , is the feature vector of instance and is the class label. Suppose that there is no cost with correct classification, a cost matrix can be represented by two values and , denoting the cost of misclassifying an instance from class and class respectively. Also, we use and to represent the class priors, so there are and instances in each class. For any classifier from the hypothesis space , its total cost is,
(1)  
where is the probability that misclassifies instances from class to class .
IiB Learning with Cost Intervals
In a recent work, Liu and Zhou [14] considered a special form of the uncertain cost problem where is , and is uncertain but within a predefined interval . Their objective is to construct a classifier that performs well for every individual cost within . Technically, the problem was transformed as finding the best surrogate cost to trained with, i.e., their learning target is,
(2)  
A SVMbased algorithm was proposed there, which primarily minimizes the largest total cost (i.e., ) and secondarily minimizes the total cost at mean cost , i.e., . Solid experimental results reported there confirmed the efficacy of the method.
However, to fit in with the interval formulation, one needs to artificially rescale original cost matrices by different factors to assure every is . Although this rescaling process does no harm to traditional costsensitive learning as well as the study in [14], it makes the comparison of total costs across different cost matrices meaningless. Considering that it is generally hard or even impossible to find a classifier that performances well on all costs over the interval (as suggested by [14] itself), the best classifier they built may lead to very big total cost on original cost matrices for realworld problems.
IiC Minimax Classifier for Uncertain Class Priors
In the many studies involving the minimax criterion [16], those focused their attention on building minimax total cost classifier for uncertain class prior [15] are of particular interest to this study.
Formally, in case of uncertain class prior, the minimax classification problem is to find the following classifier,
(3) 
It is well known that the total cost of a fixed classifier is a linear function of prior, while the optimal total cost (i.e., the Bayesian cost) is a concave function of prior [17]. Therefore, suppose the best classifier is for a given class prior , then the total cost function of w.r.t. prior would be a tangent line of the Bayesian total cost curve at . Based on these elegant properties, AlaizRodriguez et al proposed two algorithms based on neural networks model to find the minimax classifier iteratively in [15]. Readers interested in the details of the algorithms are referred to that paper.
Notice the deceptively symmetrical positions of prior and cost in Eq. (1), one may think that all the analysis and algorithms w.r.t. the uncertain prior problem can be employed directly for the uncertain cost problem concerned in this study. Unfortunately, that is not the case. For the reason that both and are free variables (i.e., the sumtoone property of prior does not applied to cost), the concavity of Bayesian total cost for prior can not be transformed to cost. In the following, we consider the minimax problem for uncertain cost along a different way.
Iii Minimax Classifier for Uncertain Cost
Iiia Problem Formulation
As mentioned above, this study focuses on minimizing the largest total cost over a finite set of possible cost matrices. Formally, given a set of cost matrices , where is the th cost matrix, the learning target is to find,
(4) 
Since the uncertain cost is formulate as a set directly, the problem is widely applicable in practice, ready for future study on multiclass problems, and facilitating theoretical analysis. On the other hand, the best classifier selected by the minimax criterion is much more reliable.
IiiB Problem Analysis
For two different cost matrices and in , if both , and , then the total cost of any classifier obtained on will be smaller than that on . In this case, we say that is dominated by . Furthermore, if there exist a cost matrix that dominates all others in , the above minimax problem can be simplified as a standard costsensitive learning problem with fixed cost matrix . Therefore, given a minimax classification problem over a set of cost matrices, the first step one should take is to check and delete cost matrices that are dominated by any other cost matrix in .
On the other hand, the performance of a classifier from the hypothesis space can be mapped to a point in the D space with as the xaxis and as the yaxis. Similarly, for two different classifiers and , if , and , then , no matter what the cost matrix is. In this case, we say that dominates . If a classifier is not dominated by any other classifier in , it is a nondominated classifier. Following the concept in economics [18], the front formed by all nondominated classifiers in is named as the Pareto front (see Fig. 1). When is an infinite hypothesis space and dataset consist of enough samples, the front is continues. Obviously, for both standard costsensitive learning problem and the minimax problem concerned in this study, the optimal classifiers must be on the Pareto front.
Let us firstly consider the situation with only one cost matrix , Lemma 1 reveals the relative order between the total costs of any two classifiers on the front.
Lemma 1.
For any two classifiers , on the Pareto front, with and , the following conclusions hold,
Proof.
Notice that the left hand of each case in Lemma 1, , is the abstract value of the slope of the segment connected and , which is determined by classifiers’ performance, and , on the other hand, is a constant given dataset and cost matrix . That is, geometrically, the relative order between the total costs of a pair of classifiers on the front is determined by slope of the segment connected these two classifiers.
Furthermore, since all the dominated classifiers can be ignored w.r.t. our problem, total cost Eq.(1) can be treated as a function of the classifiers on the Pareto front. For briefness, we further consider it as a function of , and keep in mind that is determined correspondingly. Hereafter, we denote the total cost function as for cost matrix . The following lemma describes the track of along the front.
Lemma 2.
Assume the Pareto front is convex^{2}^{2}2Analogous to the ROCCH technique, in case that the Pareto front is not convex, one can construct the convex hull of all nondominated classifiers as the surrogate Pareto front. Please refer to [12], particularly Theorem 7 there, for further details., then the total cost function decreases monotonically to its minimum at first, and then increases monotonically over the front.
Proof.
Given any three adjacent classifiers on the front, , without loss of generality, we suppose . Since the curve of the Pareto front is decreasing and convex, must lay on the rightside of the line passes and . Plus the fact that and , we have
Since are three arbitrary adjacent classifiers on the front, it comes that the abstract value of the slope is adjacently and monotonically nonincreasing along the front.
Suppose the classifier of minimal total cost for cost matrix is , with the total cost , then according to Lemma 1, decreases monotonically to at first, and then increases monotonically.
∎
In fact, Lemma 2 describes the behavior of total cost function for standard costsensitive learning problem (i.e., with only one cost matrix), and many costsensitive learning methods published in the literature could be used, hopefully, to find the minimum point. See Fig. 2 for an illustration^{3}^{3}3Note the total cost curve was drawn for illustration purpose, the convexity it appears is not implied nor has been proved..
Now, we are ready for considering the situation with multiple cost matrices. For a set of cost matrices, there are total cot curves correspondingly. Each of them decreases to its own minimum at first and then increases. Fig. 3 shows a example consist of two cost matrices. We can see that, in this case, the minimax total cost locates at the cross point of these two curves. Generally, the position of the minimax classifier for multiple cost matrices is confined by the following theorem.
Theorem 1.
For different total cost curves, each of them decreases to its own minimum at first and then increases monotonically, the minimax total cost locates at one of the two types of positions,

minimum point of an individual curve,

point where curves get crossed.
Proof.
Suppose the minimax total cost locates at neither one of the two types of positions, without loss of generality, we assume it is on the total cost curve of (i.e., ). Note that the minimax classifier is , we have,
There is no equality because the the minimax total cost is not obtained at a type2 point. On the other hand, since the minimax total cost is obtained neither at a type1 point, according to Lemma 2, we can find another classifier such that,
This means the minimax total cost can be reduced, which conflicts with the definition of minimax. Therefore, the theorem is validated. ∎
So, in order to find the minimax classifier, we just need to examine every classifier corresponds to the two types of positions. However, without further information, any pair of total cost curves may cross each other several times in practice, hence it would be very expensive or even impossible to examine all these points without omission. Fortunately, this obstacle can be removed elegantly by the following corollary.
Corollary 1.
For a set of cost matrices , the minimax total cost classifier belongs to one of following two categories,

classifiers that minimize the total cost for an individual cost matrix,

classifiers that minimax the total cost for a pair of cost matrices.
Proof.
According to Theorem 1, if the minimax total cost is obtained at one of the type1 positions, then the minimax classifier fall into the first category, thus the corollary is true.
Otherwise, the minimax total cost is obtained at a cross point of total cost curves. We know that there are at least two total cost curves with different monotonic property at the cross point, otherwise, we can move in the direction that all involved curves are decreasing, leading to reduced minimax total cost. Let the is decreasing, and is increasing at , then we know from Lemma 2 that the maximal total cost for is bigger with all other classifiers. Hence, the cross point is the also the minimax total cost for . So, the corollary is also true in this case.
∎
According to Corollary 1, the minimax classification problem over multiple cost matrices is reduced to solving a set of standard costsensitive learning problems and a set of subproblems involves only two cost matrices, saving the bother to consider the tradeoff among multiple cost matrices. Finally, the framework for solving the minimax classification problem over a set of cost matrices is summarized in Algorithm 1.
Iv Experiments
In the experiments, we compared three frameworks for solving the minimax problem. The first is to build the minimum total cost classifier for each possible cost matrix without considering any tradeoff among cost matrices at first, and then picks out the minimax classifier, the second is our framework described above, and the third one is to build the minimax classifier directly with all the possible cost matrices are under consideration simultaneously. For briefness, we denote these three frameworks as S, SP, and M respectively.
Iva Implementation
Although there are many standard costsensitive learning methods, striving to minimize the total cost for one cost matrix, can be used to implement S and one part of SP, to the best of our knowledge, there is no particular method that can be used to implement M or the the other part of SP (i.e., minimax the total cost for two or more cost matrices). Hence, for the comparison purpose, we adopted a simplified form of the Generalized Additive Model (GAM) to implement all the three frameworks. Therefore, the empirical studies presented underneath are preliminary, and only intend to serve as a baseline for future study.
The GAM used to implement all the compared frameworks is,
(5) 
where is the number of iterations, and is a decision stump, whose output is or . At each generation, we add one decision stump such that the current ensemble of decision stump get improved performance over on the predefined objective. This process repeats until the iteration number is ran out or there is no improvement.
With this simple GAM procedure, we are able to implement the three abovementioned frameworks. That is, all necessary building blocks can be generated by setting the “predefined objective” to minimize the total cost for a single cost matrix, or minimax the total cost for a pair of cost matrices, or minimax the total cost for a set of cost matrices.
IvB Experimental Setup
Ten datasets from the UCI machine learning repository [19] were used in the experiments. Brief information about these datasets is summarized in Table I.
Dataset  No. of Features  Class Distribution 

australian  14  307:383 
crx  16  307:383 
german  24  700:300 
heart  13  150:120 
hillvalley  100  612:600 
housevotes  16  168:267 
krvskp  36  1669:1527 
mushroom  22  3916:4208 
sonar  60  97:111 
wdbc  30  357:212 
Most of these ten classification problems are originally realworld costsensitive problems, for example the australian, crx, and german problems are fraud detection problems, while the heart, mushroom and wdbc problems are related to health of people. For these problems, the misclassification cost matrix is usually hard if not impossible to specified by practitioners, so the experiments on them are appropriate.
For each of the datasets, we compared the 3 frameworks on 4 set of cost matrices of different cardinalities. They are sets of 3 cost matrices, 5 cost matrices, 10 cost matrices, and 20 cost matrices. The value of each element of the matrices is randomly generated within . Besides, it is assured in advance that there is no dominated cost matrix in each set.
The iteration number in the GAM is set to , and 20 times 5fold cross validation procedure was employed to obtain stationary results. Hence, for each of configurations of (dataset, set of cost matrices, compared method), there are total cost values. Based on these values, we furthermore conducted Wilcoxon signed rank test between SP method and the other two methods with significance level .
IvC Results
Table II and Table III present the comparisons over each dataset on training and testing respectively. The value in each cell is the average total cost over 20 times 5fold, and the best performance for each (dataset, cost set) configuration is in boldface. Moreover, the results of Wilcoxon signed rank test are denoted as superscripts on the values of S and M methods, a superscript of indicates the performance of SP is significantly better than that of corresponding method, for significantly worse, and no superscript means there is no statistically significant difference between SP and corresponding method.
In summary, we can see that SP outperforms the other two methods in almost all cases, and keeps statistically comparable for the rest few cases. There is no case that SP is statistically worse (i.e., there is no on the superscripts).
Of course, it is not surprising at all that SP defeats S completely in the experiments, since SP always checks a superset of classifiers compared to S. But these results at least provide a evidence that the S framework is not adequate for uncertain costs problems. Moreover, with a closer examination of the results in each fold, we can see that the performance of S and SP are identical sometimes, and SP is better if they are not. This is consistent with Corollary 1, since the classifier obtained by S could be the optimal minimax classifier in theory.
On the other hand, the superior of SP over M is more interesting. Unlike the S framework, M searches the hypothesis space with the true learning target directly (i.e., the minimax target). Therefore, the most plausible explanation is that the implementation of the M framework is no effective enough. Since ideally it could perform as good as the SP framework. However, as the similar problem encountered in multiclass classification problems, designing algorithms that can handle multiple tradeoff simultaneously is never a trivial work.
In summary, although we implemented the three compared frameworks with a preliminary and less effective model, the result reported in the paper confirms the efficacy of Corollary 1. Once we are equipped with particular designed method can solve the minimax problem over only two cost matrices effectively, it would be very exciting to see the full advantage of the SP framework.
V Conclusions and Discussions
For many realworld costsensitive learning problems, the costs associated with misclassifications are uncertain in nature. Many existing costsensitive learning algorithms, which require the exact cost information (e.g., a unique cost matrix) being available, are not applicable for these problems. In this paper, we consider the situation where the cost information is provided as a set of cost matrices, and aim to achieve the minimax classifier over the cost matrices. It is theoretically proved that the classifier with minimax total cost is either the optimal classifier for a single cost matrix in the set, or the minimax classifier over a pair of cost matrices in the set. This result immediately leads to a framework for achieving minimax classifier over arbitrary number of cost matrices. Furthermore, it is also applicable in case that the cost information is provided as an infinite set, e.g., intervals, by combining with an appropriate sampling/discretization procedure. Preliminary empirical study has justified the efficacy of the framework.
Although there exist a lot of algorithms for standard costsensitive learning problems, achieving minimax classifier over a pair of cost matrices remains the major technical obstacle. Therefore, novel algorithms should be developed for this purpose to exploit the usefulness of our framework to the full extent. Furthermore, the theoretical analysis conducted in this work needs to be extended to multiclass problems so that the resultant framework can be generalized. These issues will be investigated in the future.
References
 [1] R. J. Bolton and D. J. Hand, “Statistical fraud detection: A review,” Statistical Science, vol. 17, no. 3, pp. 235–255, 2002.
 [2] C. Elkan, “The Foundations of CostSensitive Learning,” in Proceedings of 17th International Joint Conference on Artificial Intelligence, 2001, pp. 973–978.
 [3] N. Abe, B. Zadrozny, and J. Langford, “An iterative method for multiclass costsensitive learning,” in Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 3–11.
 [4] Z. H. Zhou and X. Y. Liu, “On MultiClass CostSensitive Learning,” in Proceedings of the 21st National Conference on Artificial Intelligence, 2006, pp. 567–572.
 [5] J. P. Dmochowski, P. Sajda, and L. C. Parra, “Maximum likelihood in costsensitive learning: Model specification, approximations, and upper bounds,” Journal of Machine Learning Research, vol. 11, pp. 3313–3332, 2010.
 [6] P. Domingos, “Metacost: A general method for making classifiers costsensitive,” in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 1999, pp. 155–164.
 [7] K. M. Ting, “An instanceweighting method to induce costsensitive trees,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 3, pp. 659–665, 2002.
 [8] Z. H. Zhou and X. Y. Liu, “Training costsensitive neural networks with methods addressing the class imbalance problem,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 1, pp. 63–77, 2006.
 [9] H. Masnadishirazi, N. Vasconcelos, and S. Member, “Costsensitive boosting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp. 294–309, 2011.
 [10] B. Zadrozny and C. Elkan, “Transforming classifier scores into accurate multiclass probability estimates,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002, pp. 694–699.
 [11] C. Bourke, K. Deng, S. D. Scott, R. E. Schapire, and N. V. Vinodchandran, “On reoptimizing multiclass classifiers,” Machine Learning, vol. 71, no. 23, pp. 219–242, 2008.
 [12] F. Provost and T. Fawcett, “Robust classification for imprecise environments,” Machine Learning, vol. 42, pp. 203–231, 2001.
 [13] B. Zadrozny and C. Elkan, “Learning and making decisions when costs and probabilities are both unknown,” in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 2001, pp. 204–213.
 [14] X. Y. Liu and Z. H. Zhou, “Learning with cost intervals,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010, pp. 403–412.
 [15] R. Alaiz Rodriguez, A. Guerrero Curieses, and J. Cid Sueiro, “Minimax classifiers based on neural networks,” Pattern Recognition, vol. 38, no. 1, pp. 29–39, 2005.
 [16] D.Z. Du and P. M. Pardalos, Eds., Minimax and Applications. Kluwer Academic Publishers, 1995.
 [17] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: John Wiley, 2001.
 [18] V. Pareto, Cour d’Economie Politique. Geneve: Librarie Droz, 1964.
 [19] A. Asuncion and D. Newman, “UCI machine learning repository,” http://www.ics.uci.edu/mlearn/MLRepository.html, 2007.