What are the Differences between Bayesian Classifiers and Mutual-Information Classifiers?
In this study, both Bayesian classifiers and mutual-information classifiers are examined for binary classifications with or without a reject option. The general decision rules in terms of distinctions on error types and reject types are derived for Bayesian classifiers. A formal analysis is conducted to reveal the parameter redundancy of cost terms when abstaining classifications are enforced. The redundancy implies an intrinsic problem of “non-consistency” for interpreting cost terms. If no data is given to the cost terms, we demonstrate the weakness of Bayesian classifiers in class-imbalanced classifications. On the contrary, mutual-information classifiers are able to provide an objective solution from the given data, which shows a reasonable balance among error types and reject types. Numerical examples of using two types of classifiers are given for confirming the theoretical differences, including the extremely-class-imbalanced cases. Finally, we briefly summarize the Bayesian classifiers and mutual-information classifiers in terms of their application advantages, respectively.
Bayes, entropy, mutual information, error types, reject types, abstaining classifier, cost sensitive learning.
The Bayesian principle provides a powerful and formal means of dealing with statistical inference in data processing, such as classifications . If classifiers are designed based on this principle, they are called “Bayesian classifiers” in this work. The learning targets for Bayesian classifiers are either the minimum error or the lowest cost. It was recognized that Chow  was “among the earliest to use Bayesian decision theory for pattern recognition” . His pioneering work is so enlightening that its idea of optimal tradeoff between error and reject still sheds a bright light for us to deep our understanding to the subject, as well as to explore its applications widely in this information-explosion era. In recent years, cost sensitive learning and class-imbalanced learning have received much attentions in various applications [12-18]. For classifications of imbalanced, or skewed, datasets, “the ratio of the small to the large classes can be drastic such as 1 to 100, 1 to 1,000, or 1 to 10,000 (and sometimes even more)” . It was pointed out by Yang and Wu  that dealing with imbalanced and cost-sensitive data is among the ten most challenging problems in the study of data mining. In fact, the related subjects are not a new challenge but a more crucial concern than before for increasing needs of searching useful information from massive data. Binary classifications will be a basic problem in such application background. Classifications based on cost terms for the tradeoff of error types is a conventional subject in medical diagnosis. Misclassification from “type I error” (or “false positive”) or from “type II error” (or “false negative”) is significantly different in the context of medical practices. In other domains of applications, one also needs to discern error types for attaining reasonable results in classifications. Among all these investigations, cost terms, which is usually specified by users from a cost matrix, play a key role in class-imbalanced learning [11-14].
In binary classifications with a reject option, Bayesian classifiers require a cost matrix with six cost terms as the given data. Different from the prior to the probabilities of classes, this requirement can be another source of subjectivity that disqualifies Bayesian classifiers as an objective approach of induction . If an objectivity aspect is enforced for classifications with a reject option, a difficulty does exist for Bayesian classifiers that assign cost terms objectively. The cost terms for error types may be given from an application background, but are generally unknown for reject types. In binary classifications, Chow  and early researchers  usually assumed no distinctions among errors and among rejects. The later study in  considered different costs for correct classification and miscalssifications, but not for rejects. The more general settings for distinguishing error types and reject types were reported in . To overcome the problems of presetting cost terms manually, Pietraszek  proposed two learning models, namely, “bounded-abstention” and “bounded-improvement”, and Grall-Maës and Beauseroy  applied a strategy of adding performance constraints for class-selective rejection. If constraints either on total reject or on total error, they may result in no distinctions between their associated cost terms. Up to now, it seems that no study has been reported for the objective design of Bayesian classifiers by distinguishing error types and reject types at the same time.
Several investigations are reported by following Chow’s rule on classifier designs with a reject option [21-30]. In addition to a kind of “ambiguity reject” studied by Chow, the other kind of “distance reject” was also considered in . Ambiguity reject is made to a pattern located in an ambiguous region between/among classes. Distance reject represents a pattern far away from the means of any class and is conventionally called an “outlier” in statistics . Ha  proposed another important kind of reject, called “class-selective reject”, which defines a subset of classes. This scheme is more suitable to multiple-class classifications. For example, in three-class problems, Ha’s classifiers will output the predictions including “ambiguity reject between Class 1 and 2”, “ambiguity reject among Class 1, 2 and 3”, and the other rejects from class combinations. Multiple rejects with such distinctions will be more informative than a single “ambiguity reject”. Among all these investigations, the Bayesian principle is applied again for their design guideline of classifiers.
While the Bayesian inference principle is widely applied in classifications, another principle based on the mutual information concept is rarely adopted for designing classifiers. Mutual information is one of the important definitions in entropy theory . Entropy is considered as a measure of uncertainty within random variables, and mutual information describes the relative entropy between two random variables . If classifiers seek to maximize the relative entropy for their learning target, we refer them to “mutual-information classifiers”. It seems that Quinlan  was among the earliest to apply the concept of mutual information (but called “information gain” in his famous ID3 algorithm) in constructing the decision tree. Kvålseth  and Wickens  introduced the definition of normalized mutual information (NMI) for assessing a contingency table, which laid down the foundation on the relationship between a confusion matrix and mutual information. Being pioneers in using an information-based criterion for classifier evaluations, Kononenko and Bratko  suggested the term “information score” which was equivalent to the definition of mutual information. A research team leaded by Principe  proposed a general framework, called “Information Theoretic Learning (ITL)”, for designing various learning machines, in which they suggested that mutual information, or other information theoretic criteria, can be set as an objective function in classifier learning. Mackay [, page 533] once showed numerical examples for several given confusion matrices, and he suggested to apply mutual information for ranking the classifier examples. Wang and Hu  derived the nonlinear relations between mutual information and the conventional performance measures, such as accuracy, precision, recall and F1 measure for binary classifications. In , a general formula for normalized mutual information was established with respect to the confusion matrix for multiple-class classifications with/without a reject option, and the advantages and limitations of mutual-information classifiers were discussed. However, no systematic investigation is reported for a theoretical comparison between Bayesian classifiers and mutual-information classifiers in the literature.
This work focuses on exploring the theoretical differences between Bayesian classifiers and mutual-information classifiers in classifications for the settings with/without a reject option. In particular, this paper derives much from and consequently extends to Chow’s work by distinguishing error types and reject types. To achieve analytical tractability without losing the generality, a strategy of adopting the simplest yet most meaningful assumptions to classification problems is pursued for investigations. The following assumptions are given in the same way as those in the closed-form studies of Bayesian classifiers by Chow  and Duda, et al :
Classifications are made for two categories (or classes) over the feature variables.
All probability distributions of feature variables are exactly known.
One may argue that the assumptions above are extremely restricted to offer practical generality in solving real-world problems. In fact, the power of Bayesian classifiers does not stay within their exact solutions to the theoretical problems, but appear from their generic inference principle in guiding real applications, even in the extreme approximations to the theory. We fully recognize that the assumption of complete knowledge on the relevant probability distributions may be never the cases in real-world problems . The closed-form solutions of Bayesian classifiers on binary classifications in  have demonstrated the useful design guidelines that are applicable to multiple classes . The author believes that the analysis based on the assumptions above will provide sufficient information for revealing the theoretical differences between Bayesian classifiers and mutual-information classifiers, while the intended simplifications will benefit readers to reach a better, or deeper, understanding to the advantages and limitations of each type of classifiers.
The contributions of this work are twofold. First, the analytical formulas for Bayesian classifiers and mutual-information classifiers are derived to include the general cases with distinctions among error types and reject types for cost sensitive learning in classifications. Second, comparisons are conducted between the two types of classifiers for revealing their similarities and differences. Specific efforts are made on a formal analysis of parameter redundancy to the cost terms for Bayesian classifiers when a reject option is applied. Section II presents a general decision rule of Bayesian classifiers with or without a reject option. Sections III provides the basic formulas for mutual-information classifiers. Section IV investigates the similarities and differences between two types of classifiers, and numerical examples are given to highlight the distinct features in their applications. The question presented in the title of the paper is concluded by a simple answer in Section V.
Ii Bayesian Classifiers with A Reject Option
Ii-a General Decision Rule for Bayesian Classifiers
Let x be a random pattern satisfying , which is in a -dimensional feature space and will be classified. The true (or target) state of x is within the finite set of two classes, , and the possible decision output is within three classes, , where is a function for classifications and represents a “reject” class. Let be the prior probability of class and be the conditional probability density function of x given that it belongs to class . The posterior probability is calculated through the Bayes formula :
where represents the mixture density for normalizing the probability. Based on the posterior probability, the Bayesian rule assigns a pattern x into the class that has the highest posterior probability. Chow  first introduced the framework of the Bayesian decision theory into the study of pattern recognition and derived the best error-type trade-off formulas and the related optimal reject rule. The purpose of the reject rule is to minimize the total risk (or cost) in classifications. Suppose is a cost term for the true class of a pattern to be , but decided as . Then, the conditional risk for classifying a particular x into is defined as:
The constraints imply that a misclassification will suffer a higher cost than a rejection, and a rejection will cost more than a correct classification. Relations about are the main concern in the study of cost-sensitive learning, and this issue will be addressed later in this work. The total risk for the decision output will be :
with integration over the entire observation space .
Definition 1 (Bayesian classifier)
If a classifier is determined from the minimization of its risk over all patterns:
or in anther form on a given pattern x:
this classifier is called “Bayesian classifier”, or “Chow’s abstaining classifier” . The term of is usually called “Bayesian risk”, or “Bayesian error” in the cases that zero-one cost terms () are used for no rejection classifications.
In , a single threshold for a reject option was investigated. This setting was obtained for the assumption that cost terms are applied without distinction among the errors and among rejects. Following Chow’s approach but with extension to the general cases to cost terms, one is able to derive the general decision rule on the rejection for Bayesian classifiers.
The general decision rule for Bayesian classifiers are:
See Appendix A.
Note that eq. (6d) suggests general constraints over . The necessity for having such constraints is explained in Appendix A. A graphical interpretation to the two thresholds is illustrated in Fig. 1. Based on eq. (6c), the thresholds can be calculated from the following formulas:
Eq. (7) describes general relations between thresholds and cost terms on binary classifications, which enables the classifiers to make the distinctions among errors and among rejects. Note that the special settings of Chow’s rules  can be derived from eq. (7):
Another important relation in  can also be obtained:
Pietraszek  derived the rational region of above through ROC curves. The error costs can be different but not for reject ones. Note that, however, the rejection thresholds will be different when . For advanced applications, Vanderlooy, et al  generalized Chow’s rules by distinguishing error types and reject types, and derived the relations between two ”likelihood ratio thresholds“ and cost terms. Their rules of missing the terms and are not theoretically general, yet sufficient for applications. They derived formulas only from the inequality constraints of and , respectively. Up to now, it seems no one has reported the general constraints (6d) in the literature. Based on eq. (6d), one can derive the rational (3), rather than employing the intuition.
By applying eq. (1) and the constraint , one can achieve the decision rules from eq. (6) with respect to the posterior probabilities and thresholds in a simple and better form for abstaining classifiers:
In comparison with the decision rules of eq. (6), which are expressed in terms of the likelihood ratio, eq. (10) together with Fig. 1 presents a better view for users to understand abstaining Bayesian classifiers. A plot of posterior probabilities show advantages over a plot of the likelihood ratio (Figure 2.3 in ) for determining rejection thresholds. Note that in Fig. 1 the plots are depicted on a one-dimensional variable for Gaussian distributions of . The simplification supports the suggestions by Duda, et al, that one “should not obscure the central points illustrated in our simple example” . Two sets of geometric points are shown for the plots. One set is called “cross-over points”, denoted by , which are formed from two curves of and . And the other is termed “boundary points”, denoted by . The boundary points partition classification regions for one-dimensional problems. For a “no rejection” case, the boundary points are controlled by the ratio of . In abstaining classifications, those points are determined from two thresholds, respectively. For multiple dimension problems, one can understand that both types of the points above become to be curves or even hypersurfaces.
With the exact knowledge of , , and , one can calculate Bayesian risk from the following equation:
where , and are the probabilities of “Correct Recognition”, “Error”, and “Rejection” for the th class in the classifications, respectively; and to are the classification regions of Class 1, Class 2 and the reject class, respectively. The general relations among , and for binary classifications are given by :
where , , and represent total correct recognition, total error and total reject rates, respectively; and is the accuracy rate of classifications.
Ii-B Parameter Redundancy Analysis of Cost Terms
Bayesian classifiers present one of the general tools for cost sensitive learning. From this perspective, there exists a need for a systematic investigation into a parameter redundancy analysis of cost terms for Bayesian classifiers, which appears missing for a reject option. This section will attempt to develop a theoretical analysis of parameter redundancy for cost terms.
For Bayesian classifiers, when all cost terms are given along with the other relevant knowledge about classes, a unique set of solutions will be obtained. However, this phenomenon does not indicate that all cost terms will be independent for determining the final results of Bayesian classifiers. In the followings, a parameter dependency analysis is conducted because it suggests a theoretical basis for a better understanding of relations among the cost terms and the outputs of Bayesian classifiers. Based on , we present the relevant definitions but derive a theorem from the functionals in eqs. (4) and (5) so that it holds generality for any distributions of features. Let a parameter vector be defined as , where is the total number of parameters in a model and S denotes the parameter space.
Definition 2 (Parameter redundancy )
A model is considered to be parameter redundant if it can be expressed in terms of a smaller sized parameter vector , where .
Definition 3 (Independent parameters)
A model is said to be governed by independent parameters if it can be expressed in terms of the smallest size of parameter vector . Let denote the total number of for the model .
(Function of parameters, parameter composition, input parameters, intermediate parameters): Suppose three sets of parameter vectors are denoted by , , and . If for a model there exists for : and : , we call and to be functions of parameters, and to be parameter composition, where are called input parameters for , and are intermediate parameters.
Suppose a model holds the relation for Definition 4. The total number of independent parameters of , denoted as for the model will be no more than , or in a form of:
Suppose without parameter composition, one can prove that . According to Definition 2, any increase of its size of over will produce a parameter redundancy in the model. Definition 3 indicates that the vector size will be an upper bound for in this situation. In the same principle, after parameter compositions are defined in Definition 4 for , the lowest parameter size within , and , will be the upper bound of .
For Bayesian classifiers defined by eq. (5a), one can rewrite it in a form of:
where and in binary classifications, with for their disjoint sets. Let (or ) be the total Bayesian error (or reject) in binary classifications:
where and are two functions of the parameters. are usually input parameters, but can serve as either intermediate parameters or input ones.
In abstaining binary classifications, the total number of independent parameters within the cost terms for defining Bayesian classifiers, , should be at most two . Therefore, applications of cost terms of in the traditional cost sensitive learning will exhibit a parameter redundancy for calculating Bayesian and even after assuming , and as the conventional way in classifications .
Applying (14) and (13) in Lemma 1, one can have for defining Bayesian classifiers from . However, when imposing three constraints on , and , will provide three free parameters in the cost matrix in a form of:
which implies a parameter redundancy for calculating Bayesian and .
Theorem 2 describes that Bayesian classifiers with a reject option will suffer a difficulty of uniquely interpreting cost terms. For example, one can even enforce the following two settings:
for achieving the same Bayesian classifier, as well as their and . However, the two sets of settings entail different meanings and do not show the equivalent relations except through eq. (7). Hence, a confusion may be introduced when attempting to understand behaviors of error and reject rates with respects to different sets of cost terms. For this reason, cost terms may present an intrinsic problem for defining a generic form of settings in cost sensitive learning if a reject option is enforced.
While Theorem 2 only shows an estimation of upper-bound of for Bayesian classifiers with a reject option because of missing a closed-form solution of , one can prove on for Bayesian classifiers without rejection. A single independent parameter from the cost terms can be formed as .
We suggest to apply independent parameters for the design and cost analysis of Bayesian classifiers. The total number of independent parameters of is changeable and dependent on the reject option of Bayesian classifiers. If rejection is not considered, we suggest for the cost or error sensitivity analysis. A single independent cost parameter, , is capable of governing complete behaviors of error rate. For a reject option, we suggest for the cost, error, or reject sensitivity analysis, which will lead to a unique interpretation to the analysis.
Ii-C Examples of Bayesian Classifiers on Univariate Gaussian Distributions
This section will consider abstaining Bayesian classifiers on Gaussian distributions. As a preliminary study, a univariate feature in  is adopted for the reason of showing theoretical fundamentals as well as the closed-form solutions. Therefore, if the relevant knowledge of and is given, one can depict the plots of from calculation of eq. (1) (Fig. 1). Moreover, when is known, the classification regions of to in terms of will be fixed for Bayesian classifiers. After the regions to , or , are determined, Bayesian risk will be obtained directly. One can see that these boundaries can be obtained from the known data of when solving an equality equation on (6a) or (6b):
The data of can be realized either from cost terms , or from threshold (see eq. (6)). By substituting the exact data of and for Gaussian distributions, where and represent the mean and standard deviation to the th class, and the data of (say, for from the given ) into (18), one can obtain the closed-form solutions to the boundary points (say, for and ):
where is an intermediate variable defined by:
Eq. (19) is also effective for Bayesian classifiers in the case of “no rejection”. However, only cost terms, , will define the data of . The general solution to abstaining classifiers has four boundary points by substituting two threshold and , respectively. For the conditions shown in Fig. 1d, will lead to and , and to and , respectively. Eq. (19a) shows a general form for achieving two boundary points from one data point of , and eq. (19b) is specific for reaching a single boundary point only when the standard deviations of two classes are the same. Substituting the other data of into eq. (19) will yield another pair of data and , or a single one , in a similar form of eq. (19).
|(Fig. 1d)||,||and||General Rejection|
|,||“Class-1 and Reject-class” Classification|
|,||and||“Class-2 and Reject-class” Classification|
|“Majority-class and Reject-class”|
|“Minority-class and Reject-class”|
|Zero, one and Two|
|(Fig.1)||Rejection to All|
Like the solution for boundary points, cross-over point(s) can also be obtained
from solving eq. (18) or (19) by substituting . One can prove that
three specific cases will be met with the cross-over point(s) from the solution of
eq. (18), namely, two, one, or zero cross-over point(s). The case for the two
cross-over points appears only when in eq. (19c), and two curves
of and demonstrate the non-monotonicity (Fig. 1b) through the equality
. When the associated standard deviations are equal for the two
classes, i.e., , only one cross-over point appears, which corresponds
to the monotonous curves of and (Fig. 1a). The case for the zero
cross-over point occurs when , which corresponds to no real-value (but complex-value)
solution to eq. (19a) and to situations of non-monotonous curves of
and . In the followings, we will discuss several specific cases for rejections
with respect to the cross-over points between the and curves, as well
as to the associated settings on and . A term is applied to
describe every case. For example, “CaseBU” indicates “k” for the th
case, “B” (or “M”) for Bayesian (or mutual-information) classifiers, and “G”
(or “U”) for Gaussian (or uniform) distributions.
For a binary classification, Chow  showed that, when , there exists no rejection for classifiers. The novel constraint of shown in eq. (6e) suggests that the setting should be when the thresholds are the input data. Users need to specify an option for “no rejection” or “rejection” as an input. When “no rejection” is selected, the conventional scheme of cost terms from a two-by-two matrix will be sufficient. Any usage of a two-by-three matrix will introduce some confusion that will be illustrated in the later section by Example 1. In addition, one cannot consider as the defaults for the cost matrix in this case.
Rejection to all or to a complete class.
In discussing this case, we relax the constraints in eq. (6e) for including the zero values of the thresholds. Chow  showed that, whenever , a classifier will reject all patterns. Substituting zero values for thresholds into eq. (7), one will obtain solutions for . These results imply that no cost is received even for a reject decision to a pattern. Obviously, a case like this should be avoided. In some situations, if one intends to reject a complete class (say, Class 1), its associated cost terms should be set to zero (say, ). We call these situations as “one-class and reject-class” classification, since only two categories are identified, that is, “Class 2” and “Reject Class”, respectively.
Rejection in two cross-over points and .
The necessary condition for realizing this case is derived from eq. (18) for while assuming :
The general situation within this case is when and ,
in which the reject region is divided by two ranges.
When and , only one
class is identified, but all other patterns are classified into
a reject class. Therefore, we refer this situation as “Class 1 and Reject-class” classification. Table I also lists
the other situations for the rejections from the different settings on .
Rejection in one cross-over point .
The general condition for realizing this case in the context of classifications is not based from setting an equality condition on (20) for . We neglect such setting in this case, but assign it into CaseBG. As demonstrated in eq. (19b), the general condition of this case is a simply setting . Since the monotonicity property is enabled for the curves of and in this case, a single reject region is formed (Fig. 1c).
Rejection in zero cross-over point.
The general condition for realizing this case corresponds to a violation of the criterion on (19a), or in (20). In this case, one class always shows a higher value of the posterior probability distribution over the other one in the whole domain of . From definitions in the study of class imbalanced dataset  , if in binary classifications, Class 1 will be called a “majority” class and Class 2 a “minority” class. Supposing that , when , all patterns will be considered as Class 1. We call these situations as a “Majority-taking-all” classification. Due to the constraints like and , one is unable to realize a “Minority-taking-all” classification. When and , all patterns will be partitioned into one of two classes, that is, majority and rejection. We call these situations “Majority-class and Reject-class” classifications. The situations of “Minority-class and Reject-class” classification occur if and .
Consider a binary classification with an exact knowledge of one-dimensional Gaussian distributions. If a zero-one cost function is applied, Bayesian classifiers without rejection will satisfy the following rule:
which indicates that the classifiers have a tendency of reaching the maximum Bayesian error, , by misclassifying all rare-class patterns in imbalanced data learning.
We will prove the misclassification of all rare-class patterns first. Suppose represents the prior probability of the “minority” or “rare” class in imbalanced data learning and consider the special case firstly on the equal variances for two classes (Fig. 1a). When approaches to zero, will approach infinity from using eq. (19b) with . This result indicates that Bayesian classifiers will assign all patterns into the “majority” class in classifications. When the variances are not equal, eqs. (19a) and (19c) with will be applicable (Fig. 1b). One can obtain the relation for the case that no cross-over point occurs on plots when approaches to zero. Only the “majority” class is identified from using Bayesian classifiers in this case. The equality of suggests an upper bound of Bayesian error (See Appendix B). If violating this bound, Bayesian classifiers will adjust themselves for achieving the smallest error rate.
Ii-D Examples of Bayesian Classifiers on Univariate Uniform Distributions
Chow  presented a study on rejection from Bayesian classifiers along uniform distributions for one-dimensional problems. This section will extend Chow’s results by providing general formulas of parameterized distributions. A binary classification is considered. The two uniform distributions on two classes are given:
Three specific cases, shown in Fig. 2, will appear, namely, “Partially overlapping”,
“Fully overlapping by one class”, and “Separating” between two distributions
for eq. (22). We will discuss each case with respect to their rejection settings.
Partially overlapping between two distributions.
Suppose that the constraints for this case are:
When the relevant knowledge of and is given, one is able to gain the posterior probabilities from eqs. (1) and (21) by a closed form:
Based on the Bayesian rules of eq. (10) and eq. (24), one can immediately determine and directly for Class 1 and Class 2, respectively, as shown in Fig. 2. The remaining range is denoted as , since it needs to be identified further depending on the thresholds defined in (7). Due to the simplicity of the uniform distributions, one is able to realize analytical solutions directly for Bayesian classifiers. The probabilities of errors and rejects are calculated from :