Attribute noise robust binary classification
Abstract
We consider the problem of learning linear classifiers when both features and labels are binary. In addition, the features are noisy, i.e., they could be flipped with an unknown probability. In SyDe attribute noise model, where all features could be noisy together with same probability, we show that  loss () need not be robust but a popular surrogate, squared loss () is. In AsyIn attribute noise model, we prove that is robust for any distribution over 2 dimensional feature space. However, due to computational intractability of , we resort to and observe that it need not be AsyIn noise robust. Our empirical results support SyDe robustness of squared loss for low to moderate noise rates.
Introduction
Quality of data is being compromised as its quantity is getting larger. In classification setup, bad quality data could be due to noise in the labels or noise in the features. Label noise research has gained a lot of attention in last decade [Sastry and Manwani, 2016]. In contrast, feature or attribute noise is still unexplored. As opposed to continuous valued attributes, noise in categorical features, particularly binary, can drastically change the relative location of a data point and significantly impact the classifier’s performance.
[Quinlan, 1986] studied the effect of noise when the algorithms are decision trees. [Zhu and Wu, 2004, Khoshgoftaar and Van Hulse, 2009] study attribute noise from the perspective of detecting noisy data points and correcting them.
Our major contributions lie in identifying loss functions that are robust (or not) to attribute (binary valued) noise in Empirical Risk Minimization (ERM) framework. This has an advantage that there is no need of either knowing the true value or crossvalidating over or estimating the noise rates.
Problem description
Let be the joint distribution over , where and . Let the decision function be , hypothesis class of all measurable functions be and class of linear hypothesis be We restrict our set of hypothesis to be in . Let denote the distribution on obtained by inducing noise to with . The corrupted sample is The probability that the value of attribute is flipped is given by . We assume that the class/label does not change with noise in the attributes.
Based on the flipping probability and the dependence between events of flipping for different attributes, we identify two attribute noise models. If all the attribute values are flipped together with same probability , then it is referred to as the symmetric dependent attribute noise model (SyDe). If each attribute flips with probability independently of any other attribute , then it is referred to as the asymmetric independent attribute noise model (AsyIn). Even though SyDe attribute noise model is simple, it cannot be obtained by taking in AsyIn attribute noise model. Real world example of SyDe (or AsyIn) noisy attributes: Consider a room with many sensors connected in series (or with individual battery) measuring temperature, humidity, etc., as binary value, i.e., high or low. A power failure (or battery failures) will lead to all (or individual) sensors/attributes providing noisy observations with same (or different) probability.
We consider ERM framework for classification. A natural choice for loss function is  loss, i.e., . Bayes classifier and Bayes risk is where . Corresponding quantities for noisy distribution are and .
Nonconvex nature of  loss makes it difficult to optimize and hence convex upper bounds (surrogate losses) are used in practice. In this work, we consider the squared loss , a differentiable and convex surrogate loss function. Our restriction of hypothesis to linear class can be interpreted as a form of regularization. Expected squared clean and corrupted risks are and . Hypothesis in minimizing these clean and corrupted risks are denoted by and . Next, we define attribute noise robustness of risk minimization scheme .
Definition 1.
Let and be obtained from clean and corrupted distribution and using any arbitrary scheme . Then, scheme is said to be attribute noise robust if
Also, is said to be an attribute noise robust loss function.
Attribute noise robust loss functions
We, first, consider SyDe attribute noise model and present a counter example (Example 1) to show that  loss need not be robust to SyDe attribute noise. To circumvent this problem, we provide a positive result by showing that squared loss is SyDe attribute noise robust with origin passing linear classifiers (Theorem 1). Our hypothesis set belongs to which could be further categorized into origin passing and nonorigin passing ( or not). Details of examples and proofs are available in Supplementary Material (SM).
Example 1.
Consider a population of two data points (in 1D) as and with probability and with a classifier . Then, the optimal clean classifier is with . Also, the optimal SyDe attribute noise () corrupted classifier is with . Since, ,  loss function need not be SyDe attribute noise robust.
Theorem 1.
Consider a clean distribution on and SyDe attribute noise corrupted distribution on with noise rate . Then, squared loss with origin passing linear classifiers is SyDe attribute noise robust,i.e.,
(1) 
where and correspond to optimal linear classifiers learnt using squared loss on clean () and corrupted ( distribution.
Remark 1.
SyDe robustness of squared loss is an interesting result because given an attribute noise corrupted dataset, obtaining a linear classifier entails solving only a linear system of equations. (Demonstrated on UCI datasets.)
Now, we consider AsyIn attribute noise model and show that  loss is robust to this noise with nonorigin passing classifiers when (Theorem 2). As based ERM is computationally intractable, we consider and present a counter example to show that need not be AsyIn noise robust (Example 2).
Theorem 2.
Consider a clean distribution with probabilities on with (population of data points) and AsyIn attribute noise corrupted distribution on with noise rates and . Then,  loss with nonorigin passing linear classifiers is AsyIn attribute noise robust, i.e.,
(2) 
where and correspond to optimal linear classifiers learnt using  loss on clean () and corrupted ( distribution respectively.
Example 2.
Consider a population of 3 data points (in 2D) as , , and with probabilities as with a classifier . Then, optimal clean classifier is with . Also, optimal AsyIn attribute noise () corrupted classifier is with . Since, , squared loss need not be AsyIn attribute noise robust.
Experiments
Figure 1 demonstrates SyDe attribute noise robustness of squared loss on 3 UCI datasets [Dheeru and Karra Taniskidou, 2017]; details in SM. As SPECT dataset is imbalanced, in addition to accuracy, we also report arithmetic mean (AM). To account for randomness in noise, results are averaged over 15 trials of traintest partitioning (8020). The low accuracy in comparison to clean classifier can be attributed to the finite samples available for learning the classifiers.
Looking forward
Our work is an initial attempt in binary valued attribute noise; an extension to general discrete valued attributes would be interesting. AsyIn attribute noise model raises some nontrivial questions w.r.t. choice of loss functions like robustness of  for , explanation for the surprising nonrobustness of squared loss as compared to robustness of a difficult to deal  loss, search for other surrogate loss functions that are robust. Finally, we believe that attribute dimension could have a role to play in noise robustness.
References
 [Dheeru and Karra Taniskidou, 2017] Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository.
 [Khoshgoftaar and Van Hulse, 2009] Khoshgoftaar, T. M. and Van Hulse, J. (2009). Empirical case studies in attribute noise detection. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 39(4):379–388.
 [Quinlan, 1986] Quinlan, J. R. (1986). The effect of noise on concept learning. Machine learning: An artificial intelligence approach, 2:149–166.
 [Sastry and Manwani, 2016] Sastry, P. S. and Manwani, N. (2016). Robust Learning of Classifiers in the Presence of Label Noise, chapter 6, pages 167–197.
 [Zhu and Wu, 2004] Zhu, X. and Wu, X. (2004). Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review, 22(3):177–210.
Supplementary material
Appendix A Proofs
Proof of Theorem 1
Proof.
Consider squared loss based clean risk given as follows:
We minimize this by differentiating the expectation term and equating it to 0.
Now, to obtain the optimal noisy classifier we minimize the following squared loss based expected risk.
(3)  
We minimize the corrupted risk given in equation (3) by differentiating and equating to 0 as described below:
We can see that the noisy classifier is just a scaled version of the classifier, , obtained from the clean risk. Since, for attribute noise robustness it is sufficient that , the aforementioned observation proves that the squared loss function is SyDe attribute noise robust when . ∎
Proof of Theorem 2
Proof.
We prove the robustness of  loss in 2 dimension by taking an exhaustive search approach. Even though we consider a particular nonuniform distribution over 4 data points in the population and certain values of and , the following claims hold for any arbitrary value of and any pair of noise rate . In the population, there are 16 ways in which four data points can be assigned to two classes. By symmetry, we can restrict ourselves to just these 4 cases.
The probabilities of , , , in the distribution are taken to be and respectively. Here, . The noise rates are and . We consider a classifier of the form where .
The clean  risk denoted by is given as follows:
The noisy  risk denoted by is given as follows:
To find the minimizer of the clean and corrupted risks and , we plot them (in MATLAB) as a function of and . In the 4 cases considered below, we observed that even though the minimum value for and are different, they have same minimizers. This implies that the clean  risk corresponding to the optimal clean and noisy classifier would be same. And hence, condition for attribute noise robustness in Definition 1 is satisfied.
Next, we provide the details for each case.
The set of classifiers obtained by minimizing the 01 risk is given in Fig 2.
Now, the set of classifiers obtained by minimizing the 01 risk is given in Figure 3.
We can see that is a classifier which minimizes the clean as well as noisy risk.
The set of classifiers obtained by minimizing the 01 risk is given in Fig 4.
Now, the set of classifiers obtained by minimizing the 01 risk is given in Figure 5.
We can see that is a classifier which minimizes the clean as well as noisy risk.
The set of classifiers obtained by minimizing the 01 risk is given in Fig 6.
Now, the set of classifiers obtained by minimizing the 01 risk is given in Figure 7.
We can see that is a classifier which minimizes the clean as well as noisy risk.
Now, the set of classifiers obtained by minimizing the 01 risk is given in Figure 9.
We can see that is a classifier which minimizes the clean as well as noisy risk.
As seen in the above four cases, we conclude that the set of minimizers of clean  risk and corrupted  are same. And hence, the classifiers learnt by minimizing the 01 risk noisy distributions are AsyIn attribute noise robust with .
∎
Appendix B Details of counter examples
Counter example 1
Consider a population of two data points of the form in 1dimension as and with probability and and consider the classifier of the form . Then, the clean  risk is given as follows:
Minimizing w.r.t. and gives us the optimal clean classifier and optimal clean risk as
Next, we consider the SyDe attribute noise corrupted risk with as follows:
(4)  
Minimizing w.r.t. and gives us the following optimal noisy classifier and  clean risk as
Clearly, implying that  loss need not be SyDe attribute noise robust.
Counter example 2
Consider a population of 3 data points of the form in 2 dimensions as and with probabilities as . We consider the classifier of the form . Then, the clean squared loss based risk is given as follows:
Minimizing w.r.t. and leads to following system of equations:
Solving the above system of equations gives us the optimal squared loss based clean classifier and clean  risk as
Next, we consider the AsyIn attribute noise corrupted risk in terms of and as follows:
Minimizing w.r.t. leads to the following system of equations:
Now, if we take the noise rate values as and , then solving for above system of equations gives us the optimal squared loss based noisy linear classifier and clean  risk as
Clearly, implying that squared loss need not be AsyIn attribute noise robust.
Appendix C UCI dataset details
In this section, we provide some details on the way we processed the datasets to obtain the experimental results. Number of features and the number of data points (number of negative and positive labelled separately) are provided in Table 1. We provide individual preprocessing details for each dataset as follows:

Vote: This is a voting dataset with original size of 435 data points. Since missing entry in a cell meant that the person takes a neutral stand, we removed such instances to fit in the framework of binary valued attributes and finally used 232 data points. Label corresponds to “Democrat” and label corresponds to ”Republican”.

SPECT: This dataset has information extracted from cardiac SPECT images with values as and . To be consistent with the binary format, we replaced all ’s by without loss of generality in both attributes and labels.

KRvsKP: This dataset originally has 36 features but we removed feature number 15 as it had three categorical values and finally used the dataset with 35 attributes. Also, the attributes are processed as follows: “f” replaced by “+1”, “t” replaced by “1”, “n” replaced by “+1”, “g” replaced by “+1”, “l” replaced by “1”. Finally, label corresponds to “Won” and label corresponds to “Nowin”.
S.no  Dataset name  n  m (,) 

1  Vote  16  232 (124,108) 
2  SPECT  22  267 (212,55) 
3  KRvsKP  35  3196 (1569,1527) 
In Tables 2 and 3, we present the values which we used to generate the plots in Figure 1. The results are averaged over 15 trials. We observe that at high noise rates, the theoretically proven robustness of squared loss doesn’t work because of the finite samples.
Dataset  Vote  SPECT  KRvsKP  
(m,n)  (232 (124,108), 16)  (267(212,55),22)  (3196(1569,1527),35)  
Mean  SD  Mean  SD  Mean  SD  
p  0.969  0.0218  0.732  0.0589  0.940  0.0109 
0  0.966  0.0203  0.693  0.0746  0.938  0.0096 
0.1  0.955  0.0363  0.680  0.0681  0.924  0.0106 
0.2  0.865  0.0839  0.638  0.0797  0.900  0.0150 
0.3  0.821  0.0941  0.625  0.0664  0.866  0.0193 
0.35  0.694  0.0919  0.579  0.0955  0.795  0.0263 
0.4  0.573  0.1041  0.505  0.0793  0.706  0.0393 
Dataset  Vote  SPECT  KRvsKP  
(m,n)  (232 (124,108), 16)  (267(212,55),22)  (3196(1569,1527),35)  
Mean  SD  Mean  SD  Mean  SD  
p  0.970  0.0207  0.665  0.0639  0.940  0.0111 
0  0.967  0.0193  0.613  0.1214  0.937  0.0097 
0.1  0.956  0.0373  0.605  0.1338  0.923  0.0109 
0.2  0.868  0.0839  0.568  0.1353  0.899  0.0152 
0.3  0.823  0.0950  0.586  0.1188  0.866  0.0186 
0.35  0.698  0.0903  0.544  0.1542  0.794  0.0261 
0.4  0.575  0.1045  0.540  0.1472  0.706  0.0399 
Appendix D Additional examples
Example 3.
This is another example which demonstrates that  is robust to AsyIn attribute noise with . Let the input data be two dimensional and the clean training set be
and uniformly distributed with . Let the flipping probabilities of the first and second component be and respectively. Let us consider the loss function to be the  loss and the classifier be of the form . We calculate as
We get a range of values for and , we can choose one of the value which is that gives the risk to be 0.
Now, for the corrupted case, we minimize the noisy  classifier to obtain given as below:
The minimizer for the above equation is calculated by plotting a graph (in MATLAB) given in Figure 10. Here, the values of and are taken to be 0.12 and 0.23 respectively. The same pattern is observed for all values of
The minimum value is obtained at and .
Comparing the clean  risk of the classifiers and , we observe that they are equal(0) and hence the  loss function is attribute noise robust in this case.