Attribute noise robust binary classification

Attribute noise robust binary classification

Aditya Petety, Sandhya Tripathi, N Hemachandra
National Institute of Science Education and Research, Bhubaneshwar
aditya.petety@niser.ac.in
Indian Institute of Technology Bombay
{sandhya.tripathi, nh}@iitb.ac.in
Abstract

We consider the problem of learning linear classifiers when both features and labels are binary. In addition, the features are noisy, i.e., they could be flipped with an unknown probability. In Sy-De attribute noise model, where all features could be noisy together with same probability, we show that - loss () need not be robust but a popular surrogate, squared loss () is. In Asy-In attribute noise model, we prove that is robust for any distribution over 2 dimensional feature space. However, due to computational intractability of , we resort to and observe that it need not be Asy-In noise robust. Our empirical results support Sy-De robustness of squared loss for low to moderate noise rates.

Introduction

Quality of data is being compromised as its quantity is getting larger. In classification setup, bad quality data could be due to noise in the labels or noise in the features. Label noise research has gained a lot of attention in last decade [Sastry and Manwani, 2016]. In contrast, feature or attribute noise is still unexplored. As opposed to continuous valued attributes, noise in categorical features, particularly binary, can drastically change the relative location of a data point and significantly impact the classifier’s performance.

[Quinlan, 1986] studied the effect of noise when the algorithms are decision trees. [Zhu and Wu, 2004, Khoshgoftaar and Van Hulse, 2009] study attribute noise from the perspective of detecting noisy data points and correcting them.

Our major contributions lie in identifying loss functions that are robust (or not) to attribute (binary valued) noise in Empirical Risk Minimization (ERM) framework. This has an advantage that there is no need of either knowing the true value or cross-validating over or estimating the noise rates.

Problem description

Let be the joint distribution over , where and . Let the decision function be , hypothesis class of all measurable functions be and class of linear hypothesis be We restrict our set of hypothesis to be in . Let denote the distribution on obtained by inducing noise to with . The corrupted sample is The probability that the value of attribute is flipped is given by . We assume that the class/label does not change with noise in the attributes.

Based on the flipping probability and the dependence between events of flipping for different attributes, we identify two attribute noise models. If all the attribute values are flipped together with same probability , then it is referred to as the symmetric dependent attribute noise model (Sy-De). If each attribute flips with probability independently of any other attribute , then it is referred to as the asymmetric independent attribute noise model (Asy-In). Even though Sy-De attribute noise model is simple, it cannot be obtained by taking in Asy-In attribute noise model. Real world example of Sy-De (or Asy-In) noisy attributes: Consider a room with many sensors connected in series (or with individual battery) measuring temperature, humidity, etc., as binary value, i.e., high or low. A power failure (or battery failures) will lead to all (or individual) sensors/attributes providing noisy observations with same (or different) probability.

We consider ERM framework for classification. A natural choice for loss function is - loss, i.e., . Bayes classifier and Bayes risk is where . Corresponding quantities for noisy distribution are and .

Non-convex nature of - loss makes it difficult to optimize and hence convex upper bounds (surrogate losses) are used in practice. In this work, we consider the squared loss , a differentiable and convex surrogate loss function. Our restriction of hypothesis to linear class can be interpreted as a form of regularization. Expected squared clean and corrupted risks are and . Hypothesis in minimizing these clean and corrupted risks are denoted by and . Next, we define attribute noise robustness of risk minimization scheme .

Definition 1.

Let and be obtained from clean and corrupted distribution and using any arbitrary scheme . Then, scheme is said to be attribute noise robust if

Also, is said to be an attribute noise robust loss function.

Attribute noise robust loss functions

We, first, consider Sy-De attribute noise model and present a counter example (Example 1) to show that - loss need not be robust to Sy-De attribute noise. To circumvent this problem, we provide a positive result by showing that squared loss is Sy-De attribute noise robust with origin passing linear classifiers (Theorem 1). Our hypothesis set belongs to which could be further categorized into origin passing and non-origin passing ( or not). Details of examples and proofs are available in Supplementary Material (SM).

Example 1.

Consider a population of two data points (in 1-D) as and with probability and with a classifier . Then, the optimal clean classifier is with . Also, the optimal Sy-De attribute noise () corrupted classifier is with . Since, , - loss function need not be Sy-De attribute noise robust.

Theorem 1.

Consider a clean distribution on and Sy-De attribute noise corrupted distribution on with noise rate . Then, squared loss with origin passing linear classifiers is Sy-De attribute noise robust,i.e.,

(1)

where and correspond to optimal linear classifiers learnt using squared loss on clean () and corrupted ( distribution.

Remark 1.

Sy-De robustness of squared loss is an interesting result because given an attribute noise corrupted dataset, obtaining a linear classifier entails solving only a linear system of equations. (Demonstrated on UCI datasets.)

Now, we consider Asy-In attribute noise model and show that - loss is robust to this noise with non-origin passing classifiers when (Theorem 2). As based ERM is computationally intractable, we consider and present a counter example to show that need not be Asy-In noise robust (Example 2).

Theorem 2.

Consider a clean distribution with probabilities on with (population of data points) and Asy-In attribute noise corrupted distribution on with noise rates and . Then, - loss with non-origin passing linear classifiers is Asy-In attribute noise robust, i.e.,

(2)

where and correspond to optimal linear classifiers learnt using - loss on clean () and corrupted ( distribution respectively.

Example 2.

Consider a population of 3 data points (in 2-D) as , , and with probabilities as with a classifier . Then, optimal clean classifier is with . Also, optimal Asy-In attribute noise () corrupted classifier is with . Since, , squared loss need not be Asy-In attribute noise robust.

Experiments

Figure 1 demonstrates Sy-De attribute noise robustness of squared loss on 3 UCI datasets [Dheeru and Karra Taniskidou, 2017]; details in SM. As SPECT dataset is imbalanced, in addition to accuracy, we also report arithmetic mean (AM). To account for randomness in noise, results are averaged over 15 trials of train-test partitioning (80-20). The low accuracy in comparison to clean classifier can be attributed to the finite samples available for learning the classifiers.

Figure 1: Test data performance of with Sy-De attribute noise.

Looking forward

Our work is an initial attempt in binary valued attribute noise; an extension to general discrete valued attributes would be interesting. Asy-In attribute noise model raises some non-trivial questions w.r.t. choice of loss functions like robustness of - for , explanation for the surprising non-robustness of squared loss as compared to robustness of a difficult to deal - loss, search for other surrogate loss functions that are robust. Finally, we believe that attribute dimension could have a role to play in noise robustness.

References

  • [Dheeru and Karra Taniskidou, 2017] Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository.
  • [Khoshgoftaar and Van Hulse, 2009] Khoshgoftaar, T. M. and Van Hulse, J. (2009). Empirical case studies in attribute noise detection. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 39(4):379–388.
  • [Quinlan, 1986] Quinlan, J. R. (1986). The effect of noise on concept learning. Machine learning: An artificial intelligence approach, 2:149–166.
  • [Sastry and Manwani, 2016] Sastry, P. S. and Manwani, N. (2016). Robust Learning of Classifiers in the Presence of Label Noise, chapter 6, pages 167–197.
  • [Zhu and Wu, 2004] Zhu, X. and Wu, X. (2004). Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review, 22(3):177–210.

Supplementary material

Appendix A Proofs

Proof of Theorem 1

Proof.

Consider squared loss based clean risk given as follows:

We minimize this by differentiating the expectation term and equating it to 0.

Now, to obtain the optimal noisy classifier we minimize the following squared loss based expected risk.

(3)

We minimize the corrupted risk given in equation (3) by differentiating and equating to 0 as described below:

We can see that the noisy classifier is just a scaled version of the classifier, , obtained from the clean risk. Since, for attribute noise robustness it is sufficient that , the aforementioned observation proves that the squared loss function is Sy-De attribute noise robust when . ∎

Proof of Theorem 2

Proof.

We prove the robustness of - loss in 2 dimension by taking an exhaustive search approach. Even though we consider a particular non-uniform distribution over 4 data points in the population and certain values of and , the following claims hold for any arbitrary value of and any pair of noise rate . In the population, there are 16 ways in which four data points can be assigned to two classes. By symmetry, we can restrict ourselves to just these 4 cases.

The probabilities of , , , in the distribution are taken to be and respectively. Here, . The noise rates are and . We consider a classifier of the form where .

The clean - risk denoted by is given as follows:

The noisy - risk denoted by is given as follows:

To find the minimizer of the clean and corrupted risks and , we plot them (in MATLAB) as a function of and . In the 4 cases considered below, we observed that even though the minimum value for and are different, they have same minimizers. This implies that the clean - risk corresponding to the optimal clean and noisy classifier would be same. And hence, condition for attribute noise robustness in Definition 1 is satisfied.

Next, we provide the details for each case.

The set of classifiers obtained by minimizing the 0-1 risk is given in Fig 2.

Figure 2: Clean - risk for case 1.

Now, the set of classifiers obtained by minimizing the 0-1 risk is given in Figure 3.

Figure 3: Noisy - risk for case 1.

We can see that is a classifier which minimizes the clean as well as noisy risk.

The set of classifiers obtained by minimizing the 0-1 risk is given in Fig 4.

Figure 4: Clean - risk for case 2.

Now, the set of classifiers obtained by minimizing the 0-1 risk is given in Figure 5.

Figure 5: Noisy - risk for case 2.

We can see that is a classifier which minimizes the clean as well as noisy risk.

The set of classifiers obtained by minimizing the 0-1 risk is given in Fig 6.

Figure 6: Clean - risk for case 3.

Now, the set of classifiers obtained by minimizing the 0-1 risk is given in Figure 7.

Figure 7: Noisy - risk for case 3.

We can see that is a classifier which minimizes the clean as well as noisy risk.

The set of classifiers obtained by minimizing the 0-1 risk is given in Fig 8.

Figure 8: Clean - risk for case 4.

Now, the set of classifiers obtained by minimizing the 0-1 risk is given in Figure 9.

Figure 9: Noisy - risk for case 4.

We can see that is a classifier which minimizes the clean as well as noisy risk.

As seen in the above four cases, we conclude that the set of minimizers of clean - risk and corrupted - are same. And hence, the classifiers learnt by minimizing the 0-1 risk noisy distributions are Asy-In attribute noise robust with .

Appendix B Details of counter examples

Counter example 1

Consider a population of two data points of the form in 1-dimension as and with probability and and consider the classifier of the form . Then, the clean - risk is given as follows:

Minimizing w.r.t. and gives us the optimal clean classifier and optimal clean risk as

Next, we consider the Sy-De attribute noise corrupted risk with as follows:

(4)

Minimizing w.r.t. and gives us the following optimal noisy classifier and - clean risk as

Clearly, implying that - loss need not be Sy-De attribute noise robust.

Counter example 2

Consider a population of 3 data points of the form in 2 dimensions as and with probabilities as . We consider the classifier of the form . Then, the clean squared loss based risk is given as follows:

Minimizing w.r.t. and leads to following system of equations:

Solving the above system of equations gives us the optimal squared loss based clean classifier and clean - risk as

Next, we consider the Asy-In attribute noise corrupted risk in terms of and as follows:

Minimizing w.r.t. leads to the following system of equations:

Now, if we take the noise rate values as and , then solving for above system of equations gives us the optimal squared loss based noisy linear classifier and clean - risk as

Clearly, implying that squared loss need not be Asy-In attribute noise robust.

Appendix C UCI dataset details

In this section, we provide some details on the way we processed the datasets to obtain the experimental results. Number of features and the number of data points (number of negative and positive labelled separately) are provided in Table 1. We provide individual pre-processing details for each dataset as follows:

  • Vote: This is a voting dataset with original size of 435 data points. Since missing entry in a cell meant that the person takes a neutral stand, we removed such instances to fit in the framework of binary valued attributes and finally used 232 data points. Label corresponds to “Democrat” and label corresponds to ”Republican”.

  • SPECT: This dataset has information extracted from cardiac SPECT images with values as and . To be consistent with the binary format, we replaced all ’s by without loss of generality in both attributes and labels.

  • KR-vs-KP: This dataset originally has 36 features but we removed feature number 15 as it had three categorical values and finally used the dataset with 35 attributes. Also, the attributes are processed as follows: “f” replaced by “+1”, “t” replaced by “-1”, “n” replaced by “+1”, “g” replaced by “+1”, “l” replaced by “-1”. Finally, label corresponds to “Won” and label corresponds to “Nowin”.

S.no Dataset name n m (,)
1 Vote 16 232 (124,108)
2 SPECT 22 267 (212,55)
3 KR-vs-KP 35 3196 (1569,1527)
Table 1: Details about the number of features (n), number of total data points (m), number of positively labelled data points () and number of negatively labelled data points ().

In Tables 2 and 3, we present the values which we used to generate the plots in Figure 1. The results are averaged over 15 trials. We observe that at high noise rates, the theoretically proven robustness of squared loss doesn’t work because of the finite samples.

Dataset Vote SPECT KR-vs-KP
(m,n) (232 (124,108), 16) (267(212,55),22) (3196(1569,1527),35)
Mean SD Mean SD Mean SD
p 0.969 0.0218 0.732 0.0589 0.940 0.0109
0 0.966 0.0203 0.693 0.0746 0.938 0.0096
0.1 0.955 0.0363 0.680 0.0681 0.924 0.0106
0.2 0.865 0.0839 0.638 0.0797 0.900 0.0150
0.3 0.821 0.0941 0.625 0.0664 0.866 0.0193
0.35 0.694 0.0919 0.579 0.0955 0.795 0.0263
0.4 0.573 0.1041 0.505 0.0793 0.706 0.0393
Table 2: Average (Mean) test accuracy along with standard deviation (SD) over 15 trials obtained by using squared loss based linear classifier learnt on Sy-De (noise rate p) attribute noise corrupted data. Even though squared loss is theoretically shown to be Sy-De noise robust, it doesn’t show good performance at high noise rates. This could be because the result in Theorem 1 is in expectation. In particular, finite sample size starts showing its effect at high noise rates and the performance deteriorates.
Dataset Vote SPECT KR-vs-KP
(m,n) (232 (124,108), 16) (267(212,55),22) (3196(1569,1527),35)
Mean SD Mean SD Mean SD
p 0.970 0.0207 0.665 0.0639 0.940 0.0111
0 0.967 0.0193 0.613 0.1214 0.937 0.0097
0.1 0.956 0.0373 0.605 0.1338 0.923 0.0109
0.2 0.868 0.0839 0.568 0.1353 0.899 0.0152
0.3 0.823 0.0950 0.586 0.1188 0.866 0.0186
0.35 0.698 0.0903 0.544 0.1542 0.794 0.0261
0.4 0.575 0.1045 0.540 0.1472 0.706 0.0399
Table 3: Average (Mean) test AM value along with standard deviation (SD) over 15 trials obtained by using squared loss based linear classifier learnt on Sy-De (noise rate p) attribute noise corrupted data. Due to the imbalanced nature of SPECT dataset, AM is a more suitable evaluation metric.

Appendix D Additional examples

Example 3.

This is another example which demonstrates that - is robust to Asy-In attribute noise with . Let the input data be two dimensional and the clean training set be

and uniformly distributed with . Let the flipping probabilities of the first and second component be and respectively. Let us consider the loss function to be the - loss and the classifier be of the form . We calculate as

We get a range of values for and , we can choose one of the value which is that gives the risk to be 0.

Now, for the corrupted case, we minimize the noisy - classifier to obtain given as below:

The minimizer for the above equation is calculated by plotting a graph (in MATLAB) given in Figure 10. Here, the values of and are taken to be 0.12 and 0.23 respectively. The same pattern is observed for all values of

Figure 10: Noisy - risk is plotted on axis. The and axis labels are to be read as and as we are looking for the minimizers of noisy risk.

The minimum value is obtained at and .

Comparing the clean - risk of the classifiers and , we observe that they are equal(0) and hence the - loss function is attribute noise robust in this case.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
399571
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description