# Label Sanitization against Label Flipping Poisoning Attacks

## Abstract

Many machine learning systems rely on data collected in the wild from untrusted sources, exposing the learning algorithms to data poisoning. Attackers can inject malicious data in the training dataset to subvert the learning process, compromising the performance of the algorithm producing errors in a targeted or an indiscriminate way. Label flipping attacks are a special case of data poisoning, where the attacker can control the labels assigned to a fraction of the training points. Even if the capabilities of the attacker are constrained, these attacks have been shown to be effective to significantly degrade the performance of the system. In this paper we propose an efficient algorithm to perform optimal label flipping poisoning attacks and a mechanism to detect and relabel suspicious data points, mitigating the effect of such poisoning attacks.

adversarial machine learning, poisoning attacks, label flipping attacks, label sanitization

## 1 Introduction

Many modern services and applications rely on data-driven approaches that use machine learning technologies to extract valuable information from the data received, provide advantages to the users, and allow the automation of many processes. However, machine learning systems are vulnerable and attackers can gain a significant advantage by compromising the learning algorithms. Thus, attackers can learn the blind spots and the weaknesses of the algorithm to manipulate samples at test time to evade detection or inject malicious data into the training set to poison the learning algorithm [huang]. These attacks have already been reported in the wild against antivirus engines, spam filters, and systems aimed to detect fake profiles or news in social networks.

Poisoning attacks are considered one of the most relevant and emerging security threats for data-driven technologies [joseph], especially in cases where the data is untrusted, as for example in IoT environments, sensor networks, applications that rely on the collection of users’ data or where the labelling is crowdsourced from a set of untrusted annotators. Related work in adversarial machine learning has shown the effectiveness of optimal poisoning attacks to degrade the performance of popular machine learning classifiers –including Support Vector Machines (SVMs) [biggioSVM], embedded feature selection methods [xiao], neural networks and deep learning systems [luis]– by compromising a small fraction of the training dataset. Previous attacks assume that the attacker can manipulate both, the features and the labels of the poisoning points. For some applications this is not possible, and the attacker’s capabilities are constrained to the manipulation of the labels. These are known as *label flipping attacks*. Even if these attacks are more constrained, they are still capable of degrading significantly the performance of learning algorithms, including deep learning [zhang].

Few defensive mechanisms have been proposed against poisoning attacks. In [nelson] the authors propose an algorithm that evaluates the impact of each training sample in the performance of the learning algorithms. Although this can be effective in some cases, the algorithm does not scale well for large datasets. In [andrea], an outlier detection scheme is proposed to identify and remove suspicious samples. Although the defensive algorithm is capable of successfully mitigating the effect of optimal poisoning attacks, its performance is limited to defend against label flipping attacks.

In this paper we first propose an algorithm to perform label flipping poisoning attacks. The optimal formulation of the problem for the attacker is computationally intractable. We have developed an heuristic that allows to craft efficient label flipping attacks at a reduced computational cost. On the other hand, we also propose a defensive mechanism to mitigate the effect of label flipping attacks with label sanitization. We have developed an algorithm based on *k*-Nearest-Neighbours (-NN) to detect malicious samples or data points that have a negative impact on the performance of machine learning classifiers. We empirically show the effectiveness of our algorithm to mitigate the effect of label flipping attacks on a linear classifier for 3 real datasets.

## 2 Related Work

Optimal poisoning attacks against machine learning classifiers can be formulated as a bi-level optimization problem where the attacker aims to inject malicious points into the training set that maximize some objective function (e.g. increase the overall test classification error), while, at the same time, the defender learns the parameters of the algorithm by minimizing some loss function evaluated on the tainted dataset. This strategy has been proposed against popular binary classification algorithms such as SVMs [biggioSVM], logistic regression [mei], and embedded feature selection [xiao]. An extension to multi-class classifiers was proposed in [luis], where the authors also devised an efficient algorithm to compute the poisoning points through back-gradient optimization, which allows to poison a broader range of learning algorithms, including neural networks and deep learning systems. An approximation to optimal poisoning attacks was proposed in [koh] where the authors provide a mechanism to detect the most influential training points. The authors in [zhang] showed that deep networks are vulnerable to (random) label noise. In [biggioLabel], a more advanced label flipping poisoning attack strategy is proposed against two-class SVMs, where the attacker selects the subset of training points that maximizes the error, evaluated on a separate validation set, when their labels are flipped.

On the defender’s side, [nelson] propose to measure the impact of each training example on the classifier’s performance to detect poisoning points. Examples that affect negatively the performance are then discarded. Although effective in some cases, the algorithm scales poorly with the number of samples. Following the same spirit, a more scalable approach is proposed in [koh] through the use of influence functions, where the algorithm aims to identify the impact of the training examples on the training cost function without retraining the model.

Optimal poisoning attack strategies usually overlook the detectability constraints. Thus, the attack points can differ significantly from the genuine ones. Outlier detection has been shown to be effective to mitigate the effect of some poisoning attacks. In [steinhardt], the authors approximate a data-dependent upper bound on the performance of the learner under data poisoning with an online learning algorithm, assuming that some data sanitization is performed before training. In contrast, in [andrea], an outlier detection scheme is proposed to defend against data poisoning. Although the experimental evaluation supports the validity of this approach to mitigate optimal poisoning attacks, the capabilities of the algorithm to reduce the effect of more constrained attack strategies is limited.

## 3 Label Flipping Attacks

In a poisoning attack, the adversary injects malicious examples in training dataset to influence the behaviour of the learning algorithm according to some arbitrary goal defined by the attacker. Typically, adversarial training examples are designed to maximize the error of the learned classifier. In line with most of the related work, in this paper, we only consider binary classification problems. We restrict our analysis to *worst-case* scenarios, where the attacker has perfect knowledge about the learning algorithm, the loss function that the defender is optimizing, the training data, and the set of features used by the learning algorithm. Additionally we assume that the attacker has access to a separate validation set, drawn from the same data distribution than the defender’s training and test sets. Although unrealistic for practical scenarios, these assumptions allows us to provide worst-case analysis of the performance and the robustness of the learning algorithm when is under attack. This is especially useful for applications that require certain levels of assurance on the performance of the system.

We consider the problem of learning a binary linear classifier over a domain with labels in . We assume that the classifiers are parametrized by , such that the output of the classifier is given by . We assume the learner to have access to an i.i.d. training dataset drawn from an unknown distribution over .

In a label flipping attack, the attacker’s goal is to find a subset of examples in such that when their label is flipped, some arbitrary objective function for the attacker is maximized. For the sake of simplicity, we assume that the objective of the attacker is to maximize the loss function, , evaluated on a separate validation dataset . Then, let such that and let a set of examples defined such that: if , and otherwise. Thus, is an indicator vector to specify the samples whose labels are flipped and denotes the training dataset after those label flips. We can formulate the optimal label flipping attack strategy as the following bi-level optimization problem:

(1) | ||||

s.t. |

where the parameters are the result of a learning algorithm that aims to optimize a loss function on the poisoned training set .^{1}

boxruled
{algorithm}
\DontPrintSemicolon
Input: training set , validation set , # of examples to flip .

Initialize:

\For \KwTo
\For \KwTo
,

Output: poisoned training set , flips

## 4 Defence against Label Flipping Attacks

We can expect aggressive label flipping strategies, such as the one described in Sect. 3, to flip the labels of points that are far from the decision boundary to maximize the impact of the attack. Then, many of these poisoning points will be far from the genuine points with the same label, and then, they can be considered as outliers

To mitigate the effect of label flipping attacks we propose a mechanism to relabel points that are suspicious to be malicious. The algorithm uses -NN to assign the label to each instance in the training set. The goal is to enforce label homogeneity between instances that are close, especially in regions that are far from the decision boundary. The procedure is described in \Crefalg:def. Thus, for each sample in the (possibly tainted) training set we find its nearest neighbours, using the euclidean distance.^{2}

boxruled
\LinesNumbered{algorithm}
\DontPrintSemicolon
Parameters: , .

Input: training set .

\For \KwTo
-NN

\lIf
\lElse
Output:

Poisoning points that are far from the decision boundary are likely to be relabelled, mitigating their malicious effect on the performance of the classifier. Although the algorithm can also relabel genuine points, for example in regions where the two classes overlap (especially for values of close to ), we can expect a similar fraction of genuine samples relabelled in the two classes, so the label noise introduced by \Crefalg:def should be similar for the two classes. Then, the performance of the classifier should not be significantly affected by the application of our relabelling mechanism. Note that the algorithm is also applicable to multi-class classification problems, although in our experimental evaluation in Sect. 5 we only consider two-class classification.

## 5 Experiments

We evaluated the performance of our label flipping attack and the proposed defence on 3 real datasets from UCI repository:^{3}*BreastCancer*, *MNIST*, and *Spambase*, which are common benchmarks for classification tasks. The characteristics of the datasets are described in \Creftab:datasets. Similar to [biggioSVM, luis], for *MNIST*, a multi-class problem for handwritten digits recognition, we transformed the problem into a two-class classification task, aiming at recognising digits 1 and 7. As classifier, we used a linear classifier that aims to minimize the expected *hinge loss*, . We learned the parameters with stochastic gradient descent.

Name | # Features | # Examples | # +/- |
---|---|---|---|

BreastCancer | 30 | 569 | 212/357 |

MNIST (1 vs 7) | 784 | 13,007 | 6,742/6,265 |

SpamBase | 54 | 4,100 | 1,657/2,443 |

In our first experiment we evaluated the effectiveness of the label flipping attack described in \Crefalg:lf to poison a linear classifier. We also assessed the performance of our defensive strategy in \Crefalg:def to mitigate the effect of this attack. For each dataset we created 10 random splits with points for training, for validation, and the rest for testing. For the learning algorithm we set the learning rate to and the number of epochs to . For the defensive algorithm, we set the confidence parameter to and selected the number of neighbours according to the performance of the algorithm evaluated in the validation dataset. We assume that the attacker has not access to the validation data, so it cannot be poisoned. In practice, this requires the defender to have a small trusted validation dataset, which is reasonable for many applications. Note that typical scenarios of poisoning happen when retraining the machine learning system using data collected in the wild, but small fractions of data points can be curated before the system is deployed. From the experimental results in \Creffig:resultsEx1 we observe the effectiveness of the label flipping attack to degrade the performance of the classifier in the 3 datasets (when no defence is applied). Thus, after 20% of poisoning, the average classification error increases by a factor of , , and respectively for *BreastCancer*, *MNIST*, and *Spambase*. In \Creffig:resultsEx1 we also show that our defensive technique effectively mitigates the effect of the attack: The performance with 20% of poisoning points is similar to the performance on the clean dataset on *BreastCancer* and *Spambase*, and we only appreciate a very slight degradation of the performance on *MNIST*. When no attack is performed, we observe that our defensive strategy slightly degrades the performance of the classifier (compared to the case where no defence is applied). This can be due to the label noise introduced by the algorithm, which can relabel some genuine data points. However, this small loss in performance can be affordable for the sake of a more secure machine learning system.

In \Creffig:resultsEx2 we show the sensitivity of the parameters and in \Crefalg:def. We report the average test classification error on *BreastCancer* dataset for different configurations of our defensive strategy. In \Creffig:resultsEx2.(a) we show the sensitivity of the algorithm to the number of neighbours , setting the value of to . We observe that for bigger values of the algorithm exhibits a better performance when the fraction of poisoning points is large, and the degradation on the performance is more graceful as the number of poisoning points increases. However, for smaller fractions of poisoning points or when no attack is performed, smaller values of show a slightly better classification error. In \Creffig:resultsEx2.(b) we observe that \Crefalg:def is more sensitive to the confidence threshold . Thus, for bigger values of the defence is less effective to mitigate the label flipping attacks, since we can expect less points to be relabelled. Then, small values of show a more graceful degradation with the fraction of poisoning points, although the performance when no attack is present is slightly worse.

## 6 Conclusion

In this paper we propose a label flipping poisoning attack strategy that is effective to compromise machine learning classifiers. We also propose a defence mechanism based on -NN to achieve label sanitization, aiming to detect malicious poisoning points. We empirically showed the significant degradation of the performance produced by the proposed attack on linear classifiers as well as the effectiveness of the proposed defence to successfully mitigate the effect of such label flipping attacks. Future work will include the investigation of similar defensive strategies for less aggressive attacks, where the attacker considers detectability constraints. Similar to [vittorio] we will also consider cases where the attack points collude towards the same objective, where more advanced techniques are required to detect malicious points and defend against these attacks.

## References

### Footnotes

- For simplicity we assume that the attacker aims to maximize the average loss on a separate validation dataset.
- Any other distance, such as the Hamming distance, can be applied, depending on the set of features used.
- https://archive.ics.uci.edu/ml/datasets.html