The Tradeoff Between Privacy and Accuracy in Anomaly Detection
Using Federated XGBoost
Privacy has raised considerable concerns recently, especially with the advent of information explosion and numerous data mining techniques to explore the information inside large volumes of data. In this context, a new distributed learning paradigm termed federated learning becomes prominent recently to tackle the privacy issues in distributed learning, where only learning models will be transmitted from the distributed nodes to servers without revealing users’ own data and hence protecting the privacy of users.
In this paper, we propose a horizontal federated XGBoost algorithm to solve the federated anomaly detection problem, where the anomaly detection aims to identify abnormalities from extremely unbalanced datasets and can be considered as a special classification problem. Our proposed federated XGBoost algorithm incorporates data aggregation and sparse federated update processes to balance the tradeoff between privacy and learning performance. In particular, we introduce the virtual data sample by aggregating a group of users’ data together at a single distributed node. We compute parameters based on these virtual data samples in the local nodes and aggregate the learning model in the central server. In the learning model upgrading process, we focus more on the wrongly classified data before in the virtual sample and hence to generate sparse learning model parameters. By carefully controlling the size of these groups of samples, we can achieve a tradeoff between privacy and learning performance. Our experimental results show the effectiveness of our proposed scheme by comparing with existing state-of-the-arts.
The Tradeoff Between Privacy and Accuracy in Anomaly Detection
Using Federated XGBoost
Mengwei Yang , Linqi Song , Jie Xu and Congduan Li , Guozhen Tan
City University of Hong Kong
University of Miami
Sun Yat-sen University
Dalian University of Technology
email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com **footnotemark: *
Nowadays, many giant internet companies, like Google, Amazon, and Alibaba, have established large scale information technology infrastructures to cope with the current huge data stream and to provide numerous services to customers. However, the large volume of data will also bring a number of serious privacy issues [?] and computing problems. For example, in social networks like Facebook, there is a growing concern of privacy risk in collecting a large amount of users’ private data, including various personal information, texts, pictures, and video data. Leveraging these large volumes of human data, these companies will utilize them to train various machine learning models for various data intensive applications. However, users can do little to protect their data. As such, in May 2018, the European Union has began to implement the General Data Protection Regulation (GDPR) to protect individual privacy [?], which is deemed the most important change in data privacy regulation in 20 years.
Even though in some areas, data can be shared between different companies or concentrated on some cloud servers, it still carries dramatic risks and transmission issues. On one hand, the transfer of private data between different parties makes it more likely to leak or to be hacked. On the other hand, the transmission of large amounts of data leads to inefficiency. In this context, the federated learning framework has been proposed and plays an indispensable role in solving these problems [?]. Instead of transmitting raw data, federated learning transmits pre-trained learning models from users to servers, while keeping the users data locally. Thus, the user privacy can be protected; computing resources in the user side can be efficiently utilized; and the communication cost is reduced.
Recently, federated learning has attracted broader attention and three categories was put forward in [?], including horizontal federated learning, vertical federated learning and federated transfer learning. In [?], the SecureBoost was presented, which achieves vertical federated learning with a tree-boosting algorithm.
In this work, we propose a horizontal federated XGBoost algorithm in anomaly detection with an application in detecting the fraudulence in bank credit card transactions; and study the trade-off behavior between the privacy preserving and the anomaly detection performance. Compared with SecureBoost, which was deployed in the vertical federated learning framework, our federated XGBoost is a horizontal federated learning algorithm where different data samples with all features are distributed among the distributed nodes. Though tree-boosting algorithm is also utilized in this work, this horizontal federated XGBoost is deployed in a totally different way. First, the biggest difference is the transfer of parameters. It is far from enough to only pass parameters and in horizontal federated XGBoost. Because when using tree-boosting algorithms, different nodes have to obtain instances of every feature so that the gain of split can be calculated and the split point can be acquired. Another key difference is to use a two step (data aggregation and federated update) method to preserve individual data’s anonymity which we will describe later. Furthermore, the sparse federated update by focusing on wrongly classified data is utilized in federated XGBoost to improve the process of federated update.
In particular, to transfer features of users in a privacy and efficient way, our proposed two-step method is described as follows. The first step is Data Aggregation: First of all, the privacy of users should be protected and thus users’ information can’t be passed directly. Hence, for the purpose of calculating the gain of split mentioned above, features of users are entailed. In this paper, for the consideration of protecting users’ information, instead of directly transmitting all exact data in each feature, the original data in each feature is projected in an anonymous way by using modified K-Anonymity, shown in Figure 1, where a group of data samples have been mapped to a virtual data sample. The projection is implemented under every feature. Because while finding the split point, the purpose of original data for tree-boosting algorithm is to get the sequence under every feature. So, after passing the number of virtual data samples in each feature, the gain of split can be calculated. Consequently, by doing this, not only the privacy of users will be protected, but also the tree-boosting model can decide the split point and build the tree.
A second step is the Federated Update: In reality, the amount of data is quite large, it is inefficient to transfer all data and also not all data is valuable for update. In that case, it is necessary to filter data so as to better update models. Though the tree-boost model can implement well in prediction by building trees, there are still many instances that cannot be classified correctly. Hence, in this paper, wrongly classified instances will be processed with more focus and then be transferred to server for federated update. The reason is that firstly, these instances are more valuable and will help the model improve itself better. Also because the data used in anomaly detection is extremely unbalanced, the boosting algorithm can solve skewed problem in some degree and elevate the generalization ability of the model. Secondly, if the correctly classified data is not filtered, these instances will affect the process of split and the construction of trees in the process of federated update, which has an adverse impact on the improvement of the model.
We show a trade-off behavior between the detection accuracy and the the privacy measured in terms of k-anonymity. Through simulation experiments, we find a reasonable size of the number of virtual samples in data aggregation so that the privacy of users will be better protected and the learning performance in federated XGBoost will be reduced to a minimum. We show that our proposed algorithm achieves up to 5% performance gains in terms of F1-core compared with existing state-of-the-art algorithms, while effectively keeping user privacy.
2 Related Work
Privacy-preserving and Federated Learning:
The transfer of data will bring the problem of data leakage [?]. Consequently, decentralized methods (i.e., data is only stored locally) are used to process the data and then the risk of data leakage is reduced [?]. Some works use encryption-based federated learning frameworks, like homomorphic encryption [?]. Homomorphic encryption means certain operations can be acted directly on encrypted data without decrypting it. However, homomorphic encryption has its disadvantages. Take Paillier-based encryption schemes as an example, the cost of generating threshold decryption keys is very high [?].
Federated learning is a new distributed learning paradigm proposed recently to utilize the user-end computing resources and preserve user’s privacy by transmitting only model parameters, instead of raw data, to the server [?; ?]. In federated learning, a general model will be firstly trained, and then the model will be distributed to each node acting as a local model [?; ?; ?]. Three categories was put forward in [?], including horizontal federated learning, vertical federated learning and federated transfer learning. The federated secure XGBoost framework using vertical federated learning was proposed in [?].
In contrast, in this work, we firstly preprocess data where instances can be merged together to learn an aggregate gradient such that the communication and computation cost will be significantly reduced. Next, we generate local models and aggregate those models in the central server to update the original model. By doing so, we can show a tradeoff between the privacy of the user and the learning performance.
Anomaly detection [?] is the identification of events or observations that do not match the expected pattern or other items in the dataset (i.e., outliers) during data mining. Outliers can be divided into point exceptions, context exceptions, and collective exceptions [?]. Anomaly detection methods include SMOTE algorithm [?] and various machine learning models, such as K-Nearest Neighbors algorithm [?], Random Forest [?], Support Vector Machine (SVM) [?], Gradient Boosting Classification Tree (GBT) [?], XGBoost [?], and deep learning neural network models [?]. In this paper, a point exception of fraud detection in credit card transactions will be focused and the dataset of credit card transactions will be used to train the model.
3 Problem Formulation
We consider the federated learning in an anomaly detection problem as follows. There are distributed nodes, e.g., bank institutions, denoted by . In each node , the local data is given with data instances (i.e., data examples) and features, i.e., . We denote the union of all local data to be . There is a center server node to aggregate the learning model. The entire system architecture is shown in Fig. 2.
The federated learning process is as follows. First, the local nodes preprocess the data (the local data) and send some learning model parameters to the server. Second, the server will integrate those received model parameters and obtain a new global model. The new model will be transmitted to local nodes as well. This process will be iterated over time again and again to train a sufficiently good model. Through this learning process, the local data do not need to be exchanged and the user privacy is protected.
In the anomaly detection problem, the data is often very unbalanced; namely, in most of the cases, the data points are normal (i.e., most samples with a label ), and in rare cases, the data points are abnormal (i.e., rear samples with a label ). In this case, the extremely skewed data on each node can not represent for the overall distribution. In that case, each node needs to share data in the whole federated learning framework, which can help to improve the existed model in each node. The goal is to train a federated learning model to detect the abnormalities using the federated learning system as described above.
In this paper, we ask the question: if the local nodes choose different learning model parameters to be transmitted to the server, then what is the tradeoff between the machine learning performance and the user privacy.
Next, we describe the performance metrics of the anomaly detection and the user privacy.
-anonymity is a property possessed by certain anonymized data, where one cannot distinguish a user out of candidates from this set of data [?]. Here, since local nodes transmit learning model parameters to the server, we define a modified -anonymity metric as the privacy property that we cannot distinguish a user out of other candidates from the transmitted learning parameters instead of a set of data.
Measurement for Unbalanced Data
In anomaly detection, the data is unbalanced and usually it is not a good idea to simply use accuracy to measure the learning performance as classifying all data into the normal category will result in a sufficient high accuracy. For example, in the bank credit card fraudulence detection dataset that we use in experiments, fraud cases account for only of the total data. In actual banking transactions, fraudulent transactions are still the minority [?]. Consequently, we will use the confusion matrix, F1-score and AUPRC (the area under Precision-Recall curve) to measure the anomaly detection performance. An illustration of these concepts are shown in Fig. 3.
4 The Federated XGBoost Framework
In this section, we will talk about the federated XGBoost algorithm that we will use in the anomaly detection problem. We will first give a recap of the XGBoost algorithm, a general description of the federated XGBoost algorithm, and some specific design tailored to the anomaly detection problem.
4.1 Preliminaries of XGBoost
We give a brief overview of the XGBoost algorithm, and one can refer to [?] for more details.
For machine learning problems, such as the classification or regression, given a dataset of examples and features, the goal is to train a learning model with parameters to minimize the objective loss function as follows
where is the training loss and is the regularization term. For XGBoost algorithm, it utilizes regression/classification trees to predict the output, where the predicted output for the -th data example is
So for the objective function of XGBoost:
where with component of being the score/weight on -th leaf of the tree. Since the newly generated tree needs to fit the last predicted residual. So can be written as for the -th iteration. Also, take Taylor expansion of the objective as follows:
where . Here, and represent the first and second gradient statistics of the loss function. By using greedy algorithm to search the best split which aims to maximize the learning gain at each iteration:
where , , , . Here, and represent the left and right sets of data sample indices. The equation of is used for evaluating the split point and denotes the weight of leaf. When searching for the best split point, instances’ and in the left and right space will be calculated for getting the value of .
Without loss of generality, we consider a particular logistic loss function . So that
These parameters will be used for parameters passing in the federated learning framework.
4.2 Federated XGBoost
In the federated learning framework, to implement the XGBoost algorithm, one simple idea is to calculate the parameters and of each data sample at each local node, and then transmit these parameters to the center server to determine an optimal split.
Note that in vertical data partition [?], different node holds one part of the same instance, so that by only passing parameters between each other, the model can make predictions in cooperation with other nodes. In this paper, we consider the horizontally partitioned data in different local nodes, which means that data provided in different node have the same feature dimension and one node holds all features of an instance. It is not easy to update other models in different nodes if only transmitting parameters ( and ). Though by averaging the parameters of different models does help, it still cannot ameliorate the model a lot in each node.
Here, instead of simply transmitting model parameters and , we make two revisions that are tailored to the specific anomaly detection setting: a data aggregation process and a sparse federated update process.
First, in the data aggregation process, we map a range of data samples that are close to each other into a virtual data sample (or a cluster of samples). Taking into account each virtual data sample as a new data sample , we sum up the s and s () in this cluster to obtain the and for this virtual data sample. We then transmit parameters corresponding to these virtual samples to the central server to train the model. However, when new learning models are obtained, these data samples will calculate their losses and parameters and separately. By controlling the size of the virtual sample, we can achieve a tradeoff between learning performance and the privacy in terms of modified -anonymity.
In our modified -anonymity, instances in every feature will be mapped into different virtual data nodes. As shown in Figure 1, we use the sequence number of virtual data nodes (from ) to represent a cluster of samples’ exact information, which means individual values in every feature are replaced by a new category. Since every virtual data node represents a range of samples’ values, samples inside every node are anonymous and attackers cannot distinguish a user out of other candidates. Also, even though attackers get instance ’s exact values, they still cannot acquire instance ’s other sensitive information because of the adoption of a new category. Therefore, our modified -anonymity in this work can protect users’ privacy in an anonymous way.
Second, to further improve the communication efficiency and the anomaly detection performance, in the sparse federated update process, we will focus on tackling these wrongly classified instances. Our assumption is that at iteration , for most data samples, the function will give sufficient accurate estimations. Therefore, at iteration , we will just focus on the wrongly classified samples. Therefore, we will calculate the and for cluster by summing up only s and s of those samples that are wrongly classified from the learning model thus far.
5 Experimental Results
In this section, we present our experimental results over a real dataset for credit card fraud detection. We will first describe the characteristics of the dataset and then show our algorithm’s performance (we use F-XGBoost to represent our algorithm) compared with other existing state-of-the-arts.
Credit Card Fraud Dataset111https://www.kaggle.com/mlg-ulb/creditcardfraud: This dataset contains transactions generated by credit card and it has 492 frauds and 284807 transactions, which is greatly unbalanced. It is a dataset that contains 30 features. Features V1, V2, … V28 are the principal components obtained with PCA and only two features, Time and Amount, are kept original.
Experimental Setting: We split the dataset into two parts: one for basic model training and the other for simulating the situation of newly acquired and also wrongly classified instances. Nearly 1/5 of the dataset (59875 tuples) is used for updating existed models in the federated learning settings and 4/5 of the dataset will be divided into testing (45569 tuples) and training data (179363 tuples). In the experiment, the XGBoost222https://github.com/dmlc/xgboost, GBDT333http://scikit-learn.org/stable/modules/generated/sklearn.ensemble. GradientBoostingClassifier.html and Random Forest444http://scikit-learn.org/stable/modules/generated/sklearn.ensemble. Random ForestClassifier.html are utilized for performance comparisons and the parameter settings of XGBoost and federated XGBoost framework are the same. Learning rate is set as 0.1, maximum depth is 4, and min child weight is 2.
Experimental Results: When the virtual cluster’s size is larger, the protection of privacy will be better. However, it is a trade-off between the cluster size and the accuracy. In the original dataset, the number of samples (sample clusters) in each feature is 275665. As it shown in Figure 4, Line B means in the original 275665 dimension, the F1-Score of federated XGBoost framework will be 0.901408. In Figure 4, we can see a tradeoff behavior between the learning performance and the privacy, where the horizontal axis represents the number of clusters and the vertical axis represents the F1-score. We show that with the increase in sample clusters, the privacy preserving ability is decreased while the learning performance is improved. In Figure 4, we can see that when the number of clusters is 405, the F1-Score is 0.895105.
In Table 1, the high accuracy of all models can be noticed, which means the evaluation parameters such as accuracy is not appropriate to fully evaluate models for this extremely unbalanced dataset. With emphasis, some good evaluating parameters like F1-Score, AUC and AUPRC can be deployed.
From Table 1, we also show the AUC and F-1 curve for different algorithms. Compared with Random Forest, GBDT and the federated XGBoost before Update (Original Dimension), the updated federated XGBoost framework (Original Dimension) has obviously good performance over F1-Score. We can see that the proposed algorithm outperforms existing methods by up to 3.4% in AUC and 5% in F1-score. For federated XGBoost, the dimension of 405 achieves a reasonable trade-off between privacy and accuracy, though the F1-score of federated XGBoost framework (Dimension:405) is 0.63% lower than federated XGBoost framework (Original Dimension). Also, the AUPRC displays the improvement of federated learning model compared with itself. The AUPRC performance is shown in Figures 6 and 7 for training and test data sets. For the train loss in Figure 5, the learning curve of federated XGBoost framework shows how the learning works and we can see that the training loss of our proposed federated XGBoost decreases faster than the GBDT algorithm.
|F-XGBoost before update (Original Dimension)||0.9994|
|F-XGBoost before update (Dimension:405)||0.9996|
|F-XGBoost after update (Original Dimension)||0.9997|
|F-XGBoost after update (Dimension:405)||0.9997|
|F-XGBoost before update (Original Dimension)||0.9214|
|F-XGBoost before update (Dimension:405)||0.9641|
|F-XGBoost after update (Original Dimension)||0.9789|
|F-XGBoost after update (Dimension:405)||0.9794|
|F-XGBoost before update (Original Dimension)||0.8169|
|F-XGBoost before update (Dimension:405)||0.8652|
|F-XGBoost after update (Original Dimension)||0.9014|
|F-XGBoost after update (Dimension:405)||0.8951|
6 Conclusion and Future Work
In this paper, we proposed a federated XGBoost algorithm to solve the anomaly detection problem. We show a tradeoff behavior between the learning performance and the privacy. In experimental results, we show some reasonable working points which achieve a balance between the privacy and accuracy. Also, by comparing with other algorithms, the effectiveness of this federated XGBoost framework can be clearly found with up to 5% performance gains. In our proposed federated XGBoost framework, we use two techniques, data aggregation and sparse federated update, to reduce the communication and computing cost while improving the anomaly detection ability. More importantly, the privacy of users is protected and through the process of data aggregation, the risk of users’ information leakage is avoided.
In our future work, we will attempt to experiment and deploy differential privacy in federated XGBoost so as to better protect users’ privacy. Also, it is believed that there are still a lot of details should be considered and more research should be done on federated learning to make it more significant.
- [Agrawal and Agrawal, 2015] Shikha Agrawal and Jitendra Agrawal. Survey on anomaly detection using data mining techniques. Procedia Computer Science, 60:708–713, 2015.
- [Bonawitz et al., 2017] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1175–1191. ACM, 2017.
- [Chawla et al., 2002] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357, 2002.
- [Chen and Guestrin, 2016] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794. ACM, 2016.
- [Chen and Zhao, 2012] Deyan Chen and Hong Zhao. Data security and privacy protection issues in cloud computing. In Computer Science and Electronics Engineering (ICCSEE), 2012 International Conference on, volume 1, pages 647–651. IEEE, 2012.
- [Cheng et al., 2019] Kewei Cheng, Tao Fan, Yilun Jin, Yang Liu, Tianjian Chen, and Qiang Yang. Secureboost: A lossless federated learning framework. arXiv preprint arXiv:1901.08755, 2019.
- [Gilad-Bachrach et al., 2016] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and John Wernsing. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In International Conference on Machine Learning, pages 201–210, 2016.
- [Hard et al., 2018] Andrew Hard, Kanishka Rao, Rajiv Mathews, Françoise Beaufays, Sean Augenstein, Hubert Eichner, Chloé Kiddon, and Daniel Ramage. Federated learning for mobile keyboard prediction. arXiv preprint arXiv:1811.03604, 2018.
- [Konečnỳ et al., 2016] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
- [Krauss et al., 2017] Christopher Krauss, Xuan Anh Do, and Nicolas Huck. Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the s&p 500. European Journal of Operational Research, 259(2):689–702, 2017.
- [Li et al., 2003] Kun-Lun Li, Hou-Kuan Huang, Sheng-Feng Tian, and Wei Xu. Improving one-class SVM for anomaly detection. In Machine Learning and Cybernetics, 2003 International Conference on, pages 3077–3081. IEEE, 2003.
- [Liao and Vemuri, 2002] Yihua Liao and V Rao Vemuri. Use of k-nearest neighbor classifier for intrusion detection1. Computers & Security, 21(5):439–448, 2002.
- [Liu et al., 2018] Yang Liu, Tianjian Chen, and Qiang Yang. Secure federated transfer learning. arXiv preprint arXiv:1812.03337, 2018.
- [McMahan et al., 2016] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
- [Mukkamala et al., 2002] Srinivas Mukkamala, Guadalupe Janoski, and Andrew Sung. Intrusion detection using neural networks and support vector machines. In Neural Networks, 2002. IJCNN’02. Proceedings of the 2002 International Joint Conference on, pages 1702–1707. IEEE, 2002.
- [Patcha and Park, 2007] Animesh Patcha and Jung-Min Park. An overview of anomaly detection techniques: Existing solutions and latest technological trends. Computer Networks, 51(12):3448–3470, 2007.
- [Phua et al., 2004] Clifton Phua, Damminda Alahakoon, and Vincent Lee. Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explorations Newsletter, 6(1):50–59, 2004.
- [Shokri and Shmatikov, 2015] Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1310–1321. ACM, 2015.
- [Sweeney, 2002] Latanya Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570, 2002.
- [Voigt and Von dem Bussche, 2017] Paul Voigt and Axel Von dem Bussche. The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing, 2017.
- [Wang et al., 2010] Cong Wang, Qian Wang, Kui Ren, and Wenjing Lou. Privacy-preserving public auditing for data storage security in cloud computing. In Infocom, 2010 proceedings ieee, pages 1–9. Ieee, 2010.
- [Yang et al., 2019] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):12, 2019.
- [Zhang et al., 2008] Jiong Zhang, Mohammad Zulkernine, and Anwar Haque. Random-forests-based network intrusion detection systems. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(5):649–659, 2008.
- [Zhuo et al., 2019] Hankz Hankui Zhuo, Wenfeng Feng, Qian Xu, Qiang Yang, and Yufeng Lin. Federated reinforcement learning. arXiv preprint arXiv:1901.08277, 2019.