Preferencebased performance measures for TimeDomain Global Similarity method
Abstract
For TimeDomain Global Similarity (TDGS) method, which transforms the data cleaning problem into a binary classification problem about the physical similarity between channels, directly adopting common performance measures could only guarantee the performance for physical similarity. Nevertheless, practical data cleaning tasks have preferences for the correctness of original data sequences. To obtain the general expressions of performance measures based on the preferences of tasks, the mapping relations between performance of TDGS method about physical similarity and correctness of data sequences are investigated by probability theory in this paper. Performance measures for TDGS method in several common data cleaning tasks are set. Cases when these preferencebased performance measures could be simplified are introduced.
a]Ting. Lan, a, 1]Jian. Liu,^{1}^{1}footnotetext: Corresponding author. a, b]Hong. Qin, Preprint
Preferencebased performance measures for TimeDomain Global Similarity method

School of Nuclear Science and Technology and Department of Modern Physics, University of Science and Technology of China, Hefei, Anhui 230026, China

Plasma Physics Laboratory, Princeton University, Princeton, NJ 08543, USA
Contents
1 Introduction
To guarantee the availability and reliability of data source, a generalpurposed TimeDomain Global Similarity (TDGS) method based on machine learning techniques has been developed, which sorts out the incorrect fusion data by classifying the physical similarity between channels [1]. In the model selection and evaluation process of TDGS method, different performance measures lead to models of various generalization abilities [2, 3]. Choices of performance measures depend on the required generalization ability of models, or say preferences of tasks. Setting preferencebased performance measures helps to perform corresponding tasks better. For TDGS method, directly adopting common performance measures, such as precision, recall, Ffactor, confusion matrix, and Receiver Operating Characteristics (ROC) graphs, could only guarantee the performance for physical similarity between data sequences [4, 5, 6]. Nevertheless, practical data cleaning tasks have requirements for the correctness of original data sequences. For example, some data cleaning tasks require high recall rate of incorrect data, and some tasks require high precision of correct data. To improve the performance of TDGS method in data cleaning tasks, new performance measures based on the preferences of corresponding tasks should be set.
Each sample of TDGS method is the combination of two data sequences from different channels of MUltichannel Measurement (MUM) systems. By tagging the sample completely constituted by correct data as physical similarity, and tagging the sample containing at least one incorrect data sequence as physical dissimilarity, the data cleaning problem turns into a binary classification problem about physical similarity between data sequences. When defining the prediction performance of TDGS method, True Positive (TP) refers that predicting results and actual sample tags are both dissimilar. True Negative (TN) refers that predicting results and actual sample tags are both similar. However, when defining the required prediction performance for data cleaning tasks, TP and TN refer to the incorrect and correct sequences which are correctly predicted. To set performance measures according to the preferences of tasks, the mapping relations between performance of TDGS method about physical similarity and correctness of data sequences should be explicit first. However, these mapping relations are complex and influenced by many factors, such as the data structure of samples, performance of models, the rule for judging the correctness of data based on given physical similarity, and the judging order. To obtain the general expression of preferencebased performance measures for TDGS, the mapping relations between performance of TDGS method about physical similarity and correctness of data sequences are investigated by probability theory in this paper. Based on these mapping relations, we set preferencebased performance measures for several common data cleaning tasks. By adopting these new performance measures in the model selection and evaluation process, models generated by TDGS method could best meet the preferences of tasks in probability.
The mapping relations between performance of TDGS method about physical similarity and correctness of data sequences are decided by the rules for judging the correctness of data based on given physical similarity. Here we adopt an absolute algorithm, i.e., by scanning through all samples tagged with similarity first, tag the sequences contained in the similar samples as correct data, and tag the other data as incorrect data. Based on this judging rule, the mapping relations between performance about physical similarity and correctness of data sequences can be analyzed by probability theory. In view that every prediction about physical similarity is independent of each other, the probability of judging the correctness of data is the product of the probabilities of all predictions employed in the judging process [7]. For example, according to the adopted judging rule, a correct data sequence would be predicted as incorrect if all samples containing are predicted as dissimilarity. Therefore, the probability of judging a correct data sequence as incorrect can be decided according to the number of similar samples containing , the probability of predicting similar samples as dissimilarity, the number of dissimilar samples containing , and the probability of predicting dissimilar samples as dissimilarity. Based on the mapping relations between performance of TDGS method about physical similarity and the correctness of data, performance measures for several common data cleaning tasks are set in this paper. Meanwhile, the correlative relations between these preferencebased performance measures and performance parameters about physical similarity are analyzed. When preferencebased performance measures are strong positive correlative with certain parameter, these performance measures could be simplified.
The rest parts of this paper are organized as follows. In section 2, the mapping relations between performance of TDGS method about physical similarity and the correctness of data sequences are studied by probability theory. In section 3, performance measures for several common data cleaning tasks are investigated. Cases when these performance measures could be simplified are introduced. In section 4, further optimizations of setting preferencebased performance measures for TDGS method are discussed.
2 Mapping relations between performance of TDGS method about physical similarity and correctness of data sequences
In this section, the correctness of data sequences based on performance of TDGS method about physical similarity is analyzed by probability theory. Corresponding mapping relations are explicitly exhibited.
MUM system measures related yet distinct aspects of the same observed object with multiple independent measuring channels. Interferometer systems [8], polarimeter systems [9, 10, 11, 12, 13], and electron cyclotron emission imaging systems [14] are all typical MUM systems in Magnetic Confinement Fusion (MCF) devices. For practical purpose of data cleaning in MCF devices, the samples of a validation set are generated from diagnostic data of one discharge. For an Nchannel MUM system, suppose and are the number and proportion of correct data sequences respectively. By combining two data sequences from different channels of MUM system as one sample, samples can be generated. Among them, samples are similar, and samples are dissimilar. The prediction performance of TDGS method about physical similarity can be divided as four types. and are the probabilities of correctly and incorrectly predicting similar samples respectively. and are the probabilities of correctly and incorrectly predicting dissimilar samples respectively. The total probability of all predictions equals 1, i.e., . The recall rate of similar samples and the recall rate of dissimilar samples are typical performance parameters about physical similarity, which are defined as the fraction of correctly predicted samples over total samples, namely
(2.1a)  
(2.1b) 
The proportions of similar and dissimilar samples are and respectively. Total probability of correct and incorrect predictions of certain samples is the proportion of corresponding class, i.e.,
(2.2a)  
(2.2b) 
Based on the given performance of TDGS method about physical similarity, the correctness of data sequences could be analyzed by probability theory. The probability of incorrectly predicting a correct data sequence is the union set of incorrectly predicting all similar samples containing as dissimilarity, and correctly predicting all dissimilar samples containing . For the validation set from one discharge, the amounts of similar and dissimilar samples containing are and respectively. The probability of predicting similar samples as dissimilarity is . And the probability of correctly predicting dissimilar samples is . Considering the proportion of correct data is , the probability of incorrectly predicting correct data equals . Since the total probability of correct and incorrect predictions of correct data sequences is , the probability of correctly predicting correct data equals . The probability of correctly predicting incorrect data sequence is the union set of predicting all dissimilar samples containing as dissimilarity. In view that the amount of dissimilar samples containing is and the proportion of incorrect data sequences is , the probability of correctly predicting incorrect data equals . Since the proportion of incorrect data sequences is , the probability of incorrectly predicting incorrect data equals .
3 Preferencebased performance measures for TDGS method in several common data cleaning tasks
Based on the mapping relations between performance of TDGS method about physical similarity and the correctness of data sequences, performance measures for several common data cleaning tasks are set in this section.
Different data cleaning tasks have various preferences. Some tasks require high recall rate of incorrect data. Then the performance measure can be set as
(3.1) 
Some tasks require high precision of incorrect data. Then the performance measure can be set as
(3.2) 
Some tasks require high recall rate of correct data. Then the performance measure can be set as
(3.3) 
Some tasks require high precision of correct data. Then the performance measure can be set as
(3.4) 
The change relations between performance parameters about physical similarity and preferencebased performance measures are different in various cases. In the case shown in figure 1, the recall of incorrect data and precision of correct data are positive correlative with the recall rate of dissimilar samples . In the model selection and evaluation process of this case, the recall of incorrect data and precision of correct data could also be enhanced by just improving the recall rate of dissimilar samples. Then the performance measures and can be replaced with the more simplified parameter . When the channel number of MUM systems is bigger , or the proportion of incorrect data is higher , this simplification is more reasonable for the correlative relations between and performance measures are stronger.
4 Summary
Data cleaning tasks could be performed better by setting preferencebased performance measures. In this paper, we provide the mapping relations between performance of TDGS method about physical similarity and correctness of data sequences by probability theory. Based on these mapping relations, preferencebased performance measures for several common data cleaning tasks are set for TDGS method. Meanwhile, the correlative relations between these new performance measures and performance parameters are analyzed.
By setting preferencebased performance measures, the preferences of data cleaning tasks could be best meet by TDGS method in probability. When these new performance measures are strong positive correlative with certain parameter, preferencebased performance measures could be simplified. Next step, we would further improve the performance of TDGS method by adopting different rules for judging the correctness of data based on given physical similarity. The rule adopted in this paper is an absolute judging rule. Next step, we could adopt a nonabsolute judging rule. For example, the sequence which is dissimilar from of the other sequences can be tagged as incorrect data. The degree parameter introduced by the judging rule changes the mapping relations between performance of TDGS method about physical similarity and correctness of data sequences. In some cases, proper setting of the degree parameter would improve the data cleaning performance of TDGS method.
Acknowledgments
This research is supported by Key Research Program of Frontier Sciences CAS (QYZDBSSWSYS004), National Natural Science Foundation of China (NSFC11575185,11575186), National Magnetic Confinement Fusion Energy Research Project (2015GB111003,2014GB124005), JSPSNRFNSFC A3 Foresight Program (NSFC11261140328),and the GeoAlgorithmic Plasma Simulator (GAPS) Project.
References
 Lan et al. [2017] T. Lan, J. Liu, H. Qin, and L. Li Xu, ArXiv eprints (2017), 1705.04947.
 Karayiannis and Venetsanopoulos [2013] N. Karayiannis and A. N. Venetsanopoulos, Artificial neural networks: learning algorithms, performance evaluation, and applications, vol. 209 (Springer Science & Business Media, 2013).
 Kohavi et al. [1995] R. Kohavi et al., in Ijcai (Stanford, CA, 1995), vol. 14, pp. 1137–1145.
 Goutte and Gaussier [2005] C. Goutte and E. Gaussier, in European Conference on Information Retrieval (Springer, 2005), pp. 345–359.
 Powers [2011] D. M. Powers (2011).
 Fawcett [2006] T. Fawcett, Pattern recognition letters 27, 861 (2006).
 Durrett [2010] R. Durrett, Probability: theory and examples (Cambridge university press, 2010).
 Kawahata et al. [1999] K. Kawahata, K. Tanaka, Y. Ito, A. Ejiri, and S. Okajima, Review of scientific instruments 70, 707 (1999).
 Donné et al. [2004] A. Donné, M. Graswinckel, M. Cavinato, L. Giudicotti, E. Zilli, C. Gil, H. Koslowski, P. McCarthy, C. Nyhan, S. Prunty, et al., Review of scientific instruments 75, 4694 (2004).
 Brower et al. [2001] D. Brower, Y. Jiang, W. Ding, S. Terry, N. Lanier, J. Anderson, C. Forest, and D. Holly, Review of Scientific Instruments 72, 1077 (2001).
 Liu et al. [2014] H. Liu, Y. Jie, W. Ding, D. L. Brower, Z. Zou, W. Li, Z. Wang, J. Qian, Y. Yang, L. Zeng, et al., Review of Scientific Instruments 85, 11D405 (2014).
 Liu et al. [2016] H. Liu, Y. Jie, W. Ding, D. Brower, Z. Zou, J. Qian, W. Li, Y. Yang, L. Zeng, S. Zhang, et al., Journal of Instrumentation 11, C01049 (2016).
 Zou et al. [2016] Z. Zou, H. Liu, W. Li, H. Lian, S. Wang, Y. Yao, T. Lan, L. Zeng, and Y. Jie, Review of Scientific Instruments 87, 11E121 (2016).
 Luo et al. [2014] C. Luo, B. Tobias, B. Gao, Y. Zhu, J. Xie, C. Domier, N. Luhmann, T. Lan, A. Liu, H. Li, et al., Journal of Instrumentation 9, P12014 (2014).