Efficient Feature Selection of Power Quality Events using Two Dimensional (2D) Particle Swarms
Abstract
A novel twodimensional (2D) learning framework has been proposed to address the feature selection problem in Power Quality (PQ) events. Unlike the existing feature selection approaches, the proposed 2D learning explicitly incorporates the information about the subset cardinality (i.e., the number of features) as an additional learning dimension to effectively guide the search process. The efficacy of this approach has been demonstrated considering fourteen distinct classes of PQ events which conform to the IEEE Standard 1159. The search performance of the 2D learning approach has been compared to the other six wellknown feature selection wrappers by considering two induction algorithms: Naive Bayes (NB) and kNearest Neighbors (kNN). Further, the robustness of the selected/reduced feature subsets has been investigated considering seven different levels of noise. The results of this investigation convincingly demonstrate that the proposed 2D learning can identify significantly better and robust feature subsets for PQ events.
keywords:
Classification, Dimensionality Reduction, Feature Selection, Particle Swarm Optimization, Pattern Recognition, Power Qualitynumbers,sort&compress
1 Introduction
Over the past two decades, the landscape of the energy market has been going through a significant transformation due to the increasing share of nonlinear loads. The breakthrough progress in the semiconductor technology has enabled a widescale deployment of the power electronic converters, adjustable speed drives and consumer electronics. Furthermore, the power electronic converters are the key components in the grid interface of the renewable generation. In this scenario, one of the major challenge faced by the utilities is the deterioration in Power Quality (PQ). Since the success of the remedial action is critically dependent on the nature of the PQ event, the identification of a PQ event is vital in addressing the poor PQ. This study, therefore, focuses on identification of PQ events.
In essence, any departure from the ideal sinusoid can be considered as a PQ event. Various PQ events exist in the modern grid which differ in terms of spectral content, magnitude and duration. Based on these ‘traits’, the IEEE standard 1159 characterizes PQ events of distinct nature IEEE:1159 (). Arguably, the earliest attempt to identify the PQ events could be traced back to the works of Santoso et al. Santoso:1996 (); Santoso:2000 (), Angrisani et al. Angrisani:1998 () and Gaouda et al. Gaouda:Salama:1999 (), in which the feature extraction through multiresolution analysis was proposed. Subsequently, over the years, the PQ event identification problem has transformed into a pattern recognition problem wherein a pattern is formed by various features extracted from the voltage measurements and a label corresponding to PQ event, i.e., pattern, ‘’, is given by, , where, ‘’ and respectively represent the output label and the features extracted from the PQ event. The patterns extracted from the PQ events are subsequently used to induce a classifier through a suitable induction algorithm. Most of the existing PQ event identification approaches are based on this framework Mahela:Shaik:2015 (); Khokhar:Zin:2015 (). Note that the utility/efficacy of the extracted features is usually not evaluated in most of the existing PQ identification approaches Mahela:Shaik:2017 (); Kumar:Singh:2015 (); Liu:Cui:2015 (); Biswal:Dash:2013 (); Naik:Hafiz:2016 (), which often leads to the inclusion of irrelevant, redundant and noisy features.
Despite the significant research efforts dedicated to the machine learning and pattern recognition, the feature extraction process is still, in major part, dependent on the expert knowledge. Since it is not trivial to estimate the required number of features/attributes beforehand, a large number of features are usually extracted in order to identify a pattern. This often leads to the inclusion of irrelevant and/or redundant features which often lead to increased storage requirement, slower processing times and reduced generalization capability of the classifier. The objective of the feature selection is to overcome these problems by the removal of irrelevant/redundant features. Note that the feature selection is one of the fundamental problems of the machine learning and it has been topic of active research since last six decades Marill:Green:1963 (); Blum:Langley:1997 (); Dash:Liu:1997 (); Guyon:Isabelle:2003 (); Xue:Zhang:2016 (). Over the years, several feature selection approaches have been developed with a proven ability to significantly reduce the number of features while maintaining/improving classification performance in many applications Guyon:Isabelle:2003 (); Xue:Zhang:2016 ().
Despite these advantages, the feature selection has not received enough attention in PQ identification except for a few existing research Panigrahi:2009 (); Gunal:2009 (); Lee:Shen:2011 (); Manimala:Selvi:2011 (); Manimala:Selvi:2012 (); Ericsti:2013 (); Dalai:Chatterjee:2013 (); Hajian:Foroud:2014a (); Hajian:Foroud:2014b (); Abdoos:Mianaei:2016 (); Hafiz:Swain:2017 (); Khokhar:Zin:2017 (); Singh:Singh:2017 (). The choice of the feature selection method varies among the existing research which include Sequential Search Gunal:2009 (); Hajian:Foroud:2014a (); Hajian:Foroud:2014b (); Abdoos:Mianaei:2016 (), Genetic Algorithm (GA) Panigrahi:2009 (); Gunal:2009 (); Manimala:Selvi:2011 (); Manimala:Selvi:2012 (); Hafiz:Swain:2017 (), Simulated Annealing (SA) Manimala:Selvi:2011 (); Manimala:Selvi:2012 (), Binary Particle Swarm Optimization (BPSO) Hafiz:Swain:2017 (), Fully Informed Particle Swarm (FIPS) Lee:Shen:2011 (), Artificial Bee Colony (ABC) Khokhar:Zin:2017 (), kmeans apriori algorithm Ericsti:2013 () and rough sets Dalai:Chatterjee:2013 (). The common drawback of the sequential search method such as Sequential Forward Search (SFS) and Sequential Backward Search (SBS) is the socalled ‘nesting effect’, i.e., once the feature is included/excluded from the subset, it cannot be removed/added. The floating sequential search approaches such as Plusl Take Awayr (PTA) and Sequential Forward Floating Search (SFFS) can prevent nesting effect. However, these are still local search, as the overemphasis on a single feature neglects the correlation among features. Furthermore, all sequential search methods require specification of the reduced subset size a priori. Given that the size of the optimal feature subset is not known, the sequential search essentially addresses only half of the feature selection problem. The metaheuristic search methods such as GA, SA and ABC could overcome these limitations. However, in Panigrahi:2009 (); Manimala:Selvi:2012 (); Khokhar:Zin:2017 (), these approaches have not been used to full potential; as the size of the feature subset is fixed a priori. Such practice is usually not recommended, since fixing the subset size to a particular value, say ‘’, will reduce feature search space from to which may exclude the optimal feature subset from the search. The filter based feature selection approaches have been employed in Hajian:Foroud:2014a (); Hajian:Foroud:2014b (); Dalai:Chatterjee:2013 () which usually involves a tradeoff in the classification performance as the nature of the induction algorithm is not considered. This is further, corroborated by the results of the earlier investigation Hafiz:Swain:2017 () which suggest that the indirect performance indicators used in the filter approach do not necessarily translate into equivalent classification performance. In Lee:Shen:2011 (), Probabilistic neural networkbased Feature Selection (PFS) has been proposed which employs a combination of Fully Informed Particle Swarm and Adaptive Probabilistic Neural Network to evaluate the efficacy of a feature subset. The major drawback of this approach is the assumption of the monotonic relation between the feature subset size and classification performance. This assumption can only be satisfied by an ideal Bayes classifier. In practice, the classifiers do not follow the monotonicity property Siedlecki:Sklansky:1989 (); Kohavi:1994 (); Kohavi:John:1997 (); Yang:Honavar:1998 (). In addition, PFS is prone to the nesting effect, as it does not have any mechanism to reconsider the discarded features.
Further, in most of the existing research, the deterministic feature selection approaches have been included in the comparative analysis whereas it is known that metaheuristic search could yield enhanced search performance Xue:Zhang:2016 (). However, among the metaheuristic search, so far only the performance of GA, SA and BPSO have been evaluated Panigrahi:2009 (); Manimala:Selvi:2011 (); Manimala:Selvi:2012 (); Hafiz:Swain:2017 (). In this scenario, the following questions arise:

Amongst various metaheuristic approaches (such as GA, BPSO, ACO and others) which approach would yield the minimum feature subset without compromising the classification performance?

Whether the classification performance obtained from the reduced feature subset is robust against various levels of measurement noise?
The main aim of this study is to address these concerns through a comprehensive investigation. For this purpose, a comparative analysis of seven different feature selection wrappers is carried out using fourteen distinct class of PQ events and two different induction algorithms. Further, the robustness of the reduced subsets against the measurement noise is evaluated through seven different levels of zeromean Gaussian white noise.
The other objective of this study is to investigate the efficacy of the new TwoDimensional feature selection algorithm on the PQ events. The TwoDimensional (2D) learning algorithm has recently been developed by the authors as a generalized feature selection wrapper Hafiz:Swain:2017a (). The core idea of the 2Dlearning approach is to embed the information about the number of features (referred to as ‘cardinality’) into the learning process of particle swarms. This distinctive quality of the 2D learning approach has been shown to be more effective in identifying the compact feature subsets with the improved classification performance for several benchmark machine learning datasets. This, therefore, has been the motivation behind the application of 2D learning to the PQ event identification problem. In particular, the present study focuses on the practical issues encountered in the feature selection of PQ events and therefore significantly differs from (Hafiz:Swain:2017a, ) in the following aspects:

A wellbalanced, comprehensive and diverse PQ event dataset is built. The dataset includes both simulated and experimental PQ events which conform to the IEEE Std. 1159. Further, a relatively large variation of PQ events is considered in comparison to the existing research on PQ identification Lee:Shen:2011 (); Biswal:Dash:2013 (); Biswal:Dash:2013a (); Kumar:Singh:2015 (); Singh:Singh:2017 (); Mahela:Shaik:2017 (). For example, the PQ events include magnitude variation in the range of , event duration from to cycles, frequency variation in the range of , and the harmonics of the order . Thus, most of the PQ events expected in the conventional power distribution grid are well represented in the investigated dataset.

The robustness of the selected/reduced feature subsets against the measurement noise is defined and evaluated under various levels of zeromean Gaussian white noise.

The TwoDimensional (2D) learning algorithm is better formalized and explained through an illustrative example from the perspective of PQ event identification.

The efficacy of 2D learning algorithm is demonstrated by the comparative evaluation on six established feature selection algorithms: GA, ACO, BPSO, CBPSO, ChBPSO and SFFS.

A rigorous analysis is carried out to determine the statistical significance of the results. For instance, multiple nonparametric statistical comparisons are carried out using the Friedman test and the Hommel’s posthoc procedure to compare the search performance of the algorithms. In addition, the performance of algorithms under various levels of noise is compared using the Contrast Estimation based on medians.
Thus, in this study, most of the operating scenarios expected in the utility distribution grid have been accommodated. Further, several key issues related to the feature selection have been investigated.
The rest of the article is organized as follows: Section 2 provides a brief overview of the feature selection approaches. The 2Dlearning approach has been briefly discussed in Section 3. The investigation framework of this study is provided in Section 4. The results of the comparative evaluation and the robustness test are shown in Section 5, followed by the discussions in Section 6 and the conclusions in Section 7.
2 Brief Overview of Feature Selection Methods
Since the main objective of this work is to select the relevant feature subset for the PQ event identification, it is pertinent to discuss the existing feature selection methods briefly.
The feature selection problem essentially begins by considering a dataset having ‘’ input features, and ‘’ output class, . For this dataset, the task of the induced classifier is to determine the output label ‘’ () corresponding to the input features. The objective of the feature selection is to identify the subset of features, ‘’, () through which this task can be accomplished with similar or improved classification performance: {linenomath*}
(1) 
where, ‘’ is the criterion function which represents the classification performance of the feature subset and ‘’ denotes the cardinality or number of features in the subset. The search for an optimal solution of the feature selection requires the evaluation of subsets. Hence the exhaustive search is intractable even for the moderate size datasets Cover:Van:1977 (). Note that the selection of a relevant feature subset from the available features is essential for the better classification performance and the generalization capability of the classifier Blum:Langley:1997 ().
Over past six decades, while the fundamental objective of the feature selection problem has not changed (i.e., identify a subset of relevant features), the machine learning algorithms have evolved significantly with time to meet the requirements of different applications. For example, in many applications, the class ‘labels’ may not be available completely or partially Guyon:Isabelle:2003 (); Liu:Yu:2005 (). Such scenarios require unsupervised/semisupervised machine learning approach. A distinct feature selection strategy is required for each machine learning approach such as supervised or unsupervised learning Guyon:Isabelle:2003 (); Liu:Yu:2005 (); Shang:Wang:2016 (); Shang:Wang:2018 (). In this study, we focus on the feature selection approaches for supervised learning; as the nature of PQ events have been well characterized by the IEEE Std. 1159 IEEE:1159 (). Hence, a comprehensive PQ event dataset has been developed through a combination of simulation and practical field measurements for the supervised learning.
Most of the existing feature selection methods for supervised learning can be distinguished by their approach (e.g., wrapper or filter) to evaluate the criterion function, Blum:Langley:1997 (); Dash:Liu:1997 (); Kohavi:John:1997 (). The wrapper approach is straightforward where a classifier is induced for each feature subset under consideration and the resulting classification accuracy is used as . On the contrary, in the filter methods, is estimated using statistical or information theory without inducing a classifier. Note that the search landscape of the feature selection problem is conjointly defined both by the dataset and the induction algorithm, i.e., each induction algorithm has specific traits. Hence, the optimal feature subset for particular induction algorithm may not be optimal for the other, as will be shown by the results of this study (Section 5.1).
The choice of feature selection method involves a tradeoff in either speed or accuracy; the wrappers are more precise whereas filters are comparatively faster. The size of the feature set plays a major role in the selection of either approach. For the smaller to medium size feature set () wrappers are more appropriate. For the larger dataset, the computational burden of the wrapper may be infeasible. Hence filters are preferred in such a scenario. Usually, the PQ event datasets range in small to medium size (), thus in this work, the focus is on the wrapper approach.
Earlier approaches of feature selection, such as sequential search methods Marill:Green:1963 (); Whitney:1971 (); Pudil:1994 (); Somol:Pudil:1999 () and branch and bound methods Narendra:1977 (); Yu:Yuan:1993 () are ‘deterministic’ in nature, i.e., for a given dataset, they give the same solution over independent runs. The core idea behind most of these approaches is to evaluate the utility/relevance of a feature by evaluating its discrimination capability over the output classes. However, due to overemphasis on an individual feature, the correlation among features is neglected. In addition, most of the deterministic approaches require a priori selection of subset cardinality, ‘’. Consequently, the search space reduces from to . Further, several deterministic approaches Narendra:1977 (); Yu:Yuan:1993 () assume a monotonic criterion function, , which is often impractical.
Metaheuristic search methods such as Genetic Algorithm (GA), Ant Colony Optimization (ACO), Tabu Search (TS) and Particle Swarm Optimization (PSO) have been applied to the feature selection problem to address the shortcomings of the deterministic approaches Xue:Zhang:2016 (). Unlike deterministic methods, most of the metaheuristic search methods operate on feature subsets. Hence, the effects of feature correlation are accounted in the search process. Further, the metaheuristic search is, in essence, a populationbased search with implicit parallelism which allows comparatively better sampling of the search space. For this reason, we focus on the metaheuristic wrappers in this study.
3 TwoDimensional (2D) Learning Framework for Particle Swarms
The core idea behind the TwoDimensional (2D) learning approach has been briefly discussed in the following subsections for the sake of completeness. The detailed discussion about this approach can be found in Hafiz:Swain:2017a (). Note that, 2D learning is intended to be a generalized learning algorithm for particle swarm based feature selection methods. It is, therefore, possible to adapt most of the existing PSO variants Hafiz:Abdennour:2013 () in the continuous domain () for the feature selection problems () following the 2Dlearning. However, the results of the study in (Hafiz:Abdennour:2016, ) indicate that among popular PSO variants, adapted Unified Particle Swarm Optimization (UPSO) (UPSO1, ) performs comparatively better for the problems in the discrete domain. For this reason, in this work, UPSO has been adapted following the 2Dlearning approach and referred throughout the manuscript as ‘2DUPSO’.
3.1 Philosophy of 2D Learning
The search for the optimal feature subset essentially entails the following two decisions: 1) How many features should be included? and 2) Which features should be included? The philosophy of the TwoDimensional (2D) learning is to explicitly embed the information about subset size (also referred as cardinality) into the search process to effectively address these issues. As the name suggests, the 2Dlearning approach extends the learning dimension of a particle swarm to integrate the cardinality information into the search process. Since the cardinality of the subset is selected through an informed decision, only the features with higher selection likelihoods are selected and the redundant features are effectively discarded.
To further understand the philosophy of 2D learning, consider a feature selection problem associated with a dataset having ‘’ number of features. For this dataset, the position of an particle, ‘’, is represented as an dimensional binary string as follows: {linenomath*}
(2) 
where, the bits with ‘’ indicates that the corresponding feature is selected and the bits with ‘’ indicates otherwise, i.e., if then feature is selected.
Note that in PSO, the ‘learning’ of a particle is accumulated in its velocity. For this reason, in 2D learning, the dimension of particle velocity is extended, and it is represented by a twodimensional matrix. The objective here is to store the selection likelihood of feature and cardinality (number of features) in a distinct dimension. For the problem considered here, the velocity of the particle, ‘’, is represented by a twodimensional matrix of size () which is given by,
(3) 
The first row of the velocity matrix stores the selection likelihood of the cardinality (i.e., subset size or number of features). For instance, implies that the likelihood of including a total of features in the new position of the particle is . In contrast, the elements in the second row of store the selection likelihood of the corresponding features. For example, indicates that likelihood of including the feature in the new position of the particle is .
Note that in order to update the selection likelihoods of cardinality and features it is crucial to extract the beneficial information from the ‘learning exemplars’ such as ‘personal best’ () and ‘neighborhood best’ (). This is accomplished through ‘learning’ process in which a dedicated ‘learning set’ () is derived from each exemplar. The learning set is also a matrix, in which the first row corresponds to the cardinality learning and the second row corresponds to the feature learning. In essence, the learning process extracts the following information from the exemplar and encodes it into a twodimensional binary learning set ‘’: 1) number of features in the exemplar 2) features that have been included in the exemplar but not in the particle.
This is achieved as follows: Let ‘’ denote the learning exemplar of the particle, e.g., ‘’, ‘’ or ‘’. Note that the learning exemplar,‘’, is also an dimensional binary string similar to particle position and essentially encodes a feature subset. Further, let dimensional vectors,‘’ and ‘’, respectively denote the ‘cardinality’ and ‘feature’ learning set. For the particle, ‘’, and the corresponding learning exemplar, ‘’, the learning sets are derived following the procedure outlined in Algorithm 1. This procedure is further explained by the illustrative example in A.
3.2 Velocity and Position Update
In this study, the wellknown PSO variant, UPSO, has been adapted for the feature selection through the 2Dlearning approach. The core idea behind UPSO (UPSO1, ) is to combine the ‘global’ (Shi:Eberhart:1998, ) and ‘local’ (Kennedy:Mendes:2002, ) version of the PSO to achieve the balance between exploration and exploitation of the search landscape. The velocity update rule of UPSO adapted through 2D learning approach (referred as ‘2DUPSO’) is given by, {linenomath*}
(4)  
(5)  
(6) 
and ‘’ is the unification factor; ‘’ is the inertia weight; the parameters are acceleration constants; are uniform random numbers; ‘’ and ‘’ are the social learning sets derived respectively from global best () and neighbourhood best (). Similarly, ‘’ denotes the learning set derived from the current position of the particle, . The adaptive weight used to control the influence of is denoted as ‘’ and it is given by, {linenomath*}
(7) 
where, ‘’ is the fitness of the particle and ‘’ is the vector which contains the fitness of the entire swarm at iteration ‘’. For a given particle, the parameter ‘’ is adaptive and its value is dependent on the relative performance of the particle with respect to the worst particle of the swarm. As seen in (7), the particle with the minimum fitness will have higher , which will lead to the increase in selection likelihood of the corresponding features (included in the particle). Further, as seen in (7), is positive only when the particle leads to improved fitness.
Following the velocity update, the new position of the particle is determined in two steps. In the first step, the cardinality of the new position is determined which is followed by the selection of the features. For this purpose, the selection likelihoods stored in the velocity matrix are used. This procedure is outlined in Algorithm2 and further explained through an illustrative example in A. The pseudo code for 2DUPSO is outlined in Algorithm 3.
4 Investigation Framework
In this study, we focus mainly on two aspects of the feature selection in PQ events: 1) selection of an appropriate feature selection method and 2) robustness of the reduced subsets against the measurement noise. For this purpose, this investigation has been carried out in two stages, as shown in Fig. 2. The objective of the first stage is to carry out the comparative analysis of different feature selection approaches. For this purpose, seven different search algorithms (discussed in Section 4.3) are applied as a metaheuristic wrapper to two induction algorithms (discussed in Section 4.2) and fourteen distinct types of PQ events (discussed in Section 4.1). Note that, the PQ events used for the feature selection purpose do not contain any noise, i.e., SignaltoNoise Ratio (SNR) . In the second stage, the robustness of the reduced subsets, obtained from the feature selection, is evaluated. For this purpose, the classification performance of the reduced subsets is evaluated in the presence of levels of zero mean Gaussian white noise: SNR=.
The following subsections provide more details about the PQ events, Search Algorithms and Induction Algorithms being used in this study.
4.1 PQ Events & Feature Extraction
For the comprehensive study, it is necessary to include a wide variety of PQ events in the investigation. For this purpose, the IEEE Std. 1159 IEEE:1159 () is followed in this study. A total of fourteen distinct PQ events are generated through parametric models given in Table 1 and the experimental setup shown in Fig. 2. These events include several distinct natures of the PQ events, e.g., stationary (), nonstationary (), low frequency (), high frequency (), single () and simultaneous ().
The synthetic PQ events were generated in MATLAB^{®} following the parametric models shown in Table 1. The real PQ events are acquired using the experimental PCC shown in Fig. 2, which consist of a harmonic source and a transient source. The harmonic source is being emulated by a threephase uncontrolled rectifier with a resistive load which generates the harmonics of order . The transient events are induced by switching of a capacitor bank. Further, the sag events are induced by creating a single line to ground fault. A digital oscilloscope/recorder (HIOKI 887020 MEMORY HiCORDER^{®}) is used to capture the real events. For each class of PQ event, instances are generated which gives a total of instances of PQ events. Approximately of these event instances are generated through the parametric models shown in Table 1 and the remaining are induced using the experimental PCC shown in Fig. 2. Each event instance is generated at the fundamental frequency of , for the duration of cycles and sampled at . The PQ events include magnitude variation in the range of , frequency variation in the range of , harmonics of order and the event duration from to cycles. In addition, to evaluate the effects of the measurement noise, seven different levels of zeromean Gaussian white noise have been added to these events. This gives a total of PQ datasets; each dataset has the same events but a different level of the measurement noise, i.e., SNR.
For accurate identification of the PQ events, it is crucial to extract information in both temporal and frequency domain. This task can be accomplished through various signal processing techniques such as StockwellTransform (ST), Wavelet Packet Transform (WPT), Discrete Wavelet Transform (DWT) Mahela:Shaik:2015 (); Khokhar:Zin:2015 (). Note that the selection of signal processing technique is primarily dependent on the frequency bandwidth of the signal being investigated. Especially, a judicious selection is essential to accommodate the relatively large frequency bandwidth of PQ events, e.g., from DC offset () to Oscillatory Transients () IEEE:1159 (). In this scenario, among existing signal processing techniques, DWT represents an ideal choice for PQ events as it is computationally more efficient; the complexity of DWT, WPT and ST is respectively , , ( denotes the number of samples). Therefore, DWT has been selected as a signal processing technique in this study.
It is wellknown that the selection of the base/mother wavelet is dependent on the nature of the application and it is crucial to the performance of wavelet transforms. The detailed investigation on this topic Hafiz:Swain:2019 () suggest that for the PQ events, the optimum classification performance of the given induction algorithm can be obtained when it is paired with a specific base wavelet. For example, the optimum performance from kNearest Neighbor (kNN) and NaiveBayes (NB) (which are being used in this study) was obtained with the order symlet (‘’) Hafiz:Swain:2019 (). Therefore, in this study, ‘’ has been selected as the base wavelet.
Further, each instance of the PQ event is decomposed to the level using DWT (the decomposition level has been selected following the rule of thumb given in B). Consequently, the detail coefficients at each level and approximation coefficients at the final level are available. In order to extract meaningful information from the wavelet coefficients, statistical functions are used (shown in C). Following this procedure, for each instance of PQ event, a total of ‘features’ ( details/approximations functions) are obtained. Hence, a pattern, ‘’, is obtained corresponding to each instance of PQ event, as follows:
(8) 
where, ‘’ denotes the number of features; denotes the features set and contains the label corresponding to each PQ event.
4.2 Induction Algorithm
The results of our previous investigation Hafiz:Swain:2019 () suggest that relatively simple induction algorithms are more robust against the measurement noise. In particular, Decision Tree (DT) and NaiveBayes (NB) Witten:Frank:Hall:2016 () were found to be comparatively more robust for PQ events. Since DT has limited inherent feature selection capability, we have selected NB as one of the induction algorithms in this study. Further, it is wellknown that the search landscape of the feature selection problem is conjointly defined by the dataset and the induction algorithm Blum:Langley:1997 (); Dash:Liu:1997 (). For this reason, in addition to NB, kNearest Neighbor (kNN) Witten:Frank:Hall:2016 () is also used as an induction algorithm in this study.
Note that the classification performance of IA is critically dependent on the ‘hyperparameters’ which are used to control the learning process, e.g., ‘number of neighbors’ () and ‘distance metric’ in kNN; ‘kernel width’ () in NB. In this study, the hyperparameters of both IAs have been selected through ‘gridsearch’ to maximize the average tenfold classification accuracy (‘’), as shown in Fig. 2(a) (kNN) and Fig. 2(b) (NB).
4.3 Compared Feature Selection Algorithms
To evaluate the efficacy of the 2D learning approach, the following existing algorithms are considered: Genetic Algorithm (GA) Siedlecki:Sklansky:1989 (); Kudo:Sklansky:2000 (), Ant Colony Optimization (ACO) Yu:Gu:2009 (); Chen:Chen:2013 (), Binary PSO (BPSO) Kennedy:Eberhart:1997 (), Catfish BPSO Chuang:Tsai:2011 () and Chaotic BPSO (chBPSO) Chuang:Yang:2011 (). In all these algorithms, the search agent (e.g., a chromosome in GA) encodes a feature subset by the binary string representation given in (2).
In this study, the classical variant of GA is considered, i.e., simple GA with roulette wheel selection, singlepoint crossover and flipbit mutation operator, as outlined in Siedlecki:Sklansky:1989 (); Kudo:Sklansky:2000 (). For ACO, the variant proposed in Yu:Gu:2009 (); Chen:Chen:2013 () is considered, where the features are represented as nodes on a directed graph. Each node on this graph is linked by two distinct edges to highlight whether a node/feature is selected or not. A feature subset is represented by a path traversed by an ant over these edges. For the given edge, the probability of inclusion in the path is given by the corresponding pheromone intensity. In each iteration, the pheromone intensity is updated based on the positive feedback mechanism. In this study, the pheromone update procedure proposed for ACO in Yu:Gu:2009 () is implemented. Further, since the proposed 2DUPSO has been developed in particle swarm theory, BPSO and its two variants (CBPSO and chBPSO) are also included in the comparative investigation. The CBPSO Chuang:Tsai:2011 () retains the learning mechanism of BPSO while introducing the concept of ‘refresh gap’, i.e., a fixed number of worst performing particles are reinitialized if the swarm cannot locate an improved solution for a prefixed number of iterations. In chBPSO Chuang:Yang:2011 (), the velocity update rule of BPSO is updated to control the inertia weight following the chaotic maps. In this study, a logistic map is being used to determine the value of inertia weight in chBPSO.
In addition to the metaheuristic algorithms, the Sequential Forward Floating Search (SFFS) Pudil:1994 () is also included to compare the search performance of the 2D learning with the existing deterministic search algorithms. The SFFS has been selected for this purpose, as it overcomes the limitations of the other sequential search methods Pudil:1994 (), e.g., nesting effects of SFS and SBS; selection of appropriate ‘’ in plusl take awayr floating search.
The search parameters of each algorithm are set following the procedures outlined in Siedlecki:Sklansky:1989 (); Yu:Gu:2009 (); Kennedy:Eberhart:1997 (); Chuang:Tsai:2011 (); Chuang:Yang:2011 () and are shown in Table 2. All of the compared algorithms are implemented in MATLAB.
4.4 Search Setup
The efficacy of the feature subset can be evaluated by either the filter or the wrapper approach. The selection of either approach requires a tradeoff between precision and the computational complexity. In this study, the total number of features is moderate (), and therefore it is appropriate to select the wrapper approach due to its precision Blum:Langley:1997 (). In other words, for each subset under consideration, a classifier is induced by the induction algorithm (say NB or kNN) and the subsequent classification error is used as the criterion function, .
Further, to remove any bias to the validation data, the mean classification error after 10fold stratified crossvalidation is used as the criterion function, Witten:Frank:Hall:2016 (). For a given feature subset, , this is given by, {linenomath*}
(9)  
Without loss of generality, the feature selection problem is approached as a minimization problem where the objective is to minimize the classification error, given by (9). To account for the inherent stochastic nature of the algorithms, independent runs of each algorithm are executed. Each run is set to terminate after Function Evaluations (FEs).
5 Results
As mentioned earlier, in this study we address the following two issues: 1) Which feature selection approach is more effective for PQ events? and 2) How robust are the reduced feature subsets against the measurement noise? For this purpose, the comparative evaluation of seven feature selection wrappers has been carried out using two induction algorithms considering fourteen distinct PQ events, following the procedure outlined in ‘StageI’ of Fig. 2. Note that the feature selection has been carried out using only the ‘pure’ PQ events, i.e., with SNR. The results of this investigation have been discussed in Section 5.1.
The next issue is to evaluate the robustness of the reduced feature subsets which are obtained by the feature selection algorithms. For this purpose, the following seven different levels of zero mean Gaussian white noise are added to the PQ events: SNR. The framework for this part of the investigation is outlined in ‘StageII’ of Fig. 2. The results of this test are discussed in Section 5.2.
5.1 StageI : Comparative evaluation of the feature selection approaches
For the purpose of comparative evaluation, independent runs of each algorithm are recorded. Since the primary objective is to improve the classification performance through the removal of irrelevant/redundant features, the search performance of the compared algorithms is evaluated by two criteria, i.e., classification performance and size of the feature subset (cardinality).
The results obtained after 40 independent runs of each algorithm with kNN and NB classifier are shown in Table 3 and 4, respectively. The results obtained with 2D learning approach (2DUPSO) are shown in the last column of Table 3 and 4. The best results obtained among the compared algorithms are shown in boldface.
To compare the classification performance, the average (Mean) and standard deviation (SD) of the criterion function, , are shown in Table 3 and 4. Similarly, to compare the reduction in subset size, ‘Mean’ and ‘SD’ of subset cardinality is computed over 40 runs for each algorithm as shown in Table 3 and 4. Further, the following two metrics are used to measure the performance improvement obtained by each of the algorithms, {linenomath*}
(10)  
(11) 
where, ‘’ is the criterion function with the original feature set () and ‘’ is the average of the criterion function over 40 runs obtained with the reduced feature sets; ‘’ is the average cardinality over 40 runs and ‘’ is the total number of features, i.e., .
The Performance Improvement, , metric gives the improvement in the classification performance with respect to the original feature set, ‘’. The second metric, , shows the percentage reduction in the cardinality with respect to the total number of features, ‘’. A higher positive value of these metrics implies a better search performance. In addition, the following metric is used to evaluate the overall performance of the algorithms:
(12) 
where, ‘’, ‘’ and ‘’ are the feature subset, its cardinality and the corresponding classification error at the end of ‘’ run of the algorithm; ‘’ denotes the total number of features. Note that this metric incorporates the information about both the cardinality and the classification performance. A lower value of this metric indicates that the algorithm could consistently find feature subsets with a lower cardinality and a lower classification error.
From the results shown in Table 3 and 4, it is clear that the 2D learning approach is most efficient as it gives the lowest classification error amongst the compared algorithms. With both the induction algorithms, 2DUPSO could achieve the highest , approximately (with kNN, Table 3) and (with NB, Table 4). It is interesting to note that in comparison to GA and ACO, all BPSO variants could yield better subsets with both kNN and NB.
Further, in the present study, the feature selection issue is approached as a singleobjective problem where the primary objective is to minimize the classification error. Therefore, any reduction in the subset cardinality is the direct consequence of the ability of the search algorithms to distinguish useful features from the irrelevant/redundant features. Intuitively, the exploitation of cardinality information in the 2D learning is likely to improve the search performance of 2DUPSO. The results shown in Table 34 corroborates this notion. 2DUPSO could provide the smallest feature subset with the highest reduction in the cardinality,e.g., (with kNN) and (with NB).
The overall performance of the compared algorithms is evaluated using the ‘overallscore’ metric (12), which considers both the cardinality, , and the classification performance, , of the feature subsets obtained by the algorithm over 40 runs. As revealed by (12), a lower score indicates the consistent discovery of a subset with fewer features and lower classification error by the algorithm. As seen in Table 34, the overall score obtained by 2DUPSO is the lowest amongst the compared algorithms which indicates the best overall performance.
The results further show a shift in the search landscape with the change of induction algorithms. For example, with kNN, from the compared algorithms lies in the range of (Table 3) whereas with NB the performance gain is comparatively lower (in the range of , Table 4). Similar effects are observed in as well; its variation with kNN and NB classifier lie in the range of and , respectively. These results further underline the need for the wrapper based feature selection approach.
5.1.1 Nonparametric Statistical Evaluation
Due to the stochastic nature of the compared algorithms, further statistical analysis is carried out to determine the significance of the results shown in Table 3 and Table 4. In particular, the objective of this analysis is to determine whether the results (i.e., and ) obtained by 2DUPSO, are significantly better than the compared algorithms. For this purpose, multiple nonparametric statistical comparisons are carried out following the guidelines in Derrac et al. Derrac:Salvador:2011 (). The test is carried out in the following two steps:
First, the Friedman Twoway Analysis of Variance by Ranks Sheskin:2003 (); Derrac:Salvador:2011 () is applied to determine whether the performance of two or more compared algorithm is significantly different. For this purpose, the results obtained by the algorithms are ranked from ‘’ (best) to ‘’ (worst). Subsequently, the average value of ranks is determined over 40 independent runs. The test statistic and the corresponding pvalue are determined following the procedures outlined in Sheskin:2003 (); Derrac:Salvador:2011 () and are shown in Table 5. The pvalues obtained through the Friedman statistic strongly suggest a significant difference in the performance of the compared algorithms. Further, the average rankings obtained over 40 independent runs establish that 2DUPSO is the best amongst the compared algorithms.
In the second step, a set of hypotheses is evaluated for multiple comparisons of 2DUPSO with the other algorithms. Specifically, the evaluation of five interconnected null hypotheses is required to compare the 2DUPSO with the five compared algorithms, i.e., GA, ACO, BPSO, CBPSO and chBPSO. Each null hypothesis () denotes that the algorithm being compared is significantly better than 2DUPSO. The test statistic, value and Adjusted pvalues (APV) which are required to evaluate these interconnected hypotheses are determined following the procedure outlined in Derrac:Salvador:2011 (); Garcia:Salvador:2010 (). The Hommel’s posthoc procedure is employed to derive the APV from the value. The outcome of the multiple comparisons for confidence interval are shown in Table 6 (for kNN) and Table 7 (for NB). These results convincingly demonstrate that, among the compared algorithms, 2DUPSO could obtain feature subsets with significantly lower cardinality () and classification error, .
5.1.2 Comparative analysis with SFFS
The results of the comparative analysis with SFFS are shown in Table 8. Note that SFFS requires a priori specification of the subset cardinality, . Since the cardinality of the optimum feature subset is not known, the cardinality of the best subset found by 2DUPSO (out of runs) is determined and denoted as ‘’. For example, the cardinality of the best subset found by 2DUPSO with kNN is and the same with NB is . Hence, (with kNN) and (with NB). Subsequently, SFFS is applied as a wrapper to both kNN and NB to find the feature subset with the cardinality equal to ‘’. Given the same cardinality ,the objective here is to determine whether the SFFS could identify the feature subset with the comparable or better accuracy than 2DUPSO. As expected, the outcome of these tests (Table 8) indicates that SFFS could not yield better feature subset with either kNN or NB.
5.2 StageII : Robustness of the Reduced Subsets
To evaluate the robustness of the reduced subsets against measurement noise, seven different levels of noise with SNR have been added to the PQ events. Given that the noise introduced by the measurement chains cannot be a priori estimated, the objective here is to compare the classification performance of the reduced subsets and the original feature set () at various noise levels.
Note that the feature selection has been carried out using only pure PQ events, i.e. with SNR=. Further, each run of the compared algorithms provides a different feature subset due to the inherent stochastic nature. To ensure the fair comparison, for each algorithm the feature subset with the minimum classification error, , out of runs is selected.
For each feature subset under consideration the average classification accuracy after fold stratified crossvalidation, ‘’, is recorded at each noise level. This is given by, {linenomath*}
(13)  
‘’ and ‘’ respectively denote the reduced subset under consideration and the corresponding classification accuracy. Note that, .
In order to evaluate the ‘robustness’ of a given feature subset, , its classification accuracy, , is compared with that of the original full feature set, . As the feature selection was carried out using the ‘pure’ dataset (i.e., SNR), the performance of all reduced subsets is better compared to at this noise level. The objective here is to investigate whether the reduced subsets could maintain improved performance in the presence of various other levels of noise. Essentially, at each noise level, we are interested in finding out whether , which is given by, {linenomath*}
(14) 
The classification accuracy, and the performance difference of the reduced subsets are shown in Table 10 (with kNN) and Table 10 (with NB). The metric in (14) indicates the degree of improvement over , i.e., a higher positive value of is desirable. It is clear that the reduced subset obtained by 2DUPSO could achieve the highest with both kNN (Table 10) and NB (Table 10). By integrating the cardinality information into the search process, 2DUPSO could find robust and effective feature subsets. These results convincingly demonstrate that with a proper feature selection approach, it is possible to obtain a robust feature subset that can yield enhanced performance even in the presence of various levels of measurement noise.
Further, the results in Tables 10 and 10 clearly show the influence of the induction algorithms on the search landscape. For example, with kNN, all algorithms could identify robust feature subset, i.e., a positive at all noise levels, as seen in Table 10. In contrast, with NB, only 2DUPSO could yield robust feature subset (Table 10). These results further underline the importance of a feature selection approach.
Finally, the statistical significance of the results shown in Table 10 and Table 10 is determined. For this purpose, it is not feasible to apply the multiple nonparametric statistical comparisons (similar to Section 5.1.1), since the number of datasets is relatively small Derrac:Salvador:2011 (). Therefore, Contrast Estimation based on medians Garcia:Salvador:2010 (); Derrac:Salvador:2011 () is applied to compare the algorithms. This test essentially estimates a quantitative performance difference over multiple datasets for all possible pairs of algorithms.
The outcomes of this test are shown in Table 12 (with kNN) and Table 12 (with NB). Note that a higher positive value of the estimator is desirable for this test. For instance, with kNN, the contrast estimator for ACO is positive with respect to GA (, Table 12) and chBPSO (, Table 12) which indicate that the ACO could yield better subset in comparison to GA and chBPSO. For each of the algorithms, the positive outcomes are shown in boldface in Table 12 and 12. As seen in Table 12 and 12, the estimator for 2DUPSO is positive for each pairwise comparison, which further highlights enhanced search performance of 2DUPSO. Furthermore, the shift in the search landscape can indirectly be illustrated by the search behavior of GA; For instance, with kNN, the estimator values for GA are negative with respect to all the algorithms (Table 12). In contrast, with NB, the positive estimators for GA are obtained with respect to three algorithms (ACO, CBPSO and chBPSO, Table 12).
6 Discussion
The following observations are inferred from the results of the comparative evaluation (Section 5.1) and the robustness evaluation (Section 5.2):

The feature selection wrappers are often criticized for computational complexity. However, as the results of this investigation suggest, wrappers can yield a significant reduction in the feature subset size while improving the classification performance. For instance, the compared algorithms could reduce the original feature set in the range of (with kNN, Table 3) and (with NB, Table 4) while improving the classification performance in the range of (with kNN, Table 3) and (with NB, Table 4). Since the feature selection is performed only once, the wrapper approach is highly recommended for the PQ events.

With an effective search strategy, it is possible to identify a feature subset which is robust against the measurement noise. In this study, the worst case scenario from the perspective of PQ events has been simulated, i.e., the feature selection is carried out using only ‘pure’ PQ events (i.e., SNR= dB) and subsequently the reduced subsets are evaluated under various levels of measurement noise. Nevertheless, under this scenario, 2DUPSO could identify robust feature subsets with both kNN and NB.

The results give empirical evidence for the hypothesis that the nature of the induction algorithm does affect the feature selection landscape. For example, all of the compared algorithms could yield robust feature subset with kNN; however, only 2DUPSO could identify a robust feature subset with NB. This further underlines the need for a wrapper based feature selection approach.
7 Conclusions
In this study, the issue of feature selection has comprehensively been investigated in the context of PQ event identification. In particular, the search performance of the Twodimensional learning approach (2DUPSO) and six other feature selection wrappers has been compared considering fourteen distinct classes of PQ events. Further, the robustness of the reduced feature subsets has been defined and evaluated under seven different levels of measurement noise. The results of the comparative evaluation convincingly demonstrate that 2DUPSO can identify significantly better and robust feature subsets for PQ events. The key distinctive property of 2D learning is the integration of information about the subset size into the learning framework. This has been shown to lead to a significant improvement in the search performance in comparison to the other wellknown algorithms, e.g., GA, ACO and BPSO.
Without loss of generality, this investigation is based on the assumption that the induction algorithm for the PQ event identification is prefixed. If this assumption does not hold or a generalized reduced feature subset is desired, then the filter based feature selection approaches are more appropriate. Hence, a detailed comparative investigation of the different filter approaches such as Mutual Information, Minimum Redundancy Maximum Relevance, Correlation based feature selection may prove to be very useful to both the practicing engineers and the PQ researchers. This could be the subject of further research.
Acknowledgement
Faizal Hafiz is thankful to Education New Zealand for supporting this research through the New Zealand International Doctoral Research Scholarship (NZIDRS).
Appendix A Illustrative Example of 2D Learning
Consider a dataset having number of features (). For this dataset, let the position of the particle ‘’ and the learning exemplar ‘’ at particular search iteration ‘’ be given by,
(15) 
Evaluation of the Learning Sets
The learning sets derived from and , as per the Algorithm 1, are as follows:

Set cardinality learning sets to nullvector:

Cardinality of and :
and . 
Set the bit of to ‘’:
and 
Evaluate feature learning sets:
and 
Evaluate the final learning sets:
and
Position Update
To understand the position update procedure, assume that the velocity of the particle is given by,
(16) 
The new position of the particle is determined as per the procedure outlined in Algorithm 2:

Set the new position to a nullvector:

Isolate the selection likelihoods of cardinality and features, :