Efficient Feature Selection of Power Quality Events using Two Dimensional (2D) Particle Swarms

Efficient Feature Selection of Power Quality Events using Two Dimensional (2D) Particle Swarms

Faizal Hafiz Department of Electrical & Computer Engineering, The University of Auckland, Auckland, New Zealand faizalhafiz@ieee.org Akshya Swain Chirag Naik Sarvajanik College of Engineering & Technology, Surat, India Nitish Patel
Abstract

A novel two-dimensional (2D) learning framework has been proposed to address the feature selection problem in Power Quality (PQ) events. Unlike the existing feature selection approaches, the proposed 2D learning explicitly incorporates the information about the subset cardinality (i.e., the number of features) as an additional learning dimension to effectively guide the search process. The efficacy of this approach has been demonstrated considering fourteen distinct classes of PQ events which conform to the IEEE Standard 1159. The search performance of the 2D learning approach has been compared to the other six well-known feature selection wrappers by considering two induction algorithms: Naive Bayes (NB) and k-Nearest Neighbors (k-NN). Further, the robustness of the selected/reduced feature subsets has been investigated considering seven different levels of noise. The results of this investigation convincingly demonstrate that the proposed 2D learning can identify significantly better and robust feature subsets for PQ events.

keywords:
Classification, Dimensionality Reduction, Feature Selection, Particle Swarm Optimization, Pattern Recognition, Power Quality
journal: Applied Soft Computing\biboptions

numbers,sort&compress

1 Introduction

Over the past two decades, the landscape of the energy market has been going through a significant transformation due to the increasing share of non-linear loads. The breakthrough progress in the semiconductor technology has enabled a wide-scale deployment of the power electronic converters, adjustable speed drives and consumer electronics. Furthermore, the power electronic converters are the key components in the grid interface of the renewable generation. In this scenario, one of the major challenge faced by the utilities is the deterioration in Power Quality (PQ). Since the success of the remedial action is critically dependent on the nature of the PQ event, the identification of a PQ event is vital in addressing the poor PQ. This study, therefore, focuses on identification of PQ events.

In essence, any departure from the ideal sinusoid can be considered as a PQ event. Various PQ events exist in the modern grid which differ in terms of spectral content, magnitude and duration. Based on these ‘traits’, the IEEE standard 1159 characterizes PQ events of distinct nature IEEE:1159 (). Arguably, the earliest attempt to identify the PQ events could be traced back to the works of Santoso et al. Santoso:1996 (); Santoso:2000 (), Angrisani et al. Angrisani:1998 () and Gaouda et al. Gaouda:Salama:1999 (), in which the feature extraction through multi-resolution analysis was proposed. Subsequently, over the years, the PQ event identification problem has transformed into a pattern recognition problem wherein a pattern is formed by various features extracted from the voltage measurements and a label corresponding to PQ event, i.e., pattern, ‘’, is given by, , where, ‘’ and respectively represent the output label and the features extracted from the PQ event. The patterns extracted from the PQ events are subsequently used to induce a classifier through a suitable induction algorithm. Most of the existing PQ event identification approaches are based on this framework Mahela:Shaik:2015 (); Khokhar:Zin:2015 (). Note that the utility/efficacy of the extracted features is usually not evaluated in most of the existing PQ identification approaches Mahela:Shaik:2017 (); Kumar:Singh:2015 (); Liu:Cui:2015 (); Biswal:Dash:2013 (); Naik:Hafiz:2016 (), which often leads to the inclusion of irrelevant, redundant and noisy features.

Despite the significant research efforts dedicated to the machine learning and pattern recognition, the feature extraction process is still, in major part, dependent on the expert knowledge. Since it is not trivial to estimate the required number of features/attributes beforehand, a large number of features are usually extracted in order to identify a pattern. This often leads to the inclusion of irrelevant and/or redundant features which often lead to increased storage requirement, slower processing times and reduced generalization capability of the classifier. The objective of the feature selection is to overcome these problems by the removal of irrelevant/redundant features. Note that the feature selection is one of the fundamental problems of the machine learning and it has been topic of active research since last six decades Marill:Green:1963 (); Blum:Langley:1997 (); Dash:Liu:1997 (); Guyon:Isabelle:2003 (); Xue:Zhang:2016 (). Over the years, several feature selection approaches have been developed with a proven ability to significantly reduce the number of features while maintaining/improving classification performance in many applications Guyon:Isabelle:2003 (); Xue:Zhang:2016 ().

Despite these advantages, the feature selection has not received enough attention in PQ identification except for a few existing research Panigrahi:2009 (); Gunal:2009 (); Lee:Shen:2011 (); Manimala:Selvi:2011 (); Manimala:Selvi:2012 (); Ericsti:2013 (); Dalai:Chatterjee:2013 (); Hajian:Foroud:2014a (); Hajian:Foroud:2014b (); Abdoos:Mianaei:2016 (); Hafiz:Swain:2017 (); Khokhar:Zin:2017 (); Singh:Singh:2017 (). The choice of the feature selection method varies among the existing research which include Sequential Search Gunal:2009 (); Hajian:Foroud:2014a (); Hajian:Foroud:2014b (); Abdoos:Mianaei:2016 (), Genetic Algorithm (GA) Panigrahi:2009 (); Gunal:2009 (); Manimala:Selvi:2011 (); Manimala:Selvi:2012 (); Hafiz:Swain:2017 (), Simulated Annealing (SA) Manimala:Selvi:2011 (); Manimala:Selvi:2012 (), Binary Particle Swarm Optimization (BPSO) Hafiz:Swain:2017 (), Fully Informed Particle Swarm (FIPS) Lee:Shen:2011 (), Artificial Bee Colony (ABC) Khokhar:Zin:2017 (), k-means apriori algorithm Ericsti:2013 () and rough sets Dalai:Chatterjee:2013 (). The common drawback of the sequential search method such as Sequential Forward Search (SFS) and Sequential Backward Search (SBS) is the so-called ‘nesting effect’, i.e., once the feature is included/excluded from the subset, it cannot be removed/added. The floating sequential search approaches such as Plus-l Take Away-r (PTA) and Sequential Forward Floating Search (SFFS) can prevent nesting effect. However, these are still local search, as the over-emphasis on a single feature neglects the correlation among features. Furthermore, all sequential search methods require specification of the reduced subset size a priori. Given that the size of the optimal feature subset is not known, the sequential search essentially addresses only half of the feature selection problem. The meta-heuristic search methods such as GA, SA and ABC could overcome these limitations. However, in Panigrahi:2009 (); Manimala:Selvi:2012 (); Khokhar:Zin:2017 (), these approaches have not been used to full potential; as the size of the feature subset is fixed a priori. Such practice is usually not recommended, since fixing the subset size to a particular value, say’, will reduce feature search space from to which may exclude the optimal feature subset from the search. The filter based feature selection approaches have been employed in Hajian:Foroud:2014a (); Hajian:Foroud:2014b (); Dalai:Chatterjee:2013 () which usually involves a trade-off in the classification performance as the nature of the induction algorithm is not considered. This is further, corroborated by the results of the earlier investigation Hafiz:Swain:2017 () which suggest that the indirect performance indicators used in the filter approach do not necessarily translate into equivalent classification performance. In Lee:Shen:2011 (), Probabilistic neural network-based Feature Selection (PFS) has been proposed which employs a combination of Fully Informed Particle Swarm and Adaptive Probabilistic Neural Network to evaluate the efficacy of a feature subset. The major drawback of this approach is the assumption of the monotonic relation between the feature subset size and classification performance. This assumption can only be satisfied by an ideal Bayes classifier. In practice, the classifiers do not follow the monotonicity property Siedlecki:Sklansky:1989 (); Kohavi:1994 (); Kohavi:John:1997 (); Yang:Honavar:1998 (). In addition, PFS is prone to the nesting effect, as it does not have any mechanism to reconsider the discarded features.

Further, in most of the existing research, the deterministic feature selection approaches have been included in the comparative analysis whereas it is known that meta-heuristic search could yield enhanced search performance Xue:Zhang:2016 (). However, among the meta-heuristic search, so far only the performance of GA, SA and BPSO have been evaluated Panigrahi:2009 (); Manimala:Selvi:2011 (); Manimala:Selvi:2012 (); Hafiz:Swain:2017 (). In this scenario, the following questions arise:

  1. Amongst various meta-heuristic approaches (such as GA, BPSO, ACO and others) which approach would yield the minimum feature subset without compromising the classification performance?

  2. Whether the classification performance obtained from the reduced feature subset is robust against various levels of measurement noise?

The main aim of this study is to address these concerns through a comprehensive investigation. For this purpose, a comparative analysis of seven different feature selection wrappers is carried out using fourteen distinct class of PQ events and two different induction algorithms. Further, the robustness of the reduced subsets against the measurement noise is evaluated through seven different levels of zero-mean Gaussian white noise.

The other objective of this study is to investigate the efficacy of the new Two-Dimensional feature selection algorithm on the PQ events. The Two-Dimensional (2D) learning algorithm has recently been developed by the authors as a generalized feature selection wrapper Hafiz:Swain:2017a (). The core idea of the 2D-learning approach is to embed the information about the number of features (referred to as ‘cardinality’) into the learning process of particle swarms. This distinctive quality of the 2D learning approach has been shown to be more effective in identifying the compact feature subsets with the improved classification performance for several benchmark machine learning datasets. This, therefore, has been the motivation behind the application of 2D learning to the PQ event identification problem. In particular, the present study focuses on the practical issues encountered in the feature selection of PQ events and therefore significantly differs from (Hafiz:Swain:2017a, ) in the following aspects:

  • A well-balanced, comprehensive and diverse PQ event dataset is built. The dataset includes both simulated and experimental PQ events which conform to the IEEE Std. 1159. Further, a relatively large variation of PQ events is considered in comparison to the existing research on PQ identification Lee:Shen:2011 (); Biswal:Dash:2013 (); Biswal:Dash:2013a (); Kumar:Singh:2015 (); Singh:Singh:2017 (); Mahela:Shaik:2017 (). For example, the PQ events include magnitude variation in the range of , event duration from to cycles, frequency variation in the range of , and the harmonics of the order . Thus, most of the PQ events expected in the conventional power distribution grid are well represented in the investigated dataset.

  • The robustness of the selected/reduced feature subsets against the measurement noise is defined and evaluated under various levels of zero-mean Gaussian white noise.

  • The Two-Dimensional (2D) learning algorithm is better formalized and explained through an illustrative example from the perspective of PQ event identification.

  • The efficacy of 2D learning algorithm is demonstrated by the comparative evaluation on six established feature selection algorithms: GA, ACO, BPSO, CBPSO, ChBPSO and SFFS.

  • A rigorous analysis is carried out to determine the statistical significance of the results. For instance, multiple non-parametric statistical comparisons are carried out using the Friedman test and the Hommel’s post-hoc procedure to compare the search performance of the algorithms. In addition, the performance of algorithms under various levels of noise is compared using the Contrast Estimation based on medians.

Thus, in this study, most of the operating scenarios expected in the utility distribution grid have been accommodated. Further, several key issues related to the feature selection have been investigated.

The rest of the article is organized as follows: Section 2 provides a brief overview of the feature selection approaches. The 2D-learning approach has been briefly discussed in Section 3. The investigation framework of this study is provided in Section 4. The results of the comparative evaluation and the robustness test are shown in Section 5, followed by the discussions in Section 6 and the conclusions in Section 7.

2 Brief Overview of Feature Selection Methods

Since the main objective of this work is to select the relevant feature subset for the PQ event identification, it is pertinent to discuss the existing feature selection methods briefly.

The feature selection problem essentially begins by considering a dataset having ‘’ input features, and ‘’ output class, . For this dataset, the task of the induced classifier is to determine the output label ‘’ () corresponding to the input features. The objective of the feature selection is to identify the subset of features, ‘’, () through which this task can be accomplished with similar or improved classification performance: {linenomath*}

(1)

where, ‘’ is the criterion function which represents the classification performance of the feature subset and ‘’ denotes the cardinality or number of features in the subset. The search for an optimal solution of the feature selection requires the evaluation of subsets. Hence the exhaustive search is intractable even for the moderate size datasets Cover:Van:1977 (). Note that the selection of a relevant feature subset from the available features is essential for the better classification performance and the generalization capability of the classifier Blum:Langley:1997 ().

Over past six decades, while the fundamental objective of the feature selection problem has not changed (i.e., identify a subset of relevant features), the machine learning algorithms have evolved significantly with time to meet the requirements of different applications. For example, in many applications, the class ‘labels’ may not be available completely or partially Guyon:Isabelle:2003 (); Liu:Yu:2005 (). Such scenarios require unsupervised/semi-supervised machine learning approach. A distinct feature selection strategy is required for each machine learning approach such as supervised or unsupervised learning Guyon:Isabelle:2003 (); Liu:Yu:2005 (); Shang:Wang:2016 (); Shang:Wang:2018 (). In this study, we focus on the feature selection approaches for supervised learning; as the nature of PQ events have been well characterized by the IEEE Std. 1159 IEEE:1159 (). Hence, a comprehensive PQ event dataset has been developed through a combination of simulation and practical field measurements for the supervised learning.

Most of the existing feature selection methods for supervised learning can be distinguished by their approach (e.g., wrapper or filter) to evaluate the criterion function, Blum:Langley:1997 (); Dash:Liu:1997 (); Kohavi:John:1997 (). The wrapper approach is straightforward where a classifier is induced for each feature subset under consideration and the resulting classification accuracy is used as . On the contrary, in the filter methods, is estimated using statistical or information theory without inducing a classifier. Note that the search landscape of the feature selection problem is conjointly defined both by the dataset and the induction algorithm, i.e., each induction algorithm has specific traits. Hence, the optimal feature subset for particular induction algorithm may not be optimal for the other, as will be shown by the results of this study (Section 5.1).

The choice of feature selection method involves a trade-off in either speed or accuracy; the wrappers are more precise whereas filters are comparatively faster. The size of the feature set plays a major role in the selection of either approach. For the smaller to medium size feature set () wrappers are more appropriate. For the larger dataset, the computational burden of the wrapper may be infeasible. Hence filters are preferred in such a scenario. Usually, the PQ event datasets range in small to medium size (), thus in this work, the focus is on the wrapper approach.

Earlier approaches of feature selection, such as sequential search methods Marill:Green:1963 (); Whitney:1971 (); Pudil:1994 (); Somol:Pudil:1999 () and branch and bound methods Narendra:1977 (); Yu:Yuan:1993 () are ‘deterministic’ in nature, i.e., for a given dataset, they give the same solution over independent runs. The core idea behind most of these approaches is to evaluate the utility/relevance of a feature by evaluating its discrimination capability over the output classes. However, due to over-emphasis on an individual feature, the correlation among features is neglected. In addition, most of the deterministic approaches require a priori selection of subset cardinality, ‘’. Consequently, the search space reduces from to . Further, several deterministic approaches Narendra:1977 (); Yu:Yuan:1993 () assume a monotonic criterion function, , which is often impractical.

Meta-heuristic search methods such as Genetic Algorithm (GA), Ant Colony Optimization (ACO), Tabu Search (TS) and Particle Swarm Optimization (PSO) have been applied to the feature selection problem to address the shortcomings of the deterministic approaches Xue:Zhang:2016 (). Unlike deterministic methods, most of the meta-heuristic search methods operate on feature subsets. Hence, the effects of feature correlation are accounted in the search process. Further, the meta-heuristic search is, in essence, a population-based search with implicit parallelism which allows comparatively better sampling of the search space. For this reason, we focus on the meta-heuristic wrappers in this study.

3 Two-Dimensional (2D) Learning Framework for Particle Swarms

The core idea behind the Two-Dimensional (2D) learning approach has been briefly discussed in the following subsections for the sake of completeness. The detailed discussion about this approach can be found in Hafiz:Swain:2017a (). Note that, 2D learning is intended to be a generalized learning algorithm for particle swarm based feature selection methods. It is, therefore, possible to adapt most of the existing PSO variants Hafiz:Abdennour:2013 () in the continuous domain () for the feature selection problems () following the 2D-learning. However, the results of the study in (Hafiz:Abdennour:2016, ) indicate that among popular PSO variants, adapted Unified Particle Swarm Optimization (UPSO) (UPSO1, ) performs comparatively better for the problems in the discrete domain. For this reason, in this work, UPSO has been adapted following the 2D-learning approach and referred throughout the manuscript as ‘2D-UPSO’.

Input : Particle Position () and Learning Exemplar ()
Output : Learning Sets: and
1 */    Learning for Subset Cardinality
2 Set the cardinality learning sets to an -dimensional null vector, i.e.,
3 Determine the cardinality of the learning exemplar () and the particle position ():
4 and
5 Set the ‘’ bit of the cardinality learning set, ‘’, to ‘’, i.e.,
6 and */    Learning for Features
7 Evaluate Feature Learning Sets: and
Evaluate the final learning sets: and
Algorithm 1 Evaluation of the learning sets

3.1 Philosophy of 2D Learning

The search for the optimal feature subset essentially entails the following two decisions: 1) How many features should be included? and 2) Which features should be included? The philosophy of the Two-Dimensional (2D) learning is to explicitly embed the information about subset size (also referred as cardinality) into the search process to effectively address these issues. As the name suggests, the 2D-learning approach extends the learning dimension of a particle swarm to integrate the cardinality information into the search process. Since the cardinality of the subset is selected through an informed decision, only the features with higher selection likelihoods are selected and the redundant features are effectively discarded.

To further understand the philosophy of 2D learning, consider a feature selection problem associated with a dataset having ‘number of features. For this dataset, the position of an particle, ‘’, is represented as an -dimensional binary string as follows: {linenomath*}

(2)

where, the bits with ‘’ indicates that the corresponding feature is selected and the bits with ‘’ indicates otherwise, i.e., if then feature is selected.

Note that in PSO, the ‘learning’ of a particle is accumulated in its velocity. For this reason, in 2D learning, the dimension of particle velocity is extended, and it is represented by a two-dimensional matrix. The objective here is to store the selection likelihood of feature and cardinality (number of features) in a distinct dimension. For the problem considered here, the velocity of the particle, ‘’, is represented by a two-dimensional matrix of size () which is given by,

(3)

The first row of the velocity matrix stores the selection likelihood of the cardinality (i.e., subset size or number of features). For instance, implies that the likelihood of including a total of features in the new position of the particle is . In contrast, the elements in the second row of store the selection likelihood of the corresponding features. For example, indicates that likelihood of including the feature in the new position of the particle is .

Note that in order to update the selection likelihoods of cardinality and features it is crucial to extract the beneficial information from the ‘learning exemplars’ such as ‘personal best’ () and ‘neighborhood best’ (). This is accomplished through ‘learning’ process in which a dedicated ‘learning set’ () is derived from each exemplar. The learning set is also a matrix, in which the first row corresponds to the cardinality learning and the second row corresponds to the feature learning. In essence, the learning process extracts the following information from the exemplar and encodes it into a two-dimensional binary learning set ‘’: 1) number of features in the exemplar 2) features that have been included in the exemplar but not in the particle.

This is achieved as follows: Let ‘’ denote the learning exemplar of the particle, e.g., ‘’, ‘’ or ‘’. Note that the learning exemplar,‘’, is also an -dimensional binary string similar to particle position and essentially encodes a feature subset. Further, let -dimensional vectors,‘’ and ‘’, respectively denote the ‘cardinality’ and ‘feature’ learning set. For the particle, ‘’, and the corresponding learning exemplar, ‘’, the learning sets are derived following the procedure outlined in Algorithm 1. This procedure is further explained by the illustrative example in A.

Input : 
Output : 
1 Set the new position to an -dimensional null vector, i.e.,
2 Isolate the selection likelihood of the cardinality and feature into respective vectors, ‘’ and ‘’ as follows: */    Roulette Wheel Selection of the Subset Cardinality, ()
3 Evaluate accumulative probabilities, .
4 Generate a random number, .
5 Determine the cardinality of the particle as follows: */    Selection of the features
6 Rank the features on the basis of their likelihood’ and store the feature rankings in vector ‘
7 for j = 1 to n do
8        if  then
9              
10        end if
11       
12 end for
Algorithm 2 2-D learning approach to the position update of the particle
Input : PQ Dataset with features and wtih SNR
Output : Reduced Feature Subset ‘’,
1 Set the search parameters:
2 Randomly initialize the swarm of ‘’ number of particles,
3 Initialize the velocity ( matrix) of each particle by uniformly distributed random numbers in [0,1]
4 Evaluate the fitness of the swarm, , and
5 for t = 1 to iterations do
6        */    Swarm Update
7        for i = 1 to ps do
8               */    Stagnation Check
9               if  then
10                     Re-initialize the velocity of the particle
11                      Set to zero
12               end if
13              Extract the learning sets,‘’, from each learning exemplar as per Algorithm 1
14               Update the velocity of the particle as per (4), (5) and (6)
15               Update the position of the particle following Algorithm-2
16        end for
17       Store the old fitness of the swarm in ‘
18        Evaluate the swarm fitness
19        Update personal, neighborhood and global best position, ,
20        */    Stagnation Check
21        for i = 1 to ps do
22               if  then
23                     
24               end if
25              
26        end for
27       
28 end for
Algorithm 3 Pseudo code of 2D-UPSO algorithm for the feature selection problem

3.2 Velocity and Position Update

In this study, the well-known PSO variant, UPSO, has been adapted for the feature selection through the 2D-learning approach. The core idea behind UPSO (UPSO1, ) is to combine the ‘global(Shi:Eberhart:1998, ) and ‘local(Kennedy:Mendes:2002, ) version of the PSO to achieve the balance between exploration and exploitation of the search landscape. The velocity update rule of UPSO adapted through 2D learning approach (referred as ‘2D-UPSO’) is given by, {linenomath*}

(4)
(5)
(6)

and ‘’ is the unification factor; ‘’ is the inertia weight; the parameters are acceleration constants; are uniform random numbers; ‘’ and ‘’ are the social learning sets derived respectively from global best () and neighbourhood best (). Similarly, ‘’ denotes the learning set derived from the current position of the particle, . The adaptive weight used to control the influence of is denoted as ‘’ and it is given by, {linenomath*}

(7)

where, ‘’ is the fitness of the particle and ‘’ is the vector which contains the fitness of the entire swarm at iteration ‘’. For a given particle, the parameter ‘’ is adaptive and its value is dependent on the relative performance of the particle with respect to the worst particle of the swarm. As seen in (7), the particle with the minimum fitness will have higher , which will lead to the increase in selection likelihood of the corresponding features (included in the particle). Further, as seen in (7), is positive only when the particle leads to improved fitness.

Following the velocity update, the new position of the particle is determined in two steps. In the first step, the cardinality of the new position is determined which is followed by the selection of the features. For this purpose, the selection likelihoods stored in the velocity matrix are used. This procedure is outlined in Algorithm-2 and further explained through an illustrative example in A. The pseudo code for 2D-UPSO is outlined in Algorithm 3.

Figure 1: Investigation framework
Figure 2: The experimental setup to generate PQ events. The setup contains a three phase uncontrolled rectifier as a harmonic source and a capacitor bank as a transient source.

4 Investigation Framework

In this study, we focus mainly on two aspects of the feature selection in PQ events: 1) selection of an appropriate feature selection method and 2) robustness of the reduced subsets against the measurement noise. For this purpose, this investigation has been carried out in two stages, as shown in Fig. 2. The objective of the first stage is to carry out the comparative analysis of different feature selection approaches. For this purpose, seven different search algorithms (discussed in Section 4.3) are applied as a meta-heuristic wrapper to two induction algorithms (discussed in Section 4.2) and fourteen distinct types of PQ events (discussed in Section 4.1). Note that, the PQ events used for the feature selection purpose do not contain any noise, i.e., Signal-to-Noise Ratio (SNR) . In the second stage, the robustness of the reduced subsets, obtained from the feature selection, is evaluated. For this purpose, the classification performance of the reduced subsets is evaluated in the presence of levels of zero mean Gaussian white noise: SNR=.

The following subsections provide more details about the PQ events, Search Algorithms and Induction Algorithms being used in this study.

PQ Event Model Parameters Pure Sinusoid DC offset , Sag , , Swell , , Interruption , Flicker , Notching , Harmonics , Oscillatory Transient , , , Sag with Harmonics , , , Swell with Harmonics , , , Flicker with Harmonics , , , Sag with Transient , , , , Swell with Transient , , , , For each event duration ; fundamental frequency ; sampling frequency ; if ; otherwise

Table 1: Power Quality Events

4.1 PQ Events & Feature Extraction

For the comprehensive study, it is necessary to include a wide variety of PQ events in the investigation. For this purpose, the IEEE Std. 1159 IEEE:1159 () is followed in this study. A total of fourteen distinct PQ events are generated through parametric models given in Table 1 and the experimental setup shown in Fig. 2. These events include several distinct natures of the PQ events, e.g., stationary (), non-stationary (), low frequency (), high frequency (), single () and simultaneous ().

The synthetic PQ events were generated in MATLAB® following the parametric models shown in Table 1. The real PQ events are acquired using the experimental PCC shown in Fig. 2, which consist of a harmonic source and a transient source. The harmonic source is being emulated by a three-phase uncontrolled rectifier with a resistive load which generates the harmonics of order . The transient events are induced by switching of a capacitor bank. Further, the sag events are induced by creating a single line to ground fault. A digital oscilloscope/recorder (HIOKI 8870-20 MEMORY HiCORDER®) is used to capture the real events. For each class of PQ event, instances are generated which gives a total of instances of PQ events. Approximately of these event instances are generated through the parametric models shown in Table 1 and the remaining are induced using the experimental PCC shown in Fig. 2. Each event instance is generated at the fundamental frequency of , for the duration of cycles and sampled at . The PQ events include magnitude variation in the range of , frequency variation in the range of , harmonics of order and the event duration from to cycles. In addition, to evaluate the effects of the measurement noise, seven different levels of zero-mean Gaussian white noise have been added to these events. This gives a total of PQ datasets; each dataset has the same events but a different level of the measurement noise, i.e., SNR.

For accurate identification of the PQ events, it is crucial to extract information in both temporal and frequency domain. This task can be accomplished through various signal processing techniques such as Stockwell-Transform (ST), Wavelet Packet Transform (WPT), Discrete Wavelet Transform (DWT) Mahela:Shaik:2015 (); Khokhar:Zin:2015 (). Note that the selection of signal processing technique is primarily dependent on the frequency bandwidth of the signal being investigated. Especially, a judicious selection is essential to accommodate the relatively large frequency bandwidth of PQ events, e.g., from DC offset () to Oscillatory Transients (IEEE:1159 (). In this scenario, among existing signal processing techniques, DWT represents an ideal choice for PQ events as it is computationally more efficient; the complexity of DWT, WPT and ST is respectively , , ( denotes the number of samples). Therefore, DWT has been selected as a signal processing technique in this study.

It is well-known that the selection of the base/mother wavelet is dependent on the nature of the application and it is crucial to the performance of wavelet transforms. The detailed investigation on this topic Hafiz:Swain:2019 () suggest that for the PQ events, the optimum classification performance of the given induction algorithm can be obtained when it is paired with a specific base wavelet. For example, the optimum performance from k-Nearest Neighbor (k-NN) and Naive-Bayes (NB) (which are being used in this study) was obtained with the order symlet (‘’) Hafiz:Swain:2019 (). Therefore, in this study, ‘’ has been selected as the base wavelet.

Further, each instance of the PQ event is decomposed to the level using DWT (the decomposition level has been selected following the rule of thumb given in B). Consequently, the detail coefficients at each level and approximation coefficients at the final level are available. In order to extract meaningful information from the wavelet coefficients, statistical functions are used (shown in C). Following this procedure, for each instance of PQ event, a total of ‘features’ ( details/approximations functions) are obtained. Hence, a pattern, ‘’, is obtained corresponding to each instance of PQ event, as follows:

(8)

where, ‘’ denotes the number of features; denotes the features set and contains the label corresponding to each PQ event.

4.2 Induction Algorithm

The results of our previous investigation Hafiz:Swain:2019 () suggest that relatively simple induction algorithms are more robust against the measurement noise. In particular, Decision Tree (DT) and Naive-Bayes (NB) Witten:Frank:Hall:2016 () were found to be comparatively more robust for PQ events. Since DT has limited inherent feature selection capability, we have selected NB as one of the induction algorithms in this study. Further, it is well-known that the search landscape of the feature selection problem is conjointly defined by the dataset and the induction algorithm Blum:Langley:1997 (); Dash:Liu:1997 (). For this reason, in addition to NB, k-Nearest Neighbor (k-NN) Witten:Frank:Hall:2016 () is also used as an induction algorithm in this study.

(a) Selection of ‘’ and distance metric in k-NN. The maximum is obtained with ‘’ and ‘Manhattan’ distance.
(b) Selection of kernel width ‘’ in NB. The maximum is obtained with ‘’.
Figure 3: Grid-search for Hyperparameter selection. ‘’ denotes the average ten-fold classification accuracy.

Note that the classification performance of IA is critically dependent on the ‘hyperparameters’ which are used to control the learning process, e.g., ‘number of neighbors’ () and ‘distance metric’ in k-NN; ‘kernel width’ () in NB. In this study, the hyperparameters of both IAs have been selected through ‘grid-search’ to maximize the average ten-fold classification accuracy (‘’), as shown in Fig. 2(a) (k-NN) and Fig. 2(b) (NB).

4.3 Compared Feature Selection Algorithms

To evaluate the efficacy of the 2D learning approach, the following existing algorithms are considered: Genetic Algorithm (GA) Siedlecki:Sklansky:1989 (); Kudo:Sklansky:2000 (), Ant Colony Optimization (ACO) Yu:Gu:2009 (); Chen:Chen:2013 (), Binary PSO (BPSO) Kennedy:Eberhart:1997 (), Catfish BPSO Chuang:Tsai:2011 () and Chaotic BPSO (chBPSO) Chuang:Yang:2011 (). In all these algorithms, the search agent (e.g., a chromosome in GA) encodes a feature subset by the binary string representation given in (2).

In this study, the classical variant of GA is considered, i.e., simple GA with roulette wheel selection, single-point crossover and flip-bit mutation operator, as outlined in Siedlecki:Sklansky:1989 (); Kudo:Sklansky:2000 (). For ACO, the variant proposed in Yu:Gu:2009 (); Chen:Chen:2013 () is considered, where the features are represented as nodes on a directed graph. Each node on this graph is linked by two distinct edges to highlight whether a node/feature is selected or not. A feature subset is represented by a path traversed by an ant over these edges. For the given edge, the probability of inclusion in the path is given by the corresponding pheromone intensity. In each iteration, the pheromone intensity is updated based on the positive feedback mechanism. In this study, the pheromone update procedure proposed for ACO in Yu:Gu:2009 () is implemented. Further, since the proposed 2D-UPSO has been developed in particle swarm theory, BPSO and its two variants (CBPSO and chBPSO) are also included in the comparative investigation. The CBPSO Chuang:Tsai:2011 () retains the learning mechanism of BPSO while introducing the concept of ‘refresh gap’, i.e., a fixed number of worst performing particles are reinitialized if the swarm cannot locate an improved solution for a pre-fixed number of iterations. In chBPSO Chuang:Yang:2011 (), the velocity update rule of BPSO is updated to control the inertia weight following the chaotic maps. In this study, a logistic map is being used to determine the value of inertia weight in chBPSO.

In addition to the meta-heuristic algorithms, the Sequential Forward Floating Search (SFFS) Pudil:1994 () is also included to compare the search performance of the 2D learning with the existing deterministic search algorithms. The SFFS has been selected for this purpose, as it overcomes the limitations of the other sequential search methods Pudil:1994 (), e.g., nesting effects of SFS and SBS; selection of appropriate ‘’ in plus-l take away-r floating search.

The search parameters of each algorithm are set following the procedures outlined in Siedlecki:Sklansky:1989 (); Yu:Gu:2009 (); Kennedy:Eberhart:1997 (); Chuang:Tsai:2011 (); Chuang:Yang:2011 () and are shown in Table 2. All of the compared algorithms are implemented in MATLAB.

Algorithm Search Parameters General PSO Parameters Other/Special Parameters GA Siedlecki:Sklansky:1989 () - , , ACO Yu:Gu:2009 () - , , , BPSO Kennedy:Eberhart:1997 () - CBPSO Chuang:Tsai:2011 () , RG=3 chBPSO Chuang:Yang:2011 () variable 2D-UPSO , ’ - swarm size, ‘’ - inertia weight, ‘’ - acceleration constants; - velocity limits, - position limits; ’ - GA population size, - crossover and mutation probability ’ - colony size, ‘’ - pheromone update factor, ‘’ - pheromone trail evaporation, - pheromone boundaries ’ - catfish particles, ‘’ - unification factor

Table 2: Search Parameter Settings

4.4 Search Setup

The efficacy of the feature subset can be evaluated by either the filter or the wrapper approach. The selection of either approach requires a trade-off between precision and the computational complexity. In this study, the total number of features is moderate (), and therefore it is appropriate to select the wrapper approach due to its precision Blum:Langley:1997 (). In other words, for each subset under consideration, a classifier is induced by the induction algorithm (say NB or k-NN) and the subsequent classification error is used as the criterion function, .

Further, to remove any bias to the validation data, the mean classification error after 10-fold stratified cross-validation is used as the criterion function, Witten:Frank:Hall:2016 (). For a given feature subset, , this is given by, {linenomath*}

(9)

Without loss of generality, the feature selection problem is approached as a minimization problem where the objective is to minimize the classification error, given by (9). To account for the inherent stochastic nature of the algorithms, independent runs of each algorithm are executed. Each run is set to terminate after Function Evaluations (FEs).

5 Results

As mentioned earlier, in this study we address the following two issues: 1) Which feature selection approach is more effective for PQ events? and 2) How robust are the reduced feature subsets against the measurement noise? For this purpose, the comparative evaluation of seven feature selection wrappers has been carried out using two induction algorithms considering fourteen distinct PQ events, following the procedure outlined in ‘Stage-I’ of Fig. 2. Note that the feature selection has been carried out using only the ‘pure’ PQ events, i.e., with SNR. The results of this investigation have been discussed in Section 5.1.

The next issue is to evaluate the robustness of the reduced feature subsets which are obtained by the feature selection algorithms. For this purpose, the following seven different levels of zero mean Gaussian white noise are added to the PQ events: SNR. The framework for this part of the investigation is outlined in ‘Stage-II’ of Fig. 2. The results of this test are discussed in Section 5.2.

Results GA ACO BPSO CBPSO chBPSO 2D-UPSO 0.0161 0.0147 0.00552 0.00558 0.0094 0.0044 53.1 57.1 83.9 83.7 72.5 87.0 47.6 45.7 39.8 38.7 41.1 20.9 5.0 5.3 4.3 4.9 4.4 7.0 52.0 53.9 59.8 60.9 58.5 78.9 Overall Score 0.3096 0.2719 0.0892 0.0879 0.1571 0.0399 ‘Mean’and ’SD’ - Mean and standard deviation over 40 runs ‘PI(%)’- Improvement in the classification error relative to the original feature set, ’ - Average length of the feature subset over 40 runs - Percentage reduction in the subset size

Table 3: Search performance of the compared algorithms with k-NN (averaged over 40 runs)

Results GA ACO BPSO CBPSO chBPSO 2D-UPSO 0.0136 0.0128 0.00663 0.00652 0.0095 0.0061 31.8 35.6 66.7 67.3 52.4 69.4 49.3 49.0 43.0 43.8 46.3 37.0 3.8 4.8 4.2 3.8 4.0 4.4 50.3 50.6 56.6 55.8 53.3 62.6 Overall Score 0.2707 0.2546 0.1156 0.1156 0.1776 0.0913 ‘Mean’and ’SD’ - Mean and standard deviation of classification error, , over 40 runs ‘PI(%)’- Improvement in the classification error relative to the original feature set, ’ - Average length of the feature subset over 40 runs - Percentage reduction in the subset size

Table 4: Search performance of the compared algorithms with NB (averaged over 40 runs)

5.1 Stage-I : Comparative evaluation of the feature selection approaches

For the purpose of comparative evaluation, independent runs of each algorithm are recorded. Since the primary objective is to improve the classification performance through the removal of irrelevant/redundant features, the search performance of the compared algorithms is evaluated by two criteria, i.e., classification performance and size of the feature subset (cardinality).

The results obtained after 40 independent runs of each algorithm with k-NN and NB classifier are shown in Table 3 and 4, respectively. The results obtained with 2D learning approach (2D-UPSO) are shown in the last column of Table 3 and 4. The best results obtained among the compared algorithms are shown in boldface.

To compare the classification performance, the average (Mean) and standard deviation (SD) of the criterion function, , are shown in Table 3 and 4. Similarly, to compare the reduction in subset size, ‘Mean’ and ‘SD’ of subset cardinality is computed over 40 runs for each algorithm as shown in Table 3 and 4. Further, the following two metrics are used to measure the performance improvement obtained by each of the algorithms, {linenomath*}

(10)
(11)

where, ‘’ is the criterion function with the original feature set () and ‘’ is the average of the criterion function over 40 runs obtained with the reduced feature sets; ‘’ is the average cardinality over 40 runs and ‘’ is the total number of features, i.e., .

The Performance Improvement, , metric gives the improvement in the classification performance with respect to the original feature set, ‘’. The second metric, , shows the percentage reduction in the cardinality with respect to the total number of features, ‘’. A higher positive value of these metrics implies a better search performance. In addition, the following metric is used to evaluate the overall performance of the algorithms:

(12)

where, ‘’, ‘’ and ‘’ are the feature subset, its cardinality and the corresponding classification error at the end of ‘’ run of the algorithm; ‘’ denotes the total number of features. Note that this metric incorporates the information about both the cardinality and the classification performance. A lower value of this metric indicates that the algorithm could consistently find feature subsets with a lower cardinality and a lower classification error.

From the results shown in Table 3 and 4, it is clear that the 2D learning approach is most efficient as it gives the lowest classification error amongst the compared algorithms. With both the induction algorithms, 2D-UPSO could achieve the highest , approximately (with k-NN, Table 3) and (with NB, Table 4). It is interesting to note that in comparison to GA and ACO, all BPSO variants could yield better subsets with both k-NN and NB.

Further, in the present study, the feature selection issue is approached as a single-objective problem where the primary objective is to minimize the classification error. Therefore, any reduction in the subset cardinality is the direct consequence of the ability of the search algorithms to distinguish useful features from the irrelevant/redundant features. Intuitively, the exploitation of cardinality information in the 2D learning is likely to improve the search performance of 2D-UPSO. The results shown in Table 3-4 corroborates this notion. 2D-UPSO could provide the smallest feature subset with the highest reduction in the cardinality,e.g., (with k-NN) and (with NB).

The overall performance of the compared algorithms is evaluated using the ‘overall-score’ metric (12), which considers both the cardinality, , and the classification performance, , of the feature subsets obtained by the algorithm over 40 runs. As revealed by (12), a lower score indicates the consistent discovery of a subset with fewer features and lower classification error by the algorithm. As seen in Table 3-4, the overall score obtained by 2D-UPSO is the lowest amongst the compared algorithms which indicates the best overall performance.

The results further show a shift in the search landscape with the change of induction algorithms. For example, with k-NN, from the compared algorithms lies in the range of (Table 3) whereas with NB the performance gain is comparatively lower (in the range of , Table 4). Similar effects are observed in as well; its variation with k-NN and NB classifier lie in the range of and , respectively. These results further underline the need for the wrapper based feature selection approach.

5.1.1 Nonparametric Statistical Evaluation

Due to the stochastic nature of the compared algorithms, further statistical analysis is carried out to determine the significance of the results shown in Table 3 and Table 4. In particular, the objective of this analysis is to determine whether the results (i.e., and ) obtained by 2D-UPSO, are significantly better than the compared algorithms. For this purpose, multiple non-parametric statistical comparisons are carried out following the guidelines in Derrac et al. Derrac:Salvador:2011 (). The test is carried out in the following two steps:

Results Average Rank Friedman Statistic p-value GA ACO BPSO CBPSO chBPSO 2D-UPSO k-NN 5.9 5.2 2.3 2.4 4.0 1.3 182.66 1.23E-10 5.2 4.9 3.3 2.9 3.7 1.0 130.25 7.95E-11 NB 5.7 5.3 2.3 2.5 4.0 1.2 181.58 8.16E-11 4.9 4.8 2.9 3.2 3.9 1.3 101.06 9.26E-11 the best average ranking is shown in bold-face

Table 5: Outcome of the Friedman Test

Algorithm Test statistic value APV Test statistic value APV GA 10.88 1.49E-27 0.0100 10.07 7.51E-24 0.0100 ACO 9.20 3.47E-20 0.0125 9.17 4.58E-20 0.0125 BPSO 2.42 1.55E-02 0.0500 5.32 1.04E-07 0.0250 CBPSO 2.60 9.33E-03 0.0250 4.60 4.19E-06 0.0500 chBPSO 6.45 1.09E-10 0.0167 6.33 2.38E-10 0.0167 denotes null-hypothesis

Table 6: Outcome of the Hommel’s Post-hoc Procedure for Confidence Interval (k-NN)

First, the Friedman Two-way Analysis of Variance by Ranks Sheskin:2003 (); Derrac:Salvador:2011 () is applied to determine whether the performance of two or more compared algorithm is significantly different. For this purpose, the results obtained by the algorithms are ranked from ‘’ (best) to ‘’ (worst). Subsequently, the average value of ranks is determined over 40 independent runs. The test statistic and the corresponding p-value are determined following the procedures outlined in Sheskin:2003 (); Derrac:Salvador:2011 () and are shown in Table 5. The p-values obtained through the Friedman statistic strongly suggest a significant difference in the performance of the compared algorithms. Further, the average rankings obtained over 40 independent runs establish that 2D-UPSO is the best amongst the compared algorithms.

In the second step, a set of hypotheses is evaluated for multiple comparisons of 2D-UPSO with the other algorithms. Specifically, the evaluation of five interconnected null hypotheses is required to compare the 2D-UPSO with the five compared algorithms, i.e., GA, ACO, BPSO, CBPSO and chBPSO. Each null hypothesis () denotes that the algorithm being compared is significantly better than 2D-UPSO. The test statistic, value and Adjusted p-values (APV) which are required to evaluate these interconnected hypotheses are determined following the procedure outlined in Derrac:Salvador:2011 (); Garcia:Salvador:2010 (). The Hommel’s post-hoc procedure is employed to derive the APV from the value. The outcome of the multiple comparisons for confidence interval are shown in Table 6 (for k-NN) and Table 7 (for NB). These results convincingly demonstrate that, among the compared algorithms, 2D-UPSO could obtain feature subsets with significantly lower cardinality () and classification error, .

Algorithm Test statistic value APV Test statistic value APV GA 10.79 3.97E-27 0.0100 8.46 2.76E-17 0.0100 ACO 9.71 2.70E-22 0.0125 8.25 1.62E-16 0.0125 BPSO 2.69 7.16E-03 0.0500 3.71 2.11E-04 0.0500 CBPSO 3.08 2.09E-03 0.0250 4.48 7.39E-06 0.0250 chBPSO 6.54 5.99E-11 0.0167 6.13 9.04E-10 0.0167 denotes null-hypothesis

Table 7: Outcome of the Hommel’s Post-hoc Procedure for Confidence Interval (NB)

IA SFFS Pudil:1994 () 2D-UPSO k-NN 10 0.0056 0.0029 NB 30 0.0070 0.0056

Table 8: Comparative analysis with SFFS

5.1.2 Comparative analysis with SFFS

The results of the comparative analysis with SFFS are shown in Table 8. Note that SFFS requires a priori specification of the subset cardinality, . Since the cardinality of the optimum feature subset is not known, the cardinality of the best subset found by 2D-UPSO (out of runs) is determined and denoted as ‘’. For example, the cardinality of the best subset found by 2D-UPSO with k-NN is and the same with NB is . Hence, (with k-NN) and (with NB). Subsequently, SFFS is applied as a wrapper to both k-NN and NB to find the feature subset with the cardinality equal to ‘’. Given the same cardinality ,the objective here is to determine whether the SFFS could identify the feature subset with the comparable or better accuracy than 2D-UPSO. As expected, the outcome of these tests (Table 8) indicates that SFFS could not yield better feature subset with either k-NN or NB.

Data Set (SNR) All Features Results GA ACO BPSO CBPSO chBPSO 2D-UPSO 98.39 98.65 99.47 99.50 99.21 99.68 1.82 2.08 2.90 2.93 2.64 3.11 96.78 97.51 97.83 97.51 97.54 98.86 1.96 2.69 3.02 2.69 2.72 4.04 95.98 97.30 97.48 97.24 96.95 98.77 1.38 2.70 2.87 2.64 2.35 4.16 95.40 96.31 96.86 96.45 95.95 98.48 1.85 2.76 3.31 2.91 2.41 4.93 94.40 95.48 96.13 95.51 95.04 97.86 1.84 2.93 3.57 2.96 2.49 5.30 93.70 94.34 94.58 94.28 93.79 96.34 1.85 2.49 2.72 2.43 1.93 4.48 91.23 90.30 91.38 91.59 91.29 93.84 1.67 0.73 1.82 2.02 1.73 4.28 86.04 86.13 86.66 86.45 86.45 88.48 0.59 0.67 1.20 0.99 1.00 3.02 ; the highest performance gain is shown in bold-face

Table 10: Classification accuracy obtained with the reduced subsets (with NB)

Data Set (SNR) All Features Results GA ACO BPSO CBPSO chBPSO 2D-UPSO 98.01 98.77 98.74 99.38 99.36 99.09 99.44 0.76 0.73 1.38 1.35 1.08 1.43 97.30 97.63 97.54 97.92 97.95 97.60 98.04 0.32 0.23 0.62 0.64 0.29 0.73 97.04 97.36 97.16 97.45 97.39 97.39 97.45 0.32 0.12 0.41 0.35 0.35 0.41 96.34 96.45 96.75 96.66 96.51 96.48 96.95 0.12 0.41 0.32 0.18 0.15 0.62 95.66 95.54 95.72 95.84 95.34 95.60 96.36 -0.12 0.06 0.18 -0.32 -0.06 0.70 94.28 94.78 94.05 94.58 93.72 93.99 95.10 0.50 -0.24 0.29 -0.56 -0.30 0.82 92.67 92.58 92.29 92.52 92.11 92.35 93.87 -0.09 -0.38 -0.15 -0.56 -0.32 1.20 89.62 89.94 89.68 90.15 89.77 90.00 91.41 0.32 0.06 0.53 0.14 0.38 1.79 ; the highest performance gain is shown in bold-face

Table 9: Classification accuracy obtained with the reduced subsets (with k-NN)

5.2 Stage-II : Robustness of the Reduced Subsets

To evaluate the robustness of the reduced subsets against measurement noise, seven different levels of noise with SNR have been added to the PQ events. Given that the noise introduced by the measurement chains cannot be a priori estimated, the objective here is to compare the classification performance of the reduced subsets and the original feature set () at various noise levels.

Note that the feature selection has been carried out using only pure PQ events, i.e. with SNR=. Further, each run of the compared algorithms provides a different feature subset due to the inherent stochastic nature. To ensure the fair comparison, for each algorithm the feature subset with the minimum classification error, , out of runs is selected.

For each feature subset under consideration the average classification accuracy after -fold stratified cross-validation, ‘’, is recorded at each noise level. This is given by, {linenomath*}

(13)

’ and ‘’ respectively denote the reduced subset under consideration and the corresponding classification accuracy. Note that, .

In order to evaluate the ‘robustness’ of a given feature subset, , its classification accuracy, , is compared with that of the original full feature set, . As the feature selection was carried out using the ‘pure’ dataset (i.e., SNR), the performance of all reduced subsets is better compared to at this noise level. The objective here is to investigate whether the reduced subsets could maintain improved performance in the presence of various other levels of noise. Essentially, at each noise level, we are interested in finding out whether , which is given by, {linenomath*}

(14)

The classification accuracy, and the performance difference of the reduced subsets are shown in Table 10 (with k-NN) and Table 10 (with NB). The metric in (14) indicates the degree of improvement over , i.e., a higher positive value of is desirable. It is clear that the reduced subset obtained by 2D-UPSO could achieve the highest with both k-NN (Table 10) and NB (Table 10). By integrating the cardinality information into the search process, 2D-UPSO could find robust and effective feature subsets. These results convincingly demonstrate that with a proper feature selection approach, it is possible to obtain a robust feature subset that can yield enhanced performance even in the presence of various levels of measurement noise.

Further, the results in Tables 10 and 10 clearly show the influence of the induction algorithms on the search landscape. For example, with k-NN, all algorithms could identify robust feature subset, i.e., a positive at all noise levels, as seen in Table 10. In contrast, with NB, only 2D-UPSO could yield robust feature subset (Table 10). These results further underline the importance of a feature selection approach.

GA ACO BPSO CBPSO chBPSO 2D-UPSO GA 0 -0.6658 -1.078 -0.8050 -0.5517 -2.759 ACO 0.6658 0 -0.4125 -0.1392 0.1142 -2.093 BPSO 1.078 0.4125 0 0.2733 0.5267 -1.681 CBPSO 0.8050 0.1392 -0.2733 0 0.2533 -1.954 chBPSO 0.5517 -0.1142 -0.5267 -0.2533 0 -2.208 2D-UPSO 2.759 2.093 1.681 1.954 2.208 0

Table 12: Contrast Estimation(NB)

GA ACO BPSO CBPSO chBPSO 2D-UPSO GA 0 0.0925 -0.2042 0.1025 0.0017 -0.6025 ACO -0.0925 0 -0.2967 0.01 -0.0908 -0.6950 BPSO 0.2042 0.2967 0 0.3067 0.2058 -0.3983 CBPSO -0.1025 -0.01 -0.3067 0 -0.1008 -0.7050 chBPSO -0.0017 0.0908 -0.2058 0.1008 0 -0.6042 2D-UPSO 0.6025 0.6950 0.3983 0.7050 0.6042 0

Table 11: Contrast Estimation(k-NN)

Finally, the statistical significance of the results shown in Table 10 and Table 10 is determined. For this purpose, it is not feasible to apply the multiple non-parametric statistical comparisons (similar to Section 5.1.1), since the number of datasets is relatively small Derrac:Salvador:2011 (). Therefore, Contrast Estimation based on medians Garcia:Salvador:2010 (); Derrac:Salvador:2011 () is applied to compare the algorithms. This test essentially estimates a quantitative performance difference over multiple datasets for all possible pairs of algorithms.

The outcomes of this test are shown in Table 12 (with k-NN) and Table 12 (with NB). Note that a higher positive value of the estimator is desirable for this test. For instance, with k-NN, the contrast estimator for ACO is positive with respect to GA (, Table 12) and chBPSO (, Table 12) which indicate that the ACO could yield better subset in comparison to GA and chBPSO. For each of the algorithms, the positive outcomes are shown in boldface in Table 12 and 12. As seen in Table 12 and 12, the estimator for 2D-UPSO is positive for each pairwise comparison, which further highlights enhanced search performance of 2D-UPSO. Furthermore, the shift in the search landscape can indirectly be illustrated by the search behavior of GA; For instance, with k-NN, the estimator values for GA are negative with respect to all the algorithms (Table 12). In contrast, with NB, the positive estimators for GA are obtained with respect to three algorithms (ACO, CBPSO and chBPSO, Table 12).

6 Discussion

The following observations are inferred from the results of the comparative evaluation (Section 5.1) and the robustness evaluation (Section 5.2):

  • The feature selection wrappers are often criticized for computational complexity. However, as the results of this investigation suggest, wrappers can yield a significant reduction in the feature subset size while improving the classification performance. For instance, the compared algorithms could reduce the original feature set in the range of (with k-NN, Table 3) and (with NB, Table 4) while improving the classification performance in the range of (with k-NN, Table 3) and (with NB, Table 4). Since the feature selection is performed only once, the wrapper approach is highly recommended for the PQ events.

  • With an effective search strategy, it is possible to identify a feature subset which is robust against the measurement noise. In this study, the worst case scenario from the perspective of PQ events has been simulated, i.e., the feature selection is carried out using only ‘pure’ PQ events (i.e., SNR= dB) and subsequently the reduced subsets are evaluated under various levels of measurement noise. Nevertheless, under this scenario, 2D-UPSO could identify robust feature subsets with both k-NN and NB.

  • The results give empirical evidence for the hypothesis that the nature of the induction algorithm does affect the feature selection landscape. For example, all of the compared algorithms could yield robust feature subset with k-NN; however, only 2D-UPSO could identify a robust feature subset with NB. This further underlines the need for a wrapper based feature selection approach.

7 Conclusions

In this study, the issue of feature selection has comprehensively been investigated in the context of PQ event identification. In particular, the search performance of the Two-dimensional learning approach (2D-UPSO) and six other feature selection wrappers has been compared considering fourteen distinct classes of PQ events. Further, the robustness of the reduced feature subsets has been defined and evaluated under seven different levels of measurement noise. The results of the comparative evaluation convincingly demonstrate that 2D-UPSO can identify significantly better and robust feature subsets for PQ events. The key distinctive property of 2D learning is the integration of information about the subset size into the learning framework. This has been shown to lead to a significant improvement in the search performance in comparison to the other well-known algorithms, e.g., GA, ACO and BPSO.

Without loss of generality, this investigation is based on the assumption that the induction algorithm for the PQ event identification is pre-fixed. If this assumption does not hold or a generalized reduced feature subset is desired, then the filter based feature selection approaches are more appropriate. Hence, a detailed comparative investigation of the different filter approaches such as Mutual Information, Minimum Redundancy Maximum Relevance, Correlation based feature selection may prove to be very useful to both the practicing engineers and the PQ researchers. This could be the subject of further research.

Acknowledgement

Faizal Hafiz is thankful to Education New Zealand for supporting this research through the New Zealand International Doctoral Research Scholarship (NZIDRS).

Appendix A Illustrative Example of 2D Learning

Consider a dataset having number of features (). For this dataset, let the position of the particle ‘’ and the learning exemplar ‘’ at particular search iteration ‘’ be given by,

(15)

Evaluation of the Learning Sets

The learning sets derived from and , as per the Algorithm 1, are as follows:

  1. Set cardinality learning sets to null-vector:

  2. Cardinality of and :
    and .

  3. Set the bit of to ‘’:
    and

  4. Evaluate feature learning sets:
    and

  5. Evaluate the final learning sets:
    and

Position Update

To understand the position update procedure, assume that the velocity of the particle is given by,

(16)

The new position of the particle is determined as per the procedure outlined in Algorithm 2:

  1. Set the new position to a null-vector:

  2. Isolate the selection likelihoods of cardinality and features, :