Unsupervised Feature Selection Based on Space Filling Concept
The paper deals with the adaptation of a new measure for the unsupervised feature selection problems. The proposed measure is based on space filling concept and is called the coverage measure. This measure was used for judging the quality of an experimental space filling design. In the present work, the coverage measure is adapted for selecting the smallest informative subset of variables by reducing redundancy in data. This paper proposes a simple analogy to apply this measure. It is implemented in a filter algorithm for unsupervised feature selection problems.
The proposed filter algorithm is robust with high dimensional data and can be implemented without extra parameters. Further, it is tested with simulated data and real world case studies including environmental data and hyperspectral image. Finally, the results are evaluated by using random forest algorithm.
keywords:Unsupervised feature selection, Coverage measure, Space filling, Random forest, Machine Learning
In recent years, the techniques of collecting environmental data (such as: wind speed, permafrost, rainfall, pollution ) have been improved. Moreover, environmental phenomena are mostly: non-linear, multivariate, and in many cases they are studied in high dimensional feature spaces KPT2009. Usually, the input space is constructed by considering available information and expert knowledge. The empirically designed input feature space can gain rapidly a high dimension. In addition to the original features, there is always redundancy in the input data. In fact, the data points are not uniformly distributed in the experimental domain in which the data are embedded. In other word, the data space is not well filled or covered in the presence of redundancy. Consequently, the modelling of these data could take much time when introducing all features. Such problems are known as the curse of dimensionality.
To overcome this issue, feature selection (FS) algorithms play an important role in data driven modelling. Therefore, numerous methods and measures for FS have been proposed intro1; intro2. The main purpose is to retain only features that bring new and relevant information by reducing the existing redundancy in data. This procedure helps to manage the curse of dimensionality. In fact, it improves the accuracy of modelling, speeds up the learning process, and offers a good interpretation of the results.
The literature of machine learning distinguishes two well-known techniques of FS, according to the availability of the output variable: supervised and unsupervised feature selection jjg1; jjg2; jango1; jango2. These techniques try to find the smallest informative subset of features regarding to a defined measure or criterion.
Other methods are available, such as feature ranking rank1; rank2, which consists in giving an order to features regarding their importance. Then, a learning process to choose how many features can be selected usually follows these methods.
Several measures and criteria are used for selecting the smallest subset of features: measures based on entropy entro1; entro2; entro3, fractal dimension frac1, intrinsic dimension (jango1; jango2; intd1), and also on distance dist1.
In the unsupervised methods, the goal is mainly to carry out an exploratory analysis and to improve the discovering of hidden pattern. Therefore, the techniques of unsupervised feature selection (UFS) unsu1; unsu2; unsu3 do not require a prior information (output variable). They try to minimise existing redundancy, which leads to a reduction of dimensionality of data. Further, UFS techniques improve the understanding, the visualisation, and the interpretation of the results. In short, the dimensionality reduction consists in choosing a subset of features that contain new and relevant information about the data.
This paper is an adaptation of a new measure based on space filling concept, which is called the coverage measure. It was mainly used in experimental designs cov1; cov2. Moreover, the proposed measure was used also for the construction of spatial coverage designs in cov3, which proposes its implementation in Splus. Other implementation for the spatial coverage is available in the R library spcosa proposed in cov4. The DiceDesign R library proposes an implementation of this measure, in the context of space filling design cov5.
The coverage measure is adapted here for the UFS problems. It can be implemented in all search techniques such as exhaustive search exau1 , sequential forward selection (SFS) sfs1, and sequential backward selection sbf1 (SBS). In this work, it is considered with a SFS technique.
The proposed measure computes how well the space is covered by the data points. In fact, it quantifies the uniformity of points in a hypercube by comparing the repartition of points to a regular grid cov1. The smallest coverage value means that the hypercube is well filled. Intuitively, the coverage measure gives zero value or near to zero if the data points are distributed as a regular grid, or near to be a regular grid, in the data space.
The analogy is quite simple and clear, the selected features have to fill uniformly all the space in which the data are incorporated. In fact, the repartition of points expresses the information amount disseminated. Therefore, the smallest value of coverage means that the variables cover well the space in which they are embedded. Moreover, the selected variables should contain new and relevant information about the data.
A filter algorithm is used to implement the coverage measure. It is applied on simulated dataset and on several well-known benchmark datasets used for feature selection purpose. In addition, real environmental data are used as well.
Further, the algorithm is tested with different scenarios of noise injection and shuffling data. Then, the results are verified and evaluated with random forest algorithm rf1; rf2 by using a consistent methodology.
The remainder of this paper is organized as follows. Section 2 explains the coverage measure and its use in experimental designs. Section 3 presents the implementation of this measure for the UFS problems and introduces the corresponding filter algorithm. In section 4, the measure is evaluated by several datasets. In the last section, the conclusion is given with future developments.
2 Definition and basic notions
Design and modelling of experiments have always been a fundamental approach over the years. The experimenter has to propose and choose the suitable factor space (i.e. experimental domain) for the experiment under study. The most important early step to check is the coverage or the uniformity of the proposed design. There are many ways to select the best design regarding several conditions and criteria fangbo.
Numerous space filling design have been proposed under some prior properties. They can be constructed by using algebraic methods: based on incomplete block resolvable design fangbo, based on association schemes imane. The construction algorithms were as well considered in (pack1; pack2; pack3). Other high quality designs, based on space filling concept, were proposed in sf1; sf2; sf3. Furthermore, different measures for choosing the best design have been given in meas1; meas2; meas3; damblin.
In the literature of sampling methods, one strategy is to generate randomly different designs. Then, a comparison is carried out using a defined measure to find the best design. Another approach can be on the extension of an existing design. The objective is to add more points in the sampling design by taking into account the prior defined measure.
Other strategies in choosing the best design is to adopt some optimality criterion, such as:
The entropy criterion shannon; con_ent, which has been widely used in coding theory and statistics. The Shannon entropy measures the amount of information contained in the distribution of a set of points. In con_ent it is described as the classical idea of the information amount in an experiment. Moreover, it is proposed with a linear model (a simple Kriging model), and presents the corresponding maximum entropy designs.
The integrated mean squared error (imse), which is computationally more demanding and needs a powerful optimisation algorithm due to the large combinatorial design space. This criterion can be replaced by the maximum mean squared error involving a multidimensional optimisation mmse.
Minimax or Maximin distance criteria, proposed in minmax, which measure how well the experimental points are distributed through the experimental domain. A minimax distance minimises the maximum distance between points. Whereas the maximin distance maximises the minimum inter-site distance. A well-known maximin designs are the Placket-Burman designs where the number of points where is a positive integer and presents the number of factors.
Besides, several uniformity measures have been proposed in fangbo. The most known is the discrepancy. Numerous kinds of discrepancies have been defined such as: the star discrepancy, the centred -discrepancy, and wrap-around -discrepancy. These uniformity criteria are based on the Kolmogorov-Smirnov test. In fact, it compares the design to a uniform distribution.
In addition to the discrepancy, the coverage measure was also proposed to quantify the uniformity. In contrast to the discrepancy, the coverage measure compares the proposed design to a regular grid. Furthermore, the coverage measure is more stable than the discrepancy in a high dimensional design. Therefore, it can be applied to high dimensional data.
2.1 Coverage measure
Let }⊂[0, 1]^dndϑ_i = min_k ( dist ( x^i,x^k ))x^i¯ϑ = 1n ∑^n_i=1 ϑ_iϑ_idist
If the data points are distributed as a regular grid: . Hence,
The quality coverage of points can be detected by using the minimum euclidean distance between the points. Further, it takes into account the dispersion of distances. In fact, the coverage measure makes appear the coefficient of variation of the , which is known as the relative standard deviation (the ratio of the standard deviation to the mean of ).
The smaller the value of is, the smaller the distance between the points is. In this case, the design is near to be a regular grid. The best design should have the smallest coverage value .
Figure 1: Different sequences of points with : 0.2908866, 0.5243953, and 0.6546629 for Halton, Sobol, and uniform sequence respectively
Figure 1 shows the capability of the coverage measure to compute and quantify the filling of space. Therefore, the use of such measure helps to find the best experimental design regarding the distribution of points.
From this point of view starts the adaptation idea of this measure for the unsupervised feature selection problems. Furthermore, the simplicity of this measure offers a good implementation in a filter algorithm for selecting variables.
It is important to note that the results giving by this measure are acceptable regarding the selection of the informative feature subset. In addition, it can make use of a parallel CPU computing and a GPU computing to speed up the search procedure.
3 Unsupervised feature selection based on coverage
Numerous techniques exist for implementing redundancy reduction measures. The SFS and SBS are the two commonly used techniques for this purpose. They give acceptable results comparing to the exhaustive search in a short time. The proposed measure can be implemented in any search technique. In the remainder of this paper the used search technique is the SFS. The implementation of the proposed measure is described in the following proposition.
For all subsets of features, the coverage measure is computed (as it is defined in equation 1). The best subset has the smallest value, regardless to the used search technique.
Since the present work is proposed with a SFS (see algorithm 1), the features are added step by step regarding the obtained coverage value.
Figure 2: Features with : 0.5110806, 1.061314, and 1.106582 for random (or non-redundancy), linear redundancy, and non-linear redundancy respectively
Figure 2 shows clearly that the redundancy is easily detected by this measure, whether it is linear or non-linear redundancy. Besides, the UFS using the (UfsCov) algorithm takes into account the multivariate interactions between selected features. In addition, the UfsCov algorithm does not need extra parameters and does not need a fixed threshold. Therefore, the best subset is the one that gives the smallest coverage measure. Finally, The UfsCov algorithm can be programmed easily in R and MATLAB software.
4 Experimental case studies
The simulated and the real world datasets presented in this section are commonly used in several papers on machine learning and feature selection. Moreover, several scenarios of noise injection and shuffling data are proposed to evaluate and to explore the limitation of the UfsCov algorithm.
Further, this section discusses the quality of the obtained results. Finally, the results are verified and evaluated by using random forest algorithm.
4.1 Simulated case study
The simulated Butterfly dataset, introduced in (cran1), is composed of features , where are relevant and contain all the information of the dataset. The remaining features are constructed basically from with linear and non-linear relations. In fact, these features are redundant and do not bring new information. (See J. Golay et al. (jango1)).
Figures 3 show the results for the Butterfly datasets with different number of points. The results show that the UfsCov algorithm finds easily the three important features regardless of the number of points used to generate the Butterfly dataset. The minimum value of the coverage measure is reached at the correct subset.
4.2 Noise injection
The robustness of UfsCov is evaluated against noise. In fact, several experiments of noise injection were performed for two different scenarios. The first one consists in injecting noise to all features of the Butterfly dataset. The second one consists to corrupt only the redundant features (). A Gaussian noise is used with a mean fixed at and a standard deviation set at: , and of the original standard deviation of feature.
The objective of these experiments is to see if UfsCov can detect an existing redundancy in data corrupted by a Gaussian noise. Furthermore, it is important to find out the limitation of this algorithm against noise and at what level.
Figure 4 shows the two proposed scenarios of noise injection. Figure 4.b presents the reaction of UfsCov with corrupted redundant features, at different levels of noise. The algorithm is still robust and detects the important features. However, at of noise, the minimum value of the coverage is not indicating the correct subset of features, which is normal for such level of noise. On the other hand, the algorithm gives at least a correct ranking of features regarding the importance and the provided information of each feature (see table 1). Therefore, it can be concluded that the UfsCov algorithm is robust against noise.
Without noise Table 1: The added features in the SFS technique (here, the Gaussian noise is injected to all features). Without noise Table 2: The added features in the SFS technique (here, the Gaussian noise is injected to all features).
4.3 Shuffling features
In addition to injecting noise in data, shuffling of features can be an interesting experiment to evaluate the UfsCov algorithm. This operation was carried out with two scenarios: at the beginning, only two redundant features are shuffled ( , ). Then, three redundant features are shuffled. The results were expected, since the shuffling destroys the linear or non-linear relation between features. In fact, this can reduce the redundancy. As figure 5 presents, UfsCov selected features with relevant information (which are not redundant).
4.4 Benchmark case studies
Benchmark case studies UCI; data0 are also used to test the UfsCov algorithm. The datasets used in this work are: Parkinson, PageBlocks, Ionosphere, and COIL20 data0. Table 3 describes these datasets and the number of selected features for each dataset.
Data Number of instances Number of features Selected features Parkinson PageBlocks Ionosphere COIL20 Table 3: Description of the used datasets and summary of the results obtained by the UfsCov algorithm.
4.4.1 Results and discussions
In addition to applying the UfsCov algorithm on simulated and real world datasets, this subsection discusses the evaluation of the results. Here, random forest algorithm is used as a classifier for the four datasets used above (Parkinson, PageBlocks, Ionosphere, and COIL20).
The used procedure of testing with random forest is applied once with all features of the datasets and once with only selected features. The procedure can be summarised as follows:
the data were split into training and testing sets ( for training and for testing);
the training set was used to find the optimal parameters of random forest (the number of trees and the number of predictors). Furthermore, the training step was performed by using a -fold cross-validation;
a random forest model was generated with the optimal parameters found above (previous step), and then applied to classify the testing set. Two classification evaluation metrics are used:
the overall accuracy of classification is computed with the following formula:
where is the predicted class label for the th observation using the random forest model. And is an indicator variable with:
Therefore, the OA formula computes the fraction of correct classifications, which means that the best classification has the highest overall accuracy.
Cohen’s Kappa coefficient k kappa1 is also used to compare the classification results of random forest. The Kappa evaluation metric is computed on the test subset by using the following formula:
where indicates the number of correctly classified samples for class ; and is the number of data points in the test subset. and are the size of samples for the class and the samples classified in the same class .
Data All features Selected features Parkinson PageBlocks Ionosphere COIL20 Table 4: In percent: random forest classification errors (20 repetitions with random splits) and the standard deviation as well. Data All features Selected features Parkinson PageBlocks Ionosphere COIL20 Table 5: Mean Kappa coefficient (over 20 repetitions with random splits) and the standard deviation as well.
4.5 Environmental case studies
This section shows the potential of the proposed unsupervised feature selection algorithm on environmental data. In fact, the algorithm is applied on Permafrost data and the Indian Pines hyperspectral image.
4.5.1 Permafrost case study
The data were collected in the Alp Mountains of Switzerland. features (excluding the XY coordinates) are used to predict Permafrost presence or absence. For more details on the study, including more complete references and more information about the collected features, see N. Deluigi et al. (nicola).
Figure 8 presents the unsupervised feature selection results. The minimum of the coverage measure is reached at features. Furthermore, the given result is evaluated by using random forest algorithm. Table 6 shows the results of random forest, with all features and with only the selected features. The classification accuracy and the Kappa coefficient are shown in figure 9. In this figure, random forest is applied after each step of UfsCov algorithm.
Features Accuracy Kappa metric All features () Selected features () Table 6: Random forest errors and the standard deviation after repetitions with different splitting (Permafrost dataset). Figure 9: Random forest results for each step of the UfsCov (a) the overall accuracy, (b) the Kappa coefficient. (Permafrost dataset)
4.5.2 Indian Pines image
The image was captured by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor in the Northwest Indiana, on June 12, 1992. The Indian Pines scene contains agricultural and forested region (figure 10 ). The data consist of x pixels and spectral bands with a spatial resolution of m/pixel PURR1947.
In this work only bands are used for the experiments, after removing noisy bands (, , ) due to water absorption. The present case study of hyperspectral image shows that the UfsCov algorithm is able to deal with remote sensing problems. Furthermore, the proposed algorithm help to manage high dimensional datasets (more than features).
Figure 11 shows the results of the proposed algorithm. The minimum value is reached with the features. Table 7 compares the difference between the two random forest models, with all features and with only the selected features.
Features Accuracy Kappa metric All features ( bands) Selected features ( bands) Table 7: Random forest errors and the standard deviation after repetitions with different splitting (for the Indian Pines image). Figure 12: Random forest results for each step of the UfsCov (a) the overall accuracy, (b) the Kappa coefficient. (Indian Pines dataset)
The research introduced a space filling measure for the unsupervised feature selection problems. A new filter algorithm considered is based on the coverage measure. The proposed UfsCov algorithm minimises redundancy in data. The proposed algorithm showed its efficiency by testing on simulated and real world case studies including environmental data. Random forest results confirm the potential of space filling concept in the unsupervised feature selection problems. Finally, the UfsCov was programmed in R language and will be available on the CRAN repository in the SFtools library.
Future development could be in the adaptation of new measures based on space filling concept for machine learning use and data mining. Furthermore, it could be important to propose algorithms with a parallel CPU computing version and GPU computing to speed up the execution time.
This research was partly supported by the Swiss Government Excellence Scholarships for Foreign Scholars.
The authors would like to thank Nicola Deluigi for providing us with the Permafrost dataset. They also would like to thank Micheal Leuenberger, Jean Golay, and Fabian Guignard for fruitful discussions about machine learning.