A branchandbound feature selection algorithm for Ushaped cost functions
Abstract
This paper presents the formulation of a combinatorial optimization problem with the following characteristics: .the search space is the power set of a finite set structured as a Boolean lattice; .the cost function forms a Ushaped curve when applied to any lattice chain. This formulation applies for feature selection in the context of pattern recognition. The known approaches for this problem are branchandbound algorithms and heuristics, that explore partially the search space. Branchandbound algorithms are equivalent to the full search, while heuristics are not. This paper presents a branchandbound algorithm that differs from the others known by exploring the lattice structure and the Ushaped chain curves of the search space. The main contribution of this paper is the architecture of this algorithm that is based on the representation and exploration of the search space by new lattice properties proven here. Several experiments, with well known public data, indicate the superiority of the proposed method to SFFS, which is a popular heuristic that gives good results in very short computational time. In all experiments, the proposed method got better or equal results in similar or even smaller computational time.
Boolean lattice; branchandbound algorithm; Ushaped curve; classifiers; Woperators; feature selection; subset search; optimal search.
I Introduction
A combinatorial optimization algorithm chooses the object of minimum cost over a finite collection of objects, called search space, according to a given cost function. The simplest architecture for this algorithm, called full search, access each object of the search space, but it does not work for huge spaces. In this case, what is possible is to access some objects and choose the one of minimum cost, based on the observed measures. Heuristics and branchandbound are two families of algorithms of this kind. An heuristic algorithm does not have formal guaranty of finding the minimum cost object, while a branchandbound algorithm has mathematical properties that guarantee to find it.
Here, it is studied a combinatorial optimization problem such that the search space is composed of all subsets of a finite set with points (i.e., a search space with objects), organized as a Boolean lattice, and the cost function has a Ushape in any chain of the search space or, equivalently, the cost function has a Ushape in any maximal chain of the search space.
This structure is found in some applied problems such as feature selection in pattern recognition [5, 7] and Woperator window design in mathematical morphology [8]. In these problems, a minimum subset of features, that is sufficient to represent the objects, should be chosen from a set of features. In Woperator design, the features are points of a finite rectangle of called window. The Ushaped functions are formed by error estimation of the classifiers or of the operators designed or by some measures, as the entropy, on the corresponding estimated join distribution. This is a well known phenomenon in pattern recognition: for a fixed amount of training data, the increasing number of features considered in the classifier design induces the reduction of the classifier error by increasing the separation between classes until the available data becomes too small to cover the classifier domain and the consequent increase of the estimation error induces the increase of the classifier error. Some known approaches for this problem are heuristics. A relatively well succeeded heuristic algorithm is SFFS [11], which gives good results in relatively small computational time.
There is a myriad of branchandbound algorithms in the literature that are based on monotonicity of the costfunction [6, 10, 14, 15]. For a detailed review of branchandbound algorithms, refer to [13]. If the real distribution of the joint probability between the patterns and their classes were known, larger dimensionality would imply in smaller classification errors. However, in practice, these distributions are unknown and should be estimated. A problem with the adoption of monotonic costfunctions is that they do not take into account the estimation errors committed when many features are considered (“curse of dimensionality” also known as “Ucurve problem” or “peaking phenomena” [7]).
This paper presents a branchandbound algorithm that differs from the others known by exploring the lattice structure and the Ushaped chain curves of the search space.
Some experiments were performed to compare the SFFS to the Ucurve approach. Results obtained from applications such as Woperator window design, genetic network architecture identification and eight UCI repository data sets show encouraging results, since the Ucurve algorithm beats (i.e., finds a node with smaller cost than the one found by SFFS) the SFFS results in smaller computational time for 27 out of 38 data sets tested. For all data sets, the Ucurve algorithm gives a result equal or better than SFFS, since the first covers the complete search space.
Though the results obtained with the application of the method developed to pattern recognition problems are exciting, the great contribution of this paper is the discovery of some lattice algebra properties that lead to a new data structure for the search space representation, that is particularly adequate for updates after updown lattice interval cuts (i.e., cuts by couples of intervals [0,X] and [X,W]). Classical tree based search space representations does not have this property. For example, if the Depth First Search were adopted to represent the Boolean lattice only cuts in one direction could be performed.
Following this introduction, Section 2 presents the formalization of the problem studied. Section 3 describes structurally the branchandbound algorithm designed. Section 4 presents the mathematical properties that support the algorithm steps. Section 5 presents some experimental results comparing Ucurve to SFFS. Finally, Conclusion discusses the contributions of this paper and proposes some next steps of this research.
Ii The Boolean Ucurve optimization problem
Let be a finite subset, be the collection of all subsets of , be the usual inclusion relation on sets and, denote the cardinality of . The search space is composed by objects organized in a Boolean lattice.
The partially ordered set is a complete Boolean lattice of degree such that: the smallest and largest elements are, respectively, and ; the sum and product are, respectively, the usual union and intersection on sets and the complement of a set in is its complement in relation to , denoted by .
Subsets of will be represented by strings of zeros and ones, with meaning that the point does not belong to the subset and meaning that it does. For example, if , the subset will be represented by . In an abuse of language, means that is the set represented by .
A chain is a collection such that . A chain is maximal in if there is no other chain such that contains properly .
Let be a cost function defined from to . We say that is decomposable in Ushaped curves if, for every maximal chain , the restriction of to is a Ushaped curve, i. e., for every , .
Figure 1 shows a complete Boolean lattice of degree with a cost function decomposable in Ushaped curves. In this figure, it is emphasized a maximal chain in and its cost function. Figure 2 presents the curve of the same cost function restricted to some maximal chains in and in . Note the Ushape of the curves in Figure 2.
Our problem is to find the element (or elements) of minimum cost in a Boolean lattice of degree . The full search in this space is an exponential problem, since this space is composed by elements. Thus, for moderately large , the full search becomes unfeasible.
Iii The Ucurve algorithm
The Ushaped format of the restriction of the cost function to any maximal chain is the key to develop a branchandbound algorithm, the Ucurve algorithm, to deal with the hard combinatorial problem of finding subsets of minimum cost.
Let and be elements of the Boolean lattice . An interval of is the subset of given by . The elements and are called, respectively, the left and right extremities of . Intervals are very important for characterizing decompositions in Boolean lattices [2, 4].
Let be an element of . In this paper, intervals of the type and are called, respectively, lower and upper intervals. The right extremity of a lower interval and the left extremity of an upper interval are called, respectively, lower and upper restrictions. Let and denote, respectively, collections of lower and upper intervals. The search space will be the poset obtained by eliminating the collections of lower and upper restrictions from , i. e., . In cases in which only the lower or the upper intervals are eliminated, the resulting search space is denoted, respectively, by and and given, respectively, by and .
The search space is explored by an iterative algorithm that, at each iteration, explores a small subset of , computes a local minimum, updates the list of minimum elements found and extends both restriction sets, eliminating the region just explored. The algorithm is initiated with three empty lists: minimum elements, lower and upper restrictions. It is executed until the whole space is explored, i. e., until becomes empty. The subset of eliminated at each iteration is defined from the exploration of a chain, which may be done in downup or updown direction. Algorithm 1 describes this process. The direction selection procedure (line 5) can use a random or an adaptative method. The random method states a static probability to select the downup or updown direction. The adaptative method calculates a new probability to each direction giving more probability to downup direction if most of the local minima is closest to the bottom of the lattice and updown otherwise.
An element of the poset is called a minimal element of , if there is no other element of with . In Figure 1, the minimal elements of are: , and . If the downup direction is chosen, the DownUpDirection procedure is performed (algorithm 2):

MinimalElement procedure calculates a minimal element of the poset . Only the lower restriction set is used to calculate the minimal element . An element is said to be covered by the lower restriction set , if , and is said to be covered by the upper restriction set , if . When the calculated is covered by an upper restriction, it is discarded, i.e., the lower restriction set is updated with and a new iteration begins (lines 15).

At this point, the element is the minimum element of the chain explored, and are, respectively, the lower and upper adjacent elements of (i.e., and ) by construction, . It can be proved that any element of , with , has cost bigger than and, any element of , with , has cost bigger than . By using this property, the lower and upper restrictions can be updated, respectively, by and (lines 1217). Figure 3 shows a schematic representation of the first iteration of the algorithm and the elements contained in the intervals and .

The result list can be updated with (line 18) , i. e., will be included in the result list if it has cost lower than the elements already saved in the list. The result list can save a predefined number of elements with low costs or only elements with the overall minimum cost.

In order to prevent visiting the element more than once, a recursive procedure called minimum exhausting procedure is performed (line 19)
An element is called a minimum exhausted element in if all its adjacents elements (upper and lower) have cost bigger than it. This definition can be extended to the poset , i. e., all its adjacent elements (upper and lower) in have cost bigger than it. In Figure 1 we can see that the elements , and are minimum exhauted elements in , but is not a minimum exhauted element in . In this paper, the term minimum exhausted will be applied always refering to a poset .
The minimum exhausting procedure (Algorithm 3) is a recursive process that visit all the adjacent elements of a given element and turn all of them into minimum exhausted elements in the resulting poset . It uses a stack to perform the recursive process. is initialized by pushing to it and the process is performed while is not empty (lines 222). At each iteration, the algorithm processes the top element of : all the adjacent elements (upper and down) of in and not in are checked. If the cost of an adjacent element is lower (or equal) than the cost of then is pushed to . If the cost of is bigger than the cost of then one of the restriction sets can be updated with , lower restriction set if is lower adjacent of and upper restriction set if is upper adjacent of (lines 516). If is a minimum exhausted element in , i. e., there is no adjacent element in with cost lower than , then is removed from and, also, the restriction sets and the result list are updated with (lines 1921). At the end of this procedure all the elements processed are minimumexhausted elements in .
Figure 4 shows a graphical representation of the minimum exhausting process. 4A shows a chain construction process in up direction, the chain has its edges emphasized. The element (orangecolored) has the minimum cost over the chain. The elements in black are the elements eliminated from the search space by the restrictions obtained by the lower and upper adjacent elements of the local minimum . The stack begins with the element . Figure 4B shows the first iteration of the minimum exhausting process. The arrows in red and the elements in red indicates the adjacents elements of (top of the stack) that have cost lower (or equal) than it. These elements and are pushed to the stack. The adjacent elements of with cost bigger than it can update the restriction sets, i. e., the lower adjacent element updates the lower restriction set and the upper adjacent element updates the upper restriction set. Figure 4C shows the second iteration: the adjacent elements and with cost lower (or equal) than the new top element are pushed to the stack and the other adjacent elements and with cost bigger than update, respectively, the lower and upper restriction sets. In Figure 4D the element is a minimum exhausted element (grey color) in and it is is removed from stack. In Figure 4E the elements eliminated by the new interval and are turned into black color. At this point, is a minimum exhausted (grey color) in and it is removed from stack. From Figure 4F to Figure 4H all the elements are removed from stack and the elements removed by the new restrictions are turned into black color. Figure 4H shows all the elements removed from a single minimum exhausted process.
The procedures to calculate minimal and maximal elements and the procedure to update lower and upper restriction sets will be discussed in the next section.
Iv Mathematical foundations
This section introduces mathematical foundations of some modules of the algorithm.
Iva Minimal and Maximal Construction Procedure
Each iteration of the algorithm requires the calculation of a minimal element in or a maximal element in . It is presented here a simple solution for that. The next theorem is the key for this solution.
Theorem 1. For every ,
.
(in Appendix Section)
Algorithm 4 implements the minimal construction procedure. It builds a minimal element of the poset . The process begins with and and executes a loop (lines 316) trying to remove components from . At each step, a component is chosen exclusively from ( prevents multiselecting). If the element resulted from by removing the component is contained in then is updated with (lines 715).
The minimal element calculated is equal to when . At this point, the poset is empty and the algorithm stops in the next iteration.
The next theorem proves the correctness of Algorithm 4 .
Theorem 2. The element of returned by the minimal construction process (Algorithm 4) is a minimal element in .
(in Appendix Section)
The process to calculate a maximal element in is dual to the one to calculate a minimal, i. e., it begins with and, at each step, when the complement of the resulting has not empty interseccion to all the elements of , adds a component to .
IvB Lower and Upper Restrictions Update
The restriction sets and represent the search space. Thus, they are updated after each new search by the following rule: an element is added to the lower (or upper) restriction set if all elements of (or ) have costs bigger or equal to .
The next theorem establishes the Ucurve condition, that permits to stop the chain construction process and to update the restriction sets.
Theorem 3. Let be the chain constructed by Algorithm 2 (or its dual version). Let be the cost function from to decomposable in Ushaped curves and , then
.
(in Appendix Section)
By a similar proof to the one of Theorem 3, it can be proved that all the elements in contained in have also cost bigger or equal to it. Figure 3 shows the chain obtained by the chain construction process and the resulted poset. The elements detached have always cost bigger than the elements or .
Algorithm 5 describes the update process of the lower restriction set by an element . If is already covered by , i. e., there exists an element of that contains then the process stops (lines 13). Otherwise, all the elements in contained in are removed from and is added to (lines 49). This procedure may diminish the cardinality of the restriction set, but does not diminish the cardinality of the resulting poset , since the removed restrictions are contained in .
The upper restriction list updating procedure is dual to the lower one, i. e., in this case we look for elements contained in instead of elements that contain .
IvC Minimum Exhausting Procedure
The computation of the cost function in general is heavy. Thus, it is desirable that each element be visited (and its cost computed) a single time. A way of preventing this reprocessing is to apply the minimum exhausting procedure. This procedure is a recursive function (Algorithm 3). It uses a stack to process recursively all the neighborhood of a given element contained in the poset . At each recursion, it visits the upper and lower adjacent elements of , the top of , in and not in . The adjacent elements with cost bigger than the cost of are elements satisfying the Ucurve condition, so they can update the restriction sets and, consequently, be removed from the search space. The adjacent elements with cost lower or equal to are pushed to to be processed in later iterations. Note that elements are not reprocessed during the exhausting procedure, since this procedure checks if a new element explored is in an interval or in , before computing its cost. If is a minimum exhausted element in then is removed from . After the whole procedure is finished, all elements processed are out of the resulting poset , so they will not be reprocessed in the next iterations. The fact that an element can not be reprocessed along the procedure implies that the cardinality of is an upper limit for the procedure number of steps. In search spaces that are lattices with high degree, this procedure can have to process a huge number of elements and some heuristics should be necessary. For example, to stop the search for adjacent elements smaller than a minimum after some badly succeeded trials.
The minimum exhausting procedure gives another interesting property to the Ucurve algorithm. If the cost function on maximal chains are Ushaped curves with oscillations, as illustrated in Figure 5A, the Ucurve algorithm may lose a local minimum element. Note that, in this case, the local minimum element after the oscillation has cost smaller than the cost of one before. However, this minimum is not lost if there is another chain, with a true Ushaped cost function, containing both local minimum elements. Figure 5B shows an alternative chain (chain in red) that reaches the true minimum element of the chain (element in black). Note that the first local minimum (element in yellow) is contained in both chains. The true minimum, reached by the alternative chain, is obtained exactly by the exhausting of the first minimum found. Hence, the exhausting procedure permits to relax the class of problems approached by the Ucurve algorithm.
V Experimental Results
In this section, some results of applications of Ucurve algorithm to feature selection are given and compared to SFFS [11]. For this study several data sets were used: Woperator window design [8], architecture identification in genetic networks and several data sets from the UCI Machine Learning Repository [1]. In all cases, it was attributed the value 3 for the parameter of SFFS. This parameter is a stop criterion of SFFS. Usually, in order to avoid that the algorithm stops at the first moment that it reaches the desired dimension. In this way, it performs more feature inclusion and deletion before returning the subset with the desired dimension, alleviating the nesting effect. The value used as default here is the same default value adopted by the original algorithm implementation [11].
All data sets used and the binary program with some documentation can be found at the supplementary material web page (http://www.vision.ime.usp.br/~davidjr/ucurve).
Va Cost function adopted: penalized mean conditional entropy
The Information theory was originated from Shannon´s works [12] and can be employed on feature selection problems [5]. The Shannon’s entropy is a measure of randomness of a random variable given by:
(1) 
in which is the probability distribution function and, by convention, .
The conditional entropy is given by the following equation:
(2) 
in which is a feature vector and is the conditional probability of given the observation of an instance . Finally, the mean conditional entropy of given all the possible instances is given by:
(3) 
Lower values of yield better feature subspaces (i.e., the lower , the larger is the information gained about by observing ).
In practice, and are estimated. A way to embed the error estimation, committed by using feature vectors with large dimensions and insufficient number of samples, is to atribute a high entropy (i.e., penalize) to the rarely observed instances. The penalization adopted here consists in changing the conditional probability distribution of the instances that present just a unique observation to uniform distribution (i.e., the highest entropy). This makes sense because if an instance has only 1 observation, the value of is fully determined (i.e., ), but the confidence about the real distribution of is very low. Adopting this penalization, the estimation of the mean conditional entropy becomes:
(4) 
in which is the number of training samples and is the number of instances with (i.e., just one observation). In this formula, it is assumed that the logarithm base is the number of possible classes , thus, normalizing the entropy values to the interval . This cost function exhibits Ushaped curves, since, for a sufficiently large dimension, the number of instances with a single observation starts to increase, increasing the penalization and, consequently, increasing the cost function value (i.e., next features included do not give enough information to compensate the error estimation).
VB Data sets description
VB1 Woperator window design
the Woperator window design problem consists in looking for subsets of a size window for which the designed operator has the lowest estimation error (i. e., the transformed images generated by the operator are as similar as possible of the expected images). The training samples were obtained from the images presented in [8]. It is composed by 20 files with 18,432 samples each. There are 16 features assuming binary values and two classes.
VB2 Biological classification
the biological classification problem studied is the problem of estimating a subset of predictor genes for a specific target gene from a timecourse microarray experiment. The data set used for the tests is the one presented in paper [9]. They are normalized and quantized in levels using the same method described in [3]. The subset of predictors is obtained from a set of genes. Thus, there are 27 features assuming three distinct values and three possible classes. It is composed by 10 files with 15 samples each.
VB3 UCI Machine Learning Repository
UCI Machine Learning Repository data sets considered are: pendigits, votes, ionosphere, dorothea_filtered, dexter_filtered, spambase, sonar and madelon. For all data sets, the feature values were normalized by subtracting them from their respective means and dividing them by their respective standard deviations. After that, all values were binarized (i.e., associated to 0, if the normalized value is nonpositive, and to 1, otherwise). Except for dorothea_filtered and dexter_filtered, all features were taken into account. The dorothea_filtered and dexter_filtered are files postprocessed from dorothea and dexter data sets, respectively. In the dorothea and dexter data sets, most features display null value for almost every sample. So, dorothea_filtered considered only the features with 100 or more nonnull values, while dexter_filtered considered the features with 50 or more nonnull values.
A description of each data set is presented in the following list:

pendigits: composed by 7494 samples, 16 binary features and 10 classes;

votes: composed by 435 samples, 16 ternary features and 2 classes;

ionosphere: composed by 351 samples, 34 binary features and 2 classes;

dorothea_filtered: composed by 800 samples, 38 binary features and 2 classes;

dexter_filtered: composed by 300 samples, 48 binary features and 2 classes;

spambase: composed by 4601 samples, 57 binary features and 2 classes;

sonar: composed by 208 samples, 60 binary features and 2 classes;

madelon: composed by 2000 samples, 500 binary features and 2 classes.
VC Results
The feature selection problem may have cost functions with chains that present oscillations and there is no theoretical guaranty of the existence of alternative chains to achieve the local minima lost because of the oscillations. However, these cases were tested experimentally and in all observed cases the minimum exhausting procedure could find the local minimum elements using alternative chains. We have examined random curves in all data sets studied. For example, in the Woperator window design almost curves () contains oscillatory parts and in the biological classifier design almost curves () contain oscillatory parts. For all these oscillatory curves and also for those found in the UCI data sets, the minimum exhausting procedure got the local minimum by alternative chains.
The results of the Ucurve algorithm are divided in two sets: i  until it beats the SFFS result (UC); ii until the search space is completely processed (UCC). The Ucurve algorithm is stochastic and at each test it can reach the best result in different processing time. So, the Ucurve was processed times for each test and the quantitative results presented are means of values gotten in these processes. The machine used for the tests was an AMD Turion 64 with 2Gb of RAM.
In the following, each of the three experiments performed is summarized by a table and all these tables have the same structure. The first column presents the winner of the comparison of SFFS with UC. The other columns present the cost in terms of processed nodes and computational time of SFFS, UC and UCC.
Table I shows the results for the Woperator window design experiment. Twenty tests were performed using the available training samples. UC beats SFFS in 8 of the 20 tests and reaches the same result in the remaining ones. In these last cases, both reach the global minimum element. In all cases, UC processes a smaller number of nodes, in a smaller time, than SFFS. The complete search (UCC) frequently needs to process more nodes (), taking more time (), than SFFS.
Test  Winner  Computed nodes  Time(sec.)  

SFFS  UC  UCC  SFFS  UC  UCC  
1  EQUAL  
2  EQUAL  
3  EQUAL  
4  UC  
5  UC  
6  UC  
7  EQUAL  
8  UC  
9  UC  
10  EQUAL  
11  EQUAL  
12  EQUAL  
13  EQUAL  
14  UC  
15  EQUAL  
16  UC  
17  EQUAL  
18  UC  
19  EQUAL  
20  EQUAL 
Table II shows the results for the biological classifier design experiment. Ten tests were performed using different target genes. In these examples, the complete search space is quite big ( nodes). SFFS reaches the best element, equalling UC, only times. The processing of the whole space (UCC) improved the result of UC in times. UC processed many more nodes than SFFS, but their computational times are very similar. This happens because these experiments involve small number of samples and, therefore, the computational time spent to process a node is very small. The preprocessing overhead is the major responsible for the time consuming in this case.
Test  Winner  Computed nodes  Time(sec.)  

SFFS  UC  UCC  SFFS  UC  UCC  
1  EQUAL  
2  UC  
3  UC  
4  UC  
5  UC  
6  EQUAL  
7  EQUAL  
8  UC  
9  UC  
10  UC 
Table III shows the results of 8 tests using public datasets. For each test, the value in parenthesis is the number of features (n) in the data set. For tests with high number of features, the results for the complete search (UCC) are not available. We can see that UC obtained better results than SFFS in of the tests and equal results in two tests with small number of features. In these two cases, SFFS reaches the best result but UC reaches them faster, processing less nodes.
Test  Winner  Computed nodes  Time(sec.)  
SFFS  UC  UCC  SFFS  UC  UCC  
pendigits (16)  EQUAL  
votes (16)  EQUAL  
ionosphere (34)  UC  NA  NA  
dorothea_filtered (37)  UC  NA  NA  
dexter_filtered (48)  UC  NA  NA  
spambase (57)  UC  NA  NA  
sonar (60)  UC  NA  NA  
madelon (500)  UC  NA  NA 
These results show that UC is more efficient than SFFS for low order problems, obtaining the same results with less processing. For high order problems, UC is more accurate, but in some cases it process more nodes and takes more time.
Vi Conclusion
This paper introduces a new combinatorial problem, the Boolean Ucurve optimization problem, and presents a stochastic branchandbound solution for it, the Ucurve algorithm. This algorithm gives the optimal elements of a cost function decomposable in Ushaped chains, that may even be oscillatory in a given sense. This model permits to describe the feature selection problem in the context of pattern recognition. Thus, the Ucurve algorithm constitutes a new tool to approach feature selection problems.
The Ucurve algorithm explores the domain and cost function particular structures. The Boolean nature of the domain permits to represent the search space by a collection of upper and lower restrictions. At each iteration, a beginning of chain node is computed from the search space restrictions. The current explored chain is constructed from this node by choosing upper or lower adjacent nodes. The choice of a beginning of chain and of an adjacent node usually has several options and one of them is taken randomly. The cost function and domain structure permit to make cuts in the search space, when a local minimum is found in a chain. After a local minimum is found, all local minimum nodes connected to it are computed, by the minimum exhausting procedure, and the corresponding cuts, by updown intervals, executed. The adjacency and connectivity relations adopted are the ones of the search space Hesse diagram, that is a graph in which the connectivity is induced by the partial order relation. The minimum exhausting procedure avoids that a node be visited more than once and generalizes the algorithm to cost functions decomposable in some class of Ushaped oscillatory chain functions. The procedures of the Ucurve algorithm are supported by formal results.
In fact, the Ucurve optimization technique constitutes a new framework to study a family of optimization problems. The restrictions representation and the intervals cut, based on Boolean lattice properties, constitutes a new optimization structure for combinatorial problems, with properties not found in conventional tree representations.
The Ucurve was applied to practical problems and compared to SFFS. The experiments involved window operator design, genetic network identification and six public data sets obtained from the UCI repository. In all experiments, the results of the Ucurve algorithm were equal or better than those obtained from SFFS in precision and, in many cases, even in performance. The results of the Ucurve algorithm considered for comparison are the mean of several executions for the same input data, since it is a stochastic algorithm that may have different performances at each run.
The efficiency of the Ucurve algorithm depends on the relative position of the local minima on the search space. The algorithm is more efficient when the local minima are near the search space extremities. The worst cases are the ones in which the local minima are near the middle of the lattice.
The results obtained until now are encouraging, but the present version of the Ucurve algorithm is not a fast solution for high dimension problems with many local minima in the center of the search space lattice. The efficient addressing of these problems in the Ucurve optimization approach opens a number of subjects for future researches such as: to develop additional cuts to the branchandbound formulation; to design and estimate distributions for the random parameters used in the choice of beginning nodes or adjacent paths in the construction of a chain, with the goal of reaching earlier to the best nodes; to build parallelized versions of the algorithm; and others.
Appendix
Theorem 1. For every ,
.
Theorem 2. The element of returned by the minimal construction process (Algorithm 4) is a minimal element in
By looking into the steps of the minimal construction procedure:

Proving that the resulting element is mimimal in is equivalent of proving that .

Let and be the step of the procedure when the index is chosen to be removed from . and imply that , i. e., cannot be removed from at the end of step . This is avoided by the algorithm (lines 812), when there exists an element with . As , then and, by Theorem 1, . This implies that is a minimal element in .
Theorem 3. Let be the chain constructed by Algorithm 2 (or its dual version). Let be the cost function from to decomposable in Ushaped curves and . It is true that,
.
Suppose that and . It contradicts the hypothesis that is a function decomposable in Ushaped curves, since , but is either or , contradicting . \QEDopen
Acknowledgement
The authors are grateful to FAPESP (99/127652, 01/094010, 04/039670 and 05/005875), CNPq (300722/982, 468 413/006, 521097/010 474596/044 and 491323/050) and CAPES for financial support. This work was partially supported by grant 1 D43 TW0701501 from the National Institutes of Health, USA. We also thank Helena Brentani by her helpful in the data for biological analysis and Roberto M. Cesar Jr. by his helpful in SFFS comparisons. The data sets used to generate the Table III results were obtained from UCI Machine Learning Repository [1].
References
 [1] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.
 [2] G. J. F. Banon and J. Barrera. Minimal representations for translationinvariant set mappings by mathematical morphology. SIAM J. Appl. Math., 51(6):1782–1798, 1991.
 [3] J. Barrera, R. M. CesarJr, D. C. MartinsJr, R. Z. N. Vencio, E. F. Merino, M. M. Yamamoto, F. G. Leonardi, C. A. B. Pereira, and H. A. del Portillo. Constructing probabilistic genetic networks of Plasmodium falciparum from dynamical expression signals of the intraerythrocytic development cycle, chapter 2, pages 11–26. Springer, 2006.
 [4] J. Barrera and G. P. Salas. Set operations on collections of closed intervals and their applications to the automatic programming of morphological machines. Electronic Imaging, 5(3):335–352, 1996.
 [5] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, volume 1, pages 1–19. WileyInterscience, 2nd edition, 2000.
 [6] A. Frank, D. Geiger, and Z. Yakhini. A distancebased branch and bound feature selection algorithm. In Proceedings of the 19th Annual Conference on Uncertainty in Artificial Intelligence (UAI03), pages 241–248, San Francisco, CA, 2003. Morgan Kaufmann Publishers.
 [7] A. K. Jain, R. P. W. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4–37, 2000.
 [8] D. C. Martins Jr, R. M. Cesar Jr, and J. Barrera. Woperator window design by minimization of mean conditional entropy. Pattern Analysis & Applications, 9:139–153, 2006.
 [9] C. Lin, A. Ström, V. B. Vega, S. L. Kong, A. L. Yeo, J. S. Thomsen, W. C. Chan, B. Doray, D. K. Bangarusamy, A. Ramasamy, L. A. Vergara, S. Tang, A. Chong, V. B. Bajic, L. D. Miller, J. Gustafsson, and E. T. Liu. Discovery of estrogen receptor target genes and response elements in breast tumor cells. Genome Biology, 5(9):1–18, 2004.
 [10] S. Nakariyakul and D. P. Casasent. Adaptive branch & bound algorithm for selecting optimal features. Pattern Recognition Letters, (28):1415–1427, 2007.
 [11] P. Pudil, J. Novovicová, and J. Kittler. Floating search methods in feature selection. Pattern Recognition Letters, 15:1119–1125, 1994.
 [12] C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27:379–423, 623–656, July, October 1948.
 [13] P. Somol and P. Pudil. Fast branch & bound algorithms for optimal feature selection. PAMI, 26(7):900–912, July 2004.
 [14] Z. Wang, J. Yang, and G. Li. An improved branch & bound algorithm in feature selection. In Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing: 9th International Conference, Lecture Notes in Computer Science, pages 549–556, Chongqing, China, May 2003. Springer Berlin / Heidelberg.
 [15] S. Yang and P. Shi. Bidirectional automated branch and bound algorithm for feature selection. Journal of Shanghai University, 9(3):244–248, 2005.
[]Marcelo Ris received a B.Sc. in Computer Science (Universidade de São Paulo  USP, Brazil) and a M.Sc. in Computer Science (Universidade de São Paulo  USP, Brazil). His main research topics are in bioinformatics, including algorithms design for gene network identification, pattern recognition for computer vision and algorithm parallelism. He is currently a Ph.D student on Bioinformatics at Universidade de São Paulo  USP, Brazil and Hospital do Câncer  Brazill.
[]Junior Barrera received a B.Sc. in Electrical Engineering (Universidade de São Paulo  USP, Brazil), a M.Sc in Applied Computing (Instituto Nacional de Pesquisas Espaciais  INPE, Brazil) and a Ph.D. in Electrical Engineering (Universidade de São Paulo  USP, Brazil). His main research topics are study of lattice operator representation and design, lattice dynamical systems, image processing, bioinformatics and computational biology. He is currently a full professor at the Department of Computer Science of IME  USP and president of the Brazilian Society for Bioinformatics and Computational Biology.
[]David C. Martins Jr received a B.Sc. in Computer Science (Universidade de São Paulo  USP, Brazil) and a M.Sc. in Computer Science (Universidade de São Paulo  USP, Brazil). His main research topics are in pattern recognition for computer vision and bioinformatics, including but not limited to gene network identification. He is currently a Ph.D student on Computer Science at IME  USP and recently he did a research stage at the Genomics Signal Processing Laboratory  Texas A. & M. University during one year.