Evolutionary Dataset Optimisation: learning algorithm quality through evolution

Evolutionary Dataset Optimisation: learning algorithm quality through evolution

Henry Wilde Vincent Knight Jonathan Gillard

In this paper we propose a new method for learning how algorithms perform. Classically, algorithms are compared on a finite number of existing (or newly simulated) benchmark data sets based on some fixed metrics. The algorithm(s) with the smallest value of this metric are chosen to be the ‘best performing’.

We offer a new approach to flip this paradigm. We instead aim to gain a richer picture of the performance of an algorithm by generating artificial data through genetic evolution, the purpose of which is to create populations of datasets for which a particular algorithm performs well. These data sets can be studied to learn as to what attributes lead to a particular progress of a given algorithm.

Following a detailed description of the algorithm as well as a brief description of an open source implementation, a number of numeric experiments are presented to show the performance of the method which we call Evolutionary Dataset Optimisation.

1 Introduction

This work presents a novel approach to learning the quality and performance of an algorithm through the use of evolution. When an algorithm is developed to solve a given problem, the designer is presented with questions about the performance of their proposed method, and its relative performance against existing methods. This is an inherently difficult task. However, under the current paradigm, the standard response to this situation is to use a known fixed set of datasets - or simulate new data sets themselves - and a common metric amongst the proposed method and its competitors. The algorithm is then assessed based on this metric with often minimal consideration for both the appropriateness or reliability of the datasets being used, and the robustness of the method in question.

This notion is not so easily observed when travelling in the opposite direction. Suppose that, instead, the benchmark was a dataset of particular interest and a preferable algorithm was to be determined for some task. There exist a number of methods employed across disciplines to complete this task that take into account the characteristics of the data and the context of the research problem. These methods include the use of diagnostic tests. For instance, in the case of clustering, if the data displayed an indeterminate number of non-convex blobs, then one could recommend that an appropriate clustering algorithm would be DBSCAN [6]. Otherwise, for scalability, -means may be chosen [28].

The approach presented in this work aims to flip the paradigm described here by allowing the data itself to be unfixed. This fluidity in the data is achieved by generating data for which the algorithm performs well (or better than some other) through the use of an evolutionary algorithm. The purpose of doing so is not to simply create a bank of useful datasets but rather to allow for the subsequent studying of these datasets. In doing so, the attributes and characteristics which lead to the success (or failure) of the algorithm may be described, giving a broader understanding of the algorithm on the whole. Our framework is described in Figure 1.

Figure 1: On the right: the current path for selecting some algorithm(s) based on their validity and performance for a given dataset. On the left: the proposed flip to better understand the space in which ‘good’ datasets exist for an algorithm.

This proposed flip has a number of motivations, and below is a non-exhaustive list of some of the problems that are presented by the established evaluation paradigm:

  1. How are these benchmark examples selected? There is no true measure of their reliability other than their frequent use. In some domains and disciplines there are well-established benchmarks so those found through literature may well be reliable, but in others less so.

  2. Sometimes, when there is a lack of benchmark examples, a ‘new’ dataset is simulated to assess the algorithm. This begs the question as to how and why that simulation is created. Not only this, but the origins of existing benchmarks is often a matter of convenience rather than their merit.

  3. In disciplines where there are established benchmarks, there may still be underlying problems around the true performance of an algorithm:

    1. As an example, work by Torralba and Efros [27] showed that image classifiers trained and evaluated on a particular dataset, or datasets, did not perform reliably when evaluated using other benchmark datasets that were determined to be similar. Thus leading to a model which lacks robustness.

    2. The amount of learning one can gain as to the characteristics of data which lead to good (or bad) performance of an algorithm is constrained to the finite set of attributes present in the benchmark data chosen in the first place.

Evolutionary algorithms (EAs) have been applied successfully to solve a wide array of problems - particularly where the complexity of the problem or its domain are significant. These methods are highly adaptive and their population-based construction (displayed in Figure 2) allows for the efficient solving of problems that are otherwise beyond the scope of traditional search and optimisation methods.

Figure 2: A general schematic for an evolutionary algorithm.

The use of EAs to generate artificial data is not a new concept. Its applications in data generation have included developing methods for the automated testing of software [11, 17, 22] and the synthesis of existing or confidential data [3]. Such methods also have a long history in the parameter optimisation of algorithms, and recently in the automated design of convolutional neural network (CNN) architecture [24, 25].

Other methods for the generation or synthesis of artificial data include simulated annealing [15] and generative adversarial networks (GANs) [7]. The unconstrained learning style of methods such as CNNs and GANs aligns with that proposed in this work. By allowing the EA to explore and learn about the search space in an organic way, less-prejudiced insight can be established that is not necessarily reliant on any particular framework or agenda.

Note that the proposed methodology is not simply to use an EA to optimise an algorithm over a search space with fixed dimension or datatype such as those set out in [3]. The size and sample space itself is considered as a property that can be traversed through the algorithm.

2 The evolutionary algorithm

In this section, the details of an algorithm that generates data for which a given function or, equivalently, an algorithm which is well suited, is described. This algorithm is to be referred to as “Evolutionary Dataset Optimisation” (EDO).

The EDO method is built as an evolutionary algorithm which follows a traditional (generic) schema with some additional features that keep the objective of artificial data generation in mind. With that, there are a number of parameters that are passed to EDO; the typical parameters of an evolutionary algorithm are a fitness function, , which maps from an individual to a real number, as well as a population size, , a maximum number of iterations, , a selection parameter, , and a mutation probability, . In addition to these, EDO takes the following parameters:

  • A set of probability distribution families, . Each family in this set has some parameter limits which form a part of the overall search space. For instance, the family of normal distributions, denoted by , would have limits on values for the mean, , and the standard deviation, .

  • A maximum number of “subtypes” for each family in . A subtype is an independent copy of the family that progresses separate from the others. These are the actual distribution objects which are traversed in the optimisation.

  • A probability vector to sample distributions from , .

  • Limits on the number of rows an individual dataset can have,

  • Limits on the number of columns a dataset can have,

    for each . That is, defines the minimum and maximum number of columns a dataset may have from each distribution in .

  • A second selection parameter, , to allow for a small proportion of ‘lucky’ individuals to be carried forward.

  • A shrink factor, , defining the relative size of a component of the search space to be retained after adjustment.

The concepts discussed in this section form the mechanisms of the evolutionary dataset optimisation algorithm. To use the algorithm practically, these components have been implemented in Python as a library built on the scientific Python stack [16, 19]. The library is fully tested and documented (at https://edo.readthedocs.io) and is freely available online under the MIT license [26]. The EDO implementation was developed to be consistent with the current best practices of open source software development [10].

Output: A full history of the populations and their fitnesses.
      create initial population of individuals find fitness of each individual record population and its fitness while current iteration less than the maximum and stopping condition not met  do
            select parents based on fitness and selection proportions use parents to create new population through crossover and mutation find fitness of each individual update population and fitness histories if adjusting the mutation probability then
                  update mutation probability
             end if
            if using a shrink factor then
                  shrink the mutation space based on parents
             end if
       end while
Algorithm 1 The evolutionary dataset optimisation algorithm
Input: parents,
Output: A new population of size
      add parents to the new population while the size of the new population is less than  do
            sample two parents at random create an offspring by crossing over the two parents mutate the offspring according to the mutation probability add the mutated offspring to the population
       end while
Algorithm 2 Creating a new population

The statement of the EDO algorithm is presented here to lay out its general structure from a high level perspective. Lower level discussion is provided below where additional algorithms for the individual creation, evolutionary operator and shrinkage processes are given along with diagrams (where appropriate). Note that there are no defined processes for how to stop the algorithm or adjust the mutation probability, . This is down to their relevance to a particular use case. Some examples include:

  • Regular decreasing in mutation probability across the available attributes [12].

  • Stopping when no improvement in the best fitness is found within some consecutive iterations [13].

  • Utilising global behaviours in fitness to indicate a stopping point [14].

2.1 Individuals

Evolutionary algorithms operate in an iterative process on populations of individuals that each represent a solution to the problem in question. In a genetic algorithm, an individual is a solution encoded as a bit string of, typically, fixed length and treated as a chromosome-like object to be manipulated. In EDO, as the objective is to generate datasets and explore the space in which datasets exist, there is no encoding. As such the distinction is made that EDO is an evolutionary algorithm.

As is seen in Figure 3, an individual’s creation is defined by the generation of its columns. A set of instructions on how to sample new values (in mutation, for instance, Section 2.4) for that column are recorded in the form of a probability distribution. These distributions are sampled and created from the families passed in . In EDO, the produced datasets and their metadata are manipulated directly so that the biological operators can be designed and be interpreted in a more meaningful way as will be seen later in this section.

However, one should not assume that the columns are a reliable representative of the distribution associated with them, or vice versa. This is particularly true of ‘shorter’ datasets with a small number of rows, whereas confidence in the pair could be given more liberally for ‘longer’ datasets with a larger number of rows. In any case, appropriate methods for analysis should be employed before formal conclusions are made.

Figure 3: An example of how an individual is first created.
Output: An individual defined by a dataset and some metadata
      sample a number of rows and columns create an empty dataset for each column in the dataset do
            sample a distribution from create an instance of the distribution fill in the column by sampling from this instance record the instance in the metadata
       end for
Algorithm 3 Creating an individual

2.2 Selection

The selection operator describes the process by which individuals are chosen from the current population to generate the next. Almost always, the likelihood of an individual being selected is determined by their fitness. This is because the purpose of selection is to preserve favourable qualities and encourage some homogeneity within future generations [2].

Figure 4: The selection process with the inclusion of some lucky individuals.
Input: population, population fitness, ,
Output: A set of parent individuals
      calculate and sort the population by the fitness of its individuals take the first individuals and make them parents if there are any individuals left then
            take the next individuals and make them parents
       end if
Algorithm 4 The selection process

In EDO, a modified truncation selection method is used [9], as can be seen in Figure 4. Truncation selection takes a fixed number, , of the fittest individuals in a population and makes them the ‘parents’ of the next. It has been observed that, despite its efficiency as a selection operator, truncation selection can lead to premature convergence at local optima [9, 18]. The modification for EDO is an optional stage after the best individuals have been chosen: with some small , a number, , of the remaining individuals can be selected at random to be carried forward. Hence, allowing for a small number of randomly selected individuals may encourage diversity and further exploration throughout the run of the algorithm. It should be noted that regardless of this step, an individual could potentially be present throughout the entirety of the algorithm.

After the parents have been selected, there are two adjustments made to the current search space. The first is that the subtypes for each family in are updated to only those present in the parents. The second adjustment is a process which acts on the distribution parameter limits for each subtype in . This adjustment gives the ability to ‘shrink’ the search space about the region observed in a given population. This method is based on a power law described in [1] that relies on a shrink factor, . At each iteration, , every distribution subtype which is present in the parents has its parameter’s limits, , adjusted. This adjustment is such that the new limits, are centred about the mean observed value, , for that parameter:


The shrinking process is given explicitly in Algorithm 5. Note that the behaviour of this process can produce reductive results for some use cases and is optional.

Input: parents, current iteration,
Output: A new mutation space focussed around the parents
      for each distribution subtype in  do
            for each parameter of the distribution do
                  get the current values for parameter over all parent columns find the mean of the current values find the new lower (1) and upper (2) bounds around the mean set the parameter limits
             end for
       end for
Algorithm 5 Shrinking the mutation space

2.3 Crossover

Crossover is the operation of combining two individuals in order to create at least one offspring. In genetic algorithms, the term ‘crossover’ can be taken literally: two bit strings are crossed at a point to create two new bit strings. Another popular method is uniform crossover, which has been favoured for its efficiency and efficacy in combining individuals [21]. For EDO, this method is adapted to support dataset manipulation: a new individual is created by uniformly sampling each of its components (dimensions and then columns) from a set of two ‘parent’ individuals, as shown in Figure 5.

Figure 5: The crossover process between two individuals with different dimensions.

Observe that there is no requirement on the dimensions of the parents to be of similar or equal shapes. This is because the driving aim of the proposed method is to explore the space of all possible datasets. In the case where there is incongruence in the lengths of the two parents, missing values may appear in a shorter column that is sampled. To resolve this, values are sampled from the probability distribution associated with that column to fill in these gaps.

Input: Two parents
Output: An offspring made from the parents ready for mutation
      collate the columns and metadata from each parent in a pool sample each dimension from between the parents uniformly form an empty dataset with these dimensions for each column in the dataset do
            sample a column (and its corresponding metadata) from the pool if this column is longer than required then
                  randomly select entries and delete them as needed
             end if
            if this column is shorter than required then
                  sample new values from the metadata and append them to the column as needed
             end if
            add this column to the dataset and record its metadata
       end for
Algorithm 6 The crossover process

2.4 Mutation

Mutation is used in evolutionary algorithms to encourage a broader exploration of the search space at each generation. Under this framework, the mutation process manipulates the phenotype of an individual where numerous things need to be modified including an individual’s dimensions, column metadata and the entries themselves. This process is described in Figure 6.

Figure 6: The stages of the mutation process.

As shown in Figure 6, each of the potential mutations occur with the same probability, . However, the way in which columns are maintained assure that (assuming appropriate choices for and ) many mutations in the metadata and the dataset itself will only result in some incremental change in the individual’s fitness relative to, say, a completely new individual.

Input: An individual, , , , ,
Output: A mutated individual
      sample a random number if  and adding a row would not violate  then
            sample a value from each distribution in the metadata append these values as a row to the end of the dataset
       end if
      sample a new if  and removing a row would not violate  then
            remove a row at random from the dataset
       end if
      sample a new if  and adding a new column would not violate  then
            create a new column using and append this column to the end of the dataset
       end if
      sample a new if  and removing a column would not violate  then
            remove a column (and its associated metadata) at random from the dataset
       end if
      for each distribution in the metadata do
            for each parameter of the distribution do
                  sample a random number if  then
                        sample a new value from within the distribution parameter limits update the parameter value with this new value
                   end if
             end for
       end for
      for each entry in the dataset do
            sample a random number if  then
                  sample a new value from the associated column distribution update the entry with this new value
             end if
       end for
Algorithm 7 The mutation process

3 Examples

3.1 -means clustering

The following examples act as a form of validation for EDO, and also highlight some of the nuances in its use. The examples will be focussed around the clustering of data and, in particular, the -means (Lloyd’s) algorithm. Clustering was chosen as it is a well-understood problem that is easily accessible - especially when restricted to two dimensions. The -means algorithm is an iterative, centroid-based method that aims to minimise the ‘inertia’ of the current partition, , of some dataset :


A full statement of the algorithm to minimise (3) is given in A.1.

This inertia function is taken as the objective of the -means algorithm, and is used for evaluating the final clustering. This is particularly true when the algorithm is not being considered an unsupervised classifier where accuracy may be used [8]. With that, the first example is to use this inertia as the fitness function in EDO. That is, to find datasets which minimise .

For the purposes of visualisation, in this example EDO is restricted to only two-dimensional datasets, i.e. . In addition to this, all columns are formed from uniform distributions where the bounds are sampled from the unit interval. Thus, the only family in is:


The remaining parameters are as follows: , , , , , , and shrinkage excluded. Figure 7 shows an example of the fitness (above) and dimension (below) progression of the evolutionary algorithm under these conditions up until the epoch.

There is a steep learning curve here; within the first 50 generations an individual is found with a fitness of roughly which could not be improved on for a further 900 epochs. The same quick convergence is seen in the number of rows. This behaviour is quickly recognised as preferable and was dominant across all the trials conducted in this work. This preference for datasets with fewer rows makes sense given that is the sum of the mean error from each cluster centre. With that, when is fixed a priori, reducing the number of points in each cluster (i.e. the terms of the second summation) quickly reduces the mean error of that cluster and thus the value of .


Figure 7: Progressions for final inertia and dimension across the first 50 epochs with .
Figure 7: Progressions for final inertia and dimension across the first 50 epochs with .

Something that may be seen as unwanted is a compaction of the cluster centres. Referring to Figure (a)a, the best and median individuals show two clusters that are essentially the same point whereas the worst is a random cloud across the whole of which was found in the initial population. The kind of behaviour exhibited by the best performing individuals occurs in part because it is allowed. There are two immediate ways in which this allowed: first, that the ‘trivial’ case is included in and, secondly, that the fitness function does nothing to penalise the proximity of the inter-cluster means, as well as aiming to reduce the intra-cluster means. This kind of unwanted behaviour highlights a subtlety in how EDO should be used; that experimentation and rigour are required to properly understand an algorithm’s quality.

Figure 8: Representative individuals based on inertia with: LABEL:sub@fig:small-inertia-inds ; LABEL:sub@fig:large-inertia-inds . Centroids displayed as crosses.

Hence, consider Figure (b)b where the individuals have been generated with the same parameters as previously except with adjusted row limits, , so as to exclude this trivial case. In these trials, the results are equivalent: the worst performing individuals are without structure whilst the best-performing individuals display clusters that are dense about a single point despite the minimum number of rows being increased. Perhaps then, this compacted clustering is ‘optimal’.

However, more extensive studying may be done. That is, the defined fitness function may require further attention. Indeed, the final inertia could be considered a flawed or fragile fitness function if it is supposed to evaluate the appropriateness or efficacy of the -means algorithm. Incorporating the inter-cluster spread to the fitness of an individual dataset can reduce this observed compaction. The silhouette coefficient is a metric used to evaluate the appropriateness of a clustering to a dataset, and is given by the mean of the silhouette value, , of each point in each cluster:


The optimisation of the silhouette coefficient is analogous to finding a dataset which increases both the intra-cluster cohesion (the inverse of ) and inter-cluster separation (). Hence, the inertia is addressed by maximising cohesion. Meanwhile, the spread of the clusters themselves is considered by maximising separation.

Repeating the trials with the same parameters as with inertia, the silhouette fitness function yields the results summarised in Figures 9 and 9. Irrespective of row limits, the datasets produced show increased separation from one another whilst maintaining low values in the final inertia of the clustering as shown in Figure 10. Again, the form of the individual clusters is much the same. The low values of inertia correspond to tight clusters, and the tightest clusters are those with a minimal number of points, i.e. a single point. As with the previous example, albeit at a much slower rate, the preferable individuals are those leading toward this case. That this gradual reduction in the dimension of the individuals occurs after the improvement of the fitness function bolsters the claim that the base case is also optimal.

However, due to the nature of the implementation, any individual from any generation may be retrieved and studied should the final results be too concentrated on any given case. This transparency in the history and progression of the proposed method is something that sets it apart from other methods of the same ilk such as GANs which have a reputation of providing so-called ‘black box’ solutions.

Figure 9: Progressions for silhouette and dimension across 1000 epochs at 100 epoch intervals with .
Figure 9: Progressions for silhouette and dimension across 1000 epochs at 100 epoch intervals with .
Figure 10: Representative individuals based on silhouette with: LABEL:sub@fig:small-silhouette-inds ; LABEL:sub@fig:large-silhouette-inds . Centroids displayed as crosses.

3.2 Comparison with DBSCAN

The capabilities of EDO as a tool for understanding an algorithm are highlighted particularly when comparing an algorithm against another (or set of others) simultaneously. This is done by utilising the freedom of choice in a fitness function for EDO. Consider two algorithms, and , and some common metric between them, . Then understanding their similarities and contrasts can be done by considering the differences in this metric on the two algorithms. In terms of EDO, this means using , or as the fitness function. By doing so, pitfalls, edge cases or fundamental conditions for the method can be highlighted. Overall, this process allows the researcher to more deeply learn about the method of interest.

As an example of this process, consider the another clustering algorithm of a different form such as Density Based Spatial Clustering of Applications with Noise (DBSCAN) and suppose the objective is to find datasets for which -means outperforms this alternative. Here there is no concept of inertia as DBSCAN is density-based and is able to identify outliers [6]. As such, a valid must be chosen. One such metric is the silhouette score as defined in (5).

However, an adjustment to the fitness function must be made so as to accommodate for the condition of the silhouette coefficient that there be more than one cluster present. Let and denote the silhouette coefficients of the clustering found by -means and DBSCAN respectively. Then the fitness function is defined to be:


There are several remarks to be made here. First, note the order of the subtraction here as EDO minimises fitness functions by default. Also, takes values in the range where is the best, i.e.  and . Likewise, 2 is the worst score. Finally, the silhouette coefficient requires at least two clusters to be present and so if DBSCAN identifies a single cluster then that individual will be penalised heavily under this fitness function when, in fact, that clustering may be of high quality. As such, this fitness function may require adjustment.

It must also be acknowledged that -means and DBSCAN share no common parameters and so direct comparison is more difficult. For the purposes of this example, only one set of parameters is used but a thorough investigation should include a parameter sweep in cases such as these. The parameters being used are for -means, and for DBSCAN. This set was chosen following informal experimentation using the Python library Scikit-learn [20] to find comparable parameters in the given search space defined by the EDO parameters used previously with .

Figure 11: Progressions for difference in silhouette (-means-preferable) and dimension across 1000 epochs at 100 epoch intervals.
Figure 12: Representative individuals from a -means-preferable run with clustering by: LABEL:sub@fig:dbscan-inds-k -means; LABEL:sub@fig:dbscan-inds-d DBSCAN. Concave and convex hulls illustrated by shading and outline respectively.

Figure 11 shows a summary of the progression of EDO for this use case. As with the previous examples where , the variation in the population fitness is unstable but there is a clear trend of improvement in the best individual over the course of the run. There is also a convergence seen in the number of rows a dataset has. The resting dimension varied across the trials conducted in this work but none exhibited a shift toward the lower limit of 50 rows as with previous examples. This is suggestive of a more competitive environment for individuals where slight changes to an individual can drastically alter their fitness.

The effect of such changes can be seen in Figure 12 where representative individuals are shown for this example. Here, the best performing individual, when clustered by -means, shows three clear and nicely separated clusters. Note that they are not so tightly packed; again, this suggests that the route to an optimal individual is less clearly defined. In contrast, when the same dataset is clustered by DBSCAN a single cluster is found with a single noise point held within the convex hull of the cluster, i.e. there are overlapping clusters (since noise points form a single cluster). Hence, along with the fact that the larger cluster is widely spread, it follows that the clustering have a relatively small, negative silhouette coefficient.

Another point of interest here is the convexity of the clusters. One of the conditions for the success of -means is that the presented clusters are of roughly equal size and are convex. This is due to the overall objective being to approximate the centroidal Voronoi tessellation [4]. Without this condition, up to the correct choice of , the algorithm will fail to produce adequate results for either inertia or silhouette. DBSCAN, however, does not have this condition and is able to detect non-convex clusters so long as they are dense enough. Figure 12 shows the both the clustering and the convex and concave hulls of the clusters found by each method. The ‘concave hull’ of a cluster is taken to be the -shape of the cluster’s data points [5] where is determined to be the smallest value such that all the points in the cluster are contained in a single polygon. The convexity of cluster , denoted , is then determined to be the ratio of the area of its concave hull, , to the area of its convex hull,  [23]:


With this definition, it should be clear that a perfectly convex cluster, such as a single point or line, would have .

It can be seen that the convexity of the clustering found by -means appears to be higher than that by DBSCAN. This was apparent across all trials conducted in this work and indicates that the condition for convex clusters is being sought out during the optimisation process. Meanwhile, however, it is not clear whether the performance of DBSCAN falls owing to its parameters or the method itself. This is a point where parameter sweeping would prove most useful so as to determine a crossing point for these two driving forces.

Now, to add to the discussion above, the inverse optimisation should be considered. That is, using the same parameters, the datasets for which DBSCAN outperforms -means with respect to the silhouette coefficient are to be investigated. This is equivalent to using as the fitness function except with the same penalty of for the case set out in (6).

Figure 13: Progressions for difference in silhouette (DBSCAN-preferable) and dimension across 1000 epochs at 100 epoch intervals.
Figure 14: Representative individuals from a DBSCAN-preferable run with clustering by: LABEL:sub@fig:dbscan-inds-k -means; LABEL:sub@fig:dbscan-inds-d DBSCAN. Concave and convex hulls illustrated by shading and outline respectively.

Figures 13 and 14 show the same summary as above with the revised fitness function. Inspecting the former, it is seen that the best fitness found is worse than with the previous example. This, in part, is due to the fact that -means cannot find a clustering with negative values as no clusters may overlap. The method can, however, produce results with small silhouette scores where the clusters are tightly packed. Hence, the best fitness score is now whereas the worst is 2, still.

Note in the first two frames of Figure (a)a how -means is forced to split what is evidently a single cluster in two whereas DBSCAN is able to identify the single cluster and the outlying noise (Figure (b)b). The proximity of these clusters has then dragged the silhouette score down for -means. Referring to Figure (b)b, this kind of behaviour is certainly preferable for DBSCAN under these parameters: the beginning individuals are likely random clouds (as seen in the rightmost two frames of the figure) and the simplest step toward a fit dataset is one that maintains that vaguely dense body with minimal noise points far from it.

4 Conclusion

In this paper we have introduced a novel approach to understanding the quality of an algorithm by exploring the space in which their well-performing datasets exist. Following a detailed explanation of its internal mechanisms, a case study in -means clustering was offered as validation for the method. The method utilises biological operators to traverse the space of all possible datasets in an organic way with a minimal external framework attached. The generative nature of the proposed method also provides transparency and richness to the solution when compared to other contemporary techniques for artificial data generation as the entire history of individuals is preserved.

The evolutionary dataset optimisation method is dependent on a number of parameters set out in this paper and perhaps the most important of which is the choice of distribution families, ; these families set out the general statistical shape of the columns of the datasets that are produced and also control the present data types. The relationship between columns and their associated distribution is not causal and appropriate methods should be employed to understand the structure and characteristics of the data produced before formal conclusions are made as set out in the examples provided.


The authors wish to thank the Cwm Taf Morgannwg University Health Board for their funding and support of the Ph.D. of which this work has formed a part.

Conflict of interest

The authors declare that they have no conflict of interest.


  • [1] Adil Amirjanov. Modeling the dynamics of a changing range genetic algorithm. Procedia Computer Science, 102:570 – 577, 2016.
  • [2] Thomas Bäck. Selective pressure in evolutionary algorithms: a characterization of selection mechanisms. In Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence, pages 57–62, 1994.
  • [3] Yingrui Chen, Mark Elliot, and Joseph Sakshaug. A genetic algorithm approach to synthetic data production. In PrAISe@ECAI, 2016.
  • [4] Qiang Du, Maria Emelianenko, and Lili Ju. Convergence of the lloyd algorithm for computing centroidal voronoi tessellations. SIAM Journal on Numerical Analysis, 44(1):102–119, 2006.
  • [5] H. Edelsbrunner, D. Kirkpatrick, and R. Seidel. On the shape of a set of points in the plane. IEEE Transactions on Information Theory, 29(4):551–559, July 1983.
  • [6] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaoming Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, 1996.
  • [7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
  • [8] Zhexue Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3):283–304, Sep 1998.
  • [9] Khalid Jebari. Selection methods for genetic algorithms. International Journal of Emerging Sciences, 3:333–344, 12 2013.
  • [10] Rafael C Jiménez, Mateusz Kuzak, Monther Alhamdoosh, Michelle Barker, Bérénice Batut, Mikael Borg, Salvador Capella-Gutierrez, Neil Chue Hong, Martin Cook, Manuel Corpas, Madison Flannery, Leyla Garcia, Josep Ll Gelpí, Simon Gladman, Carole Goble, Montserrat González Ferreiro, Alejandra Gonzalez-Beltran, Philippa C Griffin, Björn Grüning, Jonas Hagberg, Petr Holub, Rob Hooft, Jon Ison, Daniel S Katz, Brane Leskošek, Federico López Gómez, Luis J Oliveira, David Mellor, Rowland Mosbergen, Nicola Mulder, Yasset Perez-Riverol, Robert Pergl, Horst Pichler, Bernard Pope, Ferran Sanz, Maria V Schneider, Victoria Stodden, Radosław Suchecki, Radka SvobodováVařeková, Harry-Anton Talvik, Ilian Todorov, Andrew Treloar, Sonika Tyagi, Maarten van Gompel, Daniel Vaughan, Allegra Via, Xiaochuan Wang, Nathan S Watson-Haigh, and Steve Crouch. Four simple recommendations to encourage best practices in research software. F1000Research, 6:ELIXIR–876, 6 2017.
  • [11] Chahine Koleejan, Bing Xue, and Mengjie Zhang. Code coverage optimisation in genetic algorithms and particle swarm optimisation for automatic software test data generation. 2015 IEEE Congress on Evolutionary Computation (CEC), pages 1204–1211, 2015.
  • [12] Matthias Kuehn, Thomas Severin, and Horst Salzwedel. Variable mutation rate at genetic algorithms: Introduction of chromosome fitness in connection with multi-chromosome representation. International Journal of Computer Applications, 72:31–38, 07 2013.
  • [13] Yiu-Wing Leung and Yuping Wang. An orthogonal genetic algorithm with quantization for global numerical optimization. IEEE Transactions on Evolutionary Computation, 5(1):41–53, 2001.
  • [14] Luis Martí, Jesús García, Antonio Berlanga, and José M. Molina. A stopping criterion for multi-objective optimization evolutionary algorithms. Information Sciences, 367-368:700 – 718, 2016.
  • [15] Justin Matejka and George Fitzmaurice. Same stats, different graphs: Generating datasets with varied appearance and identical statistics through simulated annealing. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI ’17, pages 1290–1294. ACM, 2017.
  • [16] Wes McKinney. Data structures for statistical computing in python. Proceedings of the 9th Python in Science Conference, 2010–. [Online; accessed 2019-03-01].
  • [17] Christoph C. Michael, Gary McGraw, and Michael Schatz. Generating software test data by evolution. IEEE Trans. Software Eng., 27:1085–1110, 2001.
  • [18] Tatsuya Motoki. Calculating the expected loss of diversity of selection schemes. Evolutionary Computation, 10(4):397–422, 2002.
  • [19] Travis Oliphant. NumPy: A guide to NumPy. USA: Trelgol Publishing, 2006–. [Online; accessed 2019-03-01].
  • [20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python . Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [21] Eugene Semenkin and Maria Semenkina. Self-configuring genetic algorithm with modified uniform crossover operator. In Advances in Swarm Intelligence, pages 414–421, 2012.
  • [22] Hossein Sharifipour, Mojtaba Shakeri, and Hassan Haghighi. Structural test data generation using a memetic ant colony optimization based on evolution strategies. Swarm and Evolutionary Computation, 40:76 – 91, 2018.
  • [23] Milan Sonka, Vaclav Hlavac, and Roger Boyle. Image Processing, Analysis and Machine Vision. Springer US, 1993.
  • [24] Masanori Suganuma, Shinichi Shirakawa, and Tomoharu Nagao. A genetic programming approach to designing convolutional neural network architectures. In GECCO, 2017.
  • [25] Yanan Sun, Bing Xue, Mengjie Zhang, and Gary G. Yen. Automatically designing CNN architectures using genetic algorithm for image classification. CoRR, abs/1808.03818, 2018.
  • [26] The EDO library developers. EDO: v0.2.1, 2019.
  • [27] A. Torralba and A. A. Efros. Unbiased look at dataset bias. Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, 2011.
  • [28] X. Wu, Xindong Wu, and Vipin Kumar. The Top Ten Algorithms in Data Mining. CRC, 2009.

Appendix A Appendix

a.1 Lloyd’s algorithm

Input: a dataset , a number of centroids , a distance metric
Output: a partition of into parts,
      select initial centroids, while any point changes cluster or some stopping criterion is not met do
            assign each point, , to cluster where:
recalculate all centroids by taking the intra-cluster mean:
       end while
Algorithm 8 -means (Lloyd’s)

a.2 Implementation example

Below is an example of how the Python implementation was used to complete the first example, including the definition of the fitness function.

import edo
from edo.pdfs import Uniform
from sklearn.cluster import KMeans
def fitness(dataframe, seed):
    km = KMeans(n_clusters=2, random_state=seed).fit(dataframe)
    return km.inertia_
Uniform.param_limits["bounds"] = [0, 1]
pop_history, fit_history = edo.run_algorithm(
    fitness, size=100, row_limits=[3, 100], col_limits=[2, 2],
    families=[Uniform], max_iter=1000, best_prop=0.2,
    mutation_prob=0.01, seed=0, root="out",
    fitness_kwargs={"seed": 0},
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description