A Comparison of Evaluation Methods in Coevolution
Abstract
In this research, we compare four different evaluation methods in coevolution on the Majority Function problem. The size of the problem is selected such that an evaluation against all possible test cases is feasible. Two measures are used for the comparisons, i.e., the objective fitness derived from evaluating solutions againt all test cases, and the objective fitness correlation (OFC), which is defined as the correlation coefficient between subjective and objective fitness. The results of our experiments suggest that a combination of average score and weighted informativeness may provide a more accurate evaluation in coevolution. In order to confirm this difference, a series of ttests on the preference between each pair of the evaluation methods is performed. The resulting significance is affirmative, and the tests for two quality measures show similar preference on four evaluation methods.
A Comparison of Evaluation Methods in Coevolution
TingShuo Yo 
Utrecht University 
PO Box 80.089 
3508 TB Utrecht 
The Netherlands 
tyo@cs.uu.nl 
Edwin D. de Jong 
Utrecht University 
PO Box 80.089 
3508 TB Utrecht 
The Netherlands 
dejong@cs.uu.nl 
\@float
copyrightbox[b]
\end@floatCategories and Subject Descriptors F.0 [General]

Algorithms, Experimentation, Performance

Coevolution, evaluation, performance comparison, objective fitness corelation, OFC
Coevolution offers an approach to adaptively select tests for the evaluation of learners [?, ?, ?, ?, ?, ?]. Using coevolution, the evaluation function is adapted as part of the evolutionary process. This approach can be useful if the quality of individuals can be assessed using some form of tests. For such testbased problems, the identification of an informative set of tests can reduce the amount of required computation, while potentially providing more useful information than any static selection of tests. Since an adaptive test set can render evaluation unstable, an important question is how coevolution can be set up to be sufficiently reliable.
A recent insight in coevolution research is that the design of a coevolutionary setup should begin with a consideration of the desired solution concept [?]. A solution concept specifies which elements of the search space qualify as solutions and which do not. Examples of solution concepts include: Maximum Expected Utility (maximizing the expected outcome against a randomly selected opponent, which for uniform selection is equivalent to maximizing the average outcome against all opponents), the Paretooptimal set resulting from viewing each test as a separate objective, and Nashequilibria, in which no candidate solution or test can unilaterally deviate given the other candidate solutions and tests without decreasing its payoff.
For several of the main solution concepts used in coevolution, archive methods exist guaranteeing that when sufficiently diverse sets of new individuals are submitted to the archive, the archive will produce monotonically improving approximations of the solution concept. Recent examples of such archive methods are the Nash Memory [?, ?], which guarantees monotonicity for the Nash equilibrium solution concept; the IPCA algorithm, which guarantees monotonicity for the Paretooptimal equivalence set [?]; and the MaxSolve algorithm [?], which guarantees monotonicity for the Maximum Expected Utility solution concept.
While theoretical guarantees of monotonic progress are important, so far no bounds or guarantees regarding the improvement of the approximation to the solution concept over time are available. Thus, approximating the solution concept to a desired degree of accuracy may take an infeasible amount of time. An important current practical question is therefore: how can coevolutionary algorithms be set up such that their dynamics lead to quick improvement over time? By using such efficient algorithms as generators of new individuals and coupling them to monotic archives, thereby combining the guarantee of monotonic progress with efficiency, a principled approach to designing robust and efficient coevolution algorithms is obtained.
Efficiency in coevolutionary algorithms depends on selection, see e.g. [?], and evaluation. In this research we focus on evaluation. Our aim is to compare the efficiency, reflected in the improvement over time of an objective quality measure, that can be achieved using different coevolutionary evaluation methods. Since a main question is how sufficiently accurate evaluation may be achieved, the testing environment is chosen such that evaluating individuals on all tests is feasible; while this is not the case in practical applications of coevolution, this provides a possibility to compare evaluation methods with the maximally informative situation in which information about all possible tests is available. This setup permits investigating two important questions:

Given all information that may be relevant to evaluation, how can this information be used optimally?

Compared to evaluation based on all relevant information, how do different coevolutionary evaluation methods perform?
In this paper, we focus on the second question. Four different coevolutionary evaluation methods are compared to each other and to the baseline of testing against all tests. The test problem is a small variant of the Majority Function test problem [?] [?] [?] chosen such that evaluation against all test cases (initial conditions) is feasible. A new tool named the Objective Fitness Correlation (OFC) [?], the correlation between the subjective and the objective fitness measures, is used to assess the evaluation accuracy of the different methods.
The paper is structured as follows. In section 2 we discuss the evaluation methods and algorithms used in this research. The design of experiments, parameters and performance measures are described in section 3. The results are presented in section 4, and the discussions and concluding remarks are shown in section 5.
In this section, we describe the evaluation methods conducted in this study. Since we put our main focus on the testbased problems, we start with defining some terminologies and fitness measures. Evaluation methods based on those fitness measures are introduced in the following sections.
For a test problem, there are two sample spaces: one is the test cases, , and the other is possible solutions, . The term interaction is defined as the result of letting one solution interact with one test case. For the ease of analysis, an interaction function, , is designed to return one scalar outcome of the interaction between a pair of test case and a solution. To simplify the following discussion, we assume that the interaction function returns a binary result which represents whether the solution succeeds in solving the test case or not. An affirmative interaction is preferred by the solution, but is unfavorable to the test case.
By evaluating all solutions, , against all test points, , an interaction matrix, , is obtained. The sum of column , , of this interaction matrix represents the number of test cases that has solved, which is a common performance measure for a solution. Similarly, the sum of row , , is the number of times this test case being solved, which represents the difficulty of the test case.
In the simplest case, only the interaction outcomes are considered, so that can be used as the fitness of (i.e., the more difficult the better), and for (i.e., the more powerful the better).
Distinctions are defined as the ability to distinguish between good and bad solutions and may provide important information for selecting a proper set of test cases. This concept was first proposed in [?], and here we follow the notation in [?] for our experiments. Accordingly, a threedimensional matrix is defined to represent ”if test case distinguishes from ”, or mathematically:
(1) where and .
The number of distinctions that test case has made may be represented by summing up all solution pairs in , i.e., . This value represents the ability of a test case to maintain diversity of the solutions, and it may also be considered as a fitness measure for test cases.Although we explain the concept of distinction in terms of “selecting a proper set of test cases”, the same procedure can also be applied to solutions. This yields the distinction of solutions, and is also a reasonable fitness measure for evaluating solutions. With those measures, four different evaluation methods are defined as follows.
For each evaluation method, we define the fitness for test cases and solutions as follows.
For a solution, the averagescore is defined as “the proportion of test points it has succeeded in solving.” And for a test point, this value is represented by ”the proportion of solutions that it has failed”. With an interaction matrix as described earlier, we define:
(2) The application results of on test cases and solutions are both between , and a higher value represents a better performance.
Instead of giving each interaction an equal weight, the average scores described above may be used as weights for each test case and solution. That is to say, a test case earns more credits by failing a more powerful solution, and a solution gets a higher score when it succeeds in a tough test case. The mathematical expression of this evaluation can be represented as follows:
(3) Weights are checked to avoid being divided by zero and normalized to ensure the weighted scores are ranged from to . Hence also returns a value between , and the higher values are more preferable. This definition is similar in spirit to the niching methods of Rosin [?] and Juille [?], though their methods are not considered while we developed the weight function.
The informativeness of a test measures the amount of information it provides about a given set of candidate solutions. In [?], a definition for informativeness based on the incomparable and equal elements of the order induced by a test is provided. Here, we measure the informativeness of a test based on Ficici’s notion of distinctions [?]. Since each distinction a test makes contributes to its informativeness, and since the set of all possible distinctions is sufficient to provide ideal evaluation [?], we measure the informativeness of a test as the normalized number of distinctions it makes. In this study, we define the distinction score, , as “the number of distinctions one test case (solution) makes.” This value can be derived from the distinction matrix described in (A Comparison of Evaluation Methods in Coevolution), and represents the informativeness. For the convenience of further computation, we normalize the distinction score with its maximum and minimum values, which are returned by the function and . Because the informativeness represents the individual’s ability to maintain diversity of the other population, we want to integrate it with the average scores to form the fitness. Here we simply use a linear combination of two scores, with weights of for distinction and average score, respectively. These weights are based on the observation that the average scores are generally lower than the distinction scores due to the computational scheme we used, as well as experiences from the pilot experiments. The average informativeness is therefore defined as the linear combination of the normalized distinction score and the average score.
(4) Similar to the weighted score, each distinction that has been made can be weighted differently when the distinction score is derived. For example, the distinction made by all test cases, , provides the information of “how many test cases have made the distinction on and , such that .” The inverse of this value can be used as the weight of this particular distinction, so that when more test cases can make this distinction, the less worthy this distinction is. This operation can also be applied to solutions, and the weighted informativeness can be defined mathematically as:
(5) In the following discussion, the four evaluation methods described above are referred as AS, WS, AI, and WI, respectively.
There are two main algorithms used in this study, i.e., a single population genetic algorithm (GA) and coevolution (CO). The former uses only one single population for solutions and evaluates the population on all possible test cases to obtain the interaction defined earlier. Afterward, the average scores given by are used as the fitness in the GA. The results of this algorithm are used as the baseline for comparisons. The second algorithm uses coevolution between test cases and solutions, and all four evaluation methods are tested.
Algorithm 1 and 2 describe GA and CO, respectively. Functions used in the algorithms, e.g., INTERACTION, EVALUATE, SELECT and BREED, are specific to problems and experiments. The general concepts of INTERACTION and four implementations of EVALUATE are already discussed in this section. The SELECT and BREED together define the selection, reproduction and replacement cycle in a generation. The choice of these two functions is problem specific and they are specified in the experimental design section.
In this section, the settings of our experiments are described. The software used in this research is implemented as an extension of ECJ [?], which is developed by George Mason University’s Evolutionary Computation Laboratory. All simulations use the basic evolutionary loop provided by ECJ, plus the evaluation methods and problemspecific functions in the extension. The design of the experiments, the parameters used in the test problem, and the performance measures are introduced as follows.
As mentioned in the previous section, two algorithms and four evaluation methods are used in this study. Table A Comparison of Evaluation Methods in Coevolution shows the design of our experiments. Experiment 1 is a combination of the single population genetic algorithm and the average score evaluation method. This GAAS evaluates the solutions against all possible test cases and serves as the baseline experiment. Experiment 2, 3, 4, and 5 are four different evaluation methods combined with the coevolution algorithm. Each experiment is run for ten times with ten different random seeds.
Exp No 1 2 3 4 5 Algorithm GA CO CO CO CO Evaluation Method AS AS WS AI WI Number of Runs 10 10 10 10 10 Table \thetable: Design of experiments. In this study, we perform the experiments on the majority function problem. This problem is also known as the one dimensional cellular automata problem. Mitchell and his colleagues have detailed discussion on this problem in [?] and [?]. Two major parameters are used for this problem, i.e., the radius of the neighbourhood () and the size of the one dimensional lattice (). Boolean vectors are used to represent both the initial conditions (test cases) and the rules (solutions). In order to evaluate against all test points, we chose and . Other parameters, e.g., the size of populations, elitism, crossover type, and mutation probability are selected based on Mitchell’s work [?], and are summarized in Table A Comparison of Evaluation Methods in Coevolution.
In the coevolution experiments (2, 3, 4, and 5), a symmetric setup is used, i.e., the same set of EVALUATE, SELECT and BREED functions are used for both population of test cases and the population of solutions. However, there is one exception: the population size. Because there are only 512 possible test cases in total, the population size for test cases is set as 64 in the convenience of comparison, while the population size for solutions is set as 100. Accordingly, the number of elites for two populations are also different, while their proportions to the size of populations are the same (20%). The initial populations are both randomly created and are both checked to ensure there are no redundent individuals in the begining of the experiments.
Parameter Value MAX_GENERATION 200 Size of population 64/100 Elitism 12/20 (20%) Selection linear rank selection Crossover one point crossover Mutation rate 0.01 Table \thetable: Parameters used for the majority function problem. In order to compare all experiments fairly, an objective performance measure is required. Hence we define the subjective fitness and objective fitness separately. The subjective fitness is the fitness value returned by the EVALUATE function used in the coevolution algorithm. In this study, the objective fitness is defined as the fitness used in experiment 1, i.e., to evaluate the average scores against all possible test cases. For each generation in experiment , both the subjective and the objective fitness are recorded. All experiments are compared with the their objective fitness, based on the number of interactions. In coevolution experiments, the number of test cases is of that in the GAAS experiment, therefore during the analysis we compare the objective fitness of the coevolution experiments every generations to the fitness of the controlled experiments every generation. Since these values are all derived from exhaustive testing, the progress over interactions may be seen as the performance of each method.
In addition to the progress of the fitness, the correlation coefficients between subjective and objective fitness are also computed. This correlation is defined as the Objective Fitness Correlation, OFC, in [?] as a new objective measure for evaluating coevolutionary algorithms. In our experiments, this measure is collected for each generation of every run. Since OFC is always in the baseline experiment, we only compare OFC among the four evaluation methods with coevolution. Comparisons are made with these two measures averaged upon ten independent runs.
Figure A Comparison of Evaluation Methods in Coevolution shows the best objective fitness of the solutions averaged over 10 runs. As illustrated in the figure, the baseline experiment is outperformed by all coevolution experiments, especially when the number of total interactions is small. Among four evaluation methods with coevolution, the weighted informativeness shows a constantly better performance than others.
A series of paired, onetailed ttests are performed to examine the significance of the differences between each evaluation methods. The best fitness of each generation is averaged over 10 runs, and is paired with the same quantity of another evaluation method for the ttest. The results of ttests shows significance on WI AI, WI WS, WI AS, AI AS, and WS AS (with pvalues ), but not on AI WS with . These results are summarized in table A Comparison of Evaluation Methods in Coevolution.
The OFCs for different evaluation methods with coevolution are shown in figure A Comparison of Evaluation Methods in Coevolution. The values shown in the figure are averages of 10 independent runs, and therefore the fluctuations of OFCs are already smoothed. In a single run of one experiment, the OFC can be negative for some generations.
In all experiments, the OFC always starts from a high value. This is because the initial populations are randomly generated, and this can be seen as they are sampled from the set of all possible cases with a uniform distribution. Ideally, this random sampling may create a set well represents the original search space, i.e., the all possible cases, and results in a higher correlation between the subjective and the objective fitness. As the coevolution proceeds, the populations move toward certain direction rather than a uniformly random sampling, and hence the OFC decreases over generations. However, as argued in [?], the OFC may still remain as a performance measure for comparing different evaluation methods in a coevolutionary algorithm.
As demonstrated in figure A Comparison of Evaluation Methods in Coevolution, the weighted informativeness shows a constantly higher OFC through generations than other methods, while other evaluation methods do not show clear differences between one another due to the fluctuations.
A set of ttests is employed to verify the difference in OFC between each pair of evaluation methods. The results show significance on all of following relations: WI AI, WI WS, WI AS, AI WS, AI AS, and WS AS. The preferences for WI against other methods are very significant, with pvalues , which is the limit of precision in the computing software. These results are summarized in table A Comparison of Evaluation Methods in Coevolution.
Pvalue for test on i j WS AI WI AS WS AI Table \thetable: Significant level in paired ttests for objective fitness among experiment COAS, COWS, COAI and COWI. A pvalue smaller than 0.05 is usually considered as significant. Pvalue for test on i j WS AI WI AS WS AI Table \thetable: Significant level in paired ttests for OFC among experiment COAS, COWS, COAI and COWI. The results of our experiments suggest that in coevolution, a combination of performance (average score) and diversity (weighted distinction) can achieve the accuracy of full evaluation with less computational cost. This statement holds for both conditions when considering the improvement of an objective quality measure over time or the correlation between the subjective fitness and objective measure. This advantage is even clearer if we compare the objective fitness according to the generation instead of the number of interactions. As shown in figure A Comparison of Evaluation Methods in Coevolution, the WI evaluation method with coevolution progresses as fast as evaluating against all possible test cases, while other evaluation methods are apparently slower than the baseline.
In our study, the OFC is used as a quality measure in addition to the objective fitness. If the preference orderings of evaluation methods are considered, this measure provides information similar to the objective fitness. However, the difference between WS and AI shows significance in OFC but not in the objective fitness. This implies that two measures may contain different information, and a detailed discussion on OFC can be found in [?].
This research is the first time the OFC is calculated on a real problem, but in exchange we are only able to experiment on problems with a small number of possible test cases in total. The majority function problem with different sizes ( and ) are also tested. The results are similar to the finding presented in the previous section, and the experiments of are summarized in figure A Comparison of Evaluation Methods in Coevolution and A Comparison of Evaluation Methods in Coevolution. From these figures it is shown that OFC can distinguish AI and WI from AS and WS while the objective fitness can not. This argument is consistent with the results of ttests (not shown): ttests on the objective fitness show no significant difference between pairs among AS, WS, and AI, while those on OFC show the same preference as in the experiments.
In addition to the majority function problem, the same comparison is also performed on the oddparity problems. Although Lee, Xu, and Chau [?] have proposed that the boolean parity problem may be transformed into a cellular automata problem, here we use Koza’s Genetic Programming approach in [?].
Figure A Comparison of Evaluation Methods in Coevolution and A Comparison of Evaluation Methods in Coevolution show the results for the oddparity problem. As shown in figure A Comparison of Evaluation Methods in Coevolution, the WI still performs the best among four evaluation methods in coevolution, and this is also confirmed by the ttests (not shown). However, we also find some disagreement between results from the majority function problem and the parity problem. First, in the parity problem, the single population GA outperforms all CO experiments in this genetic programming approach, while figure A Comparison of Evaluation Methods in Coevolution shows the opposite in the majority function problem. Second, OFC in the parity problem shows no significant pattern (as shown in figure A Comparison of Evaluation Methods in Coevolution), while the OFC distinguishes different evaluation methods better than the objective fitness in the majority function problem. Finally, the ttests hardly show any significant difference among AS, WS, and AI in the parity problem.
A detailed analysis on the output has been done and suggests a few possible reasons for the disagreement. First, solving the parity problem with the genetic programming approach is not a symmetric testbased problem in nature. That is to say, while the test cases are represented with boolean vectors as they are in the majority function problem, the solutions for the parity problem are represented with treelike structures. As a result of this asymmetry, the same evolutionary operator (e.g., selection methods, mutation rate, crossover methods, etc.) may have different effects on two populations, and hence a symmetric setting for coevolution is not suitable. Second, the initial population of the treelike solutions is not randomly selected from “all possible solutions”. In Koza’s approach, the sizes of the solutions start from smaller values and grow over generations. This “biased sampling” may explain the disfavor of OFC in the parity problem. Finally, since Koza’s approach has been studies for several years, the evolutionary operators and parameters are already well tuned. We believe that the disagreement may be reduced by developing proper asymmetric coevolution schemes.
Despite the disagreement, the results of the parity problem still shows a favor to WI in coevolution.
In this research, we compare four different evaluation methods in coevolution on a testbased problem. Two measures are used for the comparisons among average score (AS), weighted score (WS), average informativeness (AI), and weighted informativeness (WI). In addition to an objective quality measure, the objective fitness correlation (OFC) is also computed.
The experimental results show a strong preference on WI, which suggest that a combination of the performance and the ability to create distinctions may provide more accurate evaluation in coevolution. The resulting significance from ttests show a similar preference when two quality measures are used, separately. This study also uses the recently proposed OFC to evaluate the accuracy of coevolutionary evaluation methods on a concrete test problem.
Although we have shown the advantages of using WI in coevolution, the way we combine these two measures is simply using a weighted summation. It may worth exploring more sophisticated methods to fuse this information together. Currently a multiobjective approach is in progress, and this may lead to a more detailed investigation on how to use both measures in coevolutionary algorithms.
 1 E. D. De Jong. The MaxSolve algorithm for coevolution. In H.G. Beyer, editor, Proceedings of the Genetic and Evolutionary Computation Conference, GECCO05, pages 483–489. ACM Press, 2005.
 2 E. D. De Jong. A monotonic archive for paretocoevolution. Evolutionary Computation, 15(1), 2007. to appear.
 3 E. D. De Jong. Objective fitness correlation: evaluating coevolutionary evaluation. 2007.
 4 E. D. De Jong and J. B. Pollack. Ideal evaluation from coevolution. Evol. Comput., 12(2):159–192, 2004.
 5 S. G. Ficici. Solution Concepts in Coevolutionary Algorithms. PhD thesis, Brandeis University, 2004.
 6 S. G. Ficici, O. Melnik, and J. B. Pollack. A gametheoretic and dynamicalsystems analysis of selection methods in coevolution. IEEE Transactions on Evolutionary Computation, 9(6):580–602, 2005.
 7 S. G. Ficici and J. B. Pollack. Pareto optimality in coevolutionary learning. In ECAL ’01: Proceedings of the 6th European Conference on Advances in Artificial Life, pages 316–325, London, UK, 2001. SpringerVerlag.
 8 S. G. Ficici and J. B. Pollack. A gametheoretic memory mechanism for coevolution. In CantúPaz, E., et al., editor, Genetic and Evolutionary Computation – GECCO2003, volume 2723 of LNCS, pages 286–297, Chicago, 1216 July 2003. SpringerVerlag.
 9 D. W. Hillis. Coevolving parasites improve simulated evolution in an optimization procedure. Physica D, 42:228–234, 1990.
 10 H. Juillé and J. Pollack. Dynamics of coevolutionary learning. In Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, pages 526–534. MIT Press, 1996.
 11 H. Juillé and J. B. Pollack. Coevolving the ”ideal” trainer: Application to the discovery of cellular automata rules. In Proceedings of the Third Annual Genetic Programming Conference, 1998.
 12 J. R. Koza. Genetic programming: on the programming of computers by means of natural selection. MIT Press, Cambridge, MA, USA, 1992.
 13 K. M. Lee, H. Xu, and H. F. Chau. Parity problem with a cellular automaton solution. Phys. Rev. E, 64(2):026702, Jul 2001.
 14 S. Luke. ECJ 15: A Java evolutionary computation library. http://cs.gmu.edu/eclab/projects/ecj/, 2006.
 15 M. Mitchell, J. P. Crutchfield, and P. T. Hraber. Evolving cellular automata to perform computations: mechanisms and impediments. Phys. D, 75(13):361–391, 1994.
 16 M. Mitchell, P. T. Hraber, and J. P. Crutchfield. Revisiting the edge of chaos: Evolving cellular automata to perform computations. Complex Systems, 7:89–130, 1993.
 17 L. Pagie and P. Hogeweg. Evolutionary consequences of coevolving targets. Evolutionary Computation, 5(4):401–418, 1998.
 18 J. Paredis. Coevolutionary computation. Artificial Life, 2(4), 1996.
 19 C. D. Rosin and R. K. Belew. New methods for competitive coevolution. Evolutionary Computation, 5(1):1–29, 1997.

