How to Evaluate Solutions in Paretobased SearchBased Software Engineering? A Critical Review and Methodological Guidance
Abstract
With modern requirements, there is an increasing tendancy of considering multiple objectives/criteria simultaneously in many Software Engineering (SE) scenarios. Such a multiobjective optimization scenario comes with an important issue — how to evaluate the outcome of optimization algorithms, which typically is a set of incomparable solutions (i.e., being Pareto nondominated to each other). This issue can be challenging for the SE community, particularly for practitioners of SearchBased SE (SBSE). On one hand, multiobjective optimization may still be relatively new to SE/SBSE researchers, who may not be able to identify right evaluation methods for their problems. On the other hand, simply following the evaluation methods for general multiobjective optimisation problems may not be appropriate for specific SE problems, especially when the problem nature or decision maker’s preferences are explicitly/implicitly available. This has been well echoed in the literature by various inappropriate/inadequate selection and inaccurate/misleading uses of evaluation methods. In this paper, we carry out a critical review of quality evaluation for multiobjective optimization in SBSE. We survey 717 papers published between 2009 and 2019 from 36 venues in 7 repositories, and select 97 prominent studies, through which we identify five important but overlooked issues in the area. We then conduct an indepth analysis of quality evaluation indicators and general situations in SBSE, which, together with the identified issues, enables us to provide a methodological guidance to selecting and using evaluation methods in different SBSE scenarios.
1 Introduction
In software engineering (SE), it is not uncommon to face a scenario where multiple objectives/criteria need to be considered simultaneously [43, 17]. In such scenarios, there is usually no single optimal solution but rather a set of Pareto optimal solutions (termed a Pareto front in the objective space), i.e., solutions that cannot be improved on one objective without degrading on some other objective. To tackle these multiobjective SE problems, different problemsolving ideas have brought up. One of them is to generate a set of solutions to approximate the Pareto front. This, in contrast with the idea of aggregating objectives (by weighting) into a singleobjective problem, provides different tradeoffs between the objectives, from which the decision maker (DM) can choose their favourite solution.
In such Paretobased optimization,
a fundamental issue is to evaluate the quality of solution sets (populations)
obtained by computational search methods (e.g., greedy search, heuristics and evolutionary algorithms) in order to know how well the methods perform.
Since the obtained solution sets are typically not comparable to each other with respect to Pareto dominance
Another way to evaluate the solution sets is to report their descriptive statistic results, such as the best, mean and median values on each objective from each solution set. This has been particularly commonly used in SearchBased SE (SBSE) [5, 19, 7, 32, 124, 14, 120]. However, some of these statistic indexes may easily give misleading evaluation results. That is, a solution set which is evaluated better than its competitor could be never preferred by the DM under any circumstance. This will be explained in detail in the text later (Section 4.1.2).
Quality indicators, which is arguably the most straightforward evaluation method that maps a solution set to a real number that indicates one or several aspects of the set’s quality, have emerged in the fields of evolutionary computation and operational research [107, 141, 55, 8]. Today analyzing and designing quality indicators have become an important research topic. There are hundreds of them in literature [69], with some measuring closeness of the solution set to the Pareto front, some gauging diversity of the set, some considering a comprehensive evaluation of the set, etc. The SBSE community benefits from this prosperity. A common practice in SBSE is to use some wellestablished quality indicators, such as hypervolume () [142] and inverted generational distance () [23], to evaluate the obtained solution sets. However, some indicators may not be appropriate when it comes to practical SE optimization scenarios. For example, since the Pareto front of a practical SE problem is typically unavailable, indicators that require a reference set that well represents the problem’s Pareto front may not be well suited [69], such as .
More importantly, specific SE problems usually have their own nature and requirements. Simply following indicators that were designed for general Paretobased optimization may fail to reflect these requirements. Take the software product line configuration problem as an example. In this problem, the objective of a product’s correctness is always prioritized above other objectives (e.g., cost and richness of features). Equally rating these objectives by using general indicators like (which in fact has been commonly practised in the literature [108, 110, 87, 57]) may return the DM meaningless solutions, i.e., invalid products with good performance on the other objectives. This situation also applies to the test case generation problem, where the DM may first favor the full code coverage and then others (e.g., low cost).
Moreover, some SE problems may associate with the DM’s explicit/implicit assumptions or preferences. It is expected for researchers to select indicators bearing these assumptions/preferences in mind. For example, in many SE scenarios, the DM may prefer wellbalanced tradeoff solutions (i.e., knee points on the Pareto front) between conflicting objectives. For example, when optimizing conflicting nonfunctional quality of a software system (e.g., latency and energy consumption) via configuration and adaptation, knee points are almost the most preferred solutions, as in such case it is often too difficult, if not impossible, to explicitly quantify the relative importance between objectives. Under this circumstance, quality indicators that treat all points on the Pareto front equally (such as ) may not be able to reflect this preference, despite the fact that they have been frequently used in such scenarios [32, 83].
Finally, the study of quality indicator selection itself in multiobjective optimization is in fact a nontrivial task. Each indicator has its own specific quality implication, and the variety of indicators in literature can easily overwhelm the researchers and practitioners in the field. On the one hand, an accurate categorization of quality indicators is of high importance. Failing to do so can easily result in a misleading understanding of search algorithms’ behaviour, see [68]. On the other hand, even under the same category, different indicators are of distinct quality implications, e.g., prefers uniformly distributed solutions and is in favor of knee solutions. A careful selection needs to be made to ensure the considered quality indicators to be in line with the DM’s preferences. In addition, many quality indicators involve critical parameters (e.g., the reference point in the indicator). It remains unclear how to properly set these parameters under different circumstances, particularly in the presence of the DM’s preferences.
Given the above, this paper aims to provide a systematic and methodological guidance of selecting and using evaluation methods and quality indicators in various Paretobased SBSE scenarios. Such a guidance is of high practicality to the SE community, as research from the wellestablished community of multiobjective optimization may still be relatively new to SE researchers and practitioners. To do so, we start by providing a systematic survey of Paretobased SBSE problems across all phases in the SDLC, along with their problem nature/DM’s preferences and the considered quality indicators (Sections 2 and 3). The survey has covered 717 searched papers published between 2009 and 2019, on 36 venues from 7 repositories, leading to 97 prominent primary studies in the SBSE community. This is followed by a critical review on the evaluation method/quality indicator selection and use in those primary studies, based on which we identify 5 important issues that have been significantly overlooked (Section 4). Then, we carry out an indepth analysis of frequentlyused quality indicators in the area (Section 5), in order to make it clear which indicators fit in which situation. Finally, to mitigate the identified issues in the future work of SBSE, we provide a methodological guidance and procedure of selecting, adjusting and using evaluation methods in various SBSE scenarios (Section 6), based upon the fundamental goal of multiobjective optimization — supplying the DM a set of solutions which are the most consistent with their preferences.
2 Review Methodology
To understand existing issues on the quality evaluation methods in the SBSE field, we conducted a systematic literature review covering the papers published between 2009 and 2019. The review methodology follows the best practice of systematic literature review for software engineering [53], consisting of clear search strategy, inclusive/exclusive criteria, formal data collection process and pragmatic classification. Our review has three goals: (i) to overview what types of SBSE problems have been solved by using Paretobased optimization; (ii) to understand the assumption about the DM’s preference made and the evaluation methods used for those problems; (iii) to uncover important, but overlooked issues in evaluating solution sets under different SBSE problems.
2.1 Literature Search Strategy
All the studied papers in this work were carefully selected from a wide range of scientific literature sources, including ACM Library, IEEE Xplore, ScienceDirect, SpringerLink, Google Scholar, DBPL and the SBSE repository maintained by the CREST research group at UCL
The defined search string aims to cover a variety of multiobjective problems’ nature and the application domain. Synonyms and keywords were properly linked via logical operators (AND, OR) to build the search term. The final search string is shown as below:
[left=5pt,right=5pt,top=5pt,bottom=5pt] {displayquote} (“multi objective” OR “multi criteria” OR “Pareto based” OR “non dominated” OR “Pareto front”) AND “search based software engineering” AND optimization
When necessary, the search string was adopted to the particular scientific literature source. For each source, different semantically equivalent version of the string were conduct as pilot searches, e.g., whether to keep hyphen. Then, the one that returns the highest number of results was used.
According to the results returned from the search, we conducted a snowballing, manual examination on the title and abstract of the paper, in order to exclude those that do not relevant to our goals of review, e.g., those neither relevant to SBSE nor focus on multiobjective optimization. In this way, we aim to reduce the found references to a much smaller and more concise set, namely the candidate studies. Finally, by using the inclusion and exclusion criteria (see Section 2.2), we carried on rigorous review on the candidate papers and keep only the most relevant ones, which are termed as primary studies.
Journal 
Search 
Candidate 
Primary 
ACM Transactions on Software Engineering and Methodology  15  10  7 
Elsevier Information and Software Technology  65  33  6 
Elsevier Applied Soft Computing  17  4  0 
Springer Automated Software Engineering  16  6  2 
IEEE Transactions on Software Engineering  79  41  9 
Springer Empirical Software Engineering  37  17  3 
Elsevier Future Generation Computing Systems  3  3  1 
Springer Soft Computing  11  4  0 
IEEE Transactions on Evolutionary Computation  13  3  2 
IEEE Transactions on Services Computing  6  3  3 
Elsevier Journal of Systems and Software  70  35  8 
Elsevier Information Sciences  18  7  4 
Springer Requirements Engineering  5  3  1 
Springer Software Quality Journal  12  7  2 
Wiley Software Testing, Verification and Reliability  4  2  1 
Wiley Software: Practice and Experience  8  4  1 
Springer Software and Systems Modeling  2  1  1 
IEEE Transactions on Systems, Man, and Cybernetics  2  2  1 
Conference, Symposium and Congress  
IEEE/ACM Conference on Software Engineering  30  12  8 
Springer Symposium on Search Based Software Engineering  110  33  5 
IEEE Congress on Evolutionary Computation  21  8  0 
IEEE/ACM Conference on Automated Software Engineering  21  8  5 
ACM Conference and Symposium on the Foundations of Software Engineering  13  3  0 
ACM Genetic and Evolutionary Computation Conference  57  33  11 
IEEE Conference on Software Testing, Verification and Validation  20  11  3 
ACM Symposium on Software Testing and Analysis  8  5  3 
IEEE/ACM Conference on Empirical Software Engineering and Measurements  1  0  0 
ACM Systems and Software Product Line Conference  11  3  1 
IEEE Conference on Web Services  2  2  1 
IEEE Conference on Software Maintenance  15  1  1 
IEEE Conference on Software Maintenance and Reengineering  4  3  1 
IEEE Workshop on Combining Modelling and SearchBased Software Engineering  5  3  1 
IEEE Conference on Requirements Engineering  7  3  1 
ACM Conference on Performance Engineering  2  2  2 
IEEE/ACM Conference on Program Comprehension  3  2  1 
IEEE Conference on Software Architecture  4  2  1 
Total  717  319  97 
2.2 Inclusion and Exclusion Criteria
For all the candidate studies identified, we first extract the primary studies by using the inclusion criteria as below; papers meeting all of the criteria were temporarily chosen as the primary studies:

The paper should be primarily focused on, or have a section that discusses, a Paretobased multiobjective solution to the SBSE problem. This means we do not consider papers that utilize multiobjective treatment that relies on objective aggregation (e.g., weighted sum), unless they have explicitly compared the solution against a Paretobased multiobjective solution. This is reasonable as if a clear aggregation of objectives can be defined, then there would be almost no need to select quality indicators but rely on the said aggregation to obtain a utility value for comparison.

The paper should explicitly or implicitly discuss, or at least make assumptions about, the DM’s preferences between the objectives for the SBSE problem in hand. Note that this also includes the assumption of no preference.

The SBSE problem in hand should be framed into one phase of the SDLC process.

There is no restriction about a particular search algorithm that is used to solve the problem.

The paper should include quantitative experiment results with clear instruction on how the results were obtained.

There is no restriction about what quality indicator(s) and methods used to assess the experiment results.
Subsequently, papers meeting any of the exclusion criteria below are filtered out from the temporary primary studies:

The paper neither explicitly nor implicitly mentions about SBSE, where the computational search is the key.

The paper is not wellrecognized or widely followed. We used the citation information from Google Scholar as a single metric to assess the popularity of a paper. In particular, we follow a pragmatic strategy that: a paper has 5 citations per year from its year of publication is counted in, e.g., a 2010 paper would expect to have at least 45 citations
^{3} . The only exception is for those papers published in the year of writing this article (i.e., 2019), we consider those that are published for shorter than 6 months and have been cited by more than once, together with the prepress papers that have not yet been given an issue number regardless of their citation counts. 
The paper is a short paper, i.e., shorter than 8 pages (double column) or 15 pages (single column).

The paper is a review, survey or tutorial type of study.

The paper is published in a nonpeer reviewed venue, e.g., arXiv.
Finally, if multiple papers of the same research work are found, we applied the following process to determine if they should all be considered as primary studies (after considering both inclusion and exclusion criteria). The same procedure is applied if the same authors have published different papers for the same SBSE approach, and thereby only significant contributions are analyzed for the review.

All papers are considered if they report on the same SBSE problem but different solutions.

All papers are considered if they report on the same SBSE problem and solutions, but have different assumptions about the DM’s preference, nature of the problem or new findings about the problem.

When the above two points do not hold, only the latest version or the extended journal paper is considered.
3 Results of the Review
As shown in Table I, after the execution of the search process and removing redundancy, a total of 717 searched studies from different wellknown journals and conferences were discovered. Then, the results are filtered out based on the titles and abstracts, which lead to 319 candidate studies. Finally, by using the inclusion and exclusion criteria (see Section 2.2), we carried out rigorous review on the candidate papers and this further brings the reference down to 97 primary studies. Error checks and investigations were also conducted to correct any issues found during the search procedure. It is worth noting that we do not only include papers published in software engineering venues, but also those relevant ones that were published in services, system and cloud engineering conferences/journals as well as those in the computational intelligence venues, as long as they are related to problems in the software engineering domain and comply with the inclusive/exclusive criteria.
Next, we analyze the results obtained from our systematic literature review, which would further motivate the remaining of our work.
3.1 Paretobased SBSE Problems
The results of systematic review have revealed that the Paretobased SBSE problems are spread across all phases in the SDLC. From Table II, we can clearly see that certain problems have attracted more attentions than the others, as evident by the much higher number of primary studies included, such as the software product line and the white/black box test case generation problems. Notably, the software testing phase, as well as the deployment and maintenance phase contain much more diverse problems than the other SDLC phases. This is probably because the nature of those problems, which are usually in a latter phases of the SDLC, fits the requirements of searchbased optimization well.
Indeed, some of the Paretobased SBSE problems can arguably fit into more than one phases of the SDLC; but in this work, we classify those problems according to which phases can be better matched with the detailed formalization of the problem and the hypotheses that the authors made. Further, certain problems in the deployment and maintenance phase are not classic software engineering problems (e.g., resource management and service composition); however, they have recently attracted more and more attentions from software engineering researchers and have been increasingly considered as important issues in the software engineering domain [42].
SDLC phase  SBSE Problem  Description  Primary Studies 

Planning  Effort Estimation  Optimize, e.g., accuracy and confidence interval, by changing the number of measured samples.  2: [82][106] 
Project Scheduling  Optimize, e.g., duration and cost, by assigning employee into the tasks of a software project.  8: [116][115][105][38][30][21][13][116]  
Requirement Analysis  Requirement Assignment  Optimize, e.g., completeness and familiarity, by assigning requirements to different stakeholders’ for their reviews.  1: [70] 
Next Project Problem  Optimize, e.g., robustness and cost, by selecting stakeholders’ requirements in the next release of software.  8: [31][27][41][63][135][136][13][137]  
Design  Software Modeling and Architecting  Optimize, e.g., cohesion and coupling, by modeling the objectoriented concept of the software and its architecture using standard notations.  7: [102][7][118][119][44][9][52] 
Software Product Line  Optimize, e.g., correctness and richness, by finding the concrete products from the feature model.  15: [108][109][110][47][45][87] [133][127][13][39][132][59][104][71][46]  
Implementation  Library Recommendation  Optimize, e.g., library linkedusage and semantic similarity, by prioritizing the libraries that meet the required functionality to be used in the codebase.  1: [93] 
Program Improvement  Optimize, e.g., execution time and number of instructions, by producing semantically preserved software code.  1: [130]  
Software Modularization  Optimize, e.g., modularization quality, cohesion and coupling, by placing different classes of code into different clusters.  5: [58][5][101][32][6]  
Testing  Code Smell Detection  Optimize, e.g., coverage of bad examples and detection of good examples, by identifying the code and modules that could potentially cause issues.  1: [75] 
Defect Prediction  Optimize, e.g., effectiveness and cost, by adjusting the source code components to be predicted by the model.  4: [12][11][20][86]  
Test Case Prioritization  Optimize, e.g., coverage of the code and cost of test, by ordering the test cases to be tested.  7: [4][96][78][113][28][99][117]  
White Box Test Case Generation  Optimize, e.g., coverage of the code and cost of test, by identifying the test cases, inputs and test suits based on internal information of the software.  9: [3][134][138][77][51][29][94][50][95]  
Black Box Test Case Generation  Optimize, e.g., length of the inputs, distance to the ideal inputs, and cost of test, by identifying the test cases, inputs and test suits without internal information about the software  3: [114][100][2]  
Deployment and Maintenance  Resource Management  Optimize, e.g., response time and cost, by changing the supported software and hardware resources such as in the Cloud environment.  3: [34][61][15] 
Software Configuration and Adaptation  Optimize, e.g., response time and energy consumption, by changing software specific configurations, structure and connectors at design time or runtime.  7: [98][36][79] [60][1][16][10]  
Program Manipulation  Optimize, e.g., response time and memory consumption, by changing the parametrized variables within the program code.  1: [131]  
Service Composition  Optimize, e.g., latency and cost, by mapping the concrete services into abstract services within a workflow.  4: [125][124][121][18]  
Log Template Identification  Optimize, e.g., the frequency and specificity of the log message matched to a log template.  1: [80]  
Workflow Scheduling  Optimize, e.g., makespan and energy consumption, by assigning activities into a given application workflow.  1: [26]  
Software Refactoring  Optimize, e.g., number of defects found and semantics, by changing the design model or the program code.  10: [92][84][88][89][90][91][76][37][85][83]  
3.2 Assumptions on Preferences of Problems
In Table III, we summarize the assumptions on the preferences of objectives for each Paretobased SBSE problem reviewed, and the evaluation methods used to compare different solution sets. As we can see, numbers of SBSE problems have concrete assumptions on the DM’s preferences and characteristics, while others do not clearly state the assumptions on this regard (noted as not specified). One of the most notable observations is that a particular SBSE problem may have multiple, distinct assumptions about the DM’s preferences. In fact, most of the problems have more than one assumption on the preferences and particularly, the project scheduling problem and software configuration/adaptation problem involve up to four assumptions on the preferences. This reflects the fact that many problems are complex and the actual preferences can be situationdependent. Yet, it does bring the requirements that all those situations need to be catered for, particularly when selecting and designing the quality evaluation methods. In contrast, problem such as effort estimation and requirements assignment have assumed only one type of preference, which implies more straightforward selection and use in the quality evaluation process.
3.3 Evaluation Methods
According to the results from Table III, there are two ways of evaluating the quality of solution sets: (i) using quality indicators or (ii) using illustrative and (descriptive) statistic methods. Formally, a quality indicator is a metric that maps a set of solutions (i.e., solution vectors) to a real number that indicates one or several aspects of solution set quality [69], e.g., to indicate how close the set is to the Pareto front; how well the set covers the Pareto front; and how evenly solutions are distributed in the set. These quality indicator can be either problem generic or specific. As can be seen from Table III, , and are the most widely used generic ones across almost all the SBSE problems, due presumably to their popularity as well as “inertia” (i.e., researchers tend to use indicators which were used before even though they are not the best fit) [69]. In particular, the test case prioritization and software product line problems involve the most diverse set of generic quality indicators according to the primary studies. There are also numerous problemspecific quality indicators, e.g., MoJoFM is a commonly used symmetric indicator for the software modularization problem, which aims to compare two resulted partitions of classes (i.e., two solutions), thus inapplicable to other optimisation scenarios.
Apart from the quality indicators, a considerable amount of primary studies have used illustrative and (descriptive) statistic methods. Two most notable methods exist, either plotting all the objective values of all solutions in the solution set, denoted as Solution Set Plotting (), or summarizing the direct statistic results of objective values, e.g., the best/mean/median of the solution set on each objective, which we call Descriptive Objective Evaluation () for distinction.
SBSE Problem  Assumptions of Preferences on Problem  Quality Indicators and Evaluation Methods 

Effort Estimation  Not specified [82] [106]  , , , , 
Project Scheduling  Not specified [13] [116]  , , , , , 
Prefer solutions that favor certain objectives using Analytic Hierarchy Process [115]  , , , , , ,  
Prefer knee solutions [105] [38] [30]  , , , ,  
Prefer widely distributed solutions [21]  ,  
Requirement Assignment  Not specified [70]  , , 
Next Release Problem  Not specified [31] [27] [41] [63] [135] [13] [137]  , , , , , , , % of included requirements 
Prefer extreme solutions [136]  
Software Modeling and Architecting  Not specified [102] [7] [118]  , , , , % of withinrange solutions, % of equivalent solutions 
Prefer knee solutions [52]  average correction, manual correction, recall, precision  
Prefer solutions that favor certain objectives as ranked by users [119]  
Prefer solutions that meet preferences in, e.g., the requirement documentations or the goal model [44] [9]  
Software Product Line  Not specified [132]  , , 
Prefer solutions that favor correctness objective over the others [108] [109] [110] [47] [45] [87] [133][46] [13] [39] [59] [104] [71]  , , , , indicator, , , , , number of required evaluations to find a valid solution, % of valid solutions  
Prefer balanced solutions [127]  ,  
Library Recommendation  Not specified [93]  , , , , accuracy, precision, recall 
Program Improvement  Prefer the program validity [130]  
Software Modularization  Prefer solutions that favor modularization quality objective over the others [58] [5] [101]  , , MoJoFM 
Prefer knee solutions [32] [6]  , , , precision, recall, manual precision, difficulty to perform task by human, possibility of manually fix the bug in solution by human, possibility of manually adapt the solution by human  
Code Smell Detection  Not specified [75]  , , precision, recall 
Defect Prediction  Not specified [12] [11] [20] [86]  , , , , precision, recall, AUC, cost of code inspection 
Test Case Prioritization  Not specified [4] [96] [78] [113] [28] [99] [117]  , , , , , , indicator, , , , , % of detected faults 
White Box Test Case Generation  Not specified [3] [134] [77] [29] [95]  , , , , , , 
Prefer solutions that favor certain objectives as ranked by users, i.e., reference point [50] [51]  , , , Average number of solutions in the region of interest  
Prefer solutions that favor coverage objective over the others [138] [94]  ,  
Black Box Test Case Generation  Not specified [100] [2]  , , , 
Prefer solutions that favor coverage objective over the others [114]  , pmeasure  
Resource Management  Not specified [34]  , 
Prefer knee solutions [61] [15]  , , , , elasticity  
Software Configuration and Adaptation  Not specified [98] [36]  , , indicator, , , 
Prefer solutions that meet preferences form the natural descriptions from the stakeholders [79] [60] [1]  , expected value of total perfect information  
Prefer knee solutions [16]  , , , % of valid solutions  
Prefer robust solutions around a given region [10]  , modified indicator and according to problem nature  
Program Manipulation  Not specified [131]  , , , 
Service Composition  Not specified [125] [18]  , , , indicator 
Prefer extreme solutions [124] [121]  , , , coefficient of variation of objective values  
Log Template Identification  Prefer knee solutions [80]  , precision, recall, fmeasure 
Workflow Scheduling  Not specified [26]  , , 
Software Refactoring  Not specified [92] [84] [88] [89] [90] [91] [76] [37] [85]  , , , precision, recall, defect correction ratio, reused refactoring, usefulness by human, % of fixed code smells, code change score, manual precision, quality gains, medium value of refactoring 
Prefer knee solutions [83]  , , , precision, recall, manual precision, quality gain, defect correction ratio, number of suggested refactoring, usefulness by human 

The acronyms are: Attainment Surface () [56], Parallel Coordinates () [67], Contribution Indicator () [81], (a.k.a. ) [142], Generational Distance () [123], Spread (include all variants) [25], Spacing () [111], Nondominated Front Size (), Indicator [141], Inverted Generational Distance () [23], Hypervolume () [142], Mean Fitness Value () [141], Descriptive Objective Evaluation (), Solution Set Plotting (). All Problemspecific indicators are listed in full and marked as .
4 Issues on Quality Evaluation in Paretobased SBSE
Based on our systematic literature review, this section provides a systematic analysis of 5 identified issues of quality evaluation, classified into two categories, from stateoftheart Paretobased SBSE work.
4.1 Problematic Use of Illustrative and Descriptive Statistic Evaluation Methods
As shown in Table III, there exist many primary studies of Paretobased SBSE, particularly in early days, that relied on plotting the solution set returned () and/or reporting some results to reflect the quality of solution sets. Despite being simple to apply, they may easily lead to inaccurate evaluations and conclusions.
Inadequacy of Solution Set Plotting ()
A straightforward way to evaluate/compare the quality of solution sets returned by search algorithms is to plot solution sets and judge intuitively how good they are. Such visual comparison is among the most frequently used methods in SBSE, but it may not be very practical in many cases.
First, it cannot scale up well — when the number of objectives is larger than three, the direction observation of solution sets (by scatter plot) is unavailable. Second, visual comparison fails to quantify the difference between solution sets. Finally, when an algorithm involves stochastic elements, different runs usually result in different solution sets. So, it may be not easy to decide which run should be considered. Printing the solution sets obtained in all the runs can easily clutter the picture. As such, plotting solution sets does not suffice to quality evaluation in Paretobased SBSE, despite the fact that it has been used solely to compare solution sets in many papers, e.g., [38, 136, 44, 9, 130, 61, 79]. Nevertheless, it is worth mentioning that is useful as an extra evaluation method in addition to quality indicators, particularly in bi and triobjective cases. This will be discussed in the guidance section (Section 6) later on.
Descriptive Objective Evaluation Method  Used in 

Mean Fitness Value ()  2: [70] [127] 
Analytic Hierarchy Process ()  2: [115] [116] 
Mean, best, worst, median and/or statistical result of each objective for solutions in the population  26: [82] [7] [127] [58] [5] [18] [124] [32] [51] [26] [77] [101] [131] [63] [105] [118] [119] [108] [29] [98] [6] [100] [95] [70, 127, 113] 
Best of one objective over the population while another is below certain thresholds  1: [59] 

The mean of all repeated runs are reported.
Inappropriate Use of Descriptive Objective Evaluation ()
Many Paretobased SBSE studies evaluate solution sets by — statistic objective values in the obtained solution set(s). For example, as it can be seen in Table IV, the mean objective value was considered in [105, 7, 118, 119, 108, 5, 98, 100, 95, 6]; the median value in [5, 32, 51]; the best value in [82, 7, 58, 101, 77, 131, 124, 26, 18]; the worst value in [7, 124]; the statistical significance between different solution sets’ objective values in [70, 127, 113]. Such measures need to be used in line with the DM’s preference. For example, comparing the best value of each objective can well evaluate solution sets if the DM prefers the extreme points (solutions), but may not be well suited when balanced points are wanted, which, unfortunately, was practised in some studies such as [127]. Worse still, many measures may give a misleading evaluation, including those comparing the mean, median and worst values of each objective and comparing statistically significant difference on each objective. That is to say, by a measure a solution set is evaluated better than another set, but in fact the latter is always preferred by the DM under any circumstances. Figure 1 gives such an example (minimization) with respect to calculating the mean of each objective. As shown, the mean of the solution set on either objective or is , lager than that of the solution set (), thus being regarded as inferior to . Yet, will always be favored by the DM since there is one solution in better than any solution in .
On the other hand, some work in the primary studies considered to select one particular solution (by using a decisionmaking method) from the whole solution set produced by the Paretobased search for comparison. For example, the studies in [70, 127] considered Mean Fitness Value () and the studies in [115][116] considered Analytic Hierarchy Process (AHP) [35]. However, one question is that if we know the clear preferences/weights of each objective of the DM (thus being able to taking only one solution from the whole set into account), why not directly integrate these information into the problem model, thus converting a multiobjective problem into an easier singleobjective problem in the first place.
4.2 Problematic Use of Quality Indicators
It has been commonly seen in Paretobased SBSE selecting/using quality indicators that cannot accurately reflect the quality of solution sets. This is virtually because people may not be very clear about indicators’ behavior, role and characteristics. This leads them either to fail to select appropriate indicators to evaluate generic quality of solution sets, or to fail to align the considered indicators with the problem’s nature or the DM’s preferences.
Confusion of the Quality Aspects Covered by Quality Indicators
In general, the generic quality of a solution set in Paretobased optimization can be interpreted as how well it represents the Pareto front. It can be broken down into four aspects: convergence, spread, uniformity, and cardinality [69]. Convergence of a solution set refers to how close the set is to the Pareto front. Spread refers to how much region the set covers. Uniformity refers to how even the solution distribution is in the set. Spread and uniformity are closely related, and they collectively are often known as the set’s diversity. Cardinality refers to how many (nondominated) solutions in the set. It is expected that when the DM’s preferences are unknown a priori, an indicator (or a combination of indicators) can cover all the four quality aspects since a solution set with these qualities can well represent the Pareto front and have a great probability of being preferred by the DM.
Unfortunately, in SBSE many studies only consider part of these quality aspects. For example, the studies in [41, 63] used the convergence indicator [123] as the sole indicator to compare the solution sets. The study in [31] completely relied on [81], an indicator that partially reflects the quality of convergence and cardinality [69]. The study in [134] considered both [45] and which however are merely for convergence and cardinality. In addition, some indicators were used to evaluate certain quality aspect of solution sets which, unfortunately, were not designed for, as shown in [68]. For example, the indicator [142], designed for convergence evaluation, was considered for evaluating spread in [128]. [111], which can only reflect the uniformity of solution sets, was used to evaluate the diversity (i.e., both spread and uniformity) in [102]. , which counts nondominated solutions in the set, was placed into the category of diversity indicators in [45, 128], and indicator, which is able to reflect all the quality aspects of solution sets, was placed into the category of convergence indicators in [128].
Some indicators were used incorrectly. For example, the indicator Spread (i.e., in [25]) as well as its variants (e.g., [139]), which is only effective in the biobjective case, was frequently used in optimization problems with three objectives or more, such as in [116, 115, 30, 105, 108, 110, 93, 45, 2]. Another example is the setting of the critical parameter reference point in the indicator, which has experienced various versions. For example, some studies set it to the worst value obtained for each objective during all runs [116, 115, 70, 87, 29, 98, 121]; some did it to precisely the boundaries of the optimization problem [47, 133, 26]; some did it to the nadir point of the Pareto front [96]. The first two settings may overemphasize the boundary solutions (as the reference point may be far away from the set to be evaluated), while the last one may lead to the boundary solutions to contribute nothing to the value.
It is worth mentioning that as usually the problem’s Pareto front in SBSE is unavailable, for indicators which needs the Pareto front for reference, a common practice is to collect the nondominated set of all the solutions produced as an estimated Pareto front. However, different indicators have different sensitivity to this practice. For example, and Spread require the Pareto front consisting of uniformlydistributed points, while and indicator do not [69]. Therefore, and Spread may not be very suitable in SBSE, despite the fact they were frequently used, e.g., in [116, 115, 105, 30, 27, 135, 45].
Oblivion of Context Information
In Paretobased SBSE, many studies compare solution sets without bearing in mind the context information with respect to the considered optimization problem. They typically adopt commonlyused quality indicators to directly evaluate the set of all the solutions obtained, although some of these solutions may never or rarely be of interest to the DM. Figure 2 shows such an example, under a scenario of optimizing the code coverage and the cost of testing time on the software test case generation problem, borrowed from [68]. As can be seen, the set is evaluated better than the set by all eight commonly used quality indicators ( [123], [22], indicator [141], [139], [45], [23], [142] and [142]) in SBSE [128]. However, depending on contexts as shown in Table III, the DM might first favor the full code coverage and then possible low cost [138, 94]. This will lead to set to be of more interest, as it has the solution () that achieves full coverage and lower cost than the one in ().
Similar observations have been seen in optimal product selection in software product line [109, 110, 45, 122, 13, 39, 59, 104, 71] where the correctness of configurations is regarded as one objective and equally rated as other objectives (e.g., richness of features and cost). This may lead to an invalid product to be evaluated better than a valid product if the former performs better in other objectives, which is apparently of no value to the DM. In addition, in many SBSE problems, cost could be an objective to minimize, but solutions with zero cost are trivial, e.g., the solution with zero cost and zero coverage in Figure 2. However, these solutions may largely affect the evaluation results. Therefore, it is necessary to remove solutions that would never be interested by the DM before the evaluation, which, unfortunately, has been rarely practised in Paretobased SBSE.
Noncompliance of the DM’s Preferences
Although every quality indicator is designed to reflect certain quality aspect(s) of solution sets (i.e., convergence, spread, uniformity, cardinality, or their combination), they do have their own implicit preferences. For example, the indicators and , both designed to cover all of the four quality aspects, have rather distinct preferences. prefers knee points of a solution set, while is in favor of a set of uniformly distributed solutions. Therefore, it is important to select indicators whose preferences are in line with the DM’s. Neglecting this can lead to misleading evaluation results. Figure 3 gives such an example — when preferring knee points, considering the indicator could return a misleading result. That is, the set having knee points is evaluated worse than that having no knee point. Similar observations also apply to the indicators and , as shown in the figure.
Unfortunately, such misuse of indicators is not uncommon in the SBSE community. For example, preferring knee points yet using in [32, 83]; preferring knee points yet using and in [30, 105]; and preferring extreme solutions yet using and in [121, 124]. can be somehow in favor of extreme solutions if the reference point is set far away from the considered set, but certainly does not prefer extreme solutions. Therefore, it is of high importance to understand the behavior, role and characteristics of the considered indicators, which may not be very clear to the community. In the next section, we will detail widelyused indicators (as well as other quality evaluation methods) in the area and explain the scope of their applicability.
5 Revisiting Quality Evaluation for Paretobased Optimization
In Paretobased optimization, the general goal for the algorithm designer is to supply the DM a set of solutions from which they can select their preferred one. Apparently, Pareto dominance relation is the foremost criterion, provided that the concept of optimum is solely based on the direct comparison of solutions’ objective values (other than on other criteria, e.g., robustness and implementability with respect to decision variables). That is to say, the DM would never prefer a solution to the one that dominates it.
The (weak) Pareto dominance relation between solutions is defined as follows. Without loss of generality, let us consider a minimization scenario. For two solutions (objective vectors) (, where denotes the number of objectives), solution is said to weakly dominate (denoted as ) if for . If there exists at least one objective on which , we say that dominates (denoted as ). A solution is called Pareto optimal if there is no that dominates . The set of all Pareto optimal solutions of a multiobjective optimization problem is called its Pareto front.
The above relations between solutions can immediately be extended to between sets. Let and be two solution sets.
Relation 1
[Dominance between two sets [141]] We say that dominates (denoted as ) if for every solution there exists at least one solution that dominates .
Relation 2
[Weak Dominance between two sets [141]] We say that weakly dominates (denoted as ) if for every solution there exists at least one solution that weakly dominates .
We can see that the weak dominance relation between two sets does not rule out their equality, while the dominance relation does but it also rules out the case that there exist same solutions with respect to the two sets. Thus, we may need another relation to define that is generally better than .
Relation 3
[Better relation between two sets [141]] We say that is better than (denoted as ) if for every solution there exists at least one solution that weakly dominates , but there exists at least one solution in that is not weakly dominated by any solution in .
The better relation represents the most general and weakest form of superiority between two sets; namely, indicates that is at least as good as , while is not as good as . It meets any preference potentially articulated by the DM. If , then it is always safe for the DM only to consider solutions in . Apparently, it is desirable that a quality evaluation method is able to capture this relation; that is to say, for any two solution sets and , if , then is evaluated better than . Unfortunately, there are very few quality evaluation methods holding this property. is one of them [140]. There is a weaker property called being Pareto compliant [141, 55], which is more commonly used in the literature. That is, a quality evaluation method is said to be Pareto compliant if and only if “at least as good” in terms of the dominance relation implies “at least as good” in terms of the evaluation values (i.e., , where is the evaluation method, assuming that the smaller the better). Many quality indicators are not Pareto compliant, including widely used ones, such as , , Spread, , and . Pareto compliant indicators are mainly those falling into the category of evaluating convergence of solution sets (e.g., and ) and the category of evaluating comprehensive quality of solution sets (e.g., , indicator, [8], [40], and [66]). [64] is the only known diversity indicator compliant with Pareto dominance when comparing two sets. In addition, some noncompliant indicators can become Pareto compliant after some modifications. For example, and can be transformed into two Pareto compliant indicators (called and ) if considering “superiority” distance instead of Euclidean distance between points [49]. Overall, it is highly recommended to consider (at least) Pareto compliant quality indicators to evaluate solution sets; otherwise, it may violate the basic assumption of the DM’s preferences. That is, recommend the DM a solution set, of which each solution is inferior to or can be replaced by (in the case of equality) some solution in the other set. This is what the evaluation method of comparing the mean on each objective did in the example of Figure 1.
Now, one may ask why not directly use the better relation to evaluate solution sets. The reason is that the better relation may leave many solution sets incomparable since in most cases there exist some solutions from different sets being nondominated to each other. Therefore, we need stronger assumptions about the DM’s preferences, which are reflected by quality evaluation methods. However, stronger assumptions (than the better relation) cannot guarantee that the favored set (under the assumptions) is certainly preferred by the DM, as in different situations the DM indeed may prefer different tradeoffs between objectives. Consequently, it is vital to ensure the considered evaluation methods in line with the DM’s explicit or implicit preferences.
Back to the example in Figure 2 where optimising the objectives code coverage and cost of testing time, essentially these two solution sets are not comparable with respect to the better relation despite the fact that most solutions in are dominated by some solution in . As stated, the DM may be more interested in full code coverage and then possible lower cost, thus preferring to . However, the considered eight indicators fail to capture this information and give opposite results. This clearly indicates the importance of understanding quality evaluation methods (including what kind of assumptions they imply). Next, we will review several commonly used quality evaluation methods along these lines.
5.1 Descriptive Objective Evaluation ()
As stated before, methods evaluate a solution set (or several sets obtained by a search algorithm in multiple runs) by directly reporting statistic results of objective values of its solutions, such as the mean, median, best, worst and statistical significance (in comparison with other sets). Unfortunately, such methods are rarely being Pareto compliant and unlike to be associated with the DM’s preferences. However, an exception is the method that considering the best value of some objective(s) in a solution set, since it is Pareto compliant and able to directly reflect the DM’s preferences in the case that they prefer extreme solutions. Overall, descriptive evaluation methods are not recommended, unless the DM explicitly expresses their preferences in line with them.
5.2 Contribution Indicator ()
The indicator [81], which was designed to compare the convergence of two solution sets, has been frequently used in SBSE, e.g., in [106, 30, 105, 135, 87, 96, 134, 131]. calculates the ratio of the solutions of a set that are not dominated by any solution in the other set. Formally, given two sets and ,
(1) 
where stands for the set of solutions in that dominate some solution of (i.e., ), and stands for the set of solutions in that do not weakly dominate any solution in and also are not dominated by any solution in (i.e., ).
The value is in the range of .
A higher value is preferable.
It is apparent that .
A clear strength of the indicator is that it holds the better relation
A clear weakness of is that it relies completely on the dominance relation between solutions, thus providing little information about to what extent one set outperforms another. Moreover, they may leave many solution sets incomparable if all solutions from the sets are nondominated to each other. This may happen frequently in manyobjective optimization, where more objectives to be considered.
There is another wellknown dominancebased quality indicator (called or ) [142], used in e.g. [115, 4, 15]. It measures the proportion of solutions in a set that weakly dominated by some solution in the other set; in other words, the percentage of a set is covered by its opponent. The details of the indicator can be found in [69]. tends to be more popular in the multiobjective optimization community, despite sharing the above strengths and weaknesses with . Finally, it is worth mentioning that despite only partially reflecting the convergence of solution sets, such dominancebased indicators are useful since most problems in SBSE are combinatorial ones, where the size of the Pareto front may be relatively small and it is likely to have comparable solutions (i.e., dominated/duplicate solutions) from different sets [69].
5.3 Generational Distance ()
As one of the most widely used convergence indicators in SBSE (used in e.g. [106, 115, 105, 30, 135, 93, 4, 3, 15, 98]), [123] is to measure how close the obtained solution set is from the Pareto front. Since the Pareto front is usually unknown in priori, a reference set, , which consists of nondominated solutions of the collection of solutions obtained by all search algorithms considered, is typically used to represent the Pareto front in practice. Formally, given a solution set , is defined as
(2) 
where means the Euclidean distance between and , and is a parameter determining what kind of mean of the distances is used, e.g., the quadratic mean and arithmetic mean.
The value is to be minimized and the ideal value is zero, which indicates that the set is precisely on the Pareto front. In the original version, the parameter was set to 2. Unfortunately, this would make the evaluation value rather sensitive to outliers and also affected by the size of the solution set (when , even if the set is far away from the Pareto front [112]). Setting has now been commonly accepted.
Compared to those dominancebased convergence indicators (e.g., and ), is more accurate in terms of measuring the closeness of solution sets to the Pareto front due to it considering the distance between points. However, a clear weakness of is not being Pareto compliant [54, 141]. This is very undesirable since , as a convergence indicator, fails to provide reliable evaluation results with respect to the weakest assumption of the DM’s preferences. A simple example was given in [141]: consider two solution sets and on a biobjective minimization scenario, where the reference set is . Clearly, dominates , but returns an opposite result: . Recently, a modified was proposed to overcome this issue, called [49], where the Euclidean distance between and in Equation (2) is modified by only considering the objectives where is superior to . Specifically,
(3) 
where denotes the number of objectives, and denotes the value of solution on the th objective. This modification makes the indicator compliant with Pareto dominance. Going back to the above example, now we have the evaluation results of better than (. Finally, note that for both and , normalization of solution sets is needed as their calculation involves objective blending [69].
5.4 Spread ()
The indicator (aka ) [25] and its variants [139, 115] have been commonly adopted to evaluate the diversity (i.e., spread and uniformity) of solution sets in the field, e.g., in [116, 115, 30, 105, 108, 110, 93, 45, 27, 135]. Specifically, the indicator of a solution set (assuming the set only consisting of nondominated solutions) in a biobjective scenario is defined as follows.
(4) 
where denotes the size of , () is the Euclidean distance between consecutive solutions in the , and is the average of all the distances . and are the Euclidean distance between the two extreme solutions of and the two extreme points of the Pareto front, respectively.
A small value is preferred, which indicates a good distribution of the set in terms of both spread and uniformity. When means that solutions in the set are equidistantly spaced and their boundaries reach the Pareto front extremes.
A major weakness of (including its variants) is that it only works reliably on biobjective problems as where nondominated solutions are located consecutively on either objective. With more objectives, the neighbor of a solution on one objective may be far away on another objective [65]. This issue applies to any distancebased diversity indicator [69]. For problems with more than two objectives, region divisionbased diversity indicators are more accurate [69]. They typically divide the space into many equalsize cells and then consider cells instead of solutions (e.g., counting the number of these cells). This is based on the fact that a set of more diversified solutions usually populate more cells. However, such indicators may suffer from the curse of dimension as they typically need to record information of every cell. In this regard, the diversity indicator [64] may be a pragmatic option since its calculation only involves nonempty cells, thus independent of the number of cells (linearly increasing computational cost in objective dimensionality).
5.5 Nondominated Front Size ()
Used in e.g. [45, 96, 134], the (also called Pareto Front Size, ) is to simply count how many nondominated solutions in the obtained solution set. However, this indicator may not be very practical as in many cases all solutions in the obtained set are nondominated to each other, particularly in manyobjective optimization. In addition, as by definition duplicate solutions are nondominated to each other, a set full of duplicate solutions would be evaluated good by if there is no other solution in the set dominating them.
As such, a measure that only considers unique nondominated solutions which are not dominated by any other set seems more reasonable. Specifically, we can consider the ratio of the number of such solutions in each set to the size of the reference set (which consists of unique nondominated solutions of the collections of solutions obtained by the algorithms). In other words, we quantify the contribution of each set to the combined nondominated front of all the sets. Formally, let be the unique nondominated front of a given solution set (i.e., ). Then, the indicator, denoted as Unique Nondominated Front Ratio (), is defined as
(5) 
where denotes the reference set which consists of the unique nondominated solutions of the collections of all solutions produced.
The value is in the range of . A high value is preferred. Being zero means that for any solution in there always exists some solution better in the other sets. Being one means that for any solution in the other sets there always exists some solution in better than (or at least equal to) it (i.e., the reference set is precisely comprised by solutions of ). In addition, is Pareto compliant.
(a)  (b) 
5.6 Inverted Generational Distance ()
[23] is a wellknown indicator in the field (e.g. in [116, 45, 32, 75, 4, 28, 34, 36, 10, 121, 83, 84]). As the name suggests, , an inversion of , is to measure how close the Pareto front is to the obtained solution set. Formally, given a solution set and a reference set , is calculated as
(6) 
where is the Euclidean distance between and . A low value is preferable.
is capable of reflecting the quality of a solution set in terms of all the four aspects: convergence, spread, uniformity and cardinality. However, a major weakness of is that the evaluation result heavily depends on the behavior of its reference set. A reference set of densely and uniformly distributed solutions along the Pareto front is required; otherwise it could easily return misleading results [69]. This is particularly problematic in SBSE since the reference set is created normally from the collection of all the obtained solutions; its distribution cannot be controlled.
Consider an example in Figure 4(a), where comparing two solution sets and . The reference set is comprised of all the nondominated solutions, i.e., the three solutions of and the two boundary solutions of . As can be seen, performs significantly worse than in terms of convergence, with its solutions being either dominated by some solution in or slightly better on one objective but much worse on the other objective; thus unlikely to be preferred by the DM. However, gives an opposite evaluation: .
In addition, the way of how the reference set is created makes IGD prefer a specific distribution pattern consistent with the majority of the considered solution sets [69]. In other words, if a solution set is distributed very differently from others, then the set is likely to assign a poor value whatever its actual distribution is. Figure 4(b) is such an example. When comparing with (the reference set comprised of these two sets), we will have evaluated better than (). But if adding another set which has the similar distribution pattern to into the evaluation, and now the reference set is comprised of the three sets, we will have worse than (). A potential way to deal with this issue is to cluster crowded solutions in the reference set first and then to consider these welldistributed clusters instead of arbitrarilydistributed points, as did in the indicator [66]. Yet, this could induce another issue — how to properly cluster the solutions in the reference set subject to potentially highly irregular distribution.
5.7 Hypervolume ()
Like , [142] evaluates the quality of a solution set in terms of all the four aspects. Due to its desirable practical usability and theoretical properties, is arguably the most commonly used indicator in SBSE, e.g., used in [116, 115, 21, 87, 29, 98, 121, 47, 133, 26, 96, 93, 32]. For a solution set, its value is the volume of the union of the hypercubes determined by each of its solutions and a reference point. It can be formulated as
(7) 
where denotes the reference point and denotes the Lebesgue measure. A high value is preferred.
A limitation of the indicator is its exponentially increasing computational time with respect to the number of objectives. Many efforts have been made to reduce its running time, theoretically and practically (see [69] for a summary), which make the indicator workable on a solution set with more than 10 objectives (under a reasonable set size).
(a)  (b) 
As stated previously, is in favor of knee points of a solution set, thus a good choice when the DM prefers knee points of the problem’s Pareto front. In addition, the settings of the reference point can affect its evaluation results. Consider the two solution sets and in Figure 5, where consists of two boundary solutions, and consists of four uniformly distributed inner solutions. When the reference point is set to (Figure 5(a)), is evaluated worse than . When the reference point is set to (Figure 5(b)), is evaluated better than . Fortunately, we can make good use of such behavior of to enable the indicator to reflect the DM’s preferences. If the DM prefers the extreme points, then a reference point can be set to be fairly distant from the solution sets’ boundaries, e.g., doubling the Pareto front’s range, namely, where is the nadir point of the Pareto front (or the reference set, i.e., the combined nondominated front) on its th objective, and is the range of the Pareto front (or the reference set) on the th objective. If there is no clear preference from the DM, unfortunately, no consensus regarding how to set the reference point has been reached in the multiobjective optimization field. A common practice is to set it 1.1 times of the range of the combined nondominated front (i.e., ). Some recent studies [48] suggested to set it as , where is an integer subject to ( and being the number of objectives and the size of the considered set, respectively). Anyway, the reference point setting is nontrivial — an appropriate setting needs to consider not only the number of objectives and the size of the solution set, but also the actual dimensionality of the set, its shape, etc.
5.8 indicator
indicator is another wellestablished comprehensive indicator frequently appearing in SBSE, e.g., [45, 28, 36, 10, 125]. It measures the maximum difference between two solution sets and can be defined as
(8) 
where denotes the objective of in the th objective and is the number of objectives. A low value is preferred. implies that weakly dominates . When replacing with a reference set that represents the Pareto front, the indicator becomes a unary indicator, measuring the gap of the considered set to the Pareto front.
indicator is Pareto compliant and user friendly (parameterfree and quadratic computational effort). Yet, the calculation of indicator only involves one particular objective of one particular solution in either set (where the maximum difference is), rendering its evaluation omitting the difference on other objectives and other solutions. This may lead to different solution sets having the same/similar evaluation results, as reported in [72]. In addition, in some studies [103], indicator has been empirically found to behave very similarly as in ranking solution sets.
5.9 Summary
Indicator  Convergence  Spread  Uniformity  Cardinality  Pareto compliant  Usage note/caveats  Applicable conditions 

1) not able to distinguish between sets if their solutions are nondominated to each other, which may happen frequently in manyobjective optimization; 2) binary indicator which evaluates relative quality of two sets and cannot be converted into a unary indicator.  1) when the user wants to know the relative quality difference (in terms of dominance relation) between two sets, and 2) when the Pareto front size is relatively small, e.g., on some lowdimensional combinatorial problems.  
()  1) not able to distinguish between sets if their solutions are nondominated to each other, which may happen frequently in manyobjective optimization; 2) binary indicator which evaluates relative quality of two sets and cannot be converted into a unary indicator; 3) removing duplicate solutions before the calculation.  1) when the user wants to know the relative quality difference (in terms of dominance relation) between two sets, and 2) when the Pareto front size is relatively small, e.g., on some lowdimensional combinatorial problems.  
1) additional problem knowledge: a reference set that represents the Pareto front (not necessarily a set of uniformlydistributed points); 2) normalization needed for each objective; 3) may give misleading results due to not holding the Pareto compliance property.  1) when the user wants to know how close the obtained sets from the Pareto front, 2) when the compared sets are nondominated to each other (i.e., no better relation between the sets), and 3) when the Pareto front range can be estimated properly (e.g., no DRS points in the reference set [69]).  
1) additional problem knowledge: a reference set that represents the Pareto front (not necessarily a set of uniformlydistributed points); 2) normalization needed for each objective.  1) when the user wants to know how close the obtained sets from the Pareto front.  
()  1) additional problem knowledge: extreme points of the Pareto front; 2) normalization needed for each objective; 3) reliable only on biobjective problems.  1) when the user wants to know the diversity (including both spread and uniformity) of the obtained sets on biobjective problems, and 2) when the compared sets are nondominated to each other.  
1) additional problem knowledge: proper setting of the grid division; 2) nary indicator which evaluates relative quality of sets, but can be converted into a unary indicator by comparing the obtained set with the Pareto front.  1) when the user wants to know the diversity of the obtained sets.  
()  1) normalization needed for each objective; 2) cannot reflect the spread of solution sets.  1) when the user wants to know the uniformity of the obtained sets, and 2) when the compared sets are nondominated to each other.  
1) not able to compare sets as it only counts the number of nondominated solutions in a set.  1) not reliable when the user wants to compare sets.  
1) when the user wants to compare the cardinality of sets, particularly how much they contribute the combined nondominated front.  
1) additional problem knowledge: a reference set that well represents the Pareto front (i.e., densely and uniformly distributed points); 2) normalization needed for each objective; 3) may give misleading results due to not holding the Pareto compliance property.  1) when the user wants to know how well the obtained sets can represent the Pareto front, 2) when the compared sets are nondominated to each other, and 3) when a Pareto front representation with densely and uniformly distributed points is available.  
1) additional problem knowledge: a reference point that worse than the nadir point of the Pareto front; 2) exponentially increasing computational cost in objective dimensionality. 3) the user can specify the reference point according to their preference to extreme solutions or to inner ones.  1) when the user wants to know comprehensive quality of the obtained sets, especially suitable if the DM prefers knee points of the problem, and 2) when the objective dimensionality is not very high.  
indicator  1) normalization needed for each objective; 2) binary indicator, but can be converted into a unary indicator by comparing the obtained set with the Pareto front; 3) differentlyperformed sets may have the same/similar evaluation results.  1) when the user wants to know maximum difference between two solution sets (or the obtained solution set from the Pareto front). 

“” generally means that the indicator can well reflect the specified quality (or meet the specified property). “” for convergence means that the indicator can reflect the convergence of a set to some extent; e.g., indicators only considering the dominance relation as convergence measure. “” for spread means that the indicator can only reflect the extensity of a set. “” for uniformity means that the indicator can reflect the uniformity of a set to some extent; i.e., a disturbance to an equallyspaced set may not certainly lead to a worse evaluation result. “” for cardinality means that adding a nondominated solution into a set is not surely but likely to lead to a better evaluation result and also it never leads to a worse evaluation result. “” for Pareto compliance means that the indicator holds the property subject to certain conditions.
Table V summarizes the above 12 indicators on several aspects, namely, 1) what kind of quality aspect(s) they are able to reflect, 2) if they are Pareto compliant, 3) where we need to take care when using them, and 4) what situation they are suitable for. The following guidelines can be derived from the table.

If the DM wants to know the convergence quality of a solution set to the Pareto front, (instead of ) could be an ideal choice — it is Pareto compliant and the reference set required can be set as the combined nondominated front of all the considered sets, not necessarily a set of uniformlydistributed points. If the DM wants to know the relative quality between two solution sets in terms of the Pareto dominance relation, (or ) could be a choice.

If the DM wants to know the diversity quality (both spread and uniformity) of a solution set, for biobjective cases, is a good choice, for problems with more objectives, can be used. can only reflect the uniformity of a solution set which may not be very useful — uniformly distributed solutions concentrating in a tiny area typically not in the DM’s favor.

The indicator should replace to measure the cardinality of solution sets.

Regarding comprehensive evaluation indicators, can generally be the first choice, especially when the DM prefers knee points. In addition, if the DM prefers extreme solutions, the reference point needs to be set fairly distant from the solution sets’ boundaries. indicator is userfriendly, but is less sensitive to solution sets’ quality difference than since its value only lies upon one particular solution on one particular objective. may not be very practical as it requires a Pareto front representation consisting of densely and uniformly distributed points.
6 Methodological Guidance to Quality Evaluation in Paretobased SBSE
In this section,
we provide guidance on how to select and use quality evaluation methods in Paretobased SBSE.
As discussed previously,
selecting and using quality evaluation methods needs to be aligned with the DM’s preferences
6.1 When the DM’s Preferences Are Clear
The case of the DM’s preferences being clear can often fall into two categories in the SDLC. The first is when relative importance/weighting among the objectives considered can be explicitly expressed and quantified, e.g., in [115]. It is worth noting that the weighting between objectives may not need to be fixed a priori. For example, in the case of interactive Paretobased SBSE for software modelling and architecting problems [119], the DM is asked to explicitly rank the relative importance of the objectives as the search proceeds. Under this circumstance, the sum of the weighted objectives can be used to find the fittest solution from a solution set, and then determine the quality of the set.
The other category concerns when the DM prefers some objective to some others (i.e., a clear priority can be assumed), or when the DM is only interested in solutions which is up to scratch on some objective (which could be seen as a constraint). This happens frequently in the software product line configuration problem [108, 109, 110, 47, 45, 87, 133], where the correctness of the products (i.e., the feature model’s dependency compliance) is always of higher priority than other objectives such as the richness and the cost of the model — only the solutions (products) that achieve full dependency compliance are of interest. This is obvious, as violation of dependency implies faulty and incorrect configuration, thus valueless in practice. A similar situation applies to the test case generation problem [138, 96, 114, 46, 97] where the DM is typically interested in test suites with full coverage. In addition, the DM may only be interested in solutions which reach certain level on some objective. For example, in software deployment and maintenance, it is not uncommon to have statement like “The software service shall be available for at least 95% of the time”. In such a case, it is rather clear that any value of availability less than 95% is absolutely unacceptable, while anything beyond 95% can be considered.
An appropriate way to perform evaluation under the above circumstance is to transfer the DM’s preferences into the solution set to be evaluated. This can be done by first removing solutions that are irrelevant from the set. After that, the set of the remaining solutions is evaluated, subject to two situations: if the remaining solutions are of the same value on the objective(s) where the DM articulates their preferences, then the quality evaluation is performed only on the other objectives; otherwise, the evaluation is done on all the objectives. The former has been commonly seen when the DM is only interested in solutions which achieve the best of the objective, such as in the full coverage for the test case generation problem, whereas the latter often applies when the DM is interested in a particular threshold of solution quality on the objective, such as in the software deployment and maintenance case mentioned above, only the solutions with availability values not less than 95% would be evaluated.
6.2 When the DM’s Preferences Are Vague/Rough
(a)  (b)  (c)  (d) 
It is not uncommon that there exist important, yet imprecise preferences in the SDLC. In general, they are mainly derived from the nonfunctional requirements recorded in documentations, notes and specifications, which are often vague in nature, as in [44] [9] [79] [60] [1]. For example, some statements may be rather ambiguous like “the first objective should be reasonable and the others are as good as possible”. In such a situation, one may not be able to integrate the preferences into the quality evaluation since it is not possible to quantify qualitative descriptions like “reasonable”. As such, a safe choice is to treat them as a general multiobjective optimization case (i.e. without specific preferences).
In other situations, the user may give some preference information around some value/threshold on one (or several) objective. This, in contrast to the case of the DM’s preferences being clear, allows some tolerances on the specified value/threshold. For example, a software may have a requirement stating that “the cost shall be low while the product shall support ideally up to 3000 simultaneous users”. This typically happens for SME where the budget of a software project (e.g., money for buying required Cloud resources or the consumption of data centers) is low, and thus it is more realistic to set a threshold point such that certain level of performance (e.g., simultaneous users) would be sufficient (anything beyond is deemed as equivalent). However, while the requirement gives a clear cap of the best performance expected, it does not constrain on the worst cases, implying that it allows tolerances when the users goal cannot be met.
Despite not impossible, it can be a challenging task to find a quality indicator that is able to reflect such preference information. First, the quality indicator should be capable of accommodating such preference information in the sense that the evaluation results can embody it. Second, the introduction of the preferences should neither compromise the general quality aspect that the indicator reflects nor violate properties that the indicator complies with (e.g., being Pareto compliant). In this regard, the indicator [142] could be a good choice since it 1) can relatively easily integrate the DM’s preference information [140, 126] and 2) can still be Pareto compliant after a careful introduction of the DM’s preference information [140].
To integrate preferences into , one approach, called the weighted presented in [140], is to interpret the value as the volume of the objective space enclosed by the attainment function [24] and the axes, where the attainment function gives for each vector in the objective space the probability that it is weakly dominated by the outcome of the solution set. Then, to give different weights to different regions by a weight distribution function, the weighted is calculated as the integral over the product of the weight distribution function and the attainment function [140]. This essentially transforms the preference information into a weight distribution function to unequalize the contribution from different regions. However, it may not be straightforward to design a weight distribution function that is able to reflect vague/rough preferences expressed by the user.
Another (perhaps more pragmatic) approach is to directly transform the original objective values into new values which accommodate the preference information and then apply (or other quality indicators) to these new values, provided that such a transformation is in line with the selected indicator. For instance, consider the above example that the cost shall be low while the product shall ideally support up to users. Let us say that we have two solutions and . Solution has a lower cost while solution can support more users. However, according to the preference information, for the number of users anything beyond can be deemed as equivalent. As such, the user number of the solution can be transformed to , that is, now , worse than (i.e. dominated by) . The indicator can capture such dominance relation information — a dominated solution is always evaluated worse by than one dominating it. Next, we look at a case study based on this example to see how such transformation affects the evaluation results.
Consider a situation of designing a product with the requirements that “the cost shall be low while the product should be able to support at least 1500 simultaneous users and ideally reach 3000 users”. As can be seen, the first objective cost is a normal one (i.e., the lower the better), while for the second objective the number of simultaneous users, there are two types of preferences: clear one and vague one. The statement “support at least 1500 users” is a clear one, which means the product is useless if it cannot support 1500 users. The statement “ideally reach 3000 users” is a vague one, which implies that despite the threshold, it is acceptable to support less users and it will have the same level of satisfaction even if more users are supportable. As such, we can have the following transformation function.
(9) 
where denotes the transformed value of solution on the th objective.
Now let us assume two solution sets obtained by two search algorithms, where . We want to evaluate and compare them under the circumstances with/without the preference information given above to see how transferring preferences into solutions affects the evaluation results. As seen in Figure 6(a) and (b), without considering the preferences is evaluated worse than (. This makes sense as the solutions of spread more widely than those of . Yet, when considering the preferences of the DM (transferred by Equation (9)), while the set stays unchanged (), the solution will be discarded and the solution will become . As a result, is evaluated significantly better than (), as shown in Figure 6(c) and (d). This shows that the integration of the DM’s preferences can completely change the evaluation results between solution sets.
6.3 When the DM’s Interest Is in Some Specific Part of the Pareto Front
Sometimes, the DM may be more interested in specific part/solutions of the Pareto front than others. Knee points are certainly among such solutions, preferred in many situations, e.g., in [105] [38] [30] [127] [32] [61] [15] [16] [83]. Knee points are points on the Pareto front where a small improvement on one objective would lead to a large deterioration on at least one other objective. They represent “good” tradeoffs between conflicting objectives, thus naturally more of interest to the DM. For example, on the cloud autoscaling problem [15], where different cloud tenants (users) may introduce conflicting objectives due to the interference and shared infrastructure. From the prospective of the cloud vendor, ensuring fairness among tenants of the same class is often the top priority and thus the knee solutions are more of interest. As we explained previously, is a good choice in such a situation, alongside other indicators like the indicator [141], [49] and [66], whereas unfortunately [23] is not one of them, despite being widely used, e.g., in [32, 83].
Another relatively common situation is that the DM may be more interested in the extreme solutions (e.g. in [136] [124] [121]), namely, solutions achieving the best on one objective or another. For example, for the service composition problem [124], one may prefer the extreme solutions around the edges, e.g., those with low latency but high cost, or vice versa. For this situation, can also be a viable solution. As shown previously (Section 5.7), setting the reference point fairly distant from the combined nondominated solution set gives the extreme solutions bigger weighting on the evaluation results. Besides, one may directly compare solution sets through their the best value on the corresponding objective(s). Such a measure, in contrast to which provides comprehensive evaluation results, returns the objective values which is straightforward for the DM to understand.
6.4 When the DM’s Preferences Are Completely Unavailable
As can be seen in Table III, the majority of studies in Paretobased SBSE effectively do not involve any preference. For this situation, a solution set that well represents the whole Pareto front is preferred. This “good representation” can be broken down to quality aspects, convergence, diversity (i.e. spread and uniformity), and cardinality. Naturally, it is expected to consider quality indicators which (together) is able to cover all of them.
In general, there are two ways to implement that in practice. One is to consider several indicators, each responsible for one specific aspect. For example, [49] is for a solution set’s convergence, [25] for diversity (under the biobjective circumstance), and for cardinality. The other one is to consider a comprehensive indicator to evaluate all the aspects. Such indicators include , and indicator. Today, there is a tendency to use comprehensive indicators. Numerous recent studies used and . However, as explained previously, may not be an ideal indicator in Paretobased SBSE as a Pareto front representation with densely and uniformly distributed points is usually unavailable in practice.
In addition, when using comprehensive indicators we suggest to consider multiple differentlybehaving indicators if applicable. Each indicator has its own (explicitly or implicitly) preferences. A solution set evaluated better on an indicator is often evaluated better as well on another similar indicator, which means nothing but the set favored under this type of preferences. When a solution set is evaluated better on all of the considered indicators whose preferences are quite different, then that set certainly has a higher chance to be chosen by the DM (otherwise it indicates that that set and its competitor may respectively fit in different situations). Unfortunately, many comprehensive indicators behave similarly as [72, 103], namely, preferring knee points of the Pareto front rather than a set of uniformlydistributed solutions on the Pareto front, such as [40] and indicator (except which, however, is not applicable typically). Therefore, as a complement to , considering a quality indicator that can well evaluate the diversity of a solution set sounds reasonable. In this regard, the indicator () [25] may be chosen for the biobjective case and [64] for the moreobjective case.
6.5 Aided Evaluation Methods
The above are four general cases of solution sets’ quality evaluation on the basis of the DM’s preferences. On top of those, there also exist some quality indicators for specific SBSE scenarios (see Table III), which we call problemspecific quality indicators. For example, in the library recommendation [93], topk accuracy, precision and recall on history datasets are commonly used indicators for evaluating recommendation systems. For another example, in the software modularization problem the MoJoFM [58] indicator, derived from the MoJo distance, compares a produced solution to a given ‘golden rule’ solution, which naturally represents the DM’s preference over the objectives such as cohesion and coupling [129]. Overall, such problemspecific indicators not only represent more “accessible” quality evaluation of solution sets (i.e., how they perform under the practical problem background), but also usually imply some preferences from the DM. Therefore, it is highly recommended to include them in the evaluation if existing.
Nevertheless, it is worth noting that they usually need to work together with generic quality indicators (e.g., those in Table V) to provide reliable evaluations since they may be irrelevant to Paretobased optimization (e.g., only focusing on particular objectives in evaluation). For example, the study [99] mainly relies on APFD, the average percentage of fault detected, to evaluate the solution set of prioritized test cases. Indeed, APFD is a frequently used problemspecific quality indicator in the test case prioritization, but it can only reflect the rate of fault detected, not the reliance of test cases, both of which are the objectives to be optimized of the problem.
In addition, plotting representative solution sets () is also desirable as an auxiliary evaluation, as it empowers the user to get a sense of what the solution set look like. This is very helpful not only for solution set comparison, but also for the DM to understand the problem and then perhaps to refine their preferences further.
To use , for an algorithm involving stochastic elements, we suggest to plot the solution set in a particular run which corresponds to the evaluation result (obtained by a comprehensive quality indicator, e.g. ) that is the closest to the median value in all the runs. Alternatively, for optimization problems with two and three objectives, median attainment surfaces [33, 56, 74] can be used to visualize the performance of the algorithm with respect to all the runs (which have already been adopted in the literature [21, 29, 131, 136]). For problems with more objectives, the parallel coordinates plot (instead of the scatter plot) is a helpful tool, which can reflect the convergence and diversity of a solution set to some extent [67]. as practised in [83, 84, 133, 46, 132].
6.6 A General Procedure
Based on the above, we now are in a good position to provide a general procedure of how to evaluate solution sets in Paretobased SBSE in Figure 7. At first, we suggest to do some screening (P1 in the figure) to filter out trivial solutions in the considered solution sets according to the nature of the optimization problem. The trivial solutions can be seen as those which are straightforward to obtain and would never be of interest to the DM, but may affect the evaluation result. e.g., the solution with zero cost and zero coverage in the example of Figure 2.
After the filtering, it comes to evaluating solution sets according to the DM’s preference information (D1). If there is no preference information available at all, one needs to consider quality indicators that together are able to accurately reflect all the quality aspects (D2–D5). They can consider separately evaluating distinct quality aspects of solution sets, e.g., for convergence (together with if willing to know the dominance relation between sets), for diversity (when involving objectives), and for cardinality. Alternatively, they can consider evaluating comprehensive quality of solution sets, e.g., in most cases; or even some mix (e.g., plus ) if they want to know some specific quality aspect on top of solution sets’ general quality. Anyway, whatever indicator considered, using them needs to comply with their usage note and caveats (see Table V).
If there is preference information available, first to see whether part of it belongs to clear preferences (D6); if so, then transfer the clear preferences into the solutions (P2). It is necessary to note that sometimes after the transfer there is only one objective left to be considered (e.g., in the example of Figure 2). In this case, the best value on that left objective represents the quality of the solution set.
After considering clear preferences, one needs to see whether there exist some vague preferences (D7). if the answer is yes, then to transfer those that are transferable (D8, P3, D9), e.g., the example in Figure 6. After that, to see whether the rest can be transferred into an indicator (D10); if so, then transfer it (P4), e.g., transferring certain preferences into a weight distribution function in the weighted [140], and use that indicator to evaluate the solution sets.
When the preference information cannot be accommodated into an indicator, the next step is to check if the DM prefers a specific part of the Pareto front (D11). If they prefer knee points on the Pareto front (i.e., wellbalanced solutions between conflicting objectives), then is a good option. If they prefer boundary solutions, then with an unusual configuration of its reference point, alongside with reporting the best value on each objective in the population, can be used.
Note that the DM may present several types of preference information. For example, they may specify a clear threshold on one objective and at the same time be interested in knee points on the Pareto front. Another example has been seen in the situation of Figure 6, where the DM’s preferences contain both the clear and the vague. In addition, it is necessary to mention that there do exist some situations where the DM’s preferences cannot be quantified/transferred properly, e.g., the DM may state like “the cost should be reasonable”. In such a situation, it goes to D2 — the general multiobjective optimization case (without specific preferences).
After going through all possible cases of the DM’s preferences, here come to chekc the last two quality evaluation methods, problemspecific evaluation (D13) and representative solution set plotting (D14). These are two extra methods, but very helpful to reflect the solution set’s quality that may not be able to captured by general quality indicators.
7 Threats to Validity
Threats to construct validity can be raised by the research methodology, which may not serve the purpose of surveying the evaluation methods for Paretobased optimization in existing SBSE studies. We have mitigated such threats by following the systematic review protocol proposed by Kitchenham et al. [53], which is a widely recognized search methodology for conducting survey in software engineering research.
Threats to internal validity may be introduced by having inappropriate classification and interpretation of the papers of SBSE, their implied preferences and used quality indicators/evaluation methods. We have limited this by conducting multiple rounds of paper reviews by all the authors. Error checks and investigations were also conducted to correct any issues found during the search procedure. The key issues identified have also been discussed among the authors by multiple rounds.
Threats to external validity may restrict the generalizability of the proposed guidance and the considered cases. We have mitigated such by conducting the survey wider and deeper: it covers 717 searched papers published between 2009 and 2019, on 36 venues from 7 repositories; while at the same time, extracting 97 prominent primary studies following the exclusion and inclusion procedure. This has included 21 most noticeable SBSE problems that spread across the whole SDLC in our analysis. The extracted assumptions of preferences/problem nature, together with our rigorous analysis of the 12 representative quality indicators (i.e., either used widely in SBSE or proposed herein for a more accurate evaluation), have provided rich sources for us to establish a general methodological guidance for the community.
Finally, although our guidance has been designed in a way that it aims to cover a wide range of SBSE problems, it is always possible that there are situations which we have unfortunately missed; for example, the quality of solutions in the decision space (e.g. their diversity and robustness). In such cases, an evaluation involving both the solution sets in the decision space and their images in the objective space is needed, which is out of scope of this paper.
8 Conclusions
The nature of considering multiple (conflicting) objectives in many SE problems leads to a link between SE and multiobjective optimization. However, compared to the flourish of the use/design of multiobjective optimizers in SBSE, the evaluation of the optimizers’ outcome remains relatively “casual”. People often work by analogy, namely, following popular (or previously used) quality evaluation methods without considering whether they are truly suitable for their specific situation. In this paper, we have carried out a systematic survey of quality evaluation in Paretobased SBSE. We have found that in many studies the selection/use of evaluation methods are not appropriate in the sense that a solution set evaluated better may not be preferred by the DM, based on which we codify five critical issues. Through revisiting the pros and cons of most widely used quality indicators in SBSE, we then have provided a methodological guidance and procedure of selecting, adjusting and using quality evaluation methods on the basis of availability/types of the DM’s preferences. We hope that such guidance would help to mitigate the identified issues in future work of SBSE with multiobjective optimization, and also to excite more currently singleobjective SBSE research [62, 73] to investigate Paretobased multiobjective search.
Footnotes
 A solution set is said to (Pareto) dominate a solution set if for any solution in there exists at least one solution in dominating it, where the dominance relation between two solutions can be seen as a natural “better” relation of the objective vectors, i.e., better or equal on all the objectives, and better at least on one objective [141].
 http://crestweb.cs.ucl.ac.uk/resources/sbse repository
 All the citations were counted by 23rd Nov 2019.
 Note that it is not unusual that binary indicators (i.e., those directly comparing two sets) holds the better relation [69].
 For convenienc