The Validity, Generalizability and Feasibility of Summative Evaluation Methods in Visual Analytics

The Validity, Generalizability and Feasibility of Summative Evaluation Methods in Visual Analytics

Mosab Khayat    Morteza Karimzadeh    David S. Ebert    Fellow, IEEE    Arif Ghafoor    Fellow, IEEE

Many evaluation methods have been used to assess the usefulness of Visual Analytics (VA) solutions. These methods stem from a variety of origins with different assumptions and goals, which cause confusion about their proofing capabilities. Moreover, the lack of discussion about the evaluation processes may limit our potential to develop new evaluation methods specialized for VA. In this paper, we present an analysis of evaluation methods that have been used to summatively evaluate VA solutions. We provide a survey and taxonomy of the evaluation methods that have appeared in the VAST literature in the past two years. We then analyze these methods in terms of validity and generalizability of their findings, as well as the feasibility of using them. We propose a new metric called summative quality to compare evaluation methods according to their ability to prove usefulness, and make recommendations for selecting evaluation methods based on their summative quality in the VA domain.

Summative evaluation, usefulness, evaluation process, taxonomy, visual analytics

10.1109/TVCG.2019.2934264 \onlineid1170 \vgtccategoryResearch \vgtcpapertypetheory/model \authorfooter Mosab Khayat, David S. Ebert, and Arif Ghafoor are with Purdue University. Email: {mkhayat, ebertd, ghafoor} Morteza Karimzadeh is with the University of Colorado Boulder (formerly at Purdue University). Email: \shortauthortitleKhayat et al.: An Analysis of Summative Evaluation of Visual Analytics Solutions \CCScatlist \CCScatTwelveHuman-centered computingVisualizationVisualization design and evaluation methods \vgtcinsertpkg\pdfstringdefDisableCommands



Visual analytics (VA) solutions emerged in the past decade and tackled many problems in a variety of domains. The power of combining the abilities of human and machine creates fertile ground for new solutions to grow. However, the rise of these hybrid solutions complicates the process of evaluation. Unlike automated algorithmic solutions, the behavior of visual analytics solutions depends on the user who operates them. This creates a new dimension of variability in the performance of the solutions that needs to be accounted for in evaluation. The existence of a human in the loop; however, allows researchers to borrow evaluation methods from other domains, such as sociology[33], to extract information with the help of the user. Such evaluation methods allow developers to assess their solutions even when a formal summative evaluation is not feasible. The challenge in these methods, however, lies in gathering and analyzing qualitative data to build valid evidence.

Many methods have been used to evaluate VA solutions, each originally developed to answer different questions with different evaluation intentions, including formative, summative and exploratory[2]. Nevertheless, many of these methods have been extensively applied in summative evaluation despite the fact that some are only suitable for formative or exploratory assessment, not summative evaluation.

In this paper, we survey and analyze the evaluation methods commonly used with summative intentions in VA research. Specifically, we survey the papers presented at VAST 2017 and 2018, resulting in a seven-category taxonomy of evaluation methods. We identify the activities typically performed within each category, focusing on the activities that could introduce risks to the validity and the generalizability of the methods’ findings, and we use both of these factors to define summative quality. We also define feasibility based on the identified activities and the limitations in applying evaluation methods in various scenarios. Finally, we use summative quality and feasibility to compare summative evaluation methods. Unlike existing problem-driven prescriptions [53], we analyze the risks to the validity, generalizability, and feasibility of each evaluation method by focusing on the activities employed in each method. We then provide a prescription of evaluation methods based on (a) their ability to prove usefulness and (b) their feasibility.

The contributions of this paper can be summarized as follows:

  • A survey, taxonomy and risk-based breakdown and analysis of evaluation methods used in summative evaluation of VA solutions.

  • A summative quality metric to assess the summative quality of evaluation methods based on the potential risks to the validity and the generalizability of the methods’ findings.

  • An analysis and prescription of summative evaluation methods in terms of their summative quality and feasibility.

This paper is organized as follows: In Section 2, we provide important definitions, and review related work in Section 3. In Section 4, we present our survey of evaluation methods used for summative assessment. We analyze these methods in Section 5, followed by a set of recommendations for practitioners in Section 6. Finally, we conclude the paper and provide directions for future research.

1 Usefulness and Summative Evaluation Definitions

The term “summative assessment” has roots in the field of education, which distinguishes it from formative assessment [77]. The former assesses students objectively at the end of a study period using standardized exams, while the latter focuses on the learning process and the students’ progress in meeting standards. In visualization and VA literature, summative evaluation has been traditionally referred to as the type of studies that measures the quality of a developed system using methods such as formal lab experiments [53]. This is in contrast with formative assessment, which seeks to inform the design and development processes by applying techniques such as expert feedback [56].

There have been some suggestions in the literature that evaluation intention should be unlinked from evaluation methods. Ellis and Dix [26] argue that a formal lab experiment of a completely developed system can be conducted with formative intentions to suggest improvement. On the other hand, Munzner [53] argues that formative methods such as expert feedback can be used with summative intentions to validate the outcome of different design stages. We agree with these arguments and believe that it is essential to give a formal definition of summative evaluation as an intention rather than an evaluation stage.

From the discussions in [77], we define summative evaluation as a systematic process which generates evidence about the degree of accomplishment of the given objectives (standards) for an assessed object (a solution) at a point in time. We use the term“solution” throughout this article to refer to different approaches for tackling a problem. VA research covers different types of solutions ranging from algorithms, to visualizations, to the integration of these in a holistic system (See [68] for design study contributions). “Standards” refer to benchmarks that are used to distinguish useful solutions from non-useful ones. These are commonly determined during the requirement elicitation stage, e.g. by conducting qualitative inquiries with domain experts.

Summative evaluation is used to determine the usefulness (i.e. the value) of a solution. From a technology point of view, usefulness is based on two main factors: effectiveness and efficiency [79]. The former can be defined as the ability of a solution to accomplish the desired goals (i.e. doing the right things). The latter concerns the ability of a solution to optimize resources, such as time or cost, while performing its tasks (i.e. doing things right). Most existing summative evaluations assess one or both of these two factors.

Effectiveness and efficiency could be assessed differently according to the nature of the evaluated solution and the problem it tackles. Some solutions can be assessed in a straightforward manner because of the availability of explicit objectives they seek to achieve. An example of such solutions is a classification algorithm, which can be evaluated by objective metrics such as accuracy. On the other hand, some solutions require extra effort to define valid objectives that can be used to assess their usefulness. Such effort can be seen in previous work targeted at finding valid objectives to determine the value of holistic visualization and visual analytics systems [79, 72, 63].

Usefulness of human-in-the-loop solutions can also be assessed by utility and usability objectives. A Useful system has the needed functionalities (utility) designed in a manner that allows users to use them correctly (usability) [55]. The question of whether to prioritize utility or usability has been discussed in previous work [29]. We focus on the objectives used in utility and usability evaluation and view them with a broader lens as ways to assess effectiveness and efficiency.

We consider effectiveness and efficiency as generic objectives of summative evaluation. This permits us to put all the methods used to assess these two factors in the same plate and compare them in terms of the quality of their evidence and the feasibility of generating them.

2 Related Work

In this section, we review previous related work in three categories {enumerate*}[label=)]

surveys of evaluation practices,

analysis of evaluation methodologies, and

prescription of evaluation methods.

Multiple studies have surveyed existing evaluation practices. Lam et al. [40] suggest that it is reasonable to generate a taxonomy of evaluation studies by defining scenarios of evaluation practices that are common in the literature. Their extensive survey is unique and provides many insights for researchers. Specifically, seven scenarios of evaluation practices are discussed along with the goals of each, with examplar studies and methods used in each scenario. Isenberg et al.[34] continue this effort by extending the number of surveyed studies and introducing an eighth scenario of evaluation practices. These studies helped us build the backbone of our taxonomy as explained in Section 3.1. The initial code to group evaluation methods in our survey was derived from Lam et al. and Isenberg et al.. We then gradually modified the coding of evaluation methods according to the studies we surveyed. In contrast to the grouping approach according to common evaluation practices taken by previous surveys, we focus on grouping evaluation methods based on the similarities in each method’s (sub)activities, with the ultimate goal of analyzing the potential risks associated with them, rather than simply describing the existing evaluation practices.

The next set of related work focuses on explaining and analyzing evaluation methodologies. Evaluation research in visualization and VA can be divided into two types from the perspective of human-involvement: human-dependent evaluation and human-independent evaluation. The methodology of the first type draws on behavioral and social science methodologies to study the effect of visual artifacts on the human operator. One of the most well-known taxonomies for classifying behavioral and social science methodologies that has been ported to the Human-Computer Interaction (HCI) community is proposed by McGrath [49]. This taxonomy was built based on the three main dimensions that any behavioral study seeks to maximize, which are {enumerate*}[label=)]


realism, and

precision. Generalizability of a study determines the extent of applicability of the study findings to any observable cases in general. It is related to the concept of external validity of results. Realism is the representativeness of studied cases to situations that can be observed in the real world; i.e., it determines the level of ecological validity of the findings. Finally, the precision of a study measures the level of reliability and internal validity of the findings. McGrath argues that these dimensions cannot be maximized simultaneously, since increasing one adversely affects the others. He then reviews common methodologies in behavioral science and assigns them to a position in the space defined by the three dimensions. Our analysis of evaluation methods relies on many of the arguments made by McGrath. A key difference between our work and that of McGrath lies in the intention of targeted studies. Our work focuses on studies that have a summative intention of proving usefulness. Unlike the general view of McGrath’s work, summative evaluation studies have unique characteristics that permit ranking according to the quality of proving usefulness, as we explain in Section 4.

An early study that introduces McGrath’s work to the information visualization evaluation context is done by Carpendale [10], who provides a summary of different quantitative, qualitative and mixed methodologies along with a discussion about their limitations and challenges. A more recent work by Crisan and Elliott [23] revisits quantitative, qualitative and mixed methodologies and provides guidance on when and how to correctly apply them. Instead of taking a general view of behavioral methodologies, we use a unified lens to identify limitations in evaluation methods used to prove usefulness, which may follow different methodologies, but are indeed used with summative intentions. Similar to Crisan and Elliott, we use validity and generalizability as our analysis criteria and add the feasibility criterion to the analysis to determine the level of applicability of the methods.

The second type of evaluation in visualization and VA is human-independent. In this type of evaluation, researchers follow a quantitative methodology to assess visualization or VA systems without considering the human element. This includes computer science methods of evaluating automated algorithms [20] and statistical methods for assessing machine learning models (e.g. [39]). A unique quantitative methodology that has been used to evaluate visualization and VA solutions is the information theoretic framework proposed by Chen and Heike [16]. This framework treats the pipeline of generating and consuming visual artifacts as a communication channel that communicates information from raw data, as the sender, to human perception as the receiver. Information theory framework has been used to define objective metrics such as the cost-benefit ratio [15], which has been recently used to build an ontological framework that supports the design and evaluation of VA systems [12]. We include human-independent methods in our analysis because they are summative by nature.

The last set of related work focuses on the prescription of evaluation methods by providing guidelines on what evaluation methods are suitable for different evaluation instances. Andrews [2] proposes four evaluation stages during the development cycle of a system: {enumerate*}[label=)]

before the design,

before the implementation,

during implementation, and

after implementation. Andrews suggests that the purpose, as well as the method of evaluation, is defined by the stage. For example, evaluation studies conducted after the implementation are summative in purpose and usually use methods such as formal experiments or guideline scoring. A more sophisticated prescription of evaluation methods is proposed by Munzner [53], who defines four nested levels, each having a set of unique problems and tasks. During the design stage, developers face multiple problems on their way to the inner level, which requires validation of the design choices. After implementation, a sequence of validation must be performed at each level to validate the implementation on the way out of the nest. Munzner then prescribes different evaluation methods to be used in each validation step. Meyer et al. [51] expand this model by focusing on each of the nested levels and proposing the concepts of blocks and guidelines. Blocks describe the outcomes of design studies at each level, and guidelines explain the relationship between blocks at the same level or across adjacent levels in the nest. Another extension to Munzner’s work is Mckenna et al. [50] who link the nested model to a general design activity framework. The framework breaks down the process of developing a visualization into four activities of understand, ideate, make and deploy.

One argument made by Munzner [53] was the necessity of summative evaluation during each stage of design studies to evaluate the outcome of that individual stage. Sedlmair et al. [69] and Mckenna et al. [50] made similar arguments while describing the process of design studies. They make the case for considering non-quantitative methods, such as heuristic evaluation, for summative purposes. While the Munzner’s nested model [53] essentially prescribes evaluation methods based on the development stage, we focus our analysis and prescription based on the activities performed during evaluation, and judge the quality of evaluation findings (evidence of usefulness) based on the amount of risk introduced by the involved activities. Further, our approach adapts to different evaluation instances and prescribes relatively smaller number of potential evaluation methods, compared to [53].

Another form of prescription studies is the study of correctly adopting existing evaluation methods in the context of VA. Most evaluation methods that have been applied in visualization and VA have been borrowed from the field of human-computer interaction (HCI). Scholtz [67] explains the main factors that need to be added or modified in existing HCI methods to increase their utility in VA research. In addition, she prescribes potential evaluation metrics that have been successfully applied to assessing VA solutions. Still, the necessity of searching for suitable evaluation metrics for visual analytics persists [67, 47, 36, 65].

3 Survey of Summative Evaluation Methods

In this section, we present our survey and taxonomy of methods used by other researchers for the summative evaluation of VA solutions. Our goal in developing a taxonomy is to identify their limitations in terms of their validity, generalizability, and feasibility. Because of our objective of analyzing evaluation methods themselves, it is important to note that we abstract the evaluated solutions and the problems they solve. For example, we do not distinguish between a study that reports a holistic evaluation of a complete VA system and another study that evaluate a part of the system, as long as they both use the same evaluation method. This abstraction is discussed in Section 4.

We focused our survey on papers that were published in VAST-17 and VAST-18. The initial number of papers we considered was 97 papers (52 papers from VAST17 and 45 papers from VAST18). We excluded papers that only included usage scenarios or did not report any evaluation at all. Usage scenarios are excluded since they only exemplify the utilization of solutions rather than systematically examining their usefulness. They differ from case studies and inspection methods, which have been used to systematically determining the usefulness of a solution as we explain next. The final number of papers we include in our taxonomy is 82. Some of these papers report more than one type of assessment. The total number of evaluation studies we found in these 82 papers is 182. The number of included papers are relatively small compared to existing surveys [40, 34]. However, our deductive approach to identify evaluation categories requires a smaller sample size compared to inductive approaches which develop concepts by grounding them to data. We built on previous taxonomies [40, 34] to layout ours, and then surveyed recent papers to guide the grouping, activity breakdown and risk analysis.

3.1 Survey Methodology

We followed a deductive approach to build our taxonomy, starting with an initial code based on the previous surveys [40, 34], and then progressively changing the concepts in the code by considering new dimensions that help highlight factors that affect validity, generalizability, and feasibility of evaluation methods.

Phase 1: Building the initial concepts

We based our taxonomy on two extensive surveys of evaluation practices in visualization and VA literature [40, 34]. The descriptive concepts developed in these works (i.e. the evaluation scenarios) are built for different objectives than our diagnostics. However, these works include the set of evaluation methods used in each scenario, which allowed us to determine our initial code. We consider each reported evaluation method as a concept in this phase and categorize the studies accordingly.

Phase 2: Selecting grouping dimensions

We looked for new dimensions that are key for diagnosing evaluation methods’ validity, generalizability, and feasibility. By examining the process of evaluation in each method, we identified four dimensions that are useful in grouping the evaluation methods to simplify our analysis: epistemology, methodology, human-dependency, and subjectivity. These dimensions can be seen as titles for each level of our taxonomy depicted in Figure 1 and are explained in more detail in the following sections.

Phase 3: Redefining concepts

We iteratively refined the taxonomy, which resulted in merging some concepts and splitting others. For example, one of our initial codes was “Quantitative-objective assessment” which included both “Quantitative User Testing” and “Quantitative Automation Testing” in our final code. The dimension responsible for splitting these two concepts is the “Human-dependency” dimension. On the other hand, we decided to merge “quantitative-subjective assessment test” and “quantitative-subjective comparison test” concepts into the single concept “Quantitative User Opinion”, because both concepts are similar in every grouping dimension that we considered.

Figure 1: A taxonomy of summative evaluation methods based on surveying 82 papers published in VAST-17 and 18. The leaves represent categories of evaluation methods distinguished by the dimensions shown in the left. The percentages show the distribution of surveyed studies.
Abb Category Description Frequency % Examples
THEO Theoretical Methods Rational, objective, quantitative methods which do not rely on human subjects to generate evidence of usefulness. These methods rely on deductive reasoning to logically derive evidence. 12 6.59% [3, 11, 14]
QUT Quantitative User Testing Empirical methods that are objective, quantitative and estimate the performance of human subjects for assessment or comparison reasons. 14 7.69% [89, 4, 46]
QUO Quantitative User Opinion Similar to the previous category; except, it assesses subjective aspects instead of measuring objective performance. A conventional method in this category is structured questionnaires which use measurable scales, e.g. Likert scale [37], to evaluate user satisfaction and opinion. 17 9.34% [76, 61, 41]
AUTO Quantitative Automation Testing Empirical methods used to quantitatively and objectively assess human-independent solutions such as machine learning models. This includes evaluation methods such as cross-validation and hold-out test set to predict the performance of supervised machine learning models [39]. 19 10.44% [83, 85, 81]
INST Insight-based A mixed method which relied on human subjects to qualitatively identify a set of insights that can be reached with the help of a solution. Insight-based methods map identified insights to measurable metrics, e.g. insights count, which are used for quantitative reasoning [63]. 3 1.65% [43, 87, 18]
CASE Case Studies Qualitative methods which allow researchers to determine objective values and subjective opinions about the evaluated solution by interacting with human subjects who are typically domain experts. This category encompasses different variants of case studies including Pair analytics [5] and Multi-dimensional In-depth Long-term Case studies “MILC” [70]. 67 36.81% [73, 80, 57]
INSP Inspection Methods Methods which assess objective or subjective potentials of a solution without testing or recruiting human subjects. Inspection methods help in checking the satisfaction of predefined requirements that characterize objective or subjective features needed in useful solutions [56, 67, 66, 78, 1]. 50 27.47% [35, 84, 60]
Table 1: A summary of our survey of the evaluation studies reported in VAST-2017 and 2018. The table provides a brief description of our seven categories and the distribution of the surveyed studies within those categories. The total number of categorized studies is 182 reported in 82 papers.

3.2 Taxonomy

Figure 1 summarizes our taxonomy. The dimensions are independent and can be used separately to classify evaluation methods. Therefore, the order of dimensions in Figure 1 is not important. However, we chose to present a breakdown leading to our identified seven categories. In this section, we explain the dimensions that differentiate the seven categories of evaluation methods and the distribution of the surveyed papers in each level. The following section focuses on the analysis of the processes and activities in each method category.

3.2.1 Epistemology Dimension

Evaluation methods produce evidence that justifies our beliefs about the value of the evaluated solution. The process of justification in the evaluation methods can be categorized, according to epistemological views, into two classes: rational and empirical. Rational evaluation methods use deductive reasoning by relying on logically true premises. For example, the analysis of algorithms complexity as reported in [44, 3] is rational. This method is used to evaluate the efficiency of an algorithm by determining the time required to execute its instructions. Another rational method of evaluation is the information-theoretic framework [16] that is used in [14] to study the cost-benefit of visualization in a virtual environment. Both of these methods are built on top of a set of basic premises that are assumed to hold, such as the assumption of unit execution time per the algorithm’s instruction and the axioms of probability, respectively.

Empirical evaluation methods, on the other hand, follow inductive reasoning by collecting and using practical evidence to justify the value of the solution. Most categories of evaluation methods are empirical. An example of an empirical method is the estimation of automated models’ performance as reported in [83, 85]. Such estimations are performed empirically by measuring the performance of a solution in a number of test cases.

Our survey shows that 12 out of 182 evaluation studies (6.59%) were conducted using rational methods. Only one (0.58%) of these 12 studies uses the information-theoretic framework. The other 11 studies (6.08%) applied the traditional analysis of algorithms. Empirical evaluation methods are reported as the method of evaluation in the remaining 170 studies (93.41%).

3.2.2 Methodology Dimension

Evaluation methods are categorized, according to the methodology they follow, into three classes: quantitative, qualitative and mixed methods [10, 23]. Quantitative methods rely on measurable variables to interpret the evaluated criteria. They collect data in the form of quantities and analyze it using statistical procedures to generalize their findings. The evidence generated by these methods has high precision but a narrow scope, i.e. rejection of a hypothesis by measuring particular metrics. Thus, these methods are preferable for problems that are well-abstracted to a set of measurable objectives. Controlled experiments are examples of quantitative methods, used extensively in comparative evaluations such as the studies reported in [7] and [89]. These studies aim to justify the value of a solution by comparing it to counterpart solutions.

Qualitative methods, on the other hand, have fewer restrictions on the type of data that can be collected from a study. They evaluate the usefulness of solutions which tackle less abstract, concrete problems using data that is less precise but more descriptive, such as narratives, voice/screen recordings, and interaction logs. Such data can be generated as a result of observation, or with the active participation of human subjects such as in interviews and self-reporting techniques. The case studies reported in [73] and [80] are examples of qualitative methods used with a summative intention.

Mixed methodology integrates both quantitative and qualitative methods to produce better comprehensive studies [23]. The most common way of following this methodology is to perform multiple complementary studies that are independent but serve the same summative intention (called a convergence mixed method design [22]). For example, the authors of [58] report a controlled experiment as well as a case study with domain experts used to evaluate ConceptVector, a VA system that guides users in building lexicons for custom concepts. The results of both studies can be compared to support each other in proving the value of ConceptVector. Another way of mixing quantitative and qualitative methods is to connect the two types of data prior to analysis such as in an insight-based evaluation method [63]. This method starts by collecting qualitative data in the form of written or self-reported insights, then transforms this data into quantity, e.g. insight count, for analysis, such as the evaluation reported in[43]. Since our taxonomy categorizes evaluation methods at individual resolution, we only categorize methods which follow embedded and merging designs, e.g. the insight-based method, as mixed methods.

According to our survey, 62 studies (34.07%) out of 182 were conducted using quantitative methods. 117 studies (64.29%) were conducted using qualitative methods, and only 3 studies (1.65%) were conducted using the mixed method. According to this, qualitative methods constitute the majority of evaluations in VAST-17 and 18. 21 out of 82 (25.61%) apply the convergence mixed method design.

3.2.3 Human-dependency Dimension

Visual analytics solutions combine both human and automated processes to tackle problems [38]. Researchers may evaluate different components independently. For example, researchers may evaluate the efficiency of an automated algorithm [17, 54], or inspect the requirements of a user interface [75]. Another option is to assess human-related tasks such as estimating the performance of the users [52] or gathering expert feedback about the value of a VA system holistically [86].

The human-dependency dimension in our taxonomy affects all the factors we aim to analyze (i.e. validity, generalizability, and feasibility); therefore, we include it as a dimension in the taxonomy.

Our survey shows that 81 (44.51%) out of 182 studies summatively evaluated a solution without utilizing any human subjects. Among these studies, 31 studies (17.03%) used quantitative methods and 50 (27.47%) used qualitative methods in the form of inspection. On the other hand, 101 studies out of 182 (55.49%) used methods that rely on human subjects. This includes 31, 67, and 3 studies using quantitative, qualitative, and mixed methods respectively (17.03%, 36.81%, and 1.65% respectively). We remind the reader that the word solution is an abstract concept, which can represent automated algorithms, user interfaces or a complete VA system.

3.2.4 Subjectivity Dimension

The usefulness of a solution can be determined by assessing the objective level of accomplishments. However, the objectives are sometimes defined as abstract ideas that cannot be directly or independently assessed. For example, VA systems have a general objective of generating insights about available data [38]. Such an abstract objective may not always be assessable by defined measures. From another angle, a correlation between subjective assessment such as user satisfaction in information systems and the usefulness of these systems has been shown [28]. Therefore, researchers include subjective assessment methods as ways of determining a solution’s usefulness. Subjective assessment can be performed quantitatively [76, 61] or qualitatively [27], and can be done with the help of human subjects [88] or through inspecting the design without relying on human subjects [42]. Qualitative methods have the flexibility to assess both objective and subjective aspects.

There is a clear difference between summative evaluation methods that use objective versus subjective scopes. Objective methods assess effectiveness and efficiency of a solution in tackling the targeted problem, whereas subjective methods assess factors that correlate with that solution’s capabilities (indirect assessment of usefulness). This led us to include the subjectivity dimension in our taxonomy, to highlight the differences between objective and subjective categories in terms of validity, generalizability, and feasibility.

Our survey shows that 48 (26.37%) studies out of 182 applied objective evaluation methods. 17 studies (9.34%) applied subjective evaluation and 117 studies (64.29%) applied qualitative methods that are not restricted to a narrow scope and can assess both objective and subjective aspects.

3.2.5 The Seven Categories of Summative Evaluation Methods

Table 1 summarizes the surveyed evaluation studies in our seven categories of summative evaluation, fully listed in the supplementary material. The most reported evaluation category in VAST-17 and 18 is case studies, followed by the inspection category. These two types are used significantly more than other evaluation categories. The high feasibility of case studies and inspections could be the reason for their popularity, as we explain in Section 4.2. On the other hand, the least utilized evaluation category is the insight-based methods. Many of the reported studies that capture subjects’ insights do not perform the second stage of defining quantitative measures from captured insights, and thus, end up in the case studies category in our taxonomy.

4 An Analysis of Summative Evaluation Methods

We analyze the identified seven evaluation categories in terms of validity, generalizability and feasibility, in order to compare their capability of proving usefulness, which is the objective of summative evaluation. Some of these methods are originally designed to address different evaluation requirements, such as formative or exploratory questions. However, we include them here, since they have been used by others to prove usefulness. Our focus is to analyze the process of evaluation itself regardless of the type of solutions they evaluate.

Activity Relevant categories Description of the Risk
Defining the objectives and the objective metric(s) THEO, QUT, AUTO Some tasks do not have a clear objective, e.g. exploratory tasks (feasibility risk).
Abstracting the evaluated solution by a formal language THEO, AUTO Some solutions cannot be automated with our current knowledge, e.g. human-dependent solutions. (feasibility risk)
Deductively inferring the performance of the evaluated solution using a formal system THEO Building a new formal system requires extraordinary work and high abstraction skills. Reusing a formal system requires skills of mapping abstract problems and performing mathematical deduction. (feasibility risk)
Sampling problem instance(s) QUT, QUO, AUTO, INST, CASE Relying on unrepresentative problem instances. (validity, generalizability risk)
Sampling human subject(s) QUT, QUO, INST, CASE Relying on unrepresentative target users. (validity, generalizability risk)
Sampling competing solution(s) QUT, QUO, AUTO, INST, CASE Bias in selecting competing solutions included in a comparative evaluation study.(validity, generalizability risk)
Identifying the ground-truth QUT, AUTO Unavailable ground-truth for a representative number of problem instances.(feasibility risk)
Organizing studied treatments QUT, QUO, INST Fail to eliminate confounders. (validity, generalizability risk)
Statistical testing QUT, QUO, AUTO, INST A potential reduction to the risk as a result of testing the statistical significance of quantitative analysis findings. (validity, generalizability risk reduction)
Qualitatively identifying insights INST Subjects potential miss-reporting of reached insights / researcher potential miss-collecting of reached insights. (validity, generalizability risk)
Defining quantity from insights INST Defining a metric that do not reflect the value of solutions. (validity risk)
Collecting and interpreting qualitative data CASE Missing essential pieces of information / misinterpreting the value of a solution evaluated using collected information. (validity, generalizability risk)
Identifying the requirements / heuristics sources INSP Relying on a source which provides less than needed requirements/heuristics to distinguish a useful solution from another. (validity, generalizability risk)
Requirements / heuristics elicitation INSP Mis-eliciting requirements / heuristics from the identified source. (validity, generalizability risk)
Judging the satisfaction of the requirements / heuristics INSP Inspector subjectivity in checking the accomplishment of requirements / heuristics. (validity, generalizability risk)
Indirect inference of usefulness QUO, CASE, INSP Inferring the value of a solution from measures or findings that do not directly test the solution objectively. (validity, generalizability risk)
Table 2: The source of validity, generalizability and feasibility risks encountered when conducting summative evaluation studies.

4.1 Analysis Criteria

Validity and generalizability are well-known properties of generated evidence in scientific studies and have been broken down into many types. The primary types influencing our analysis are internal validity and external validity as defined in experimental quantitative studies [9], as well as credibility and transferability as defined in qualitative studies literature [45]. We view validity as the property of correctness of study findings, while we see generalizability as the extent to which study findings can be applied to similar but unstudied (unevaluated) cases.

By examining the findings of each evaluation method, we found four types of summative evidence which assess effectiveness or efficiency:

  1. quantities that represent the objective performance (measured or estimated by a method from THEO, QUT, AUTO, or INST),

  2. quantities that represent subjective satisfaction (estimated by a method from the QUO category),

  3. qualitative information about objective or subjective value of a solution (gathered by CASE methods),

  4. accomplishment of requirements/heuristics (inspected by a method belonging to INSP category).

Each evaluation method includes a set of activities resulting in one of the aforementioned four types of evidence. In our analysis, we outline the activities for each method and highlight risk factors associated with each activity. We rely on the definition of risk found in the software engineering literature [6], which defines exposure to risk as the probability-weighted impact of an event on a project (evaluation in our case). The identified risk factors may affect the validity and generalizability of the outcome of each method. For example, the generalizability of empirical evidence is affected by the sampling of cases for the study. Thus, in our analysis, we designate sampling as an activity for empirical evaluation methods and associate it with potential generalizability risk. On the contrary, some activities may reduce risks to validity or generalizability. For example, a typical activity to maintain the validity of quantitative empirical evidence is to apply inferential statistical tests [48]. Such testing activity is an example of what we call a risk reducer.

Besides validity and generalizability, feasibility is the third criterion we consider in our analysis. We include this criterion to reason about researchers’ decisions to evaluate solutions using methods with less summative quality. Table 2 describes the potential validity, generalizability and feasibility risks we identify for each of the summative evaluation category, along with the source of these risks.

4.2 Evaluation Process Breakdown

We break down the (sub)activities common to the methods in each category of our taxonomy. Then, we highlight the risks introduced or reduced as a result of performing these activities. The process of identifying the activities and highlighting their associated risks was performed based on our personal experience, validated and by the survey we report in 3. Figure 2 presents a summary of our analysis, along with risks highlighted on each activity.

Figure 2: A summary of our analysis of evaluation methods. We capture the main activities taken by evaluation methods which could introduce risk to evidence validity, generalizability and feasibility. We assign 3 risk categories for these criteria per activity, classify each risk factor to high, normal or reducer class, then compare the methods using their summative quality (SQ) and feasibility.

4.2.1 Theoretical Methods (THEO)

This category includes complexity Analysis of Algorithms & information-theoretic framework. These rational methods start by defining an objective metric, e.g time complexity, which is a useful measurement for assessment or comparison tasks. To measure the metric, researchers are required to abstract the behavior of the solution using a formal language, e.g. a programming language (Figure 2). This explicitly means full knowledge about the behavior of the solution. The last activity is to build a formal system, e.g. Turing machine [19], and use the premises in that system, e.g. unit execution time per instruction, to deductively measure the defined objective metric. Most rational studies captured in our survey apply the analysis of algorithm method to measure the time complexity of algorithms that are abstract by nature, and thus do not require the second activity. Moreover, the Turing machine is an applicable formal system that can be used to perform the deduction in this context. Another set of rational studies, which are more sophisticated, rely on information theory premises [13]. Most remarkably, these works present an abstraction activity for solutions that are not abstract by nature [14, 12].

The three activities we report for rational methods do not introduce any risk to the validity and generalizability criteria. They are rigorous activities that always measure what they claim to measure. Rational methods also evaluate abstract problems and solutions with well-defined behavior, and thus are completely generalizable to any untested cases. For example, finding the worst case time complexity for an algorithm as means no observable case of input size will ever take longer than linear execution time.

The issue of rational methods appears in the feasibility criterion. The first feasibility risk is introduced by the first activity, which defines an objective metric. In many problems, the objective metric might not be feasibly defined. For example, the general goal of VA systems is to generate insights about data, a goal that may not be easily assessed by measurable factors. The second activity introduces much more sever risk to the feasibility. Abstracting the evaluated solution’s behavior using a formal language requires sufficient knowledge about that solution’s behavior, which may not be possible for some types of solutions. For example, it is challenging to develop a formal language representation of human analytical processes, which practically limits the applicability of this type of evaluation on human-in-the-loop solutions. Since a human in these solutions controls their behavior, and that we cannot replace a human with a completely automated machine, it is not feasible to describe the human user’s behavior using a formal language. If the behavior of the solution cannot be abstracted, the third activity becomes infeasible since it cannot be performed in a formal manner without an abstract, well-defined solution. Moreover, building a formal system to deductively infer the performance of a solution is challenging and requires high abstracting skill.

4.2.2 Quantitative User Testing (QUT)

In these empirical quantitative methods, researchers study human-in-the-loop solutions by either conducting a formal comparative experiment or measuring the performance of the solutions independently. The latter can be considered a special case of the former. These methods start by defining objective metrics, similar to rational methods. However, a typical activity in all empirical methods is to sample test cases. These cases are determined by sampling problem instances and human subjects. In comparative evaluation studies, the sampling of test cases includes the sampling of competing solutions. To objectively estimate the performance of the solution, researchers need to define ground truth for tested problem instances, which can be either sampled or synthesized [82]. After sampling the test cases, researchers organize human subjects into groups (treatments) according to the study design. Two common designs include the within-subject (repeated measures) and between-subject (independent measures) designs. After organizing the study according to the selected design, researchers test human subjects with the sampled problems and collect quantitative measures of performance for each subject. These performance measurements can subsequently be analyzed per treatment using statistical tests (e.g. Analysis Of VAriance “ANOVA”). For assessment studies, statistics provide a confidence interval of the measured performance score for the solution. For comparative evaluation, the statistical tests ensure the significance of the difference between the performance of treatments. Some accuse such typical hypothesis testing methodology [24]. Nevertheless, Null-hypothesis significance testing (NHST) remains the most recognized methodology in quantitative scientific work.

The activities in the QUT category introduce risk to every criterion we analyze. A risk to the validity and generalizability criteria can be introduced as a result of sampling bias that excludes cases included in the study claim, sampling an insufficient number of cases to prove the claim, or failing to eliminate confounders when organizing treatments. The second risk can be reduced by applying a statistical test to show the potential of observing the findings for represented cases in general. The third risk is not a concern for assessment methods that do not generate evidence of usefulness as a result of comparing treatments.

The activities of QUT introduce risk to the feasibility criterion as well. Sampling representative cases can be infeasible because of the unavailability of representative human subjects or representative problem instances with known ground truth. Moreover, as in rational methods, it may not always be possible to identify a clear, objective metric that correctly distinguishes useful solutions from non-useful ones.

4.2.3 Quantitative User Opinion (QUO)

The activities in this category are quite similar to the previous category. However, the focus here is on assessing subjective aspects instead of the objective performance, and there is no need to establish ground truth for the test cases.

The difference between subjective and objective methods, in terms of risk can be illustrated as follows. The risk to the validity is higher in subjective methods, since besides potential sampling and assignment biases, subjective methods do not assess usefulness directly. As we have mentioned earlier, the evaluation of usefulness by definition is a way to assess solutions objectively. Subjective methods approach achieve this by assessing factors that are assumed to correlate with usefulness, such as user satisfaction. However, such correlation may not always be valid. According to Nelson [55], a system with limited utility could have high usability but would not be useful because of the missing functionalities. However, subjective methods are more feasible than objective methods. They do not require knowledge about ground truth nor quantifying objectives, and thus can be applied in more cases.

4.2.4 Quantitative Automation Testing (AUTO)

These methods apply the same activities as THEO methods. The only difference between the two categories is the method of measuring the objective metrics for abstract solutions. In THEO, extensive work is devoted to building the formal system used in deduction, which is challenging because it requires high abstraction skills and sufficient knowledge about the problem domain. An alternative approach, taken by methods in the AUTO category, is to prove usefulness empirically by relying on sampled cases and statistics. For example, most methods used to evaluate machine learning models rely on estimating the performance with a set of testing problem instances [39].

The risk to the validity and generalizability of the evidence generated by a method from the AUTO category is slightly less than the risk associated with the QUT category. The reason is the reduction in sampling bias in AUTO methods as the result of excluding the human dimension. On the other hand, the exclusion of the human dimension explicitly means less feasibility of AUTO methods, since they are only capable of evaluating abstract solutions described by a formal language.

4.2.5 Insight-based Evaluations (INST)

As an empirical category, sampling activities are typical in INST. A unique activity in this category is the qualitative data collection of insights. This is done by asking human subjects to self-report any insights they reach during the analysis by applying techniques such as diary [64] or think-aloud protocols [71]. Another unique activity in this category is the creation of measurable quantities out of collected qualitative data. The typical quantity to generate is insights count, which gives an indication of the usefulness of analytical support solutions.

Besides sampling bias, which can introduce risk to both validity and generalizability, INST’s unique activities may increase the risk to these criteria. For example, collecting insights as qualitative data introduces the possibility of misreporting some insights or misunderstanding reported ones. However, INST has a low feasibility risk since it does not require defining any objective metrics nor developing any tasks that ought to be evaluated quantitatively. INST also does not require prior knowledge about the ground truth of sampled problem instances.

4.2.6 Case Studies (CASE)

Instead of measuring the accomplishment of solutions with some predefined metric (which may not be feasible or known for concrete domain problems), CASE methods study realistic cases defined by actual real-world problem instances and intended users who are usually experts. To extract evidence of usefulness, evaluators pay extra attention to any data that can be captured during the examination. Collecting qualitative data is essential in case studies for creating a rich source of information, which helps in determining the usefulness of evaluated solutions. Many techniques can be implemented to generate qualitative data, including observation, semi-structured interviews, subject feedback, Think-aloud protocol, video/audio recordings, interaction logs, eye tracking and screen capturing [10]. During data collection, researchers may assist human subjects to overcome learnability issues. From the collected qualitative data, researchers can infer the value of evaluated solutions from the human subjects’ perspective. This hypothesis of evaluated solutions’ value can be used as evidence of usefulness, given that the human subjects are experts in the problem domain.

The risk to the validity and generalizability criteria for CASE methods can be explained as follows. Beside possible sampling bias, qualitative methods evaluate usefulness indirectly. The risk resulting from this indirectness can stem from two issues. The first is the potential misunderstanding of the human subjects when hypothesizing the value of the evaluated solution, which is typically known as the credibility of study findings. The second risk is the credibility of the subjects themselves, whose opinions are considered evidence of usefulness. This validity is affected primarily by how knowledgeable the subjects are about the problem domain, and secondarily by how much they know about using the evaluated solution. Another possible source of risk to the validity of case studies comes from the evaluators. The data collection and analysis in case studies can be profoundly affected by evaluators’ subjectivity. Inexperienced evaluators may miss relevant information during data collection or wrongly infer the value of the solution from collected data. The risk introduced by the evaluators can be minimized by experience and by following guidelines that reduce subjectivity. There are tremendous existing literature on the correct application of qualitative studies [74, 21].

The advantage of case studies lies in their feasibility. They do not require specifying and measuring objective metrics or abstracting the solution. They also do not require knowledge about ground truth for the problem instances included in the test cases, because their objective assessment is derived from expert opinion, who are assumed to be capable of assessing the usefulness while testing the solution. The only feasibility risk to this category is the availability of expert human subjects, and the sampling of representative realistic problem instances.

4.2.7 Inspection Methods (INSP)

The first activity of INSP is to identify the factors needed in useful systems through methods such as conducting a qualitative inquiry with stakeholders to identify requirements [84] or surveying the literature to identify known heuristics [56]. Once a set of requirements/heuristics is identified, researchers start inspecting the evaluated solution and judge whether it satisfies the identified requirements/heuristics.

INSP includes the most feasible methods, not requiring human subjects nor testing with any problem instances. However, these methods prove usefulness marginally and with many validity and generalizability concerns. The risk to the validity and generalizability of the findings of INSP include (a) the credibility of the information source, (b) the exhaustiveness of the elicited requirements/heuristics, and (c) the subjectivity of the inspectors. Inspection methods have been shown to have significantly less potential for identifying usability issues compared to formal testing [25]. This finding inherently means high risks to both the validity and generalizability of INSP’s evidence of usefulness.

4.3 A Ranking of the Summative Evaluation Categories

After identifying risk factors to the validity and generalizability, we combine both criteria into a single metric which we call summative quality (SQ). The term is inspired by applied medical research for categorizing and ranking the quality of research evidence [30]. We define SQ as the probability of not falling in any of the potential validity and generalizability risks introduced by a set of activities, i.e. the probability of an evidence to be valid and generalizable. Similarly, we consider feasibility as the probability of not falling in any of the risk factors that threaten feasibility.

SQ can be calculated by equation 1. We assume that the risks introduced by different activities are independent. Thus, to measure the total quality from subsequent activities, we take the product of the complement of the probability of risk in each activity. Taking the product is typical in similar total probability calculations (e.g. [62]). It is worth mentioning that the granularity of describing evaluation methods should not affect the total risk calculation. A single activity in a coarse-grained description of a method should accumulate all risk probabilities of that method when described in a fine-grained manner.


Equation 1 measures the product of the probabilities of not falling in any of the validity and generalizability risks. This model of risk assessment requires estimating the probability of the captured risks, which is a challenging task. To overcome this issue and to be able to compare evaluation methods, we categorize the risk factors into three groups: high risk (HR), normal risk (NR) and risk reducers (RR) (Figure 2). High risk factors are introduced by any activities that infer usefulness indirectly (i.e., from evidence that do not measure objective metrics). Such activity would produce evidence of usefulness that have more uncertainty due to the high evaluators’ potential subjectivity.

Using the categories of risk, we define to compare evaluation methods In lieu of . can be defined as a triplet , with each dimension representing the number of risk factors in each category. We calculate for all categories then use the resultant triplets to observe any clear superiority of one category over another (e.g. (2,6,1) has less given the two high risks compaerd to (0,8,1)). Based on this, we rank evaluation methods in terms of their (Table 3). The table also ranks evaluation methods based on , which can be defined as a tuple considering the feasibility risk factors. In case a clear superiority can not be decided (e.g. (0,10,1) Vs. (2,6,1)), we assign the same ranking to these methods (more examples of ranking calculations in the supplementary material). We stress that even though some categories rank low for , they may still be suitable for other purposes such as formative or exploratory.

Abb Category Summative Quality Rank Feasibility Rank
THEO Theoretical Methods 1 6
QUT Quantitative User Testing 3 4
QUO Quantitative User Opinion 5 2
AUTO Quantitative Automation Testing 2 5
INST Insight-based evaluation 4 2
CASE Case studies 4 3
INSP Inspection methods 4 1
Table 3: The ranking of the seven categories of summative evaluation methods based on the potential risk to their validity, generalizability, and feasibility. We rank the categories according to their and .

5 Recommendations

Based on our taxonomy and analysis of summative evaluation methods, we provide the following recommendations:

1- Always select a feasible method with the highest summative quality.

Prescribing an evaluation method for a given context can be done based on summative quality and feasibility. It is always encouraged to select the method with the highest summative quality. However, the feasibility of applying one of the methods in a given evaluation context may influence the selection. For example, the superiority of rational methods over empirical methods when testing usefulness; however, researchers may use an empirical method to evaluate a human-in-the-loop solution because of the infeasibility of abstracting human behavior using formal language as previously mentioned.

Our approach complements the nested model [53], which prescribes potential evaluation methods for each level. For instance, four different methods were prescribed to validate a solution in the encoding level. Complementing such prescriptions by following our approach can narrow down to a method from the Nested model prescribed methods.

2- Provide reasoning for evaluation method choice.

We suggest providing solid reasoning when choosing an evaluation method for a summative evaluation. Our framework may help in this reasoning by considering the summative quality and feasibility as criteria. We note that it is always possible to use a weaker form of proving usefulness when it is feasible to generate stronger evidence with another method. For example, one can rely on subjective methods to assess the usefulness of a solution designed to tackle a problem that can be evaluated objectively. In such scenarios, evaluators should explain the limitation that prevents them from using the method that generates stronger evidence of usefulness.

An example from the literature for a study that could have provided such an explanation is [59]. The authors used the inspection method to evaluate the usefulness of DeepEyes, a VA system developed to enhance designing deep neural networks. DeepEyes could have been evaluated using a formal controlled experiment i.e. by measuring training time and the classification accuracy of the end architecture (when using DeepEyes vs. traditional trial and error). Inspection has less summative quality compared to controlled experiments; thus, choosing the former over the later requires justification.

3- Encouraging insight-based evaluation.

A surprising finding from our survey is the limited application of insight-based evaluation to published work in VA. According to our analysis, insight-based evaluation is one of the few methods that do not suffer from high risk factors. It is capable of assessing human analytical processes with realistic problems while generating quantitative outcomes that can be replicated and generalized. According to our survey, researchers favor case studies over insight-based methods in evaluation contexts that are suitable for both. We encourage performing insight quantification and quantitative analysis instead of case studies to increase their precision and generalization potentials.

4- Apply multiple evaluation methods to minimize risk

Our final recommendation encourages practitioners to apply multiple evaluation methods to prove the usefulness of their developed solutions. All of the evaluation methods include activities that could potentially invalidate the evidence they generate. An easy remedy is to compare the level of usefulness reached by different methods. This recommendation is strongly encouraged for subjective methods and inspection methods because of their relatively high validity and generalizability risks. Subjective methods are usually utilized to complement objective assessment, which is an excellent strategy for measuring usefulness from different angles.

6 Conclusion and Future Work

We presented our survey of evaluation practices used with summative intentions in VA. We identified seven categories of evaluation, broke down the activities in each, and analyzed each category in terms of feasibility as well as the validity and generalizability of their findings. We proposed summative quality as the primary metric for selecting evaluation methods for the summative intention of proving usefulness. Based on the summative quality metric and the complementary feasibility metric, we proposed a ranking of the categories of evaluation.

One of the limitations in our analysis is the possible subjectivity in identifying risk factors. We attempted to minimize it by continuously consulting the literature and conducting a survey. Assigning risk factors to only two categoris could also be considered a limitation. However, we favor robustness over precision when analyzing evaluation methods.

Even though we based risk analysis on extensive literature and our survey, our proposed ranking of evaluation methods might be considered subjective. Regardless, we argue that it characterizes the risks involved in selecting methods for summative evaluation, and most importantly, our risk analysis paves the way for future research and community ranking, similar to many repeated fruitful efforts in medical research [30]. Categorizing risks associated with activities and even quantifying such risks based on expert-assigned scores or probabilities is an established practice in system engineering and risk assessment[31], and our work lays the foundation for such analysis of evaluation methods in VA.

By identifying risk factors and providing a methodology, our work also enables community-driven prescription of evaluation methods. According to [32], experts have high potential in judging risk factors and assigning probabilities. This approach can be used to assign probabilities to our identified risk factors using equation 1 (see the supplementary materials for an example of such approach). To reduce subjectivity in judgment, one can deploy a community-driven voting system to increase the accuracy of estimating the risk probabilities and to build standards to prescribe evaluation methods.

Another direction to pursue in the future is to examine methods that have not been utilized in VA literature and their applicability in the field. An example of these methods is the formal specification and verification method [8], which can evaluate the effectiveness of algorithms instead of measuring their efficiency. We are also interested in exploring mixed methods from other domains that resemble insight-based methods, because they can assess usefulness along with providing explanatory information to consider in relation to the captured performance.

The authors wish to thank Christina Stober, Gourav Jhanwar, and Bryan Jimenez for their comments and proofreading the paper.


  • [1] K. Allendoerfer, S. Aluker, G. Panjwani, J. Proctor, D. Sturtz, M. Vukovic, and C. Chen. Adapting the cognitive walkthrough method to assess the usability of a knowledge domain visualization. In Information Visualization, 2005. INFOVIS 2005. IEEE Symposium on, pp. 195–202, 2005.
  • [2] K. Andrews. Evaluation comes in many guises. In CHI workshop on BEyond time and errors: novel evaLuation methods for Information Visualization (BELIV), pp. 7–8, 2008.
  • [3] N. Andrienko, G. Andrienko, J. M. C. Garcia, and D. Scarlatti. Analysis of flight variability: a systematic approach. IEEE Transactions on Visualization and Computer Graphics, 25(1):54–64, 2019.
  • [4] M. Angelini, G. Blasilli, T. Catarci, S. Lenti, and G. Santucci. Vulnus: Visual vulnerability analysis for network security. IEEE Transactions on Visualization and Computer Graphics, 25(1):183–192, 2019.
  • [5] R. Arias-Hernandez, L. Kaastra, T. M. Green, and B. D. Fisher. Pair analytics: Capturing reasoning processes in collaborative visual analytics. In 2011 44th Hawaii International Conference on System Sciences, 2011.
  • [6] P. L. Bannerman. Risk and risk management in software projects: A reassessment. Journal of Systems and Software, 81(12):2118–2133, 2008.
  • [7] J. Bernard, M. Hutter, M. Zeppelzauer, D. Fellner, and M. Sedlmair. Comparing visual-interactive labeling with active learning: An experimental study. IEEE Transactions on Visualization and Computer Graphics, 24(1):298–308, 2018.
  • [8] G. Bernot, M.-C. Gaudel, and B. Marre. Software testing based on formal specifications: a theory and a tool. Software Engineering Journal, 6(6):387–405, 1991.
  • [9] D. T. Campbell and J. C. Stanley. Experimental and quasi-experimental designs for research. Rand McNally, Chicago, 1963.
  • [10] S. Carpendale. Evaluating Information Visualizations. Springer, Berlin, Heidelberg, 2008.
  • [11] G. Y.-Y. Chan, P. Xu, Z. Dai, and L. Ren. V i B r: Visualizing bipartite relations at scale with the minimum description length principle. IEEE Transactions on Visualization and Computer Graphics, 25(1):321–330, 2019.
  • [12] M. Chen and D. S. Ebert. An ontological framework for supporting the design and evaluation of visual analytics systems. Computer Graphics Forum, 38(3):131–144, 2019.
  • [13] M. Chen, M. Feixas, I. Viola, A. Bardera, H.-W. Shen, and M. Sbert. Information theory tools for visualization. AK Peters/CRC Press, 2016.
  • [14] M. Chen, K. Gaither, N. W. John, and B. McCann. An information-theoretic approach to the cost-benefit analysis of visualization in virtual environments. IEEE Transactions on Visualization and Computer Graphics, 25(1):32–42, 2019.
  • [15] M. Chen and A. Golan. What may visualization processes optimize? IEEE Transactions on Visualization and Computer Graphics, 22(12):2619–2632, 2016.
  • [16] M. Chen and H. Jaenicke. An information-theoretic framework for visualization. IEEE Transactions on Visualization and Computer Graphics, 16(6):1206–1215, 2010.
  • [17] Y. Chen, P. Xu, and L. Ren. Sequence synopsis: Optimize visual summary of temporal event data. IEEE Transactions on Visualization and Computer Graphics, 24(1):45–55, 2018.
  • [18] H. Chung, S. P. Dasari, S. Nandhakumar, and C. Andrews. Cricto: Supporting sensemaking through crowdsourced information schematization. In 2017 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 139–150, 2017.
  • [19] B. J. Copeland. The essential turing. Clarendon Press, Oxford, England, UK, 2004.
  • [20] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. MIT press, Cambridge, MA, 2009.
  • [21] J. W. Creswell. Qualitative inquiry & research design: choosing among five approaches. SAGE, Thousand Oaks, 2017.
  • [22] J. W. Creswell and V. L. P. Clark. Designing and conducting mixed methods research. SAGE, 2017.
  • [23] A. Crisan and M. Elliott. How to evaluate an evaluation study? comparing and contrasting practices in vis with those of other disciplines: Position paper. In 2018 IEEE Evaluation and Beyond-Methodological Approaches for Visualization (BELIV), pp. 28–36, 2018.
  • [24] G. Cumming. The new statistics: Why and how. Psychological Science, 25(1):7–29, 2014.
  • [25] H. W. Desurvire. Usability inspection methods. chap. Faster, Cheaper!! Are Usability Inspection Methods As Effective As Empirical Testing?, pp. 173–202. John Wiley & Sons, Inc., Hoboken, NJ, 1994.
  • [26] G. Ellis and A. Dix. An explorative analysis of user evaluation studies in information visualisation. In Proceedings of the 2006 Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization (BELIV), BELIV ’06, pp. 1–7, 2006.
  • [27] S. Fu, H. Dong, W. Cui, J. Zhao, and H. Qu. How do ancestral traits shape family trees over generations? IEEE Transactions on Visualization and Computer Graphics, 24(1):205–214, 2018.
  • [28] M. Gelderman. The relation between user satisfaction, usage of information systems and performance. Information & Management, 34(1):11 – 18, 1998.
  • [29] G. Grinstein, A. Kobsa, C. Plaisant, and J. T. Stasko. Which comes first, usability or utility? In IEEE Visualization, 2003. VIS 2003., pp. 605–606, 2003.
  • [30] G. H. Guyatt, A. D. Oxman, G. E. Vist, R. Kunz, Y. Falck-Ytter, P. Alonso-Coello, and H. J. Schünemann. Grade: an emerging consensus on rating quality of evidence and strength of recommendations. The British Medical Journal (BMJ), 336(7650):924–926, 2008.
  • [31] Y. Y. Haimes. Risk modeling, assessment, and management. John Wiley & Sons, Hoboken, NJ, 2015.
  • [32] D. Hubbard and D. Evans. Problems with scoring methods and ordinal scales in risk assessment. IBM Journal of Research and Development, 54(3):2:1–2:10, 2010.
  • [33] P. Isenberg, T. Zuk, C. Collins, and S. Carpendale. Grounded evaluation of information visualizations. In Proceedings of the 2008 Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization (BELIV), pp. 6:1–6:8, 2008.
  • [34] T. Isenberg, P. Isenberg, J. Chen, M. Sedlmair, and T. Möller. A systematic review on the practice of evaluating visualization. IEEE Transactions on Visualization and Computer Graphics, 19(12):2818–2827, 2013.
  • [35] S. Jänicke and D. J. Wrisley. Interactive visual alignment of medieval text versions. In 2017 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 127–138, 2017.
  • [36] Y.-a. Kang, C. Gorg, and J. Stasko. Evaluating visual analytics systems for investigative analysis: Deriving design principles from a case study. In 2009 IEEE Symposium on Visual Analytics Science and Technology, pp. 139–146, 2009.
  • [37] M. C. Kaptein, C. Nass, and P. Markopoulos. Powerful and consistent analysis of likert-type ratingscales. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2391–2394, 2010.
  • [38] D. A. Keim, F. Mansmann, J. Schneidewind, J. Thomas, and H. Ziegler. Visual Analytics: Scope and Challenges, pp. 76–90. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.
  • [39] R. Kohavi et al. A study of cross-validation and bootstrap for accuracy estimation and model selection. In International Joint Conference on Artificial Intelligence (IJCAI), vol. 14, pp. 1137–1145, 1995.
  • [40] H. Lam, E. Bertini, P. Isenberg, C. Plaisant, and S. Carpendale. Empirical studies in information visualization: Seven scenarios. IEEE Transactions on Visualization and Computer Graphics, 18(9):1520–1536, 2012.
  • [41] P.-M. Law, R. C. Basole, and Y. Wu. Duet: Helping data analysis novices conduct pairwise comparisons by minimal specification. IEEE Transactions on Visualization and Computer Graphics, 25(1):427–437, 2019.
  • [42] P.-M. Law, Z. Liu, S. Malik, and R. C. Basole. MAQUI: Interweaving queries and pattern mining for recursive event sequence exploration. IEEE Transactions on Visualization and Computer Graphics, 25(1):396–406, 2019.
  • [43] R. A. Leite, T. Gschwandtner, S. Miksch, S. Kriglstein, M. Pohl, E. Gstrein, and J. Kuntner. Eva: Visual analytics to identify fraudulent events. IEEE Transactions on Visualization and Computer Graphics, 24(1):330–339, 2018.
  • [44] H. Lin, S. Gao, D. Gotz, F. Du, J. He, and N. Cao. Rclens: Interactive rare category exploration and identification. IEEE Transactions on Visualization and Computer Graphics, 24(7):2223–2237, 2018.
  • [45] Y. S. Lincoln and E. G. Guba. Naturalistic inquiry. SAGE, Thousand Oaks, 1985.
  • [46] S. Liu, C. Chen, Y. Lu, F. Ouyang, and B. Wang. An interactive method to improve crowdsourced annotations. IEEE Transactions on Visualization and Computer Graphics, 25(1):235–245, 2019.
  • [47] N. Mahyar, S.-H. Kim, and B. C. Kwon. Towards a taxonomy for evaluating user engagement in information visualization. In Workshop on Personal Visualization: Exploring Everyday Life, 2015.
  • [48] G. Marczyk, D. DeMatteo, and D. Festinger. Essentials of research design and methodology. John Wiley & Sons Inc, Hoboken, NJ, 2005.
  • [49] J. E. McGrath. Methodology matters: Doing research in the behavioral and social sciences. In Readings in Human–Computer Interaction, pp. 152 – 169. Morgan Kaufmann, 1995.
  • [50] S. McKenna, D. Mazur, J. Agutter, and M. Meyer. Design activity framework for visualization design. IEEE Transactions on Visualization and Computer Graphics, 20(12):2191–2200, 2014.
  • [51] M. Meyer, M. Sedlmair, and T. Munzner. The four-level nested model revisited: blocks and guidelines. In Proceedings of the 2012 Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization (BELIV), p. 11, 2012.
  • [52] Y. Ming, H. Qu, and E. Bertini. RuleMatrix: Visualizing and understanding classifiers with rules. IEEE Transactions on Visualization and Computer Graphics, 25(1):342–352, 2019.
  • [53] T. Munzner. A nested process model for visualization design and validation. IEEE Transactions on Visualization & Computer Graphics, (6):921–928, 2009.
  • [54] P. K. Muthumanickam, K. Vrotsou, A. Nordman, J. Johansson, and M. Cooper. Identification of temporally varying areas of interest in long-duration eye-tracking data sets. IEEE Transactions on Visualization and Computer Graphics, 25(1):87–97, 2019.
  • [55] J. Nielsen. Usability engineering. Elsevier, Maryland Heights, MO, 1994.
  • [56] J. Nielsen. Usability inspection methods. chap. Heuristic Evaluation, pp. 25–62. John Wiley & Sons, Inc., New York, NY, USA, 1994.
  • [57] D. Orban, D. F. Keefe, A. Biswas, J. Ahrens, and D. Rogers. Drag and track: A direct manipulation interface for contextualizing data instances within a continuous parameter space. IEEE Transactions on Visualization and Computer Graphics, 25(1):256–266, 2019.
  • [58] D. Park, S. Kim, J. Lee, J. Choo, N. Diakopoulos, and N. Elmqvist. Conceptvector: Text visual analytics via interactive lexicon building using word embedding. IEEE Transactions on Visualization and Computer Graphics, 24(1):361–370, 2018.
  • [59] N. Pezzotti, T. Höllt, J. Van Gemert, B. P. Lelieveldt, E. Eisemann, and A. Vilanova. Deepeyes: Progressive visual analytics for designing deep neural networks. IEEE Transactions on Visualization and Computer Graphics, 24(1):98–108, 2017.
  • [60] N. Pezzotti, T. Höllt, J. Van Gemert, B. P. Lelieveldt, E. Eisemann, and A. Vilanova. Deepeyes: Progressive visual analytics for designing deep neural networks. IEEE Transactions on Visualization and Computer Graphics, 24(1):98–108, 2018.
  • [61] R. Pienta, F. Hohman, A. Endert, A. Tamersoy, K. Roundy, C. Gates, S. Navathe, and D. H. Chau. Vigor: interactive visual exploration of graph query results. IEEE Transactions on Visualization and Computer Graphics, 24(1):215–225, 2018.
  • [62] H. Rohani and A. K. Roosta. Calculating total system availability., 2014. Accessed: 2019-07-30.
  • [63] P. Saraiya, C. North, and K. Duca. An insight-based methodology for evaluating bioinformatics visualizations. IEEE Transactions on Visualization and Computer Graphics, 11(4):443–456, 2005.
  • [64] P. Saraiya, C. North, V. Lam, and K. A. Duca. An insight-based longitudinal study of visual analytics. IEEE Transactions on Visualization and Computer Graphics, 12(6):1511–1522, 2006.
  • [65] J. Scholtz. Beyond usability: Evaluation aspects of visual analytic environments. In 2006 IEEE Symposium On Visual Analytics Science And Technology, pp. 145–150, 2006.
  • [66] J. Scholtz. Developing guidelines for assessing visual analytics environments. Information Visualization, 10(3):212–231, 2011.
  • [67] J. C. Scholtz. User-centered evaluation of visual analytics. Synthesis digital library of engineering and computer science. Morgan & Claypool, San Rafael, CA, 2018.
  • [68] M. Sedlmair. Design study contributions come in different guises: Seven guiding scenarios. In Proceedings of the 2016 Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization (BELIV), pp. 152–161, 2016.
  • [69] M. Sedlmair, M. Meyer, and T. Munzner. Design study methodology: Reflections from the trenches and the stacks. IEEE Transactions on Visualization and Computer Graphics, 18(12):2431–2440, 2012.
  • [70] B. Shneiderman and C. Plaisant. Strategies for evaluating information visualization tools: multi-dimensional in-depth long-term case studies. In Proceedings of the 2006 Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization (BELIV), pp. 1–7, 2006.
  • [71] M. Smuc, E. Mayr, T. Lammarsch, W. Aigner, S. Miksch, and J. Gärtner. To score or not to score? tripling insights for participatory design. IEEE Computer Graphics and Applications, 29(3):29–38, 2009.
  • [72] J. Stasko. Value-driven evaluation of visualizations. In Proceedings of the 2014 Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization (BELIV), pp. 46–53, 2014.
  • [73] M. Stein, H. Janetzko, A. Lamprecht, T. Breitkreutz, P. Zimmermann, B. Goldlücke, T. Schreck, G. Andrienko, M. Grossniklaus, and D. A. Keim. Bring it to the pitch: Combining video and movement data to enhance team sport analysis. IEEE Transactions on Visualization and Computer Graphics, 24(1):13–22, 2018.
  • [74] A. L. Strauss and J. Corbin. Basics of qualitative research: grounded theory procedures and techniques. SAGE, Thousand Oaks, 1998.
  • [75] H. Strobelt, S. Gehrmann, M. Behrisch, A. Perer, H. Pfister, and A. M. Rush. S eq 2s eq-v is: A visual debugging tool for sequence-to-sequence models. IEEE Transactions on Visualization and Computer Graphics, 25(1):353–363, 2019.
  • [76] N. Sultanum, D. Singh, M. Brudno, and F. Chevalier. Doccurate: A curation-based approach for clinical text visualization. IEEE Transactions on Visualization and Computer Graphics, 25(1):142–151, 2019.
  • [77] M. Taras. Assessment -– Summative and Formative -– Some Theoretical Reflections. British Journal of Educational Studies, 53(4):466–478, 2005.
  • [78] M. Tory and T. Moller. Evaluating visualizations: do expert reviews work? IEEE computer graphics and applications, 25(5):8–11, 2005.
  • [79] J. J. Van Wijk. The value of visualization. In VIS 05. IEEE Visualization, 2005., pp. 79–86, 2005.
  • [80] H. Wang, Y. Lu, S. T. Shutters, M. Steptoe, F. Wang, S. Landis, and R. Maciejewski. A visual analytics framework for spatiotemporal trade network analysis. IEEE Transactions on Visualization and Computer Graphics, 25(1):331–341, 2019.
  • [81] J. Wang, L. Gou, H.-W. Shen, and H. Yang. DQNViz: A visual analytics approach to understand deep q-networks. IEEE Transactions on Visualization and Computer Graphics, 25(1):288–298, 2019.
  • [82] M. A. Whiting, J. Haack, and C. Varley. Creating realistic, scenario-based synthetic data for test and evaluation of information analytics software. In Proceedings of the 2008 Workshop on Beyond Time and Errors: Novel Evaluation Methods for Visualization (BELIV), p. 8, 2008.
  • [83] L. Wilkinson. Visualizing big data outliers through distributed aggregation. IEEE Transactions on Visualization & Computer Graphics, (1):1–1, 2018.
  • [84] Y. Wu, X. Xie, J. Wang, D. Deng, H. Liang, H. Zhang, S. Cheng, and W. Chen. Forvizor: Visualizing spatio-temporal team formations in soccer. IEEE Transactions on Visualization and Computer Graphics, 25(1):65–75, 2019.
  • [85] C. Xie, W. Xu, and K. Mueller. A visual analytics framework for the detection of anomalous call stack trees in high performance computing applications. IEEE Transactions on Visualization and Computer Graphics, 25(1):215–224, 2019.
  • [86] J. Zhang, Y. Wang, P. Molino, L. Li, and D. S. Ebert. Manifold: A model-agnostic framework for interpretation and diagnosis of machine learning models. IEEE Transactions on Visualization and Computer Graphics, 25(1):364–373, 2019.
  • [87] J. Zhao, M. Glueck, P. Isenberg, F. Chevalier, and A. Khan. Supporting handoff in asynchronous collaborative sensemaking using knowledge-transfer graphs. IEEE Transactions on Visualization and Computer Graphics, 24(1):340–350, 2018.
  • [88] X. Zhao, Y. Wu, W. Cui, X. Du, Y. Chen, Y. Wang, D. L. Lee, and H. Qu. Skylens: Visual analysis of skyline on multi-dimensional data. IEEE Transactions on Visualization and Computer Graphics, 24(1):246–255, 2018.
  • [89] Y. Zhao, F. Luo, M. Chen, Y. Wang, J. Xia, F. Zhou, Y. Wang, Y. Chen, and W. Chen. Evaluating multi-dimensional visualizations for understanding fuzzy clusters. IEEE Transactions on Visualization and Computer Graphics, 25(1):12–21, 2019.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description