[

# [

## Abstract

Information theory can be used to analyze the cost-benefit of visualization processes. However, the current measure of benefit contains an unbounded term that is neither easy to estimate nor intuitive to interpret. In this work, we propose to revise the existing cost-benefit measure by replacing the unbounded term with a bounded one. We examine a number of bounded measures that include the Jenson-Shannon divergence and a new divergence measure formulated as part of this work. We use visual analysis to support the multi-criteria comparison, enabling the selection of the most logical and intuitive option. We applied the revised cost-benefit measure to two case studies, demonstrating its uses in practical scenarios, while the collected real world data further informs the selection of a bounded measure.

\SpecialIssuePaper\BibtexOrBiblatex\electronicVersion\PrintedOrElectronic

A Bounded Measure for Estimating the Benefit of Visualization]A Bounded Measure for Estimating the Benefit of Visualization M. Chen et al.] Min Chen\orcid0000-0001-5320-5729, Mateu Sbert\orcid0000-0003-2164-6858, Alfie Abdul-Rahman, and Deborah Silver
University of Oxford, UK, University of Girona, Spain, King’s College London, UK, and Rutgers University, USA

## 1 Introduction

It is now widely understood among visualization researchers and practitioners that the effectiveness of a visualization process depends on data, user, and task. One important aspect of user is a user’s knowledge, which plays a critical role in reconstructing the information lost during visualization processes (e.g., data transformation and visual mapping). One major challenge in appreciating the significance of such knowledge is the difficulty to measure or estimate the knowledge used by a user during visualization.

Chen and Golan proposed an information-theoretic measure [CG16] for measuring the cost-benefit of a data intelligence process. The measure features a term based on the Kullback-Leibler (KL) divergence [KL51] for measuring the potential distortion of a user in reconstructing the information that may have been lost or distorted during a visualization process. The cost-benefit ratio instigates that a user with more knowledge about the source data and its visual representation is likely to suffer less distortion. While using KL-divergence is mathematically intrinsic for measuring the potential distortion, its unboundedness property has some undesirable consequences. Kijmongkolchai et al. applied the formula of Chen and Golan to the results of an empirical study for estimating users’ knowledge used in visualization processes, and used a bounded approximation of the KL-divergence in their estimation [KARC17].

In this work, we propose to replace the KL-divergence with a bounded term. We first confirm the boundedness is a necessary property. We then use visual analysis to compare a number of bounded measures, which include the JensenâShannon (JS) divergence [Lin91] and a new divergence measure, , formulated as part of this work. Based on our multi-criteria analysis, we narrow down our selections to three most logical and intuitive options. We then apply the selected divergence measures, in conjunction with the revised cost-benefit measure, to the real world data collected in two case studies. The numerical calculation in the application further informs us about the relative merits of the selected measure, which enables us to the final selection while demonstrating its uses in practical scenarios.

## 2 Related Work

Claude Shannon’s landmark article in 1948 [Sha48] signifies the birth of information theory. It has been underpinning the fields of data communication, compression, and encryption since. As a mathematical framework, information theory provides a collection of useful measures, many of which, such as Shannon entropy [Sha48], cross entropy [CT06], mutual information [CT06], and Kullback-Leibler divergence [KL51] are widely used in applications such as physics, biology, neurology, psychology, and computer science (e.g., visualization, computer graphics, computer vision, data mining, and machine learning). In this work, we will also consider Jensen-Shannon divergence [Lin91] in detail.

Information theory has been used extensively in visualization [CFV16]. The theory has enabled many applications in visualization, including scene and shape complexity analysis by Feixas et al. [FdBS99] and Rigau et al. [RFS05], light source placement by Gumhold [Gum02], view selection in mesh rendering by Vázquez et al. [VFSH04] and Feixas et al. [FSG09], attribute selection by Ng and Martin [NM04], view selection in volume rendering by Bordoloi and Shen [BS05], and Takahashi and Takeshima [TT05], multi-resolution volume visualization by Wang and Shen [WS05], focus of attention in volume rendering by Viola et al. [VFSG06], feature highlighting by Jänicke and Scheuermann [JWSK07, JS10], and Wang et al. [WYM08], transfer function design by Bruckner and Möller [BM10], and Ruiz et al. [RBB11, BRB13b], multimodal data fusion by Bramon et al. [BBB12], isosurface evaluation by Wei et al. [WLS13], measuring of observation capacity by Bramon et al. [BRB13a], measuring information content by Biswas et al. [BDSW13], proving the correctness of “overview first, zoom, details-on-demand” by Chen and Jänicke [CJ10] and Chen et al. [CFV16], confirming visual multiplexing by Chen et al. [CWB14].

Ward first suggested that information theory might be an underpinning theory for visualization [PAJKW08]. Chen and Jänicke [CJ10] outlined an information-theoretic framework for visualization, and it was further enriched by Xu et al. [XLS10] and Wang and Shen [WS11] in the context of scientific visualization. Chen and Golan proposed an information-theoretic measure for analyzing the cost-benefit of visualization processes and visual analytics workflows [CG16]. It was used to frame an observation study showing that human developers usually entered a huge amount of knowledge into a machine learning model [TKC17]. It motivated an empirical study confirming that knowledge could be detected and measured quantitatively via controlled experiments [KARC17]. It was used to analyze the cost-benefit of different virtual reality applications [CGJM19]. It formed the basis of a systematic methodology for improving the cost-benefit of visual analytics workflows [CE19]. This work continues the path of theoretical developments in visualization [CGJ17], and is intended to improve the original cost-benefit formula [CG16], in order to make it more intuitive in practical applications.

## 3 Overview, Motivation, and Problem Statement

Visualization is useful in most data intelligence workflows, but it is not universally true because the effectiveness of visualization is usually data-, user-, and task-dependent. The cost-benefit ratio proposed by Chen and Golan [CG16] captures some essence of such dependency. Below is the qualitative expression of the measure:

 BenefitCost=Alphabet Compression−% Potential DistortionCost (1)

Consider the scenario of viewing some data through a particular visual representation. The term Alphabet Compression (AC) measures the amount of information loss due to visual abstraction [VCI20]. Since the visual representation is fixed in the scenario, AC is thus largely data-dependent. AC is a positive measure reflecting the fact that visual abstraction must be useful in many cases though it may result in information loss. This apparently counter-intuitive term is essential for asserting why visualization is useful. (Note that the term also helps assert the usefulness of statistics, algorithms, and interaction since they all usually cause information loss [CE19].)

The positive implication of the term AC is counterbalanced by the term Potential Distortion, while both being moderated by the term Cost. The term Cost encompasses all costs of the visualization process, including computational costs (e.g., visual mapping and rendering), cognitive costs (e.g., cognitive load), and consequential costs (e.g., impact of errors). The measure of cost (e.g., in terms of energy, time, or money) is thus data-, user-, and task-dependent.

The term Potential Distortion (PD) measures the informative divergence between viewing the data through visualization with information loss and viewing the data without any information loss. The latter might be ideal but is usually at an unattainable cost except for values in a very small data space (i.e., in a small alphabet as discussed in [CG16]). PD is data-dependent or user-dependent. Given the same data visualization with the same amount of information loss, one can postulate that a user with more knowledge about the data or visual representation usually suffers less distortion. This postulation is the main focus of this paper.

Consider the visual representation of a network of arteries in Figure 1. The image was generated from a volume dataset using the maximum intensity projection (MIP) method. While it is known that MIP cannot convey depth information well, it has been widely used for observing some classes of medical imaging data, such as arteries. The highlighted area in Figure 1 shows an apparently flat area, which is a distortion from the actuality of a tubular surface likely with some small wrinkles and bumps. The doctors who deal with such medical data are expected to have sufficient knowledge to reconstruct the reality adequately from the “distorted” visualization, while being able to focus on more important task of making diagnostic decisions, e.g., about aneurysm.

As shown in some recent works, it is possible for visualization designers to estimate AC, PD, and Cost qualitatively [CGJM19, CE19] and quantitatively [TKC17, KARC17]. It is highly desirable to advance the scientific methods for quantitative estimation, towards the eventual realization of computer-assisted analysis and optimization in designing visual representations. This work focuses on one challenge of quantitative estimation, i.e., how to estimate human knowledge that may be used in a visualization process.

Building on the methods of observational estimation in [TKC17] and controlled experiment in [KARC17], one may reasonably anticipate a systematic method based on a short interview by asking potential viewers a few questions. For example, one may use the question in Figure 1 to estimate the knowledge of doctors, patients, and any other people who may view such a visualization. The question is intended to tease out two pieces of knowledge that may help reduce the potential distortion due to the “flat area” depiction. One piece is about the general knowledge that associates arteries with tube-like shapes. Another, which is more advanced, is about the surface texture of arteries and the limitations of the MIP method.

Let the binary options about whether the “flat area” is actually flat or curved be an alphabet . The likelihood of the two options is represented by a probability distribution or probability mass function (PMF) , where . Since most arteries in the real world are of tubular shapes, one can imagine that a ground truth alphabet might have a PMF strongly in favor of the curved option. However, the visualization seems to suggest the opposite, implying a PMF strongly in favor of the flat option. It is not difficult to interview some potential viewers, enquiring how they would answer the question. One may estimate a PMF from doctors’ answers, and another from patients’ answers.

Table 1 shows two scenarios where different probability data is obtained. The values of PD are computed using the most well-known divergence measure, KL-divergence [KL51], and are of unit bit. In Scenario 1, without any knowledge, the visualization process would suffer 6.50 bits of PD. As doctors are not fooled by the “flat area” shown in the MIP visualization, their knowledge is worth 6.50 bits. Meanwhile, patients would suffer 1.12 bits of PD on average, their knowledge is worth bits.

In Scenario 2, the PMFs of and depart further away, while and remain the same. Although doctors and patients would suffer more PD, their knowledge is worth more than that in Scenario 1 (i.e., bits and bits respectively).

Similarly, the binary options about whether the “flat area” is actually smooth or not can be defined by an alphabet . Table 2 shows two scenarios about collected probability data. In these two scenarios, doctors exhibit much more knowledge than patients, indicating that the surface texture of arteries is of specialized knowledge.

The above example demonstrates that using the KL-divergence to estimate PD can differentiate the knowledge variation between doctors and patients regarding the two pieces of knowledge that may reduce the distortion due to the “flat area”. When it is used in Eq.  1 in a relative or qualitative context (e.g., [CGJM19, CE19]), the unboundedness of the KL-divergence does not pose an issue.

However, this does become an issue when the KL-divergence is used to measure PD in an absolute and quantitative context. From the two diverging PMFs and in Table 1, or and in Table 2, we can observe that the smaller is, the more divergent the two PMFs become and the higher value the PD has. Indeed, consider an arbitrary alphabet , and two PMFs defined upon : and . When , we have the KL-divergence .

Meanwhile, the Shannon entropy of , , has an upper bound of 1 bit. It is thus not intuitive or practical to relate the value of to that of . Many applications of information theory do not relate these two types of values explicitly. When reasoning such relations is required, the common approach is to impose a lower-bound threshold for (e.g., [KARC17]). However, there is yet a consistent method for defining such a threshold for various alphabets in different applications, while preventing a range of small or large values (i.e., or ) in a PMF is often inconvenient in practice. In the following section, we discuss several approaches to defining a bounded measure for PD.

Note: for an information-theoretic measure, we use an alphabet and its PMF interchangeably, e.g., .

## 4 Bounded Measures for Potential Distortion (PD)

Let be a process in a data intelligence workflow, be its input alphabet, and be its output alphabet. can be a human-centric process (e.g., visualization and interaction) or a machine-centric process (e.g., statistics and algorithms). In the original proposal [CG16], the value of Benefit in Eq. 1 is measured using:

 Benefit=AC−PD=H(Zi)−H(Zi+1)−DKL(Z′i||Zi) (2)

where is the Shannon entropy of an alphabet and is KL-divergence of an alphabet from a reference alphabet. Because the Shannon entropy of an alphabet with a finite number of letters is bounded, AC, which is the entropic difference between the input and output alphabets, is also bounded. On the other hand, as discussed in the previous section, PD is unbounded. Although Eq. 2 can be used for relative comparison, it is not quite intuitive in an absolute context, and it is difficult to imagine that the amount of informative distortion can be more than the maximum amount of information available.

In this section, we present the unpublished work by Chen and Sbert [CS19], which shows mathematically that for alphabets of a finite size, the KL-divergence used in Eq. 2 should ideally be bounded. In their arXiv report, they also outlined a new divergence metric and compare it with a few other bounded divergence measures. Building on initial comparison in [CS19], we use visualization in Section 4.2 and real world data in Section 5 to assist the multi-criteria analysis and selection of a bounded divergence measure to replace the KL-divergence used in Eq. 2.

### 4.1 A Mathematical Proof of Boundedness

Let be an alphabet with a finite number of letters, , and is associated with a PMF, , such that:

 q(zn)=ϵ,(where 0<ϵ<2−(n−1)),q(zn−1)=(1−ϵ)2−(n−1),q(zn−2)=(1−ϵ)2−(n−2),⋯q(z2)=(1−ϵ)2−2,q(z1)=(1−ϵ)2−1+(1−ϵ)2−(n−1). (3)

When we encode this alphabet using an entropy binary coding scheme [Mos12], we can be assured to achieve an optimal code with the lowest average length for codewords. One example of such a code for the above probability is:

 z1:0,z2:10,z3:110⋯zn−1:111…10(with n−2 1''s % and one 0'') zn:111…11(with n−1 1''s % and no 0'') (4)

In this way, , which has the smallest probability, will always be assigned a codeword with the maximal length of . Entropy coding is designed to minimize the average number of bits per letter when one transmits a “very long” sequence of letters in the alphabet over a communication channel. Here the phrase “very long” implies that the string exhibits the above PMF (Eq. 3).

Suppose that is actually of PMF , but is encoded as Eq. 4 based on . The transmission of using this code will have inefficiency. The inefficiency is usually measured using cross entropy , such that:

 HCE(P,Q)=H(P)+DKL(P||Q) (5)

Clearly, the worst case is that the letter, , which was encoded using bits, turns out to be the most frequently used letter in (instead of the least in ). It is so frequent that all letters in the long string are of . So the average codeword length per letter of this string is . The situation cannot be worse. Therefore, is the upper bound of the cross entropy. From Eq. 5, we can also observe that must also be bounded since and are both bounded as long as has a finite number of letters. Let be the upper bound of . The upper bound for , , is thus:

 DKL(P||Q)=HCE(P,Q)−H(P)≤⊤\text{CE}−min∀P(Z)(H(P)) (6)

There is a special case worth noting. In practice, it is common to assume that is a uniform distribution, i.e., , typically because is unknown or varies frequently. Hence the assumption leads to a code with an average length equaling (or in practice, the smallest integer ). Under this special (but rather common) condition, all letters in a very long string have codewords of the same length. The worst case is that all letters in the string turn out to the same letter. Since there is no informative variation in the PMF for this very long string, i.e., , in principle, the transmission of this string is unnecessary. The maximal amount of inefficiency is thus . This is indeed much lower than the upper bound , justifying the assumption or use of a uniform in many situations.

### 4.2 Bounded Measures and Their Visual Analysis

While numerical approximation may provide a bounded KL-divergence, it is not easy to determine the value of and it is difficult to ensure everyone to use the same for the same alphabet or comparable alphabets. It is therefore desirable to consider bounded measures that may be used in place of .

Jensen-Shannon divergence is such a measure:

 DJS(P||Q)=12(DKL(P||M)+DKL(Q||M))=DJS(Q||P)=12n∑i=1(pilog22pipi+qi+qilog22qipi+qi) (7)

where and are two PMFs associated with the same alphabet and is the average distribution of and . With the base 2 logarithm as in Eq. 7, is bounded by 0 and 1.

Another bounded measure is the conditional entropy :

 H(P|Q)=H(P)−I(P;Q)=H(P)−n∑i=1n∑j=1ri,jlog2ri,jpiqj (8)

where is the joint probability of the two conditions of that are associated with and . is bounded by 0 and .

The third bounded measure was proposed as part of this work, which is referred as and is defined as follows:

 Dknew(P||Q)=12n∑i=1(pi+qi)log2(|pi−qi|k+1) (9)

where . is bounded by 0 and 1.

In this work, we focus on two options of , i.e., when and . Since the KL-divergence is non-commutative, we can also have a non-commutative version of , i.e.,

 Dkncm(P||Q)=n∑i=1pilog2(|pi−qi|k+1) (10)

As , , and are bounded by [0, 1], if any of them is selected to replace , Eq. 2 can be rewritten as

 Benefit=H(Zi)−H(Zi+1)−Hmax(Zi)D(Z′i||Zi) (11)

where denotes maximum entropy, while is a placeholder for , , or .

The four measures in Eqs. 7, 8, 9, 10 all consist of logarithmic scaling of probability values, in the same form of Shannon entropy. They are entropic measures. In addition, we also considered a set of non-entropic measures in the form of Minkowski distances, which have the following general form:

 DkM(P,Q)=k ⎷n∑i=1|pi−qi|k(k>0) (12)

To evaluate the suitability of the above measures, we can first consider three criteria. It is essential for the selected divergence measure to be bounded. Otherwise we can just use the KL-divergence. Another important criterion is the number of PMFs that the measure depends on. While all measures considered depend on two PMFs, the conditional entropy depends on three. Because it requires some effort to obtain a PMF, especially a joint probability distribution, this makes less favourable. In addition, we also prefer to have an entropic measure as it is more compatible with the measure of alphabet compression. With these three criteria, we can start our multi-criteria analysis as summarized in Table 3, where we score each divergence measure against a criterion using an integer between 0 and 5, with 5 being the best. We will draw our conclusion about the multi-criteria in Section 6.

We now consider several criteria using visualization. One desirable property is for a bounded measure to have a geometric behaviour similar to the KL-divergence. Since the KL-divergence is unbounded, we make use of a scaled version, , which does not rise up too quickly, though it is still unbounded.

Let us consider a simple alphabet , which is associated with two PMFs, and . We set , such that when , is most divergent away from . We can visualize how different measures numerically convey the divergence between and by observing their relationship with . Figure 2 compares several measures by varying the values of in the range of .

From Figure 2, we can observe that has almost a perfect match when , while is also fairly close. They thus score 5 and 4 respectively in Table 3. Meanwhile, the lines of curve in the opposite direction of . We score it 1. and are of similar shapes, with correlating with slightly better. We thus score 2 and 3. Note that for the above PMFs and , has the same curves as . Hence has the same score as in Table 3. With scored poorly, we focus on the other candidate measures in the rest of the analysis.

We now consider Figure 3, where the candidate measures are visualized in comparison with and in a range close to zero, i.e., . The ranges and are there only for references to the nearby contexts as they do not have the same logarithmic scale as that in the range . We can observe that in the curve of rises as almost quickly as . This confirms that simply scaling the KL-divergence is not an adequate solution. The curves of and converge to their maximum value 1.0 earlier than that of . If the curve of is used as a benchmark as in Figure 2, the curve of is closer to than that of . We thus score 5, , 4, 3, 3, and 2. Since we use the same PMFs and as in Figure 2, has the same curves and thus the same score as .

Let us consider a few numerical examples that may represent some practical scenarios. We use these scenarios to see if the values returned by different divergence measures make sense. Let be an alphabet with two letters, good and bad, for describing a scenario (e.g., an object or an event), which has the probability of good is , and that of bad is . In other words, . Imagine that a biased process (e.g., a distorted visualization, an incorrect algorithm, or a misleading communication) conveys the information about the scenario always bad, i.e., a PMF . Users at the receiving end of the process may have different knowledge about the actual scenario, and they will make a decision after receiving the output of the process. For example, we have five users and we have obtained the probability of their decisions as follows:

• LD — The user has a little doubt about the output of the process, and decides bad 90% of the time, and good 10% of the time, i.e., with PMF .

• FD — The user has a fair amount of doubt, with .

• RG — The user makes a random guess, with .

• UC — The user has adequate knowledge about , but under-compensate it slightly, with .

• OC — The user has adequate knowledge about , but over-compensate it slightly, with .

We can use different candidate measures to compute the divergence between and . Figure 4 shows different divergence values returned by these measures. Each value is decomposed into two parts, one for good and one for bad. All these measures can order these five users reasonably well. The users UC (under-compensate) and OC (over-compensate) have the same values with and , while considers OC has slightly more divergence than UC (0.014 vs. 0.010). returns relatively low values than other measures. For UC and OC, , , and return small values , which are a bit difficult to estimate.

and show strong asymmetric patterns between good and bad, reflecting the probability values in . In other words, the more decisions on good, the more good-related divergence. This asymmetric pattern is not in anyway incorrect, as the KL-divergence is also non-commutative and would also produce much stronger asymmetric patterns. Meanwhile an argument for supporting commutative measures would point out that the higher probability of good in should also influence the balance between the good-related divergence.

We decide to score 3 because of its lower valuation and its non-equal comparison of OU and OC. We score and 5; and and 4 as the values returned by and are slightly more intuitive.

We now consider a slightly more complicated scenario with four pieces of data, A, B, C, and D, which can be defined as an alphabet with four letters. The ground truth PMF is . Consider two processes that combine these into two classes AB and CD. These typify clustering algorithms, downsampling processes, discretization in visual mapping, and so on. One process is considered to be correct, which has a PMF for AB and CD as , and another biased process with . Let CG, CU, and CH be three users at the receiving end of the correct process, and BG, BS, and BM be three other users at the receiving end of the biased process. The users with different knowledge exhibit different abilities to reconstruct the original scenario featuring A, B, C, D from aggregated information about AB and CD. Similar to the good-bad scenario, such abilities can be captured by a PMF . For example, we have:

• CG makes random guess, .

• CU has useful knowledge, .

• CB is highly biased, .

• BG makes guess based on , .

• BS makes a small adjustment, .

• BM makes a major adjustment, .

Figure 5 compares the divergence values returned by the candidate measures for these six users. We can observe that and return values , which seem to be less intuitive. Meanwhile shows a large portion of divergence from the AB category, while shows more divergence in the BC category. In particular, for user BG, does not show any divergence in relation to A and B, though BG clearly has reasoned A and B rather incorrectly. shows a relatively balanced account of divergence associated with A, B, C, and D. On balance, we give scores 5, 4, 3, 2, 1 to , , , , and respectively.

With the major shortcomings of in this scenario, we can now focus on three commutative measures and in conjunction with two case studies.

## 5 Case Studies

To complement the visual analysis in Section 4.2, we conducted two surveys to collect some realistic examples that feature the use of knowledge in visualization. In addition to supporting the selection of a bounded measure for potential distortion, the surveys were also designed to demonstrate that one could use a few simple questions to estimate the cost-benefit of visualization in relation to individual users. Built on the visual analysis in the previous section, we focus on three divergence measures, namely the JS divergence and two versions of the new divergence, i.e., with and . We denote as , and as .

### 5.1 Volume Visualization

This survey, which involved ten surveyees, was designed to collect some real-world data that reflects the use of knowledge in viewing volume visualization images. The full set of questions were presented to surveyees in the form of slides, which are included in the supplementary materials. The full set of survey results is given in Appendix C. The featured volume datasets were from “The Volume Library” [Roe19], and visualization images were either rendered by the authors or from one of the four publications [NSW02, CSC06, WQ07, Jun19].

The transformation from a volumetric dataset to a volume-rendered image typically features a noticeable amount of alphabet compression. Some major algorithmic functions in volume visualization, e.g., iso-surfacing, transfer function, and rendering integral, all facilitate alphabet compression, hence information loss.

In terms of rendering integral, maximum intensity projection (MIP) incurs a huge amount of information loss in comparison with the commonly-used emission-and-absorption integral [MC10]. As shown in Figure 1, the surface of arteries are depicted more or less in the same color. The accompanying question intends to tease out two pieces of knowledge, “curved surface” and “with wrinkles and bumps”. Among the ten surveyees, one selected the correct answer B, while seven selected the relatively plausible answer A and one selected the less plausible answer D.

Let alphabet contain the four optional answers. One may assume a ground truth PMF since there might still be a small probability for a section of artery to be flat or smooth. The rendered image depicts a misleading impression, implying that answer C is correct or a false PMF . The amount of alphabet compression is thus .

When a surveyee gives an answer to the question, it can also be considered as a PMF . Different answers thus lead to different values of divergence as follows:

 A: B: C: D:

Without any knowledge, a surveyee would select answer C, leading to the highest value of divergence in terms of any of the three measures. Based PMF , we expect to have divergence values in the order of C > A > D B. Both and have produced values in that order, while indicates an order A > D > C B, which cannot be interpreted easily. We thus score 1 in Table 3, and leave it out in the following discussions.

Together with the alphabet compression and the maximum entropy of 2 bits, we can also calculate the informative benefit using Eq. 11. For surveyees with different answers, the lossy depiction of the surface of arteries brought about different amounts of benefit:

 with DJS, A:−0.889, B:0.500, C:−1.351, D:−1.230 with D2, A:−1.038, B:0.586, C:−1.097, D:−1.088

The two sets of values both indicate that only those surveyees who gave answer C would benefit from such lossy depiction produced by MIP. One may also consider the scenarios where flat or smooth surfaces are more probable. For example, if the ground truth PMF were and , the amounts of benefit would be:

 with DJS, A:0.480, B:0.951, C:−0.337, D:−0.049 with D2, A:0.487, B:1.044, C:0.212, D% :0.257

Because the ground truth PMF would be less certain, the knowledge of “curved surface” and “with wrinkles and bumps” would become more useful. Further, because the probability of flat and smooth surfaces would have also increased, an answer C would not be as bad as when it is with the original PMF .

The above example of MIP rendering shows that to those users with the appropriate knowledge, the missing information in a visualization image is not really “lost”. Using the categorization of visual multiplexing [CWB14], the information about “curved surface” and “with wrinkles and bumps” is conveyed using a hollow visual channel. Volume visualization features some other forms of visual multiplexing. The viewers’ ability to de-multiplex depends on their knowledge, which can now be estimated quantitatively.

Figure 6 shows another volume-rendered image used in the survey. Two iso-surfaces of a head dataset are depicted with translucent occlusion, which is a type of visual multiplexing [CWB14]. Meanwhile, the voxels for soft tissue and muscle are not depicted at all, which can also been regarded as using a hollow visual channel. The visual representation has been widely used, and the viewers are expected to use their knowledge to infer the 3D relationships between the two iso-surfaces as well as the missing information about soft tissue and muscle. The question that accompanies the figure is for estimating such knowledge.

Although the survey offers only four options, it could in fact offer many other configurations as optional answers. Let us consider four color-coded segments similar to the configurations in answers C and D. Each segment could be one of four types: bone, skin, soft tissue and muscle, or background. There are a total of configurations. If one had to consider the variation of segment thickness, there would be many more options. Because it would not be appropriate to ask a surveyee to select an answer from 256 options, a typical assumption is that the selected four options are representative. In other words, considering that the 256 options are letters of an alphabet, any unselected letter has a probability similar to one of the four selected options.

For example, we can estimate a ground truth PMF such that among the 256 letters,

• Answer A and four other letters have a probability 0.01,

• Answer B and 64 other letters have a probability 0.0002,

• Answer C and 184 other letters have a probability 0.0001,

• Answer D has a probability 0.9185.

We have the entropy of this alphabet . Similar to the previous example, we can estimate the values of divergence as:

 A: P={1,....4,0,....64,0,....184,0}→DJS=0.960,D2=0.903 B: P={0,....4,1,....64,0,....184,0}→DJS=0.999,D2=0.905 C: P={0,....4,0,....64,1,....184,0}→DJS=0.999,D2=0.905 D: P={0,....4,0,....64,0,....184,1}→DJS=0.042,D2=0.009

where denotes zeros. With the maximum entropy being 8 bits, we can estimate the amounts of informative benefit as:

 with DJS, A:−6.826, B:−7.139, C:−7.144, D:0.514 with D2, A:−6.374, B:−6.392, C:−6.392, D:0.777

Because both and have returned some sensible values, we give a score of 5 to each of them in Table 3.

### 5.2 London Underground Map

This survey was designed to collect some real-world data that reflects the use of some knowledge in viewing different London underground maps. It involved sixteen surveyees, twelve at King’s College London (KCL) and four at University of Oxford. Surveyees were interviewed individually in a setup as shown in Figure 7. Each surveyee was asked to answer 12 questions using either map, followed by two further questions about their familiarity of a metro system and London. A £5 Amazon voucher was offered to each surveyee as an appreciation of their effort and time. The survey sheets and the full set of survey results are given in Appendix D.

Harry Beck first introduced geographically-deformed design of the London underground maps in 1931. Today almost all metro maps around the world adopt this design concept. Information-theoretically, the transformation of a geographically-faithful map to such a geographically-deformed map causes a significant loss of information. Naturally, this affects some tasks more than others.

For example, the distances between stations on a deformed map are not as useful as in a faithful map. The first four questions in the survey asked surveyees to estimate how long it would take to walk (i) from Charing Cross to Oxford Circus, (ii) from Temple and Leicester Square, (iii) from Stanmore to Edgware, and (iv) from South Rulslip to South Harrow. On the deformed map, the distances between the four pairs of the stations are all about 50mm. On the faithful map, the distances are (i) 21mm, (ii) 14mm, (iii) 31mm, and (iv) 53mm respectively. According to the Google map, the estimated walk distance and time are (i) 0.9 miles, 20 minutes; (ii) 0.8 miles, 17 minutes; (iii) 1.6 miles, 32 minutes; and (iv) 2.2 miles, 45 minutes respectively.

The average range of the estimations about the walk time by the 12 surveyees at KCL are: (i) 19.25 [8, 30], (ii) 19.67 [5, 30], (iii) 46.25 [10, 240], and (iv) 59.17 [20, 120] minutes. The estimations by the four surveyees at Oxford are: (i) 16.25 [15, 20], (ii) 10 [5, 15], (iii) 37.25 [25, 60], and (iv) 33.75 [20, 60] minutes. The values correlate better to the Google estimations than what would be implied by the similar distances on the deformed map. Clearly some surveyees were using some knowledge to make better inference.

Let be an alphabet of integers between 1 and 256. The range is chosen partly to cover the range of the answers in the survey, and partly to round up the maximum entropy to 8 bits. For each pair of stations, we can define a PMF using a skew normal distribution peaked at the Google estimation . As an illustration, we coarsely approximate the PMF as , where

 qi=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩0.01/236if 1≤i≤ξ−8(wild guess)0.026if ξ−7≤i≤ξ−3(close)0.12if ξ−2≤i≤ξ+2(spot on)0.026if ξ+3≤i≤ξ+12(close)0.01/236if ξ+13≤i≤256(wild guess)

Using the same way in the previous case study, we can estimate the divergence for an answer in range, resulting in:

 DJS=⎧⎨⎩0.725if spot on0.913if close1.000if wild guessD2=⎧⎨⎩0.468if spot on0.500if close0.506if wild guess

With the entropy of the alphabet as bits and the maximum entropy being 8 bits, we can estimate the amounts of informative benefit for different answers as:

 with DJS,spot on:−1.765, close:−3.266, wild guess:−3.963 with D2,spot on:0.287, close:0.033, wild guess:−0.017

For instance, surveyee P9, who has lived in a city with a metro system for a period of 1-5 years and lived in London for several months, made similarly good estimations about the walking time with both types of underground maps. With one spot on answer and one close answer under each condition, the estimated benefit on average is bits if one uses or bits if one uses . Meanwhile, surveyee P3, who has lived in a city with a metro system for two months, provided all four answers in the wild guess category, leading to negative benefit with both and .

Among the first set of four questions, Questions 1 and 2 are about stations near KCL, and Questions 3 and 4 are about stations more than 10 miles away from KCL. The local knowledge of the surveyees from KCL clearly helped their answers. Among the answers given by the twelve surveyees from KCL,

• For Question 1, four spot on, five close, and three wild guess — the average benefit is with or with .

• For Question 2, two spot on, nine close, and one wild guess — the average benefit is with or with .

• For Question 3, three close, and nine wild guess — the average benefit is with or with .

• For Question 4, two spot on, one close, and nine wild guess — the average benefit is with or with .

From the above calculation, we also notice that tends to produce higher divergence values, and seems a bit “too eager” to give negative benefit values. With the above real world data, produces measures that can be interpreted more intuitively. We therefore give (i.e., ) a 5 score and a 3 score.

When we consider answering each of Questions 14 as performing a visualization task, we can estimate the cost-benefit ratio of each process. As the survey also collected the time used by each surveyee in answering each question, the cost in Eq. 1 can be approximated with the mean response time. For Questions 14, the mean response times by the surveyees at KCL are 9.27, 9.48, 14.65, and 11.40 seconds respectively. Using the benefit values based on , the cost-benefit ratios are thus 0.0113, 0.0075, -0.0003, and 0.0033 bits/second respectively. While these values indicate the benefits of the local knowledge used in answering Questions 1 and 2, they also indicate that when the local knowledge is absent in the case of Questions 3 and 4, the deformed map (i.e., Question 3) is less cost-beneficial.

## 6 Conclusions

In this paper, we have considered the need to improve the mathematical formulation of an information-theoretic measure for analyzing the cost-benefit of visualization as well as other processes in a data intelligence workflow [CG16]. The concern about the original measure is its unbounded term based on the KL-divergence. We have obtained a proof that as long as the input and output alphabets of a process have a finite number of letters, the divergence measure used in the cost-benefit formula should be bounded.

We have considered a number of bounded measures to replace the unbounded term, including a new divergence measure and its variation . We have conducted multi-criteria analysis to select the best measure among these candidates. In particular, we have used visualization to aid the observation of different properties of the candidate measures, assisting in the analysis of four criteria. We have conducted two case studies, both in the form of surveys. One consists of questions about volume visualizations, while the other features visualization tasks performed in conjunction with two types of London Underground maps. The case studies allowed us to test some most promising candidate measures with the real world data collected in the two surveys, providing important evidence to two important aspects of the multi-criteria analysis.

From Table 3, we can observe the process of narrowing down from eight candidate measures to two measures. Taking the importance of the criteria into account, we consider that candidate is slightly ahead of . We therefore propose to revise the original cost-benefit ratio in [CG16] to the following:

 BenefitCost=Alphabet Compression−Potential DistortionCost=H(Zi)−H(Zi+1)−Hmax(Zi)D2new(Z′i||Zi)Cost (13)

This cost-benefit measure was developed in the field of visualization, for optimizing visualization processes and visual analytics workflows. It is now being improved by using visual analysis and with the survey data collected in the context of visualization applications. We would like to continue our theoretical investigation into the mathematical properties of the new divergence measure. Meanwhile, having a bounded cost-benefit measure offers many new opportunities of using it in practical applications, especially in visualization and visual analytics.

ï»¿

Appendices

A Bounded Measure for Estimating the Benefit of Visualization

Min Chen, University of Oxford, UK
Mateu Sbert, University of Girona, Spain
Alfie Abdul-Rahman, King’s College London, UK
Deborah Silver, Rutgers University, USA

## Appendix A Further Details of the Original Cost-Benefit Ratio

This appendix contains an extraction from a previous publication [CE19], which provides a relatively concise but informative description of the cost-benefit ratio proposed in [CG16]. The inclusion is to minimize the readers’ effort to locate such an explanation. The extraction has been slightly modified.

Chen and Golan introduced an information-theoretic metric for measuring the cost-benefit ratio of a visual analytics (VA) workflow or any of its component processes [CG16]. The metric consists of three fundamental measures that are abstract representations of a variety of qualitative and quantitative criteria used in practice, including operational requirements (e.g., accuracy, speed, errors, uncertainty, provenance, automation), analytical capability (e.g., filtering, clustering, classification, summarization), cognitive capabilities (e.g., memorization, learning, context-awareness, confidence), and so on. The abstraction results in a metric with the desirable mathematical simplicity [CG16]. The qualitative form of the metric is as follows:

 BenefitCost=Alphabet Compression−Potential DistortionCost (14)

The metric describes the trade-off among the three measures:

• Alphabet Compression (AC) measures the amount of entropy reduction (or information loss) achieved by a process. As it was noticed in [CG16], most visual analytics processes (e.g., statistical aggregation, sorting, clustering, visual mapping, and interaction), feature many-to-one mappings from input to output, hence losing information. Although information loss is commonly regarded harmful, it cannot be all bad if it is a general trend of VA workflows. Thus the cost-benefit metric makes AC a positive component.

• Potential Distortion (PD) balances the positive nature of AC by measuring the errors typically due to information loss. Instead of measuring mapping errors using some third party metrics, PD measures the potential distortion when one reconstructs inputs from outputs. The measurement takes into account humans’ knowledge that can be used to improve the reconstruction processes. For example, given an average mark of 62%, the teacher who taught the class can normally guess the distribution of the marks among the students better than an arbitrary person.

• Cost (Ct) of the forward transformation from input to output and the inverse transformation of reconstruction provides a further balancing factor in the cost-benefit metric in addition to the trade-off between AC and PD. In practice, one may measure the cost using time or a monetary measurement.

## Appendix B Basic Formulas of Information-Theoretic Measures

This section is included for self-containment. Some readers who have the essential knowledge of probability theory but are unfamiliar with information theory may find these formulas useful.

Let be an alphabet and be one of its letters. is associated with a probability distribution or probability mass function (PMF) such that and . The Shannon Entropy of is:

 H(Z)=H(P)=−n∑i=1pilog2pi(unit: bit)

Here we use base 2 logarithm as the unit of bit is more intuitive in context of computer science and data science.

An alphabet may have different PMFs in different conditions. Let and be such PMFs. The KL-Divergence describes the difference between the two PMFs in bits:

 DKL(P||Q)=n∑i=1pilog2piqi(unit: bit)

is referred as the divergence of from . This is not a metric since cannot be assured.

Related to the above two measures, Cross Entropy is defined as:

 H(P,Q)=H(P)+DKL(P||Q)=−n∑i=1pilog2qi(unit: bit)

Sometime, one may consider as two alphabets and with the same ordered set of letters but two different PMFs. In such case, one may denote the KL-Divergence as , and the cross entropy as .

## Appendix C Survey Results of Useful Knowledge in Volume Visualization

This survey consists of eight questions presented as slides. The questionnaire is given as part of the supplementary materials. The ten surveyees are primarily colleagues from the UK, Spain, and the USA. They include doctors and experts of medical imaging and visualization, as well as several persons who are not familiar with the technologies of medical imaging and data visualization. Table 4 summarizes the answers from these ten surveyees.

## Appendix D Survey Results of Useful Knowledge in Viewing London Underground Maps

Figures 8, 9, and 10 show the questionnaire used in the survey about two types of London Underground maps. Table 5 summarizes the data from the answers by the 12 surveyees at King’s College London, while Table 6 summarizes the data from the answers by the four surveyees at University Oxford.

In Section 5.2, we have discussed Questions 14 in some detail. In the survey, Questions 58 constitute the second set. Each question asks surveyees to first identify two stations along a given underground line, and then determine how many stops between the two stations. All surveyees identified the stations correctly for all four questions, and most have also counted the stops correctly. In general, for each of these cases, one can establish an alphabet of all possible answers in a way similar to the example of walking distances. However, we have not observed any interesting correlation between the correctness and the surveyees’ knowledge about metro systems or London.

With the third set of four questions, each questions asks surveyees to identify the closest station for changing between two given stations on different lines. All surveyees identified the changing stations correctly for all questions.

The design of Questions 512 was also intended to collect data that might differentiate the deformed map from the faithful map in terms of the time required for answering questions. As shown in Figure 11, the questions were paired, such that the two questions feature the same level of difficulties. Although the comparison seems to suggest that the faithful map might have some advantage in the setting of this survey, we cannot be certain about this observation as the sample size is not large enough. In general, we cannot draw any meaningful conclusion about the cost in terms of time. We hope to collect more real world data about the timing cost of visualization processes for making further advances in applying information theory to visualization.

Meanwhile, we consider that the space cost is valid consideration. While both maps have a similar size (i.e., deformed map: 850mm580mm, faithful map: 840mm595mm, their font sizes for station labels are very different. For long station names, “High Street Kensington” and “Totteridge & Whetstone”, the labels on the deformed map are of 35mm and 37mm in length, while those on the faithful map are of 17mm and 18mm long. Taking the height into account, the space used for station labels in the deformed map is about four times of that in the faithful map. In other worlds, if the faithful map were to display its labels with the same font size, the cost of the space would be four times of that of the deformed map.

### References

1. Bramon R., Boada I., Bardera A., Rodríguez Q., Feixas M., Puig J., Sbert M.: Multimodal data fusion based on mutual information. IEEE Transactions on Visualization and Computer Graphics 18, 9 (2012), 1574–1587.
2. Biswas A., Dutta S., Shen H.-W., Woodring J.: An information-aware framework for exploring multivariate data sets. IEEE Transactions on Visualization and Computer Graphics 19, 12 (2013), 2683–2692.
3. Bruckner S., Möller T.: Isosurface similarity maps. Computer Graphics Forum 29, 3 (2010), 773–782.
4. Bramon R., Ruiz M., Bardera A., Boada I., Feixas M., Sbert M.: An information-theoretic observation channel for volume visualization. Computer Graphics Forum 32, 3pt4 (2013), 411–420.
5. Bramon R., Ruiz M., Bardera A., Boada I., Feixas M., Sbert M.: Information theory-based automatic multimodal transfer function design. IEEE Journal of Biomedical and Health Informatics 17, 4 (2013), 870–880.
6. Bordoloi U., Shen H.-W.: View selection for volume rendering. In Proc. IEEE Visualization (2005), pp. 487–494.
7. Chen M., Ebert D. S.: An ontological framework for supporting the design and evaluation of visual analytics systems. Computer Graphics Forum 38, 3 (2019), 131–144.
8. Chen M., Feixas M., Viola I., Bardera A., Shen H.-W., Sbert M.: Information Theory Tools for Visualization. A K Peters, 2016.
9. Chen M., Golan A.: What may visualization processes optimize? IEEE Transactions on Visualization and Computer Graphics 22, 12 (2016), 2619–2632.
10. Chen M., Grinstein G., Johnson C. R., Kennedy J., Tory M.: Pathways for theoretical advances in visualization. IEEE Computer Graphics and Applications 37, 4 (2017), 103–112.
11. Chen M., Gaither K., John N. W., McCann B.: Cost-benefit analysis of visualization in virtual environments. IEEE Transactions on Visualization and Computer Graphics 25, 1 (2019), 32–42.
12. Chen M., Jänicke H.: An information-theoretic framework for visualization. IEEE Transactions on Visualization and Computer Graphics 16, 6 (2010), 1206–1215.
13. Chen M., Sbert M.: On the upper bound of the kullback-leibler divergence and cross entropy. arXiv:1911.08334, 2019.
14. Correa C., Silver D., Chen M.: Feature aligned volume manipulation for illustration and visualization. IEEE Transactions on Visualization and Computer Graphics 12, 5 (2006), 1069–1076.
15. Cover T. M., Thomas J. A.: Elements of Information Theory. John Wiley & Sons, 2006.
16. Chen M., Walton S., Berger K., Thiyagalingam J., Duffy B., Fang H., Holloway C., Trefethen A. E.: Visual multiplexing. Computer Graphics Forum 33, 3 (2014), 241–250.
17. Feixas M., del Acebo E., Bekaert P., Sbert M.: An information theory framework for the analysis of scene complexity. Computer Graphics Forum 18, 3 (1999), 95–106.
18. Feixas M., Sbert M., González F.: A unified information-theoretic framework for viewpoint selection and mesh saliency. ACM Transactions on Applied Perception 6, 1 (2009), 1–23.
19. Gumhold S.: Maximum entropy light source placement. In Proc. IEEE Visualization (2002), pp. 275–282.
20. Jänicke H., Scheuermann G.: Visual analysis of flow features using information theory. IEEE Computer Graphics and Applications 30, 1 (2010), 40–49.
21. Jung Y.: instantreality 1.0. https://doc.instantreality.org/tutorial/volume-rendering/, last accessed in 2019.
22. Jänicke H., Wiebel A., Scheuermann G., Kollmann W.: Multifield visualization using local statistical complexity. IEEE Transactions on Visualization and Computer Graphics 13, 6 (2007), 1384–1391.
23. Kijmongkolchai N., Abdul-Rahman A., Chen M.: Empirically measuring soft knowledge in visualization. Computer Graphics Forum 36, 3 (2017), 73–85.
24. Kullback S., Leibler R. A.: On information and sufficiency. Annals of Mathematical Statistics 22, 1 (1951), 79–86.
25. Lin J.: Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory 37 (1991), 145â151.
26. Max N., Chen M.: Local and global illumination in the volume rendering integral. In Scientific Visualization: Advanced Concepts, Hagen H., (Ed.). Schloss Dagstuhl, Wadern, Germany, 2010.
27. Moser S. M.: A Student’s Guide to Coding and Information Theory. Cambridge University Press, 2012.
28. Ng C. U., Martin G.: Automatic selection of attributes by importance in relevance feedback visualisation. In Proc. Information Visualisation (2004), pp. 588–595.
29. Nagy Z., Schneide J., Westerman R.: Interactive volume illustration. In Proc. Vision, Modeling and Visualization (2002).
30. Purchase H. C., Andrienko N., Jankun-Kelly T. J., Ward M.: Theoretical foundations of information visualization. In Information Visualization: Human-Centered Issues and Perspectives, Springer LNCS 4950. 2008, pp. 46–64.
31. Ruiz M., Bardera A., Boada I., Viola I., Feixas M., Sbert M.: Automatic transfer functions based on informational divergence. IEEE Transactions on Visualization and Computer Graphics 17, 12 (2011), 1932–1941.
32. Rigau J., Feixas M., Sbert M.: Shape complexity based on mutual information. In Proc. IEEE Shape Modeling and Applications (2005).
33. Roettger S.: The volume library. http://schorsch.efi.fh-nuernberg.de/data/volume/, last accessed in 2019.
34. Shannon C. E.: A mathematical theory of communication. Bell System Technical Journal 27 (1948), 379–423.
35. Tam G. K. L., Kothari V., Chen M.: An analysis of machine- and human-analytics in classification. IEEE Transactions on Visualization and Computer Graphics 23, 1 (2017).
36. Takahashi S., Takeshima Y.: A feature-driven approach to locating optimal viewpoints for volume visualization. In Proc. IEEE Visualization (2005), pp. 495–502.
37. Viola I., Chen M., Isenberg T.: Visual abstraction. In Foundations of Data Visualization, Chen M., Hauser H., Rheingans P., Scheuermann G., (Eds.). Springer, 2020. Preprint at arXiv:1910.03310, 2019.
38. Viola I., Feixas M., Sbert M., Gröller M. E.: Importance-driven focus of attention. IEEE Transactions on Visualization and Computer Graphics 12, 5 (2006), 933–940.
39. Vázquez P.-P., Feixas M., Sbert M., Heidrich W.: Automatic view selection using viewpoint entropy and its application to image-based modelling. Computer Graphics Forum 22, 4 (2004), 689–700.
40. Wei T.-H., Lee T.-Y., Shen H.-W.: Evaluating isosurfaces with level-set-based information maps. Computer Graphics Forum 32, 3 (2013), 1–10.
41. Wu Y., Qu H.: Interactive transfer function design based on editing direct volume rendered images. IEEE Transactions on Visualization and Computer Graphics 13, 5 (2007), 1027–1040.
42. Wang C., Shen H.-W.: LOD Map - a visual interface for navigating multiresolution volume visualization. IEEE Transactions on Visualization and Computer Graphics 12, 5 (2005), 1029–1036.
43. Wang C., Shen H.-W.: Information theory in scientific visualization. Entropy 13 (2011), 254–273.
44. Wang C., Yu H., Ma K.-L.: Importance-driven time-varying data visualization. IEEE Transactions on Visualization and Computer Graphics 14, 6 (2008), 1547–1554.
45. Xu L., Lee T. Y., Shen H. W.: An information-theoretic framework for flow visualization. IEEE Transactions on Visualization and Computer Graphics 16, 6 (2010), 1216–1224.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters