A review of possible effects of cognitive biases on interpretation of rule-based machine learning models
This paper investigates to what extent do cognitive biases affect human understanding of interpretable machine learning models, in particular of rules discovered from data. Twenty cognitive biases (illusions, effects) are covered, as are possibly effective debiasing techniques that can be adopted by designers of machine learning algorithms and software. While there seems no universal approach for eliminating all the identified cognitive biases, it follows from our analysis that the effect of most biases can be ameliorated by making rule-based models more concise. Due to lack of previous research, our review transfers general results obtained in cognitive psychology to the domain of machine learning. It needs to be succeeded by empirical studies specifically aimed at the machine learning domain.
keywords:cognitive bias, cognitive illusion, machine learning, interpretability, rule induction
This paper aims to investigate the effect of cognitive biases on human understanding of machine learning models, in particular inductively learned rules. We use the term cognitive bias as a representative term for various related cognitive phenomena (heuristics, effects, illusions and constraints) that demonstrate as seemingly irrational reasoning patterns that are thought to allow humans to make fast and risk averse decisions. Following the “cognitive biases and heuristics” research program started by Tversky and Kahneman in the 1970s over 50 cognitive biases have been discovered to date (Pohl, 2017). Their cumulative effect on human reasoning should not be underestimated as already the early work showed that “cognitive biases seem reliable, systematic, and difficult to eliminate” (Kahneman and Tversky, 1972). The effect of some cognitive biases is more pronounced when people do not have well-articulated preferences (Tversky and Simonson, 1993), which is often the case in explorative machine learning.
Previous works have analysed the impact of cognitive biases on multiple types of human behaviour and decision making. A specific example is the seminal book “Social cognition” by Kunda (1999), which is concerned with their impact on social interaction. Another, more recent work by Serfas (2011) is focused on the context of capital investment. Closer to the domain of machine learning, in an article entitled “Psychology of Prediction” Kahneman and Tversky (1973) warned that cognitive biases can lead to violations of the Bayes theorem when people make fact-based predictions under uncertainty. These results directly relate to inductively learned rules, since these are associated with measures such as confidence and support expressing the (un)certainty of the prediction they make. Despite the early papers (Michalski, 1969, 1983) showing the importance of study of cognitive phenomena for rule induction and machine learning in general, there has been a paucity of follow-up research. In previous work (Fürnkranz et al., 2018), we have evaluated a selection of cognitive biases in the very specific context of whether minimizing the complexity or length of a rule will also lead to increased interpretability, which is often taken for granted in machine learning research.
In this paper, we attempt to systematically relate cognitive biases to the interpretation of machine learning results. To that end, we review twenty cognitive biases that can distort interpretation of inductively learnt rules. The review is intended to help to answer questions such as: Which cognitive biases affect understanding of symbolic machine learning models? What could help as the “debiasing antidote”? We summarize the review by proposing a model that describes which cognitive biases are triggered when humans assess the plausibility of inductively learned rules and whether they influence plausibility in positive or negative way. The model is based on evidence transposed from general empirical studies in psychology to the particular domain of rule learning. The purpose is to give an indicative holistic view of the problem in the rule learning domain, and to foster empirical studies specifically aimed at the machine learning domain.
This paper is organized as follows. Section 2 provides a brief review of related work published at the intersection of rule learning and psychology, defining rule induction and cognitive bias on the way. Section 3 motivates our study on the example of the insensitivity to sample size effect. Section 4 describes the criteria that we applied to select a subset of cognitive biases into our review. The twenty selected biases are covered in Section 5. The individual cognitive biases have very disparate effects and causes. Section 6 provides a discussion of our results and a concise summary in a form of an illustrative visual model. In Section 7 we state the limitations of our review and outline directions for future work. The conclusions summarize the contributions.
2 Background and Related Work
We selected individual rules as learnt by many machine learning algorithms as the object of our study. Focusing on simple artefacts—individual rules—as opposed to entire models such as rule sets or rule lists allows a deeper, more focused analysis since a rule is a small self-contained item of knowledge. Making a small change in one rule, such as adding a new condition, allows to test the effect of an individual factor that can influence perception of rule plausibility. In this section, we will shortly introduce the inductively learnt rule. Then, we will focus on rule plausibility as a measure of comprehension of rule comprehension.
2.1 Decision Rules in Machine Learning
The type of inductively learned decision rule which we consider is in Figure 1. Following the terminology of Fürnkranz et al. (2012), represent an arbitrary number of literals, i.e., Boolean expressions which are composed of attribute name (e.g., veil) and its value (e.g., white). The conjunction of literals on the left side of the rule is called antecedent, the single literal predicted by the rule is called consequent. Literals in the antecedent are sometimes referred to as conditions throughout the text. While this rule definition is restricted to conjunctive rules, other definitions, e.g., the formal definition given by Slowinski et al. (2006, page 2) also allows for negation and disjunction as connectives.
Rules on the output of rule learning algorithms are most commonly characterized by two parameters, confidence and support. The confidence of a rule is defined as , where is the number of correctly classified objects, i.e. those matching the rule antecedent as well as the rule consequent, and is the number of misclassified objects, i.e. those matching the antecedent, but not the consequent. The support of a rule is defined either as , where is the number of all objects (relative support), or simply as (absolute support).
Some rule learning frameworks, in particular association rule learning Agrawal et al. (1995); Zhang and Zhang (2002) require the user to set thresholds for minimum confidence and support. Only rules with confidence and support values meeting or exceeding these thresholds are included on the output of rule learning and presented to the user.
2.2 Study of Rules in Cognitive Science
Rules are a commonly embraced model of human reasoning in cognitive science (Smith et al., 1992; Nisbett, 1993; Pinker, 2015). They also closely relate to Bayesian inference, another frequently used model of human reasoning. A rule “IF A AND B THEN C” can be interpreted as a hypothesis corresponding to the logical implication . We can express the plausibility of such hypothesis in terms of Bayesian inference as the conditional probability . This corresponds to the confidence of the rule, a term used in rule learning, and to strength of evidence, a term used by cognitive scientists (Tversky and Kahneman, 1974).111In the terminology used within the scope of cognitive science (Griffin and Tversky, 1992), confidence corresponds to the strength of the evidence and support to the weight of the evidence. Interestingly, this problem was already mentioned by Keynes (1922) (according to Camerer and Weber (1992)) who drew attention to the problem of balancing the likelihood of the judgment and the weight of the evidence in the assessed likelihood.
Given that is a probability estimate computed on a sample, another relevant piece of information for determining the plausibility of the hypothesis is the robustness of this estimate. This corresponds to the number of observed instances for which the rule has been observed to be true. The size of the sample (typically expressed as ratio) is known as rule support in machine learning and as weight of the evidence in cognitive science (Tversky and Kahneman, 1974).
Rule as object of study in cognitive science
A hypothesis—a rule inductively learned from data—is also a very specific form of alternative. Psychological research specifically on hypothesis testing in rule discovery tasks has been performed in cognitive science at least since the 1960’s. The seminal article by Wason (1960) introduced what is widely referred to as Wason’s 2-4-6 task. Participants are given the sequence of numbers 2, 4 and 6 and asked to come up with a rule that generates this sequence. In search for the hypothesized rule they can ask the experimenter other sequences of numbers, such as 3-5-7 that are either supposed to conform to the rule or not. The experimenter answers yes or no. While the target rule is simple “ascending sequence”, people find it difficult to discover this specific rule, because they apply the confirmation bias, a human tendency to focus on evidence confirming the hypothesis at hand (Nickerson, 1998).
One of the later works in this area is entitled “Strategies of rule discovery in an inference task” (Tweney et al., 1980). While the title could suggest that this work is highly relevant to our machine learning problem, it is actually a psychological study of the inference processes (that is “meta-rules” people use in the reasoning process), which does not directly relate to the notion of “rules” used in machine learning as particular pattern in data in a specific domain. Of similar limited relevance are follow-up works of Rossi et al. (2001); Vallée-Tourangeau and Payton (2008).
Another related field are cognitive theories of human decision making, which study how humans combine multiple pieces of evidence, which in our case correspond to conditions (literals) in a rule. The contribution of individual conditions to the overall plausibility of the rule is an important part of our research problem, but there is a paucity of directly applicable research in cognitive science. Most of this research that we identified in our brief survey is based on Bayesian reasoning (studies by Gopnik and Tenenbaum (2007); Griffiths et al. (2010)), rather than rule induction.
If we consider an individual rule (hypothesis) as one of several alternatives between which the user has to decide, we can apply research on human decision-making processes. Most notably, this includes results of the research program on cognitive heuristics and biases started by Amos Tversky and Daniel Kahneman in the 1970’s. In our work, we draw heavily from this intensely studied area of human cognition.
2.3 Cognitive Bias
According to the Encyclopedia of Human Behavior (Wilke and Mata, 2012), the term cognitive bias was introduced in the 1970s by Amos Tversky and Daniel Kahneman (Tversky and Kahneman, 1974), and is defined as:
Systematic error in judgment and decision-making common to all human beings which can be due to cognitive limitations, motivational factors, and/or adaptations to natural environments.
The research on cognitive biases and heuristics is considered as the most important psychological research done in the past 40 years (Wilke and Mata, 2012).
The narrow initial definition of cognitive bias as a shortcoming of human judgment was criticized by German psychologist Gerd Gigerenzer, who started in the late 1990s the “Fast and frugal heuristic” program to emphasize ecological rationality (validity) of cognitive biases.
As for terminology, the concept of “cognitive bias” includes many cognitive phenomena, multiple of which are not called “biases” but instead heuristics (e.g. Representativeness heuristic), effects (e.g. Mere exposure effect), fallacies (e.g. Conjunction fallacy), illusions (e.g. Illusionary correlation) or otherwise.
Three types of cognitive biases are recognized in the recent authoritative work of Pohl (2017): those relating to thinking, judgment and memory. The Thinking category covers biases related to thinking processes. These require the person to apply a certain rule (such as Bayes theorem). Since many people are not aware of this rule, they have to apply it intuitively, which can result in errors. The Judgment category covers biases used by people when they are asked to rate some property of a given object (such as a plausibility of a rule). The Memory category covers biases related to memory, which deal mainly with the phenomena of various memory errors, such as omission substitution and interpolation.
Function and validity of cognitive biases
In the introduction, we briefly characterized cognitive biases as “seemingly irrational reasoning patterns that are thought to allow humans to make fast and risk averse decisions.” In fact, the function of cognitive biases is subject of scientific debate. According to the review of functional views in Pohl (2017), there are three fundamental positions among researchers. The first group considers them as dysfunctional errors of the system, the second group as faulty by-products of otherwise functional processes and the third group as adaptive and thus functional responses. According to Pohl (2017), most researchers are in the second group, where cognitive biases (illusions) are consider to be “built-in errors of the human information-processing systems”.
In this work, we consider cognitive biases as strategies that evolved to improve the fitness and chances of survival of the individual in particular situations. This stand in defense of biases is succinctly expressed by the influential work of Haselton and Nettle (2006): “Both the content and direction of biases can be predicted theoretically and explained by optimality when viewed through the long lens of evolutionary theory. Thus, the human mind shows good design, although it is design for fitness maximization, not truth preservation.” According to the same paper, empirical evidence shows that cognitive biases are triggered or their effect strengthened by environmental cues and context (Haselton and Nettle, 2006).
Interpretation of statistical hypotheses and machine learning results is a very recent type of cognitive task. We thus assume that when interpreting machine learning results, the human mind applies many of the heuristics and biases inappropriately. It also follows that cognitive biases will not demonstrate in all humans and in all situations equally (or many times even at all).
2.4 Measures of Interpretability, Perceived and Objective Plausibility
We claim that cognitive biases can affect the interpretation of rule-based models. However, how does one measure interpretability? According to our literature review, there is no generally accepted measure of interpretability of machine learning models. Model size, which was used in several studies, has recently been criticized (Freitas, 2014; Stecher et al., 2016; Fürnkranz et al., 2018).
In our work, we embrace the concept of plausibility to measure interpretability. Plausibility is defined according to the Oxford Dictionary of US English as “seeming reasonable or probable” and according to the Cambridge dictionary of UK English as “Seeming likely to be true, or able to be believed”. In the previous, we linked the machine learning’s inductively learned rule to the concept of “hypothesis” used in cognitive science. There is a body of work in cognitive science on analyzing the perceived plausibility of hypotheses (Gettys et al., 1978, 1986; Anderson and Fleming, 2016). Plausibility can also be directly elicited for inductively learnt rules (Kliegr, 2017). The concepts of “trust” and “acceptance” are used in connection with measuring comprehension of machine learning models in the influential position paper by Freitas (2014). Plausibility is also closely related to the term justifiability, which requires the expert to assess that the model is in line with existing domain knowledge. In a recent review of interpretability definitions by Bibal and Frénay (2016), the term plausibility is not explicitly covered, but justifiability is stated to depend on interpretability. Martens et al. (2011) define justifiability as “intuitively correct and in accordance with domain knowledge”.
We are aware of the fact that if a decision maker finds a rule plausible, it does not necessarily mean that the rule is correctly understood, it can be quite the contrary in many cases. Nevertheless, we believe that the alignment of the perceived plausibility with objective, data-driven, plausibility of a hypothesis should be at the heart of an effort that strives for interpretable machine learning.
3 Motivational Example
It is well known in machine learning that chance rules with a deceptively high confidence can appear in the output of rule learning algorithms (Azevedo and Jorge, 2007). For this reason, the rule learning process typically outputs both confidence and support for the analyst to make an informed choice about merits of each rule.
In the example listing above, both rules are associated with values of confidence and support to inform about the strength and weight of evidence for both rules. While the first rule is less strong (80% vs 90% correct), its weight of the evidence is ten times higher than of the second rule.
According to the insensitivity to sample size effect (Tversky and Kahneman, 1974) there is a systematic bias in human thinking that makes humans put higher weight on the strength of evidence (confidence) than on the weight of evidence (support). It has been shown that this bias is applicable also to statistically sophisticated psychologists (Tversky and Kahneman, 1971) and thus can be applicable to the widening number of professions that are using rule learning to obtain insights from data.
The second bias that we consider is base rate fallacy, according to which people are unable to correctly process conditional probabilities. The conditional probability in our example is the confidence value, which—in the shown example—is the probability of a good rating on the condition of the film being released in 2006 and in English.
The analysis of relevant literature from cognitive science not only reveals applicable biases, but also provides in some cases methods for limiting their effect (debiasing). The standard way used in rule learning software for displaying rule confidence and support metrics is to use ratios, as in our example. Extensive research in psychology has shown that if natural numbers are used instead then the number of errors in judgment drops (Gigerenzer and Goldstein, 1996; Gigerenzer and Hoffrage, 1995). Reflecting these suggestions, the first rule in our example could be presented as follows:
A correct understanding of machine learning models can be difficult even for experts. In this section, we tried to motivate why addressing cognitive biases can play an important role in making the results of inductive rule learning more understandable. In the remainder of this paper, both biases involved in our example will be revisited in greater depth, along with 18 other biases.
4 Selection Criteria
A number of cognitive biases have been discovered, experimentally studied, and extensively described in the literature. There are at least 51 different biases falling into the thinking and judgment categories (Evans et al., 2007; Pohl, 2017). As Pohl (2017) states in a recent authoritative book on cognitive illusions: “There is a plethora of phenomena showing that we deviate in our thinking, judgment and memory from some objective and arguably correct standard.” In the first phase of our research we selected a subset of biases which will be reviewed. To select applicable biases, we considered those that have some relation to the following properties of inductively learned rules: 1. rule length (number of literals in antecedent), 2. rule interest measures (especially support and confidence), 3. position (ordering) of conditions in rule and ordering of rules in the rule list, 4. Specificity and predictive power of conditions (correlation with target variable), 5. use of additional logical connectives (conjunction, disjunction, negation), 6. treatment of missing information (inclusion of conditions referring to missing value), and 7. conflict between rules in the rule list.
Through selection of appropriate learning heuristics, the rule induction learning algorithm can influence these properties. For example, most heuristics implement some form of trade-off between the coverage or support of a rule, and its implication strength or confidence Fürnkranz and Flach (2005); Fürnkranz et al. (2012).
Inclusion of “overlapping” biases
We do not consider the correlation between individual cognitive biases. For example, it is known that a number of cognitive biases (such as conjunction fallacy, base rate neglect, insensitivity to sample size, confusion of the inverse) can all be attributed to a more general phenomenon called representativeness heuristic (Ballin et al., 2008). To our knowledge, the correlation between cognitive biases has not yet been systematically studied. In our review, we thus include multiple biases even though they may overlap.
5 Review of Cognitive Biases
In this section, we cover a selection of twenty cognitive biases. For all of them, we include a short description, and a paragraph which quantifies the effect of the cognitive bias. We pay particular attention to their potential effect on the interpretability of rule learning results, which has not been covered in previous works. For those biases that were categorized by Pohl (2017), the name of the category (Thinking, Judgment) appears in parenthesis in subsection heading. We have not included any biases categorized into the Memory category.
For all cognitive biases we suggest a debiasing technique that could be effective in aligning the perceived plausibility of the rule with its objective plausibility. The suggestions are based on empirical results obtained by psychologists, we indicate when these are our conjectures that are in need of further validation. Most biases have a limited validity scope. In our review we thus put attention to identifying groups of people who are (not) susceptible to the specific bias wherever this information is available. Also, for selected biases, we report the success rates, i.e., the number of people committing a fallacy corresponding to a specific bias in an experiment.
An overview of the main traits of the reviewed cognitive biases is presented in Table 1.
|phenomenon||implications for rule-learning||debiasing technique|
|biases that increase plausibility with rule length|
|Availability Heuristic||Predictive strength of literal is based on association between the literals in the antecedent and the consequent of the rule||Trigger System 2|
|Averaging Heuristic||Probability of antecedent as the average of probabilities of literals||Reminder of probability theory|
|Base-rate Fallacy||Emphasis on confidence, neglect for support||Express confidence and support in natural frequencies|
|Confirmation Bias||Rules confirming their prior hypothesis are “cherry picked”||i) Explicit guidance to consider evidence for and against hypothesis, ii) Screen analysts for susceptibility using Defense Confidence Scale questionnaire|
|Disjunction Fallacy||Prefer more specific literals over less specific||Inform on taxonomical relation between literals, explain benefits of higher support*|
|Effect of Difficulty||Rules with small support and high confidence are “overrated”||Filter rules that do not pass a statistical significance test, explain benefits of higher support*|
|Information Bias||More literals increase preference||Visualize value of literals, sort by predictive value*|
|Insensitivity to Sample Size||Analyst does not realize the increased reliability of confidence estimate with increasing value of support||Use support to compute confidence (reliability) intervals for the value of confidence|
|Mere Exposure Effect||Repeated exposure to literal results in increased preference||Extra time or knowledge of literals*|
|Misunderstanding of “and”||“and” is understood as disjunction||Express literal as proposition rather than as category|
|Negativity Bias||Negated or negative literals are considered as more important||Avoid the use of negation*|
|Recognition Heuristic||Recognition of literals increases preference||Extra time or knowledge of literals|
|Reiteration Effect||Same literal present in multiple rules increases preference||Pruning algorithms*|
|Representativeness Heuristic||Overestimate the probability of literal representative of target||Use natural frequencies instead of ratios|
|Tradeoff Contrast||Preference for literal is influenced by other literals in the rule or in other rules||i) Pruning discovered rules, ii) Information on semantics of the literal and covariance with other literals*|
|Unit Bias||Literals are perceived to have same weight||Inform on discriminatory power of literals*|
|biases that decrease plausibility with rule length|
|Ambiguity Aversion||Prefer known literal over unknown literal||Textual description of literals*|
|Weak Evidence Effect||Literal only weakly perceived as predictive of target decreases plausibility||Omission of weak predictors from antecedent|
|effect independent of rule length|
|Confusion of the Inverse||Confusing the difference between the confidence of the rule with||NA|
|Primacy Effect||Rules that are presented as first in the rule model are more preferred||Sort rules from strongest to weakest*|
5.1 Base-rate Fallacy (Thinking)
The base-rate fallacy indicates that people are unable to correctly process conditional probabilities.
In the original experiment reported in Kahneman and Tversky (1973) more than 95% of psychology graduate students committed the fallacy.
Implications for rule learning
The application of the base rate fallacy suggests that when facing two otherwise identical rules with different values of confidence and support metrics, an analyst’s preferences will be primarily shaped by the confidence of the rule.
It follows that by its preference for higher confidence, the base-rate fallacy will generally contribute to a positive correlation between rule length and plausibility, since longer rules can better adapt to a particular group in data and thus have a higher confidence than a more general, shorter rules. This is in contrast to the general bias for simple rules that is implemented by state-of-the-art rule learning algorithms because simple rule tend to be more general, have a higher support, and are thus statistically more reliable.
Our literature review has surfaced several techniques for addressing the base-rate fallacy. Gigerenzer and Hoffrage (1995) show that representations in terms of natural frequencies, rather than conditional probabilities, facilitate the computation of cause’s probability. To the authors’ knowledge, confidence is typically presented as ratio in current software systems. The support rule quality metric is sometimes presented as a ratio and sometimes as a natural number. It would foster correct understanding if analysts are consistently presented with natural frequencies in addition to ratios.
5.2 Confirmation Bias and Positive Test Strategy (Thinking)
Confirmation bias is the best known and most widely accepted notion of inferential error of human reasoning (Evans, 1989, p. 552).222Cited according to Nickerson (1998). This bias refers to the notion that people tend to look for evidence supporting the current hypothesis, disregarding conflicting evidence. Research suggests that even neutral or unfavourable evidence can be interpreted to support existing beliefs, or, as Trope et al. (1997, p. 115-116) put it, “the same evidence can be constructed and reconstructed in different and even opposite ways, depending on the perceiver’s hypothesis.”
A closely related heuristic is the Positive Test Strategy (PTS) proposed by Klayman and Ha (1987). This heuristic suggests that when trying to test a specific hypothesis, people examine cases which they expect to confirm the hypothesis rather than the cases which have the best chance of falsifying it. The difference between PTS and confirmation bias is that PTS is applied to test a candidate hypothesis while the true confirmation bias is concerned with hypotheses that are already established (Pohl, 2004, p. 93). The experimental results of Klayman and Ha (1987) show that under realistic conditions, PTS can be a very good heuristic for determining whether a hypothesis is true or false, but it can also lead to systematic errors if applied to an inappropriate task.
Finally, it should be noted that, according to a review conducted by Klayman and Ha (1987), this heuristic is used as a “general default heuristic” in situations where either specific information that identifies some tests as more relevant than others is absent or when the cognitive demands of the task prevent a more careful strategy.
According to Mynatt et al. (1977, p. 404), 70% of the subjects did not abandon falsified hypotheses in an experiment that simulated a research environment.333Result for the experiment performed in a “complex” environment. This success rate is particularly relevant for the problem of comprehending rule learning results as the simulated research environment is close to our target domain of analysts interpreting discovered rules.
Implications for rule learning
This bias can have significant impact depending on the purpose for which the rule learning results are used. If the analyst had some prior hypothesis before she obtained the rule learning results, according to the confirmation bias she will tend to “cherry pick” rules confirming this prior hypothesis and disregard rules that contradict it. Given that some rule learners may output contradicting rules, the analyst can select only the rules conforming to the hypothesis, disregarding applicable rules with the opposite conclusion, which could otherwise turn out to be more relevant.
Using evidence gathered using MRI brain scans Westen et al. (2006) observe confirmation bias and explain it by emotions related to the favoured hypothesis. Evidence that challenges such a preferred hypothesis is involuntarily suppressed. The experiments in this study were conducted by presenting information that challenged the moral integrity of the politician that the subject favoured. While it could be argued that data analysts interpreting the rule learning results are free of emotional bonds to the problem and can be trained to correctly interpret machine learning results, they may still be subject to a confirmation bias:
Stanovich et al. (2013) show that incidence of myside bias, which is closely related to confirmation bias, is surprisingly not related to general intelligence. This suggests that even highly intelligent analysts can be affected.
Some research can even be interpreted as indicating that data analysts can be more susceptible to the myside bias than the general population. An experiment reported by Wolfe and Britt (2008) shows that subjects who defined good arguments as those that can be ââproved by factsââ (this stance, we assume, would also apply to many data analysts) were more prone to exhibiting a myside bias.444This tendency is explained as follows: “For people with this belief, facts and support are treated uncritically. The intended audience is not part of the schema and thus ignored. More importantly, arguments and information that may support another side are not part of the schema and are also ignored.”
Tweney et al. (1980) successfully tested a modification of Wason’s 2,4,6 task. In its original setup, participants try to “discover” the rule according to which the sequence was created. The correct answer is ”ascending sequence of numbers”. In the modification by Tweney et al. (1980), participants were asked to search for two rules (“any ascending sequence of numbers” and “all other sequences”) instead of one rule (“ascending sequence of numbers”). Following this, the response format was changed from positive and negative to whether the rule belongs to the first category “DAX” or the second category “MED”, which improved performance in the task. Thus, we conclude that relabeling categories from “positive” and “negative” to something more neutral can possibly help to debias the analysts’ interpretation of the rule learning result.
Albarracín and Mitchell (2004) suggest that the susceptibility to the confirmation bias can depend on one’s personality traits. This publication also presents a diagnostic tool called “defense confidence scale” that can identify individuals who are prone to confirmational strategies.
Wolfe and Britt (2008) successfully experimented with providing the subjects with explicit guidelines for considering evidence both for and against the hypothesis. While this research is not directly related to hypothesis testing, providing explicit guidance combined with modifications of the user interface of the system presenting the rule learning results could also prove to be an effective debiasing technique.
5.3 Conjunction Fallacy and Representativeness Heuristic (Thinking)
Human-perceived plausibility of hypotheses has been extensively studied in cognitive science. One of the best-known cognitive phenomena related to our focus area of rule plausibility is the Conjuctive fallacy. This fallacy falls into the research program on cognitive biases and heuristics carried out by Amos Tversky and Daniel Kahneman since the 1970s’.
This heuristic relates to the tendency to make judgments based on similarity, based on rule “like goes with like”, which is typically used to determine whether an object belongs to a specific category. According to Gilovich and Savitsky (2002), the representativeness heuristic can be held accountable for number of widely held false and pseudo-scientific beliefs, including those in astrology or graphology.555Gilovich and Savitsky (2002) give the following example: astrology relates the resemblance of the physical appearance of a sign, such as a crab, with personal traits, such as a tough appearance on the outside. For graphology, the following example is given: handwriting to the left is used to indicate that the person is holding something back. It can also inhibit valid beliefs that do not meet the requirements of resemblance.
The conjunctive fallacy is in the literature often defined via the “Linda” problem (Tversky and Kahneman, 1983, page 299), which was first used to demonstrate it.666Note that the paper (Tversky and Kahneman, 1983) also contains a different set of eight answer options for the Linda problem on page 297. The two option version on page 299 is prevalently used as a canonical version of the Linda problem in subsequent research (cf. the seminal paper of Gigerenzer (1996, page 592)), and is referred to by Daniel Kahneman as the “more direct version” of the Linda problem (Kahneman, 2003, page 712). In this problem (Figure 2), subjects are asked to compare conditional probabilities and , where refers to “bank teller”, to “active in feminist movement” and to the description of Linda (Bar-Hillel, 1991).
Multiple studies have shown that humans tend to consistently select the second, longer hypothesis, which is in conflict with the elementary law of probability: the probability of a conjunction, , cannot exceed the probability of its constituents, and (Tversky and Kahneman, 1983). In other words, it always holds for the Linda problem that
Preference for alternative (option b in Figure 2) is thus always a logical fallacy. The conjunction fallacy has been shown to hold across multiple settings (hypothetical scenarios, real-life domains), as well as for various kinds of subjects (university students, children, experts, as well as statistically sophisticated individuals) (Tentori and Crupi, 2012).
According to Tversky and Kahneman (1983), the results of the conjunctive fallacy experiments manifest that a conjunction can be more representative than one of its constituents. The conjunctive fallacy is a symptom of a more general phenomenon, in which people have a tendency to overestimate the probabilities of representative events and underestimate those of less representative ones. The reason is attributed to the application of the representativeness heuristic (Tversky and Kahneman, 1983). This heuristic provides humans with means for assessing a probability of an uncertain event. It is used to answer questions such as “What is the probability that object A belongs to class B? What is the probability that event A originates from process B?” According to the representativeness heuristic, probabilities are evaluated by the degree to which A is representative of B, that is by the degree to which A resembles B (Tversky and Kahneman, 1974).
The representativeness heuristic is not the only explanation for the results of the conjunctive fallacy experiments. Hertwig et al. (2008) hypothesized that the reason is caused by “a misunderstanding about conjunction”, in other words by a different interpretation of “probability” and “and” by the subjects than assumed by the experimenters. The validity of this alternate hypothesis has been subject to criticism (Tentori and Crupi, 2012), nevertheless the problem of correct understanding of “and” exists and is of particular importance to machine learning. Another proposed hypothesis for explaining the conjunctive fallacy is the averaging heuristic (Fantino et al., 1997) (cf. Section 5.8).
Tversky and Kahneman (1983) report that 85% of the subjects indicate (b) as the more probable option for the Linda problem, which is defined in Figure 2. It should be noted that the actual proportion may vary, 83% are reported when the experiment was replicated by Hertwig and Gigerenzer (1999), and 58% when replicated by Charness et al. (2010).
Implications for rule learning
Rules are not composed only of conditions, but also of an outcome (value of a target variable). A higher number of conditions generally allows the rule to filter a purer set of objects with respect to the value of the target variable than a smaller number of conditions. This means that the conjunctive fallacy does not directly manifest itself when interpreting rule learning results since it cannot be stated that the selection of a longer rule is a reasoning error in the rule learning context, even in cases when the set of conditions of the longer rule subsumes the set of conditions of the shorter rule. Nevertheless, application of representativeness heuristic can affect human perception of rule plausibility, in that rules that are more ”representative” of the user’s mental image of the concept may be preferred even in cases when their objective discriminatory power may be lower.
A number of factors that decrease the ratio of subjects exhibiting the conjunctive fallacy as an undesired consequence of the representativeness heuristic when its application is not rational have been identified:
Charness et al. (2010) found that the number of committed fallacies is reduced under a monetary incentive. Such an addition is reported to drop the fallacy rate to 33%. The observed rate under a monetary incentive better hints at smaller importance of this problem for real-life decisions.
Zizzo et al. (2000) found that unless the decision problem is simplified neither monetary incentives nor feedback can ameliorate the fallacy rate. A reduced task complexity is a precondition for monetary incentives and feedback to be effective.
Stolarz-Fantino et al. (1996) observed that the number of fallacies is reduced but still strongly present when the subjects receive training in logics.
5.4 Availability Heuristic (Judgment)
The availability heuristic is a judgmental heuristic in which a person evaluates the frequency of classes or the probability of events by the ease with which relevant instances come to mind. This heuristic is explained by its discoverers, Tversky and Kahneman (1973), as follows: “That associative bonds are strengthened by repetition is perhaps the oldest law of memory known to man. The availability heuristic exploits the inverse form of this law, that is, it uses the strength of the association as a basis for the judgment of frequency.”
To determine availability, it is sufficient to assess the ease with which instances or associations could be brought to mind – it is not necessary to perform the actual operations of retrieval or construction. An illustration of this phenomenon by Tversky and Kahneman (1973) is: “One may estimate the probability that a politician will lose an election by considering the various ways he may lose support.”
Success rates for availability heuristics are very varied and depend greatly on the experiment setup. Among other factors, they depend on the ease of recall (Schwarz et al., 1991). In one of the original experiments (judgment of word frequency) presented by Tversky and Kahneman (1973), the number of wrong judgments was 105 out of 152 (70%). The task was to estimate whether letter “R” appears more frequently on first or third position in English texts. The reason why most subjects incorrectly assumed the first position is that it is easier to recall words starting with R than words with R on the third position.
Implications for rule learning
The application of availability heuristic is based on the perceived association between the literals in the antecedent and the consequent of the rule. The stronger this perceived association, the higher the perceived confidence of the rule. It is our opinion this heuristic will favour longer rules, since they have higher chance to contain a literal which the analyst perceives as associated with the predicted label.
It is true that the longer rule is also more likely to contain literals not perceived as associated. It can be argued that while the remaining weakly associated literals will decrease the preference for the longer rule, this effect can be attributed to the weak evidence heuristic rather than the availability heuristic. However, according to our literature review, the availability heuristic can only increase the preference level.
Our initial review did not reveal any debiasing strategies for the availability heuristic. From a broader perspective, availability is associated with the associative System 1, which can be corrected by the rule-based System 2 (Kahneman, 2003). Therefore, inducing conditions known to trigger engagement of System 2 could be effective.
5.5 Effect of Difficulty (Judgment)
When an analyst is supposed to give a preference judgment between two competing hypotheses, one of the factors used in the decision making process is the difficulty of the problem and the corresponding confidence that is related to the judgment.
Griffin and Tversky (1992) developed a model that combines the strength of evidence with its weight (credibility). Their main research finding is that people tend to combine strength with weight in suboptimal ways, resulting in the decision maker being too much or too little confident about the hypothesis at hand than would be normatively appropriate given the information available. This discrepancy between the normative confidence and the decision maker’s confidence is called overconfidence or underconfidence. Research has revealed systematic patterns in overconfidence and underconfidence:
If the estimated difference between the two hypotheses is large, it is easy to say which one is better, then there is a pattern of underconfidence.
As the degree of difficulty rises (the difference between the normative confidence of two competing hypotheses is decreasing), there is a strengthening pattern of overconfidence.
People use the provided data to assess the hypothesis at hand but they insufficiently regard the quality of the data. Griffin and Tversky (1992) illustrate this manifestation of bounded rationality as follows: “If people focus primarily on the warmth of the recommendation with insufficient regard for the credibility of the writer, or the correlation between the predictor and the criterion, they will be overconfident when they encounter a glowing letter based on casual contact, and they will be underconfident when they encounter a moderately positive letter from a highly knowledgeable source.”
Griffin and Tversky (1992) used regression to analyze the relation between the strength of evidence and weight of evidence. The conclusion was that the regression coefficient for strength was larger than the regression coefficient for weight for 30 out of 35 subjects, which was found statistically significant. The median ratio of these coefficients was established to be 2.2 to 1 in favour of strength.
Implications for rule learning
The strongest overconfidence was recorded for problems where the weight of evidence is low and the strength of evidence is high. This directly applies to rules with high value of confidence and low value of support. These are typically the longer rules. The empirical results related to the effect of difficulty therefore suggest that the predictive ability of such rules will be substantially overrated by analysts. This is particularly interesting because rule learning algorithms often suffer from a tendency to unduely prefer overly specific rules that have a high confidence on small parts of the data to more general rules that have a somewhat lower confidence, a phenomenon also known as overfitting. The above-mentioned results seem to indicate that humans suffer from a similar problem (albeit for presumably for different reasons), which, e.g., implies that a human-in-the-loop solution may not alleviate this problem.
Similar to fighting overfitting in machine learning, we conjecture that this effect could be ameliorated by filtering out rules that do not pass a statistical significance test from the output and informing the users on the value and meaning of the value of statistical significance.
5.6 Mere Exposure Effect (Judgment)
According to this heuristic (effect), repeated exposure to an object results in an increased preference for that object. The mere exposure effect and the recognition heuristic are, according to Pachur et al. (2011), two different phenomena, because unlike the latter, the mere exposure effect does not “require that the object is recognized as having been seen before”.
As with other biases, the success rates for the mere exposure effect are very varied and depend greatly on the experimental setup. Among other factors, they depend on whether the stimulus the subject is exposed to is exactly the same as in prior exposure or similar to it (Monahan et al., 2000). Instead of selecting one particular success rate from a specific experiment, we can refer to the well-established finding that when a concrete stimulus is repeatedly exposed, the preference for that stimulus increases logarithmically as a function of the number of exposures (Bornstein, 1989).
Implications for rule learning
Already the initial research of Zajonc (1968) included experimental evidence on the correlation between word frequency and affective connotation of the word. From this it follows that a longer rule—as measured by word length rather than the number of conditions—will have a greater chance of containing a word that the analyst had been strongly exposed to. Moreover, the exposure effects of individual words may possibly add up. This leads to the conclusion that mere exposure effect will increase plausibility of longer rules.
While our limited literature review did not reveal any debiasing techniques, we conjecture that similarly to the related recognition heuristic the knowledge of the criterion variable could ameliorate the mere exposure effect: presenting information on the semantics of the literal as well as on its covariance with other literals may suppress the heuristic.
5.7 Ambiguity Aversion
Ambiguity aversion corresponds to the finding that humans tend to prefer known risks over unknown risks. It is not a reasoning error. Consider the following comparison with the conjunctive fallacy. When a typical subject is explained the conjunctive fallacy, they will recognize their reasoning as an “error”, and, as Al-Najjar and Weinstein (2009) put it, the subjects “feel embarrassed” for their irrational choice. This contrasts with the ambiguity aversion, as for example demonstrated by the Ellsberg paradox (Ellsberg, 1961), which shows that humans tend to systematically prefer a bet with a known albeit very small probability of winning over a bet with a not precisely known probability of winning, even if it would in practice mean a near guarantee of winning.
As follows from the research of Camerer and Weber (1992), ambiguity aversion is related to the information bias: the demand for information in cases when it has no effect on decision can be explained by the aversion to ambiguity: people dislike having missing information.
As noted by Camerer and Weber (1992), Ellsberg did not perform careful experiments. According to the same paper, follow-up empirical work can be divided into three categories: replications of Ellsberg’s experiment, determination of psychological causes of ambiguity, and studies of ambiguity in applied setting. The most relevant to our work are experiments focusing on the applied setting. Curley et al. (1984) describe an experiment in a medical domain where 20% of subjects avoided ambiguous treatments.
Implications for rule learning
The ambiguity aversion may have profound implications for rule learning. The typical data mining task will contain a number of attributes the analyst has no or very limited knowledge of. The ambiguity aversion will manifest itself in a preference for rules that do not contain ambiguous attributes or literals. Ambiguity aversion may also steer the analyst to shorter rules as these can be expected to have lower chance of containing an ambiguous literal.
We conjecture that this bias would be alleviated if textual description of the meaning of all the literals is made easily accessible to the analyst.
5.8 Averaging Heuristic
While the representativeness heuristic is the most commonly associated heuristic with the conjunctive fallacy, the averaging heuristics provides an alternate explanation: it suggests that people evaluate the probability of a conjuncted event as the average of probabilities of the component events (Fantino et al., 1997).
As reported by Zizzo et al. (2000): “approximately 49% of variance in subjects’ conjunctions could be accounted for by a model that simply averaged the separate component likelihoods that constituted a particular conjunction.” This high success rate suggests that the averaging heuristic may be an important subject of further study within machine learning.
Implications for rule learning
The averaging heuristic can be interpreted to increase preference for longer rules. The reason is that longer rules are more likely to contain literals with low probability. Due to the application of the averaging heuristic the analyst may not fully realise the consequences of the presence of a low-probability literal for the overall likelihood of the set of conditions in the antecedent of the rule.
Consider the following example: Let us assume that the learning algorithm only adds independent conditions that have a probability of , and we compare a 3-condition rule to a 2-condition rule. Averaging would evaluate both rules equally, because both have an average probability of . A correct computation of the joint probability, however, shows that the longer rule is considerably less likely ( vs. because all conditions are assumed to be independent).
Averaging can also affect same-length rules. Fantino et al. (1997) derive from their experiments on the averaging heuristic that humans tend to judge “unlikely information [to be] relatively more important than likely information.” Continuing our example, if we compare the above 2-condition rule with another rule with two features with more diverse probability values, e.g., one condition has and the other has , then averaging would again evaluate both rules the same, but in fact the correct interpretation would be that the rule with equal probabilities is more likely than the other (). In this case, the low 0.6 probability in the new rule would “knock down” the normative conjoint probability below the one of the rule with two 0.8 conditions.
Experiments presented in Zizzo et al. (2000) showed that prior knowledge of probability theory, and a direct reminder of how probabilities are combined, are effective tools for decreasing the incidence of conjunctive fallacy, which is the hypothesized consequence of the averaging heuristic.
5.9 Confusion of the Inverse
This effect corresponds to confusing the probability of cause and effect, or, formally, confidence of an implication with its inverse , i.e., is confused with . This confusion may manifest itself strongest in the area of association rule learning, where an attribute can be of interest to the analyst both in the antecedent and consequent of a rule.
In a study referenced from Plous (1993) this fallacy was committed by 95% of physicians involved.
Implications for rule learning
Obviously, the confusion of the direction of an implication sign has its consequences on the interpretation of a rule. Already Michalski (1983) has noted that there are two different kinds of rules, discriminative and characteristic. Discriminative rules can quickly discriminate an object of one category from objects of other categories. A simple example is the rule
|IF trunk THEN elephant|
which states that an animal with a trunk is an elephant. This implication provides a simple but effective rule for recognizing elephants among all animals.
Characteristic rules, on the other hand, try to capture all properties that are common to the objects of the target class. A rule for characterizing elephants could be
|IF elephant THEN heavy, large, grey, bigEars, tusks, trunk.|
Note that here the implication sign is reversed: we list all properties that are implied by the target class, i.e., by an animal being an elephant. From the point of understandability, characteristic rules are often preferable to discriminative rules. For example, in a customer profiling application, we might prefer to not only list a few characteristics that discriminate one customer group from the other, but are interested in all characteristics of each customer group.
Characteristic rules are very much related to formal concept analysis (Wille, 1982; Ganter and Wille, 1999). Informally, a concept is defined by its intent (the description of the concept, i.e., the conditions of its defining rule) and its extent (the instances that are covered by these conditions). A formal concept is then a concept where the extension and the intension are Pareto-maximal, i.e., a concept where no conditions can be added without reducing the number of covered examples. In Michalski’s terminology, a formal concept is both discriminative and characteristic, i.e., a rule where the head is equivalent to the body.
The confusion of the inverse thus seems to imply that humans will not clearly distinguish between these types of rules, and, in particular, tend to interpret an implication as an equivalence. From this, we can infer that characteristic rules, which add all possible conditions even if they do not have additional discriminative power, may be preferable to short discriminative rules.
Edgell et al. (2004) studied the influence of the effect of training of analysts in probabilistic theory with the conclusion that it is not effective in addressing the confusion of the inverse fallacy. Our literature review did not reveal any other applicable work.
5.10 Context and Tradeoff Contrast
Tversky and Simonson (1993) developed a theory that combines background context defined by prior options with local context which is given by the choice problem at hand. The contributions of both types of context are additive. While additivity is considered as not essential for the model, it is included because it “provides a good approximation in many situations and because it permits a more parsimonious representation”. The analyst adjusts the relative weights of attributes in the light of tradeoffs implied by the background.
The reference application scenario for the tradeoff contrast is that selection of one of the available alternatives, such as products or job candidates, can be manipulated by the addition or deletion of alternatives that are otherwise irrelevant. Tversky and Simonson (1993) attribute the tradeoff effect to the fact that “people often do not have a global preference order and, as a result, they use the context to identify the most ’attractive’ option.”
In one of the experiments described by Tversky and Simonson (1993), subjects were asked to choose between two microwave ovens (Panasonic priced 180 USD and Emerson priced 110 USD), both a third off the regular price. The number of subjects who chose Emerson was 57% and 43% chose Panasonic. Another group of subjects was presented the same problem with the following manipulation: A more expensive Panasonic valued at 200 USD (10% off the regular price) was added to the list of possible options. The newly added device was described to look as inferior to the other Panasonic, but not to the Emerson device. After this manipulation, only 13% chose the more expensive Panasonic, but the number of subjects choosing the less expensive Panasonic rose from 43% to 60%.
It should be noted that according to Tversky and Simonson (1993) if people have well-articulated preferences, the background context has no effect on the decision.
Implications for rule learning
In rule learning, context manipulation will typically not be deliberate but a systematic result of the algorithmic process. It will manifest by presence of redundant rules or attributes within rules on the output.
The influence of context can be manifested by preference towards longer rules. The reason is that if a rule contains a literal with unknown predictive power and multiple other literals with known (positive) predictive power for the consequent of the rule, these known literals create a context which may make the analyst believe that also the unknown literal has positive predictive power. By doing so, the context provided by the longer rule can soften the effects of ambiguity aversion, which would otherwise have made the analyst prefer the shorter rule (cf. Subsection 5.7), and through the information bias (cf. Subsection 5.12) further increase the preference for the longer rule.
An attempt to making contextual attributes explicit was made by Gamberger and Lavrač (2003), who introduced supporting factors as a means for complementing the explanation delivered by conventional learned rules. Essentially, supporting factors are additional attributes that are not part of the learned rule, but nevertheless have very different distributions with respect to the classes of the application domain. In line with the results of Kononenko (1993), medical experts found that these supporting factors increase the plausibility of the found rules.
We conjecture that similarly to other effects, the influence of context can be suppressed by reducing the number of rules the analyst is presented and removal of irrelevant literals from the remaining rules.
5.11 Disjunction Fallacy
The disjunction fallacy is demonstrated by assessing the probability to be higher than the probability , where is a union of event with another event . Bar-Hillel and Neter (1993) explain the disjunction fallacy with a preference for the narrower possibility over the broader one. In case the narrower category is unlikely, the broader possibility is preferred.
In experiments reported by Bar-Hillel and Neter (1993), and were nested pairs of categories, such as Brazil and Latin America. Subjects were assigned problems such as: “Writes letter home describing a country with snowy wild mountains, clean streets, and flower decked porches. Where was the letter written?” It follows that since Latin America contains Brazil, the normative answer is Latin America. However, Brazil was the most likely answer.
The rate of the disjunction fallacy in the experiment presented by Bar-Hillel and Neter (1993) averaged 64%. The authors offer two explanations for why this is a lower fallacy rate than for the conjunction fallacy. The first one is that the disjunction rule is more compelling than the conjunction rule. The second favoured explanation is that the Linda experiments in Tversky and Kahneman (1983) used highly non-representative categories (bank teller), while in (Bar-Hillel and Neter, 1993) both levels of categories (Brazil and Latin America) were representative.
Implications for rule learning
In data mining context, it can be the case that the feature space is hierarchically ordered. The analyst can thus be confronted with rules containing attributes (literals) on multiple levels of granularity. Following the disjunction fallacy, the analyst will generally prefer rules containing more specific attributes, which can result in preference for rules with fewer backing instances and thus weaker statistical validity.
The disjunction fallacy can be generally expected to bias the analysts towards longer rules since these have a higher chance of containing a literal corresponding to a narrower category.
We conjecture that disjunction fallacy could be alleviated by making the analysts aware of the taxonomical relation of the individual attributes and educating them on the benefits of larger supporting sample, which is associated with more general attributes.
5.12 Information Bias
Information bias relates to the tendency of people to consider more available information to improve the perceived validity of a statement even if the additional information is not relevant. The typical manifestation of the information bias is evaluating questions as worth asking even when the answer cannot affect the hypothesis that will be accepted (Baron et al., 1988).
Baron et al. (1988) performed four experiments to show the effect of information bias. For example, in their fourth experiment, subjects were asked to assess to what degree a medical test is suitable for deciding which of the three diseases to treat using a scale from 0 to 100. The test detected a chemical “Tutone”, which was with certain given probability associated with each of the three diseases. This probability was varied across the cases. There were ten cases evaluated, the test could normatively help only in two of those (no. 2 and 9)—the correct answer for the remaining eight was thus 0. For example, in case no. 1 and 10 the probability of Tutone being associated with all three diseases was equal—the knowledge of Tutone presence had no value for distinguishing between the three diseases—and the normative answer was 0. Even for these simple cases, the mean ratings were 21 and 9 instead of 0. The normative answer for cases 2 and 9 was equal at 24, while the subjects assigned 61 and 75 respectively.
Implications for rule learning
Rules often contain redundant, or nearly redundant conditions. By redundant it is meant that the knowledge of the particular piece of information represented by the additional condition (literal) has no or very small effect on rule quality. According to information bias, a rule containing additional (redundant) literals may be preferred to a rule not containing this literal. The information bias clearly steers the analyst towards longer rules.
We conjecture that this bias would be alleviated by a visualization of the information value (e.g. by predictive strength) of individual conditions in the rule.
5.13 Insensitivity to Sample Size
This effect implies that analysts are unable to appreciate the increased reliability of the confidence estimate with increasing value of support, i.e., they fail to appreciate that the strength of the connection between antecedent and consequence of a rule becomes more reliable with an increasing number of observations. Unlike the base-rate fallacy, this effect assumes that the size of the sample is understood: while the base-rate fallacy deals with the more complex case when people are presented with probabilistic information but are unable to understand it correctly, insensitivity to sample size is the related problem that people underestimate the increased benefit of higher robustness of estimates that are made on a larger sample.
Another bias to which insensitivity to sample size is connected is the frequency illusion, which relates to an overestimation of the base rate of an event as a result of selective attention and confirmation bias.
When the insensitivity to sample size effect was introduced by Tversky and Kahneman (1974), it was supported by experimental results from the so-called hospital problem. In this problem, subjects are asked which hospital is more likely to record more days in which more than 60 percent of the newborns are boys. The options are a larger hospital, a smaller hospital or two hospitals with about the same size. The correct expected answer—the smaller hospital—was chosen only by 22% of subjects, the fallacy rate is thus 78%. The experimental subjects were 95 undergraduate students.
Implications for rule learning
If confronted with two rules, where one of them has a slightly higher confidence and the second rule a higher support, this cognitive bias states that the analyst will prefer the rule with higher confidence (all other factors equal). As typically rule length trades off coverage and precision—longer rules tend to be more precise but cover fewer examples—this may result in a preference for longer rules.
In our opinion, one possible approach for mitigation of this bias in rule learning research is to use the value of support to compute confidence (reliability) intervals for the value of confidence. Such confidence interval might be better understood than the original “raw” value of support.
5.14 Recognition Heuristic
Pachur et al. (2011) define the recognition heuristic as follows: “For two-alternative choice tasks, where one has to decide which of two objects scores higher on a criterion, the heuristic can be stated as follows: If one object is recognized, but not the other, then infer that the recognized object has a higher value on the criterion.” For example, when asked which of the two cities Chongqing or Hongkong are bigger, subjects from the Western hemisphere tend to prefer the former because it is much better known.
The recognition heuristic can be differentiated from the availability heuristic as follows: “To make an inference, one version of the availability heuristic retrieves instances of the target event categories, such as the number of people one knows who have cancer compared to the number of people who have suffered from a stroke (Hertwig et al., 2005). The recognition heuristic, by contrast, bases the inference simply on the ability (or lack thereof) to recognize the names of the event categories.” (Pachur et al., 2011).
An experiment performed by Goldstein and Gigerenzer (1999) focused on estimating which of two cities in a presented pair is more populated. The estimates were analysed with respect to the recognition of the cities by subjects. The median proportion of judgments complying to the recognition heuristic was 93%. It should be noted that the application of this heuristic is in this case ecologically justified since recognition will be related to how many times the city appeared in a newspaper report, which in turn is related to the city size (Beaman et al., 2006).
Implications for rule learning
The recognition heuristic can manifest itself by preference for rules containing a recognized literal or attribute in the antecedent of the rule. Since the odds that a literal will be recognized increase with the length of the rule, the recognition heuristic generally increases the preference for longer rules.
One could argue that for longer rules, the odds of occurrence of an unrecognized literal will also increase. The counterargument is the empirical finding that – under time pressure – people assign a higher value to recognized objects than to unrecognized objects. This happens also in situations when recognition is a poor cue (Pachur and Hertwig, 2006).
As to the alleviation of effects of recognition heuristic in situations where it is ecologically unsuitable, Pachur and Hertwig (2006) note that suspension of the heuristic requires additional time or the direct knowledge of the “criterion variable”. This coincides with the intuition that the interpretation of rule learning results by experts should be less prone to recognition heuristic. However, in typical real-world machine learning tasks the data can include a high number of attributes that even subject-matter experts are not acquainted with in detail. When these recognized – but not understood – attributes are present in the rule model even the experts are liable to the recognition heuristic. We therefore conjecture that the experts can strongly benefit from easily accessible information on the meaning of individual attributes and literals.
5.15 Negativity Bias
According to this bias, negative evidence tends to have a greater effect than neutral or positive evidence of equal intensity.
Extensive experimental evidence for negativity bias was summarized by Rozin and Royzman (2001) for a range of domains. The most relevant to our focus appears to be the domain of attention and salience. In the experiments reported by Pratto and John (2005), it was investigated whether the valence of a word (desirable or undesirable trait) has effect on the time required to identify the color in which the word appears on the screen. The result was that the subjects took 29 ms longer to name the color of an undesirable word than for a desirable word (679 vs 650 ms). As for the number of subjects affected, for 9 out of the 11 subjects the mean latency was higher for desirable words. The authors explain the fact that the response time was higher for undesirable words with the undesirable trait obtaining more attention.
Implications for rule learning
There are two types of effects that we discuss in the following: 1) effect of a negated literal in the antecedent and 2) effect of a negative class in the consequent.
Most rule learning algorithms are capable of generating rules containing negated literals. For example, male gender can be represented as not(female). According to the negativity bias, the negative formulation of the same information will be given higher weight.
Considering a binary classification task, when one class is viewed as “positive” and the other class as “negative”, the rule model may contain a mix of rules with the positive and negative class in the consequent. According to the negativity bias rules with the negative class in the consequent will be given higher weight. This bias can also manifest in the multiclass setting, when one or more classes can be considered as “negative”. This effect can manifest also in the subsequent decision making based on the discovered and presented rules, because according to the principle of negative potency (Rozin and Royzman, 2001) and prospect theory (Kahneman and Tversky, 1979) people are more concerned with the potential losses than gains.
An interesting discovery applicable to both negation in antecedent and consequent shows that negativity is an “attention magnet” (Fiske, 1980; Ohira et al., 1998). This implies that a rule predicting a negative class will obtain more attention than a rule predicting a positive class, which may also apply to appearance of negated literals in the antecedent. Also, research suggests that negative information is better memorized and subsequently recognized (Robinson-Riegler and Winton, 1996; Ohira et al., 1998).
We conjecture that rule learning systems can mitigate the effects of the negativity bias by avoiding the use of negation: use gender=male instead of not(gender=female).
5.16 Primacy Effect
Once humans form initial assessment of plausibility (favourability) toward an option, subsequent evaluations of this option will favour the initial disposition.
Bond et al. (2007) investigated to what extent changing the order of information which is presented to a potential buyer affects the propensity to buy. If the positive information (product description) was presented as first, the number of participants indicating they would buy the product was 48%. When the negative information (price) was presented first, this number decreased to 22%. Participants were 118 undergraduate students.
Additional experimental evidence was provided by Shteingart et al. (2013).
Implications for rule learning
Following the primacy effect the analyst will favour rules that are presented as first in the rule model. Rule learning algorithms, such as CBA (Liu et al., 1998), are natively capable of taking advantage of the primacy effect, since they naturally create rule models that contain rules sorted by their strength. Others order rules so that more general rules (i.e., rules that cover more examples) are presented first. This typically also corresponds to the order in which rules are learned with the commonly used separate-and-conquer or covering strategy Fürnkranz (1999). However, it has been pointed out by Webb (1994) that prepending (adding to the beginning) a new rule to the previously learned rules can produce simpler concepts. The intuition behind this argument is that there are often simple rules that would cover many of the positive examples, but also cover a few negative examples that have to be excluded as exceptions to the rule. Placing the simple general rule near the end of the rule list allows us to handle exceptions with rules that are placed before the general rule and keep the general rule simple. Experimental results confirmed this hypothesis with respect to the complexity of the rules, but did not directly evaluate comprehensibility.
A machine learning application can take advantage of the primacy effect by presenting rules that are considered as most plausible based on observed data as first in the resulting rule model.
5.17 Reiteration Effect
The experiment performed by Hasher et al. (1977) presented subjects with general statements and asked them to asses their validity on a 7-point scale. Part of the statements were false and part were true. The experiment was conducted in several sessions, where some of the statements repeated in subsequent sessions. The average validity of repeated true statements rose between Session 1 and Session 3 from 4.52 to 4.80, while for non-repeated statements it dropped slightly. Similarly, for false statements, the validity rose from 4.18 to 4.67 for repeated statements and dropped for non-repeated statements. In this case repeating of false statements increased the subjectively-perceived validity by 11%.
Implications for rule learning
In the rule learning context, “the repeated statement which becomes more believable” corresponds to the entire rule or possibly a “sub rule” consisting of the consequent of the rule and a subset of conditions in its antecedent. A typical rule learning result contains multiple rules that are substantially overlapping. If the analyst is exposed to multiple similar statements, the reiteration effect will increase the analyst’s belief in the repeating “sub rule”. In the rule learning context the bias behind the reiteration effect may not be justified. Especially in the area of association rule learning, a very large set of redundant rules—covering the same, or nearly same set of examples—is routinely included in the output.
A possible remedy for the reiteration effect can be performed already on algorithmic level by ensuring that rule learning output does not contain redundant rules. This can be achieved by pruning algorithms (Fürnkranz, 1997). We also conjecture that this effect can be alleviated by explaining the redundancy on rule learning output to the analyst, for example by clustering rules.
5.18 Misunderstanding of “and”
The misunderstanding of “and” is a phenomenon affecting the syntactic comprehensibility of the logical connective “and”. As discussed by Hertwig et al. (2008), “and” in natural language can express several relationships, including temporal order, causal relationship, and most importantly, can also indicate a collection of sets instead of their intersection.777As in “He invited friends and colleagues to the party”
According to the two experiments reported in Hertwig et al. (2008), the conjunction “bank tellers and active feminists” used in the Linda problem (cf. Section 5.3) was found by about half of the subjects as ambiguous—they explicitly asked the experimenter how “and” is to be understood. The experiment involved determining understanding of “and” based on shading of Venn diagrams. The results indicate that 45 subjects interpreted “and” as intersection and 14 subjects as a union. The fallacy rate is thus 23%. Two thirds of subjects were university students and one third of subjects were professionals.
Implications for rule learning
This effect will increase the preference of longer rules for reasons similar to those discussed for the conjunctive fallacy (cf. Subsection 5.3).
According to Sides et al. (2002) “and” ceases to be ambiguous when it is used to connect propositions rather than categories. The authors give the following example of a sentence which is not prone to misunderstanding: “IBM stock will rise tomorrow and Disney stock will fall tomorrow.” Similar wording of rule learning results may be, despite its verbosity, preferred. We further conjecture that representations that visually express the semantics of “and” such as decision trees may be preferred over rules, which do not provide such visual guidance.
5.19 Weak Evidence Effect
According to this effect presenting weak evidence in favour of an outcome can actually decrease the probability that a person assigns to it. In an experiment in the area of forensic science reported by Martire et al. (2013), it was shown that participants presented with evidence weakly supporting guilt tended to “invert” the evidence, thereby counterintuitively reducing their belief in the guilt of the accused.
Martire et al. (2013) performed an experiment in the judicial domain. When the presented evidence provided by the expert was weak, but positive, the number of responses incongruent with the evidence provided was 62%. When the strength of evidence was moderate or high the corresponding average was 13%. The subjects were undergraduate psychology students and Amazon Mechanical Turk workers (altogether over 600 participants).
Implications for rule learning
The weak evidence effect can be directly applied on rules: the evidence is represented by rule antecedent; the consequent corresponds to the outcome. The analyst can intuitively interpret each of the conditions in the antecedent as a piece of evidence in favour of the outcome. Typical of many machine learning problems is the uneven contribution of individual attributes to the prediction. Let us assume that the analyst is aware of the prediction strength of the individual attributes. If the analyst is to choose from a shorter rule containing only the strong predictor and a longer rule containing a strong predictor and a weak (weak enough to trigger this effect) predictor, according to the weak evidence effect the analyst should choose the shorter rule.
Our review did not reveal any debiasing strategies. This is related to the fact that the weak evidence effect is a relatively recent discovery. Our conjecture is that this effect can be alleviated by intentional omission of weak predictors from rules either directly by the rule learner or as part of feature selection.
5.20 Unit Bias
This cognitive bias manifests by humans tending to consider each condition as a unit of equal weight at the expense of detailed scrutiny of the actual effect of the condition (Geier et al., 2006).
The effect of this bias was evaluated by Geier et al. (2006) on three food items: Tootsie Rolls, pretzels and M&Ms. These food items were offered in two sizes/scoops (on different days) and it was observed how this will affect consumption. For Tootsie Rolls and M&Ms the larger unit size was the smaller one and for pretzels the smaller one. It follows from the figure included in (Geier et al., 2006) that increasing the size of the unit had about 50% effect on the amount consumed.
Implications for rule learning
From a technical perspective, the number of conditions (literals) in rules is not important. What matters is the actual discriminatory power of the individual conditions, which can vary substantially. However, following the application of unit bias, the number of conditions affects the subjective perception of discriminatory power of the antecedent as a whole. Under the assumption that the analyst will favour rule with higher discriminatory power, this heuristic will clearly contribute to preference for longer rules, since these contain more literals considered as “units”.
Unlike other modes of communication humans are used to, rules resulting from algorithmic analysis of data do not provide clues relating to the importance of individual conditions, since rules often place conditions of vastly different importance side by side, not even maintaining the order from the most important to the least important. Such computer-generated rules violate conversational rules or “maxims”, because they contain conditions which are not informative or relevant.888The relevance maxim is one of four conversation maxims proposed by philosopher Paul Grice, which was brought to relation with the conjunctive fallacy in the work of Gigerenzer and Hoffrage (1999) (see also (Mosconi and Macchi, 2001)).
In summary, the application of the unit bias in the context of rule learning can result in gross errors in interpretation. When domain knowledge on the meaning of the literals in the rule is absent, the unit bias can manifest particularly strongly.
We conjecture that informing analysts about the discriminatory power of the individual conditions (literals) may alleviate unit bias. Such indicator can possibly be generated automatically by listing the number of instances in the entire dataset that meet the condition. Second, rule learning algorithms should ensure that literals are present in the rules in the order of significance, complying to human conversational maxims.
6 A Model for Rule Plausibility and Recommendations for Interpretable Rule Learning
Based on literature review and partly on experimental results presented in (Kliegr, 2017), we propose a graphical model of the plausibility of rules. The model is intended to raise awareness about the effect of cognitive biases on perception of rule learning results among the designers of machine learning algorithms and software. It suggests which cognitive biases might be triggered when humans assess plausibility of inductively learned rules and whether they influence plausibility. To some extent similar model describing general factors influencing plausibility assessment of a hypothesis was proposed in Gettys et al. (1978, 1986). In our model, we focus on inductively learnt rule and cognitive biases.
The model consists of two decision trees, which are presented in Figure 3. The first tree captures the hypothesized contributions of individual literals in the antecedent of the rule towards increase or decrease of human perceived plausibility of the rule. The second tree suggests how the individual literal contributions might be combined into perception of overall plausibility of the rule.
6.1 Categorization of Biases based on Agency
Inspection of the first tree (Figure 2(a)) hints that the effect of the individual literals in the rule largely depends on the domain knowledge of the person inspecting the model. In contrast, the way the contribution of the literals is aggregated into a final plausibility score in Figure 2(b) seems to depend on the general information processing style and preferences of the person. We thus propose to divide the biases reviewed in this paper into the following two groups:
Triggered by domain knowledge related to attributes and values in the rules. An example is aversion to ambiguous information.
Generic strategies applied when evaluating alternatives. An example is insensitivity to sample size, which implies that rule confidence is considered as more important than rule support.
While domain knowledge may be difficult to change, systematic errors in reasoning can often be avoided. One example is making the person aware of the fact that low rule support influences the reliability of the rule confidence estimate.
6.2 Implications for Algorithm Design and Visualizations in Machine Learning
This section provides a concise list of considerations that is aimed to raise awareness among machine learning practitioners regarding availability of measures that could potentially suppress effect of cognitive biases on comprehension of rule-based models. We expect part of the list to be useful also for other symbolic machine learning models, such as decision trees.
Remove near-redundant rules and near-redundant literals from rules. Rule models often incorporate output that is considered as marginally relevant. This can take form of (near) redundant rules or (near) redundant literals in the rule. Our analysis shows that these redundancies can induce a number of biases. For example, a frequently occurring but otherwise not very important literal can – by the virtue of the mere exposure effect – be perceived as more important than would be appropriate given the data.
Represent rule quality measures as frequencies not ratios. Currently, rule interest measures such as confidence and support are typically represented as ratios. Extensive research has shown that natural frequencies are better understood.
Make conjunctions unambiguous. There are several cognitive studies indicating “and” is often misunderstood. The results of our experiments also support this conclusion. Machine learning software should thus make sure that the meaning of and in presented rules is clear.
Present confidence interval for rule confidence. The tendency of humans to ignore base-rates and sample sizes (which closely relate to rule support) is a well-established fact in cognitive science, results of our experiments on inductively learned rules also provide evidence for this conclusion. Our proposition is that this effect can be addressed by computing confidence (reliability) intervals for confidence. In this way, the “weight of evidence” will effectively be communicated through confidence.
Avoid the use of negated literals as well as positive/negative class labels. It is an established fact in cognitive science that negative information receives more attention and is associated with higher weight than positive information. There is research indicating that recasting a yes/no attribute to two “neutral” categories (such as “DAX/MED”) can improve human understanding.
Sort rules as well as literals in the rules from strongest to weakest. People have the tendency to put higher emphasis to information they are exposed to first. By presenting the important information as first, machine learning software can also conform to these human conversational maxims. The output could also visually delimit literals in the rules based on their significance, which would again correspond to humans using various non-verbal clues to convey significance in the spoken word.
Provide explanation for literals in rules. Number of biases can be triggered or strengthened by the lack of domain knowledge of literals in the rules. Some examples include ambiguity aversion or unit bias. Providing the analyst with easily accessible information on literals in the rules including their predictive power can prove as an effective debiasing technique.
Explain difference between negation and absence of a condition. Prior results in cognitive science as well as some experimental results in the rule learning domain (Kliegr, 2017) show that absence of a condition can be misinterpreted as negation if the omitted condition is present in in other rules. Consider the following pair of rules
Rule 1: IF bankteller=yes THEN class=A
Rule 2: IF bankteller=yes AND feminist=yes then class=B.
In presence of Rule 2, Rule 1 can be read as
Rule 1’: IF bankteller=yes AND feminist=no THEN class=A.
Elicit and respect monotonicity constraints. Research has shown that if monotonicity constraints—such as that fuel consumption increases with increasing car weight—are observed, the plausibility of the rule model increases.
Educate and assess human analysts. One perhaps surprising result related to confirmation or myside bias is that its incidence is not related to intelligence. Some research even suggests that analysts, who think that good arguments are those that can be ââproved by factsââ, are even more susceptible to myside bias than the general population. There is a psychological test that can reveal the susceptibility of a person to myside bias. Several studies have shown that providing explicit guidance and education on formal logics, hypothesis testing and critical assessment of information can reduce fallacy rates in some tasks.
7 Limitations and Future Work
Our goal was to validate whether cognitive biases affect interpretation of machine learning models and propose remedies if they do. Since this field is untapped from the machine learning perspective, we tried to approach this problem holistically. Our work yielded a number of partial contributions, rather than a single profound result. We mapped applicable cognitive biases, identified prior works on their suppression and proposed how these could be transfered to machine learning. All the shortcomings of human judgment pertaining to interpretation of inductively learned rules that we have reviewed are based on empirical cognitive science research. For each cognitive bias, we attempted to provide a justification how it would relate to machine learning. Due to absence of applicable prior research, this justification is mostly based on authors’ experience in machine learning.
7.1 Incorporating Additional Biases
There are about 24 cognitive biases covered in Cognitive Illusions, the authoritative overview of cognitive biases by Pohl (2017), and even 51 different biases are covered by Evans et al. (2007). While doing the initial selection of cognitive biases to study, we tried to identify those most salient for machine learning research matching our criteria. This is the reason why we included the weak evidence effect, which has been discovered only recently and is not yet included into the latest edition of Cognitive Illusions. In the end, our review focused on a selection of 20 cognitive biases (effects, illusions). Future work might focus on expanding the review with additional relevant biases, such as labelling and overshadowing effects (Pohl, 2017).
7.2 Applicability of Results on Wason’s 2-4-6 Problem
According to our review, the results obtained in cognitive science have only rarely been integrated or aligned with research done in machine learning. As our review also showed, there is a number results in cognitive science relevant for machine learning. Remarkably, since 1960 there is a consistent line of work done by psychologists on the problem of studying cognitive processes related to rule induction, which is centred around the so called Wason’s 2-4-6 problem.
Cognitive science research on rule induction in humans has been so far completely unnoticed in the rule learning subfield of machine learning.999Based on our analysis of cited reference search in Google Scholar for (Wason, 1960). It was out of the scope of the objectives of this review to perform analysis of the significance of results obtained for the Wason’s 2-4-6 problem for rule learning, nevertheless we believe that such investigation could bring interesting insights for cognitively-inspired design of rule learning algorithms.
To our knowledge, cognitive biases have not yet been discussed in relation to interpretability of machine learning results. We thus initiated this review of research published in cognitive science with the intent to give a psychological basis to changes in inductive rule learning algorithms, and the way their results are communicated. Our review identified twenty cognitive biases, heuristics and effects that can give rise to systematic errors when inductively learned rules are interpreted.
For most biases and heuristics involved in our study, psychologists have proposed “debiasing” measures. Application of prior empirical results obtained in cognitive science allowed us to propose several methods that could be effective in suppressing these cognitive phenomena when machine learning models are interpreted. While each cognitive bias requires a different “antidote” two noticeable trends emerged from our analysis.
Our first finding indirectly supports the previous view of interpretability of machine learning models, which is that smaller models are better interpretable. However, in our review of literature from cognitive science, we did not identify results that would support this view. What our analysis did reveal is a number of cognitive phenomena that would make longer rules (or generally descriptions) more likely to trigger various cognitive biases than would shorter rules (descriptions). An example of such bias is the information bias, i.e., the preference for more information even if it does not help to address the problem at hand. To summarize our contribution, we found indirect support in psychology for the “smaller is better” paradigm used in many machine learning algorithms. While small models may not necessarily be found to be more plausible by humans than larger models, they provide less opportunities for cognitive biases to be triggered, leading to better, more truthful, comprehension.
Our second observation is that there are two categories of cognitive biases. Those that are associated by individual conditions in rules mostly relate to domain knowledge relating to attributes and values in the rules. An example is aversion to ambiguous information. Second, there are generic objectively valid reasoning rules, such as the Bayes theorem. Instead of this rule, a heuristic may be applied, leading to distortion. An example of such cognitive bias is insensitivity to sample size. The choice of debiasing technique depends on the category.
Overall, in our review we processed only a fraction of potentially relevant psychological studies of cognitive biases, however, we were unable to locate a single study focused on machine learning. Future research should thus focus on empirical evaluation of effects of cognitive biases in the machine learning domain.
TK was supported by long term institutional support of research activities by Faculty of Informatics and Statistics, University of Economics, Prague.
- Pohl (2017) R. Pohl, Cognitive illusions: A handbook on fallacies and biases in thinking, judgement and memory, Psychology Press, 2017. 2nd ed.
- Kahneman and Tversky (1972) D. Kahneman, A. Tversky, Subjective probability: A judgment of representativeness, Cognitive psychology 3 (1972) 430–454.
- Tversky and Simonson (1993) A. Tversky, I. Simonson, Context-dependent preference, Management science 39 (1993) 1179–1189.
- Kunda (1999) Z. Kunda, Social cognition: Making sense of people, MIT press, 1999.
- Serfas (2011) S. Serfas, Cognitive biases in the capital investment context, in: Cognitive Biases in the Capital Investment Context, Springer, 2011, pp. 95–189.
- Kahneman and Tversky (1973) D. Kahneman, A. Tversky, On the psychology of prediction, Psychological Review 80 (1973) 237 – 251.
- Michalski (1969) R. S. Michalski, On the quasi-minimal solution of the general covering problem, in: Proceedings of the V International Symposium on Information Processing (FCIP 69)(Switching Circuits), Yugoslavia, Bled, 1969, pp. 125–128.
- Michalski (1983) R. S. Michalski, A theory and methodology of inductive learning, in: Machine learning, Springer, 1983, pp. 83–134.
- Fürnkranz et al. (2018) J. Fürnkranz, T. Kliegr, H. Paulheim, On cognitive preferences and the interpretability of rule-based models, CoRR abs/1803.01316 (2018).
- Fürnkranz et al. (2012) J. Fürnkranz, D. Gamberger, N. Lavrač, Foundations of Rule Learning, Springer-Verlag, 2012.
- Slowinski et al. (2006) R. Slowinski, I. Brzezinska, S. Greco, Application of bayesian confirmation measures for mining rules from support-confidence pareto-optimal set, Artificial Intelligence and Soft Computing–ICAISC 2006 (2006) 1018–1026.
- Agrawal et al. (1995) R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A. I. Verkamo, Fast discovery of association rules, in: U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining, AAAI Press, 1995, pp. 307–328.
- Zhang and Zhang (2002) C. Zhang, S. Zhang, Association Rule Mining: Models and Algorithms, Springer-Verlag, 2002.
- Smith et al. (1992) E. E. Smith, C. Langston, R. E. Nisbett, The case for rules in reasoning, Cognitive science 16 (1992) 1–40.
- Nisbett (1993) R. E. Nisbett, Rules for reasoning, Psychology Press, 1993.
- Pinker (2015) S. Pinker, Words and rules: The ingredients of language, Basic Books, 2015.
- Tversky and Kahneman (1974) A. Tversky, D. Kahneman, Judgment under uncertainty: Heuristics and biases, Science 185 (1974) 1124–1131.
- Griffin and Tversky (1992) D. Griffin, A. Tversky, The weighing of evidence and the determinants of confidence, Cognitive psychology 24 (1992) 411–435.
- Keynes (1922) J. M. Keynes, A Treatise on Probability, Macmillan & Co, 1922.
- Camerer and Weber (1992) C. Camerer, M. Weber, Recent developments in modeling preferences: Uncertainty and ambiguity, Journal of Risk and Uncertainty 5 (1992) 325–370.
- Wason (1960) P. C. Wason, On the failure to eliminate hypotheses in a conceptual task, Quarterly journal of experimental psychology 12 (1960) 129–140.
- Nickerson (1998) R. S. Nickerson, Confirmation bias: A ubiquitous phenomenon in many guises, Review of general psychology 2 (1998) 175.
- Tweney et al. (1980) R. D. Tweney, M. E. Doherty, W. J. Worner, D. B. Pliske, C. R. Mynatt, K. A. Gross, D. L. Arkkelin, Strategies of rule discovery in an inference task, Quarterly Journal of Experimental Psychology 32 (1980) 109–123.
- Rossi et al. (2001) S. Rossi, J. P. Caverni, V. Girotto, Hypothesis testing in a rule discovery problem: When a focused procedure is effective, The Quarterly Journal of Experimental Psychology: Section A 54 (2001) 263–267.
- Vallée-Tourangeau and Payton (2008) F. Vallée-Tourangeau, T. Payton, Goal-driven hypothesis testing in a rule discovery task, in: Proceedings of the 30th Annual Conference of the Cognitive Science Society, Cognitive Science Society Austin, TX, 2008, pp. 2122–2127.
- Gopnik and Tenenbaum (2007) A. Gopnik, J. B. Tenenbaum, Bayesian networks, Bayesian learning and cognitive development, Developmental science 10 (2007) 281–287.
- Griffiths et al. (2010) T. L. Griffiths, N. Chater, C. Kemp, A. Perfors, J. B. Tenenbaum, Probabilistic models of cognition: Exploring representations and inductive biases, Trends in cognitive sciences 14 (2010) 357–364.
- Wilke and Mata (2012) A. Wilke, R. Mata, Cognitive bias, in: V. Ramachandran (Ed.), Encyclopedia of Human Behavior (Second Edition), second edition ed., Academic Press, San Diego, 2012, pp. 531 – 535. URL: https://www.sciencedirect.com/science/article/pii/B978012375000600094X. doi:https://doi.org/10.1016/B978-0-12-375000-6.00094-X.
- Haselton and Nettle (2006) M. G. Haselton, D. Nettle, The paranoid optimist: An integrative evolutionary model of cognitive biases, Personality and social psychology Review 10 (2006) 47–66.
- Freitas (2014) A. A. Freitas, Comprehensible classification models: a position paper, ACM SIGKDD explorations newsletter 15 (2014) 1–10.
- Stecher et al. (2016) J. Stecher, F. Janssen, J. Fürnkranz, Shorter rules are better, aren’t they?, in: Proceedings of the 19th International Conference on Discovery Science (DS-16), Bari, Italy, 2016, pp. 279–294. URL: https://doi.org/10.1007/978-3-319-46307-0_18. doi:10.1007/978-3-319-46307-0_18.
- Gettys et al. (1978) C. F. Gettys, S. D. Fisher, T. Mehle, Hypothesis Generation and Plausibility Assessment, Technical Report, Decision Processes Laboratory, University of Oklahoma, Norman, 1978. Annual report TR 15-10-78 (AD A060786.
- Gettys et al. (1986) C. F. Gettys, T. Mehle, S. Fisher, Plausibility assessments in hypothesis generation, Organizational Behavior and Human Decision Processes 37 (1986) 14–33.
- Anderson and Fleming (2016) J. Anderson, D. Fleming, Analytical procedures decision aids for generating explanations: Current state of theoretical development and implications of their use, Journal of Accounting and Taxation 8 (2016) 51.
- Kliegr (2017) T. Kliegr, Effect of Cognitive Biases on Human Understanding of Rule-based Machine Learning, Queen Mary University London, London, United Kingdom, 2017. Dissertation Thesis.
- Bibal and Frénay (2016) A. Bibal, B. Frénay, Interpretability of machine learning models and representations: an introduction, in: Proceedings of the 24th European Symposium on Artificial Neural Networks (ESANN), 2016, pp. 77–82.
- Martens et al. (2011) D. Martens, J. Vanthienen, W. Verbeke, B. Baesens, Performance of classification models from a user perspective, Decision Support Systems 51 (2011) 782–793.
- Azevedo and Jorge (2007) P. J. Azevedo, A. M. Jorge, Comparing rule measures for predictive association rules, in: Ecml, volume 7, Springer, 2007, pp. 510–517.
- Tversky and Kahneman (1971) A. Tversky, D. Kahneman, Belief in the law of small numbers, Psychological bulletin 76 (1971) 105.
- Gigerenzer and Goldstein (1996) G. Gigerenzer, D. G. Goldstein, Reasoning the fast and frugal way: models of bounded rationality, Psychological review 103 (1996) 650.
- Gigerenzer and Hoffrage (1995) G. Gigerenzer, U. Hoffrage, How to improve Bayesian reasoning without instruction: frequency formats, Psychological review 102 (1995) 684.
- Evans et al. (2007) J. S. B. Evans, et al., Hypothetical thinking: Dual processes in reasoning and judgement, volume 3, Psychology Press, 2007.
- Fürnkranz and Flach (2005) J. Fürnkranz, P. A. Flach, Roc ânârule learningâtowards a better understanding of covering algorithms, Machine Learning 58 (2005) 39–77.
- Ballin et al. (2008) M. Ballin, R. Carbini, M. F. Loporcaro, M. Lori, R. Moro, V. Olivieri, M. Scanu, The use of information from experts for agricultural official statistics, in: European Conference on Quality in Official Statistics (Q2008), 2008.
- Evans (1989) J. S. B. Evans, Bias in human reasoning: Causes and consequences, Lawrence Erlbaum Associates, Inc, 1989.
- Trope et al. (1997) Y. Trope, B. Gervey, N. Liberman, Wishful thinking from a pragmatic hypothesis-testing perspective, The mythomanias: The nature of deception and self-deception (1997) 105–31.
- Klayman and Ha (1987) J. Klayman, Y.-W. Ha, Confirmation, disconfirmation, and information in hypothesis testing, Psychological review 94 (1987) 211.
- Pohl (2004) R. Pohl, Cognitive illusions: A handbook on fallacies and biases in thinking, judgement and memory, Psychology Press, 2004.
- Mynatt et al. (1977) C. R. Mynatt, M. E. Doherty, R. D. Tweney, Confirmation bias in a simulated research environment: An experimental study of scientific inference, The quarterly journal of experimental psychology 29 (1977) 85–95.
- Westen et al. (2006) D. Westen, P. S. Blagov, K. Harenski, C. Kilts, S. Hamann, Neural bases of motivated reasoning: An fMRI study of emotional constraints on partisan political judgment in the 2004 US presidential election, Journal of cognitive neuroscience 18 (2006) 1947–1958.
- Stanovich et al. (2013) K. E. Stanovich, R. F. West, M. E. Toplak, Myside bias, rational thinking, and intelligence, Current Directions in Psychological Science 22 (2013) 259–264.
- Wolfe and Britt (2008) C. R. Wolfe, M. A. Britt, The locus of the myside bias in written argumentation, Thinking & Reasoning 14 (2008) 1–27.
- Albarracín and Mitchell (2004) D. Albarracín, A. L. Mitchell, The role of defensive confidence in preference for proattitudinal information: How believing that one is strong can sometimes be a defensive weakness, Personality and Social Psychology Bulletin 30 (2004) 1565–1584.
- Gilovich and Savitsky (2002) T. Gilovich, K. Savitsky, Like goes with like: The role of representativeness in erroneous and pseudo-scientific beliefs, Cambridge University Press, 2002.
- Tversky and Kahneman (1983) A. Tversky, D. Kahneman, Extensional versus intuitive reasoning: the conjunction fallacy in probability judgment, Psychological review 90 (1983) 293.
- Gigerenzer (1996) G. Gigerenzer, On narrow norms and vague heuristics: A reply to Kahneman and Tversky, Psychological Review (1996) 592–596.
- Kahneman (2003) D. Kahneman, A perspective on judgment and choice, American Psychologist 58 (2003).
- Bar-Hillel (1991) M. Bar-Hillel, Commentary on Wolford, Taylor, and Beck: The conjunction fallacy?, Memory & cognition 19 (1991) 412–414.
- Tentori and Crupi (2012) K. Tentori, V. Crupi, On the conjunction fallacy and the meaning of and, yet again: A reply to Hertwig, Benz, and Krauss (2008), Cognition 122 (2012) 123–134.
- Hertwig et al. (2008) R. Hertwig, B. Benz, S. Krauss, The conjunction fallacy and the many meanings of and, Cognition 108 (2008) 740–753.
- Fantino et al. (1997) E. Fantino, J. Kulik, S. Stolarz-Fantino, W. Wright, The conjunction fallacy: A test of averaging hypotheses, Psychonomic Bulletin & Review 4 (1997) 96–101.
- Hertwig and Gigerenzer (1999) R. Hertwig, G. Gigerenzer, The ”conjunction fallacy” revisited: How intelligent inferences look like reasoning errors, Journal of Behavioral Decision Making 12 (1999) 275–305.
- Charness et al. (2010) G. Charness, E. Karni, D. Levin, On the conjunction fallacy in probability judgment: New experimental evidence regarding Linda, Games and Economic Behavior 68 (2010) 551 – 556.
- Zizzo et al. (2000) D. J. Zizzo, S. Stolarz-Fantino, J. Wen, E. Fantino, A violation of the monotonicity axiom: Experimental evidence on the conjunction fallacy, Journal of Economic Behavior & Organization 41 (2000) 263–276.
- Stolarz-Fantino et al. (1996) S. Stolarz-Fantino, E. Fantino, J. Kulik, The conjunction fallacy: Differential incidence as a function of descriptive frames and educational context, Contemporary Educational Psychology 21 (1996) 208–218.
- Tversky and Kahneman (1973) A. Tversky, D. Kahneman, Availability: A heuristic for judging frequency and probability, Cognitive psychology 5 (1973) 207–232.
- Schwarz et al. (1991) N. Schwarz, H. Bless, F. Strack, G. Klumpp, H. Rittenauer-Schatka, A. Simons, Ease of retrieval as information: Another look at the availability heuristic, Journal of Personality and Social psychology 61 (1991) 195.
- Pachur et al. (2011) T. Pachur, P. M. Todd, G. Gigerenzer, L. Schooler, D. G. Goldstein, The recognition heuristic: A review of theory and tests, Frontiers in psychology 2 (2011) 147.
- Monahan et al. (2000) J. L. Monahan, S. T. Murphy, R. B. Zajonc, Subliminal mere exposure: Specific, general, and diffuse effects, Psychological Science 11 (2000) 462–466.
- Bornstein (1989) R. F. Bornstein, Exposure and affect: overview and meta-analysis of research, 1968–1987, Psychological Bulletin 2 (1989) 265–289.
- Zajonc (1968) R. B. Zajonc, Attitudinal effects of mere exposure, Journal of personality and social psychology 9 (1968) 1.
- Al-Najjar and Weinstein (2009) N. I. Al-Najjar, J. Weinstein, The ambiguity aversion literature: a critical assessment, Economics and Philosophy 25 (2009) 249–284.
- Ellsberg (1961) D. Ellsberg, Risk, ambiguity, and the Savage axioms, The Quarterly Journal of Economics 75 (1961) 643–669.
- Curley et al. (1984) S. P. Curley, S. A. Eraker, J. F. Yates, An investigation of patient’s reactions to therapeutic uncertainty, Medical Decision Making 4 (1984) 501–511.
- Plous (1993) S. Plous, The psychology of judgment and decision making, McGraw-Hill Book Company, 1993.
- Michalski (1983) R. S. Michalski, A theory and methodology of inductive learning, Artificial Intelligence 20 (1983) 111–162.
- Wille (1982) R. Wille, Restructuring lattice theory: An approach based on hierarchies of concepts, in: I. Rival (Ed.), Ordered Sets, Reidel, Dordrecht-Boston, 1982, pp. 445–470.
- Ganter and Wille (1999) B. Ganter, R. Wille, Formal Concept Analysis – Mathematical Foundations, Springer, 1999.
- Edgell et al. (2004) S. E. Edgell, J. Harbison, W. P. Neace, I. D. Nahinsky, A. S. Lajoie, What is learned from experience in a probabilistic environment?, Journal of Behavioral Decision Making 17 (2004) 213–229.
- Gamberger and Lavrač (2003) D. Gamberger, N. Lavrač, Active subgroup mining: A case study in coronary heart disease risk group detection, Artificial Intelligence in Medicine 28 (2003) 27–57.
- Kononenko (1993) I. Kononenko, Inductive and Bayesian learning in medical diagnosis, Applied Artificial Intelligence 7 (1993) 317–337.
- Bar-Hillel and Neter (1993) M. Bar-Hillel, E. Neter, How alike is it versus how likely is it: A disjunction fallacy in probability judgments, Journal of Personality and Social Psychology 65 (1993) 1119.
- Baron et al. (1988) J. Baron, J. Beattie, J. C. Hershey, Heuristics and biases in diagnostic reasoning: II. congruence, information, and certainty, Organizational Behavior and Human Decision Processes 42 (1988) 88–110.
- Hertwig et al. (2005) R. Hertwig, T. Pachur, S. Kurzenhäuser, Judgments of risk frequencies: tests of possible cognitive mechanisms, Journal of Experimental Psychology: Learning, Memory, and Cognition 31 (2005) 621.
- Goldstein and Gigerenzer (1999) D. G. Goldstein, G. Gigerenzer, The recognition heuristic: How ignorance makes us smart, in: Simple heuristics that make us smart, Oxford University Press, 1999, pp. 37–58.
- Beaman et al. (2006) C. P. Beaman, R. McCloy, P. T. Smith, When does ignorance make us smart? additional factors guiding heuristic inference, in: Proceedings of the Cognitive Science Society, volume 28, 2006.
- Pachur and Hertwig (2006) T. Pachur, R. Hertwig, On the psychology of the recognition heuristic: Retrieval primacy as a key determinant of its use, Journal of Experimental Psychology: Learning, Memory, and Cognition 32 (2006) 983.
- Rozin and Royzman (2001) P. Rozin, E. B. Royzman, Negativity bias, negativity dominance, and contagion, Personality and social psychology review 5 (2001) 296–320.
- Pratto and John (2005) F. Pratto, O. P. John, Automatic vigilance: The attention-grabbing power of negative social information, Social cognition: key readings 250 (2005).
- Kahneman and Tversky (1979) D. Kahneman, A. Tversky, Prospect theory: An analysis of decision under risk, Econometrica: Journal of the econometric society (1979) 263–291.
- Fiske (1980) S. T. Fiske, Attention and weight in person perception: The impact of negative and extreme behavior, Journal of personality and Social Psychology 38 (1980) 889.
- Ohira et al. (1998) H. Ohira, W. M. Winton, M. Oyama, Effects of stimulus valence on recognition memory and endogenous eyeblinks: Further evidence for positive-negative asymmetry, Personality and Social Psychology Bulletin 24 (1998) 986–993.
- Robinson-Riegler and Winton (1996) G. L. Robinson-Riegler, W. M. Winton, The role of conscious recollection in recognition of affective material: Evidence for positive-negative asymmetry, The Journal of General Psychology 123 (1996) 93–104.
- Bond et al. (2007) S. D. Bond, K. A. Carlson, M. G. Meloy, J. E. Russo, R. J. Tanner, Information distortion in the evaluation of a single option, Organizational Behavior and Human Decision Processes 102 (2007) 240–254.
- Shteingart et al. (2013) H. Shteingart, T. Neiman, Y. Loewenstein, The role of first impression in operant learning, Journal of Experimental Psychology: General 142 (2013) 476.
- Liu et al. (1998) B. Liu, W. Hsu, Y. Ma, Integrating classification and association rule mining, in: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, KDD’98, AAAI Press, 1998, pp. 80–86.
- Fürnkranz (1999) J. Fürnkranz, Separate-and-conquer rule learning, Artificial Intelligence Review 13 (1999) 3–54.
- Webb (1994) G. I. Webb, Recent progress in learning decision lists by prepending inferred rules, in: Proceedings of the 2nd Singapore International Conference on Intelligent Systems, 1994, pp. B280–B285.
- Hertwig et al. (1997) R. Hertwig, G. Gigerenzer, U. Hoffrage, The reiteration effect in hindsight bias, Psychological Review 104 (1997) 194.
- Hasher et al. (1977) L. Hasher, D. Goldstein, T. Toppino, Frequency and the conference of referential validity, Journal of Verbal Learning and Verbal Behavior 16 (1977) 107–112.
- Fürnkranz (1997) J. Fürnkranz, Pruning algorithms for rule learning, Machine Learning 27 (1997) 139–172.
- Sides et al. (2002) A. Sides, D. Osherson, N. Bonini, R. Viale, On the reality of the conjunction fallacy, Memory & Cognition 30 (2002) 191–198.
- Martire et al. (2013) K. A. Martire, R. I. Kemp, I. Watkins, M. A. Sayle, B. R. Newell, The expression and interpretation of uncertain forensic science evidence: verbal equivalence, evidence strength, and the weak evidence effect, Law and human behavior 37 (2013) 197.
- Geier et al. (2006) A. B. Geier, P. Rozin, G. Doros, Unit bias a new heuristic that helps explain the effect of portion size on food intake, Psychological Science 17 (2006) 521–525.
- Gigerenzer and Hoffrage (1999) G. Gigerenzer, U. Hoffrage, Overcoming difficulties in Bayesian reasoning: A reply to Lewis and Keren (1999) and Mellers and McGraw (1999)., Psychological Review (1999) 425–430.
- Mosconi and Macchi (2001) G. Mosconi, L. Macchi, The role of pragmatic rules in the conjunction fallacy, Mind & Society 2 (2001) 31–57.