The division of labor in communication: Speakers help listeners account for asymmetries in visual perspective

The division of labor in communication: Speakers help listeners account for asymmetries in visual perspective

Robert Hawkins Hyowon Gweon Noah Goodman Department of Psychology, Stanford University, Stanford, CA, US Department of Computer Science, Stanford University, Stanford, CA, US

Recent debates over adults’ theory of mind use have been fueled by surprising failures of perspective-taking in communication, suggesting that perspective-taking can be relatively effortful. How, then, should speakers and listeners allocate their limited cognitive resources to successfully understand one another? We argue for a resource-rational account of how agents navigate this division of labor. Under this account, the cognitive effort an agent chooses to allocate toward perspective-taking should depend flexibly on expectations about their interlocutor’s behavior in context. In particular, we investigate the behavior of speakers in the influential director-matcher task and show that they may be expected to take on more of this effort than previously assumed. In Experiment 1, we explicitly manipulated the presence or absence of occlusions and found that speakers systematically produced longer, more specific referring expressions when it was clear that additional objects could be in their partner’s view but not their own. In Experiment 2, we compare the scripted utterances used by confederates in prior work with those produced by unscripted speakers in the same task. We found that confederate speakers are systematically less informative than listeners would initially expect from naive speakers in this context, but that listeners may use violations to adjust their expectations over time. These results suggest that it may be boundedly rational for listeners to reduce the effort put toward perspective-taking to a certain extent given contextually appropriate pragmatic expectations.

theory of mind, pragmatics, interaction, communication, social cognition, replication
journal: XXX

1 Introduction

Our success as a social species depends on our ability to understand, and be understood by, different social partners across different contexts. Theory of mind—the ability to represent and reason about others’ mental states (Premack and Woodruff, 1978)—is considered to be the key cognitive mechanism that supports such context-sensitivity in our everyday social interactions. Being able to infer what others see, want, and think allows us to make more accurate predictions about their future behavior in different contexts and adjust our own behaviors accordingly. These inferences do not necessarily come for free, however. Behavioral, developmental, and neural evidence increasingly suggests that at least some aspects of theory of mind use are computationally costly, requiring effortful processing under cognitive control Saxe et al., 2006; Brown-Schmidt, 2009; Low and Perner, 2012; Ferguson et al., 2015; Bradford et al., 2015; Jouravlev et al., 2019, but see Rubio-Fernández et al., 2019.

How, then, should agents allocate their limited cognitive resources to successfully communicate with one another? One prominent proposal is that agents cope with these constraints by using egocentric heuristics (Keysar et al., 1998a; Keysar, 2007; Barr, 2014). An ‘anchor-and-adjust’ heuristic, in particular, allows agents to anchor on their own easily available perspective and effortfully adjust in the direction of another’s perspective to the extent that sufficient cognitive resources are available (Epley et al., 2004). Because the adjustment process satisfices at some threshold, heuristic accounts predict that completely optimal perspective-taking is rarely observed and communicative behavior is marked by egocentric biases. These accounts have provided a satisfying algorithmic-level explanation of a variety of key phenomena, such as the increase of egocentric biases under cognitive load and the effect of individual differences in working memory (Lin et al., 2010). At the same time, they have been complicated by apparently contradictory eye-tracking evidence showing sensitivity to another’s perspective from the earliest moments of processing, when the egocentric bias would expected to be strongest (Nadig and Sedivy, 2002; Heller et al., 2008; Brown-Schmidt and Tanenhaus, 2008; Hanna et al., 2003).

More recent accounts have appealed to the computational level of analysis (Marr, 2010) to resolve this puzzle. Under a simultaneous integration account, formalized as a Bayesian model, listeners (Heller et al., 2016) and speakers (Mozuraitis et al., 2018) use a probabilistic weighting of different “referential domains” derived from different perspectives (see also Brown-Schmidt and Hanna, 2011; Degen and Tanenhaus, to appear). An intermediate weighting is found to account for prior results better than a purely egocentric or purely perspective-taking strategy, explaining why traces of the agent’s own perspective and that of their partner are found throughout processing. Yet probabilistic weighting models leave open a key puzzle: of all possible weightings, why do people use the weighting they do in a given context? What determines the particular proportion of egocentric knowledge that will be used in different scenarios? Heller et al. (2016) and Mozuraitis et al. (2018) raise this question and discuss a role for the cognitive demands of inhibiting one’s own perspective, but a endogenous mechanism accounting for the appropriate level of perspective-taking has not yet been pursued.

We argue in this paper for a resource rational account of perspective-taking in communication (Griffiths et al., 2015; Lieder and Griffiths, 2019) which seeks to bridge the computational and algorithmic levels of prior accounts. The recent development of resource rational analysis has provided a framework for understanding a range of costly but functionally important behaviors, from planning (Callaway et al., 2018) to decision-making under uncertainty (Shenhav et al., 2017; Lieder et al., 2018), through the application of rational principles under cognitive constraints. The key insight, motivated by recent work on the mechanisms of cognitive control, is that agents consider both the functional value of a computation as well as its costs (Shenhav et al., 2013; Kool and Botvinick, 2018), and behave in a way that is consistent with an approximately optimal tradeoff between them. In other words, “the question of interest has begun to shift from whether an individual is capable of exerting cognitive effort to whether the individual will choose to do so” (Kool and Botvinick, 2013). This view is consistent with the central role of executive control and recurrent processing in recent mechanistic frameworks for language processing (Ferreira, 2019).

In the setting of communication, resource rationality begins with the assumption that the participants in an interaction share the functional goal of understanding while minimizing joint effort (Tomasello, 2009; Clark, 1996). A resource rational account shares with simultaneous integration accounts the assumption that agents may be attending to and probabilistically weighting their partner’s perspective even at the outset of an interaction. It shares with purely heuristic accounts a central theoretical role for the process-level challenge of handling resource limitations. It differs from these models, however, in the extent to which agents may flexibly control their own resource usage. Rather than assuming agents are “reflexively mindblind” with no control over their egocentric biases, or are using a fixed weighting of perspectives, resource rationality assumes agents can anticipate the needs of the interaction and adaptively calibrate how much control to dedicate toward perspective-taking based on various contextual factors.

Communicative expectations under uncertainty about the visual context

Here, we focus on one particular contextual factor: communicative expectations derived from Gricean maxims. Theory of mind use not only incorporates people’s mental models of a partner’s knowledge or visual access but also reasoning about how their partner uses and interprets language. Just as making sense of an agent’s physical behaviors requires a broad, accurate mental model of how the agent’s visual access, beliefs, and intentions translate into motor plans (Jara-Ettinger et al., 2016; Baker et al., 2017), making sense of an agent’s linguistic behaviors depends on an accurate model of what a speaker would say, or what a listener would understand, in different situations (Bergen and Grodner, 2012; Goodman and Frank, 2016; Frank and Goodman, 2012; Franke and Jäger, 2016).

The Gricean notion of cooperativity (Grice, 1975; Clark, 1996) refers to the idea that speakers try to avoid saying things that are confusing or unnecessarily complicated given the current context, and that listeners expect this. For instance, imagine trying to help someone spot your dog at a busy dog park. It may be literally correct to call it a “dog,” but as a cooperative speaker you would understand that the listener would have trouble disambiguating the referent from many other dogs. Likewise, the listener would reasonably expect you to say something more informative than “dog” in this context. You may therefore prefer to use a more specific or informative expressions, like “the little terrier with the blue collar,” even though it is more costly to produce (Brennan and Clark, 1996; van Deemter, 2016). Recent work has shown that listeners can successfully use expectations about informativity to make online inferences about the speaker’s knowledge (Rubio-Fernández, 2017; Rubio-Fernández and Jara-Ettinger, 2018).

Critically, however, you might also prefer more specific labels even when you happen to see only one dog at the moment, but know there are likely to be other dogs from the listener’s point of view. In the presence of uncertainty about their partner’s visual context, a cooperative speaker may tend toward additional specificity. We argue that speakers in the influential director-matcher task are in an analogous situation. In this task, which has been central to the debate over limits of theory of mind use, a speaker instructs a listener to move objects around a grid but certain cells of the grid are covered to prevent the speaker from seeing some of the objects (e.g. Fig. 1). For example, on one trial a roll of Scotch tape was placed within view of both parties, while a cassette tape was placed behind an occluder and only visible to the listener. The speaker must generate a description such that a listener can identify the correct object among distractors, even though the speaker cannot be sure what all of the distractors are. It is therefore highly salient to the speaker that there exist hidden objects she cannot see but her partner can.

Figure 1: Critical trial of director-matcher task using the ambiguous utterance “the tape”: a cassette tape is in view of both players, but a roll of tape is occluded from the speaker’s view.

Gricean reasoning, as realized by recent formal models (Goodman and Frank, 2016; Frank and Goodman, 2012; Franke and Jäger, 2016), predicts that a speaker in this context will compensate for her uncertainty about the listener’s visual context by increasing the informativity of her utterance beyond what she would produce in a completely shared context. (See Appendix A for a formal model of pragmatic reasoning in this situation and a mathematical derivation of the qualitative informativity prediction.) The director-matcher task is therefore not only challenging for the listener. It also requires the use of theory of mind, vis a vis pragmatic audience design, on the part of the speaker to anticipate what level of informativity would be appropriate for the listener’s (unknown) visual context. While extensive prior work has examined how speakers adjust their utterances (or fail to adjust their utterances) depending on their own private information, that work has not considered the possibility that speakers pragmatically compensate for their salient lack of access to the listener’s private information by modifying their informativity.

In the following experiments, we ask whether people, as speakers, are sensitive to their own uncertainty about their partner’s visual context. Furthermore, we suggest that such sensitivity (and the listener’s expectations about this sensitivity) can help us understand why listeners make such frequent errors in the director-matcher task. A resource rational listener who expects the speaker to increase their informativity given the contextual presence of occlusions may devote relatively fewer resources to considering the speaker’s visual access, under the cognitive load of the task. This allocation may backfire and lead to errors when paired with a confederate speaker who is relatively less informative than expected. Listeners may then use the resulting prediction error to update their expectations about the speaker’s informativity and may decide it is worth dedicating more resources to monitoring the speaker’s perspective on subsequent trials.

To be clear, we are emphatically not arguing that speakers would ever be expected to shoulder all of the work or that Gricean considerations free listeners to completely ignore visual perspective. There is abundant evidence, consistent with the resource rational view, that speakers use vague or ambiguous language to reduce their own production costs because they can trust listeners to infer the intended meaning from context. Likewise, as reviewed above, listeners do weight their partner’s perspective alongside their own to some extent from the earliest moments of processing. In the view we are advancing, the perspective-taking effort each person chooses to exert is rarely all or none. It is a matter of degree. There is in principle a continuum of many acceptable divisions of labor, and no single weighting is objectively “optimal” for all interactions. The appropriate weighting for one agent depends on what the other agent is doing and is continually negotiated throughout an interaction.

Our main goal here is to directly establish the natural pragmatic behavior of speakers when faced with uncertainty about the listener’s context. This behavior establishes the reasonable baseline expectation that listeners use when initially deciding how much perspective-taking effort to allocate. First, we directly test our speaker model’s prediction by manipulating the presence and absence of occlusions in a simplified director-matcher task based on the design used by Keysar et al. (2003). Second, we conduct a replication of the landmark result reported by Keysar et al. (2003) with an additional unscripted condition to evaluate the gap between the scripted referring expressions used by confederate speakers in prior work and what a naive speaker would be expected to say in the same interactive context (Kuhlen and Brennan, 2013; Bavelas and Healing, 2013; Tanenhaus and Brown-Schmidt, 2008). Our broader claim emerges from establishing the plausibility of a resource-rational basis for some degree of perspective-neglect, and the role of pragmatic expectations in particular, by showing that speakers are adaptive to occlusions (Experiment 1) and listeners indeed make more errors when speakers violate their expectations (Experiment 2). Causally manipulating listener expectations is beyond the scope of the current work. We return to the broader implications and predictions of this account in the discussion.

2 Experiment 1: Speaker production under uncertainty about the listener’s context

How does a speaker refer to an object when there is uncertainty about exactly what her partner can see? Our computational model (Appendix A) predicts that speakers will go beyond what is necessary given their own view, anticipate possible confusion from the listener’s perspective, and err on the side of providing redundant information. To test this prediction empirically, we designed a simplified version of the director-matcher task that allows us to causally isolate the effect of occlusions on production. Note that we are not asking whether speakers produce strictly “optimal” referring expressions by some absolute standard — it is implausible that they would know the true underlying distribution of hidden objects, and would face their own resource constraints even if they did. The question is whether they contribute additional effort to produce more informative referring expressions in the presence of occlusions.

2.1 Methods

Figure 2: Stimuli in 2 2 design used in Experiment 1 (from speaker’s view; grey square indicates target).

2.1.1 Participants

We recruited 102 pairs of participants from Amazon Mechanical Turk and randomly assigned speaker and listener roles. After we removed 7 games that disconnected part-way through and 12 additional games according to our pre-registered exclusion criteria (due to being non-native English speakers, reporting confusion about the instructions, or clearly violating the instructions), we were left with a sample of 83 full games.

2.1.2 Materials & Procedure

On each trial, both players were presented with a grid containing objects. One target object was privately highlighted for the speaker, who freely typed a message into a chat box in order to get the listener to click the intended referent. The objects varied along three discrete features (shape, texture, and color), each of which took four discrete values (64 possible objects). See Appendix Fig. 7 for a screenshot of the full interface.

There were four types of trials, forming a within-pair factorial design. The key manipulation was the presence or absence of occlusions (see Fig. 2). On ‘occlusion-absent’ trials, all objects were seen by both participants, but on ‘occlusion-present’ trials, two cells of the grid were covered with occluders (curtains) such that only the listener could see the contents of the cell. For comparison, we also included a well-studied informativity manipulation (e.g. Pechmann, 1989; Dale and Reiter, 1995; Brennan and Clark, 1996; Monroe et al., 2017). On ‘distractor-absent’ trials, the target is the only object with a particular shape; on ‘distractor-present’ trials, there is a distractor with the target’s shape in common ground, differing only in color or texture.

In order to make it clear to the speaker that there could be objects behind the occluders without providing a statistical cue to their identity or quantity on any particular trial, we randomized the total number of distractors in the grid on each trial (between 2 and 4) as well as the number of those distractors covered by curtains (1 or 2). If there were only two distractors, we did not allow both of them to be covered: there was always at least one visible distractor. Each trial type appeared 6 times for a total of 24 trials, and the sequence of trials was pseudo-randomized such that no trial type appeared more than twice in each block of eight trials. Participants were instructed to use visual properties of the objects rather than spatial locations in the grid.

Finally, we collected mouse-tracking data as a window into the real-time decision-making process. On each trial, we first asked the matcher to wait on an empty grid while the director typed their message. When the message was received, the matcher clicked a small circle in the center of the grid to show the objects and proceed with the trial. We recorded at 100Hz from the matcher’s mouse in the decision window after this click, until the point where they started to move one of the objects. While we did not intend to analyze these data for Experiment 1, we anticipated using it in our second experiment below and wanted to use the same procedure across experiments for consistency.

2.2 Results

Our primary measure of speaker behavior is the length (in words) of naturally produced referring expressions sent through the chat box. We tested differences in speaker behavior across conditions using a mixed-effect regression of distractor- and occlusion-presence on the number of words produced, with maximal random effect structure containing intercept, slopes, and interaction. First, as a baseline, we restricted our analysis to occlusion-absent trials and examined the simple effect of whether a distractor of the same shape as the target was present vs. absent. We found that speakers used significantly more words on average ( words) when a distractor was present (; see Fig. 3A). This replicates the findings of extensive prior studies in experimental pragmatics that have established speaker sensitivity to what information is needed to disambiguate different objects in common ground. Next, we turn to the key simple effect of occlusion in ‘distractor-absent‘ contexts, which are most similar to the displays used in the director-matcher task that we examine in Experiment 2. We found that speakers used significantly additional words on average ( words) when they knew that additional objects could potentially be visible to their partner (). Lastly, we found a significant interaction () where the effect of occlusion was larger in distractor-absent trials, likely reflecting a ceiling on the level of informativity required to individuate objects in our simple stimulus space.

Figure 3: Results for Experiment 1. (A) Speakers used significantly more words when occlusions were present. (B) Utterances broken out by feature mentioned. Error bars on empirical data are bootstrapped 95% confidence intervals; model error bars are 95% credible intervals.

What are these additional words used for? As a secondary analysis, we annotated each utterance based on which of the three object features were mentioned (shape, texture, color). Because speakers nearly always mentioned shape (e.g. ‘star’, ‘triangle’) as the head noun of their referring expression regardless of context ( of trials), differences in utterance length across conditions must be due to differentially mentioning the other two features (color and texture). To test this observation, we ran separate mixed-effect logistic regressions for color and texture predicting mention from context; due to convergence issues, the maximum random effect structure supported by our data contains only speaker-level intercepts and slopes for the occlusion effect. We found simple effects of occlusion in distractor-absent contexts for both features ( for color; for texture, see Fig. 3B). In other words, in displays like the left column of Fig. 2 where the target was the only ‘star’, speakers were somewhat more likely to produce the star’s color—and much more likely to produce its texture—when there were occlusions present, even though shape alone is sufficient to disambiguate the target from visible distractors in both cases. Finally, we note that listener errors were rare: 88% of listeners made less than two errors (out of 24 trials), and there was no significant difference in error rates across the four conditions (). We test the connections between context-sensitive speaker behavior and listener error rates more explicitly in Experiment 2.

2.3 Model comparison

Figure 4: Modeling results for Experiment 1. Posterior predictives of each model are projected to the mean number of features produced in each condition (top) and directly compared to data across all context types, varying occlusion, number of distractors, and types of distractors (bottom). Error bars on empirical data are bootstrapped 95% confidence intervals; model error bars are 95% credible intervals.

While our behavioral results qualitatively support the hypothesis that speakers incur additional cost to be additionally informative in the presence of occlusions, formalizing this idea in a computational model allows a stronger test by generating graded quantitative predictions. To do so, we build on the probabilistic Rational Speech Act (RSA) framework (Frank and Goodman, 2012; Goodman and Frank, 2016; Franke and Jäger, 2016; Kao et al., 2014; Goodman and Stuhlmüller, 2013), which has successfully derived a variety of pragmatic phenomena from the basic mechanism of recursive social reasoning. In this framework, speakers are decision-theoretic agents attempting to (soft-)maximize a utility function balancing cost or effort (i.e., a preference for shorter, easier-to-produce utterances) with informativeness (i.e., the likelihood of an imagined listener agent having the intended interpretation). An ‘occlusion-sensitive’ speaker explicitly represents uncertainty over her partner’s visual context. In particular, she assumes a probability distribution over possible objects that might be hidden behind the occlusions and attempts to be informative on average. We compare this model with a baseline ‘occlusion-blind’ speaker who assumes her partner sees exactly the same objects she herself does. These two models have the same four free parameters: a speaker optimality parameter controlling the soft-max temperature, and three parameters controlling the costs of producing the features of shape, color, and texture (see Appendix B for details).

We conducted a Bayesian data analysis to infer these parameters, conditioning on our empirical data, and computed a Bayes Factor to compare the models. We found extremely strong support for the occlusion-sensitive model relative to the occlusion-blind model (; see Appendix Fig. 8 for likelihoods). To examine the pattern of behavior of each model, we computed the posterior predictive on the expected number of features mentioned in each trial type of our design. While the occlusion-blind speaker model successfully captured the simple effect of distractor-absent vs. distractor-present contexts, it failed to account for behavior in the presence of occlusions. The occlusion-sensitive model, on the other hand, accurately accounted for the full pattern of results (see Fig 4). Finally, we examined parameter posteriors for the occlusion-sensitive model (see Appendix Fig. 9): the inferred production cost for texture was significantly higher than that for the other features, accounting for why participants were overall less likely to include texture in their descriptions relative to color.

2.4 Discussion

Experiment 1 directly tested the hypothesis that speakers increase their specificity in contexts with clear asymmetries in visual access. We found that speakers are not only context-sensitive in choosing referring expressions that distinguish target from distractors in the shared context, as extensive prior work has shown, but are also occlusion-sensitive. In the presence of occlusions, speakers were spontaneously willing to spend additional time and keystrokes to give further information beyond what they produce in the corresponding unoccluded contexts, even though that information would be redundant given the visible objects in their own display. This effect is larger than the simple and well-explored pragmatic effect of a similar distractor in common ground. These results validate our prediction that speakers increase their level of specificity in contexts containing occlusions. Critically, rather than planning their utterance purely in light of objects shared in common ground, which was held constant across conditions, this finding shows that speakers plan their utterance relative to what they think the listener privately knows.

3 Experiment 2: Comparing confederates to naive speakers

Our findings in Experiment 1 established that speakers naturally adjust their informativity in the presence of occlusions. Next, we examine the consequences of such adjustments for influential arguments about listener behavior in the same setting. To do so, we created an interactive, online reproduction of the director-matcher task used by Keysar et al. (2003), the task we used to derive the simplified design used in Experiment 1. We predicted that naive speakers would naturally provide more informative referring expressions than confederate directors used in prior work. This would suggest that the confederate directors in prior work carried less of the cognitive burden than listeners reasonably expected them to carry, with detrimental consequences for listener performance.

To be clear, this hypothesis is not a criticism of the use of a confederate or the choice of experimental stimuli in prior work, which served as intended to causally intervene on and reveal surprising lapses in the listener’s perspective-taking. Instead, our goal in Experiment 2 is to clarify what these lapses reveal about the listener’s initial expectations of the speaker and how listeners may use these expectations to strike a resource rational balance between accurate performance and the cost of perspective-taking. Clearly listeners do not expect the speaker to do all of the work, producing perfectly unambiguous utterances every time, or we would find more errors in their perspective-taking. But they may nonetheless expect speakers to take on a higher load than they encountered when playing with a confederate.

3.1 Methods

3.1.1 Participants

We recruited 200 pairs of participants from Amazon Mechanical Turk. 58 pairs were unable to complete the game due to a server outage. Following our preregistered exclusion criteria, we removed 24 games who reported confusion, violated our instructions, or made multiple errors on filler items, as well as 2 additional games containing non-native English speakers. This left 116 pairs in our final sample.

3.1.2 Materials and Procedure

The materials and procedure were chosen to be as faithful as possible to those reported in Keysar et al. (2003) while allowing for interaction over the web (we discuss the potential impact of these differences below). Directors used a chat box to communicate where to move a privately cued target object in a grid (see Fig. 1). The listener then attempted to click and drag the intended object. In each of 8 objects sets, mostly containing filler objects, one target belonged to a ‘critical pair’ of objects, such as a visible cassette tape and a hidden roll of tape that could both plausibly be called ‘the tape.’

We displayed instructions to the director as a series of arrows pointing from some object to a neighboring unoccupied cell. Trials were blocked into eight sets of objects, with four instructions each. As in Keysar et al. (2003), we collected baseline performance by replacing the hidden alternative (e.g. a roll of tape) with a filler object that did not fit the critical instruction (e.g. a battery) in half of the critical pairs. The assignment of items to conditions was randomized across participants, and the order of conditions was randomized under the constraint that the same condition would not be used on more than two consecutive items. All object sets, object placements, and corresponding instruction sets were fixed across participants. In case of a listener error, the object was placed back in its original position; both participants were given feedback and asked to try again.

We used a between-subject design to compare the scripted labels used by confederate directors in prior work against what participants naturally say in the same role. For participants assigned to the director role in the ‘scripted’ condition, a pre-scripted message using the precise wording from Keysar et al. (2003) automatically appeared in their chat box on half of trials (the 8 critical trials as well as nearly half of the fillers). Hence, the scripted condition served as a direct replication. To maintain an interactive environment, the director could freely produce referring expressions on the remainder of filler trials. In the ‘unscripted’ condition, directors were unrestricted and free to send whatever messages they deemed appropriate on all trials. In addition to analyzing messages sent through the chat box and errors made by matchers (listeners), we collected mouse-tracking data to examine the real-time decision process.

3.2 Results

3.2.1 Listener errors

Figure 5: Listener results for Experiment 2. (A) Distribution of errors with scripted and unscripted instructions. Participants in the unscripted condition made significantly fewer errors. (B) Even when they were correct, listeners in the scripted condition were more likely to hover their mouse cursor over the distractor relative to baseline while the unscripted condition shows no difference.

Our scripted condition successfully replicated the results of Keysar et al. (2003) with even stronger effects: listeners incorrectly moved the hidden object on approximately 50% of critical trials. However, on unscripted trials, the listener error rate dropped by more than half, (Fig. 5A). While we found substantial heterogeneity in error rates across object sets (just 3 of the 8 object sets accounted for the vast majority of remaining unscripted errors; see Appendix Fig. 10), listeners in the unscripted condition made fewer errors for nearly every critical item. In a maximal logistic model with fixed effect of condition, random intercepts for each dyad, and random slopes and intercepts for each object set, we found a significant difference in error rates across conditions ().

Even if participants in the unscripted condition make fewer actual errors, they may still be considering the hidden object just as often on trials where they go on to make correct responses. To address this question, we conducted an analysis of mouse-tracking data. We computed the mean (logged) amount of time spent hovering over the hidden distractor and found a significant interaction between condition and the contents of the hidden cell (; Fig. 5B) in a mixed-effects regression using dyad-level and object-level random intercepts and slopes for the difference from baseline. That is, listeners in the scripted condition spent more time hovering over the hidden cell when it contained a confusable distractor relative to baseline. In the unscripted condition there was no difference from baseline.111Mean hover time was exactly zero for the majority of trials; we thus conducted a follow-up analysis examining the binarized proportion of trials that listeners hovered over the hidden distractor at all, and found the same pattern of results. We also pre-registered an analysis of the latency before first hovering over the target but due to unexpectedly poor precision in aligning response times to the beginning of the trial, we did not pursue this analysis further.

3.2.2 Speaker informativity

Next, we test whether these improvements in listener performance in the unscripted condition are accompanied by more informative speaker behavior than the scripted utterances allowed. The simplest measure of speaker informativity is the raw number of words used in referring expressions. Compared to the scripted referring expressions, speakers in the unscripted condition used significantly more words to refer to critical objects ( in a mixed-effects regression on difference scores using a fixed intercept and random intercepts for object and dyads). However, this is a coarse measure: for example, the shorter “Pyrex glass” may be more specific than “large measuring glass” despite using fewer words. For a more direct measure, we extracted the referring expressions generated by speakers in all critical trials and standardized spelling and grammar, yielding 122 unique labels after including scripted utterances.

Figure 6: Speaker results for Experiment 2. (A) While speakers in the scripted condition were forced to use utterances that were judged to fit target and distractor roughly equally (by design), speakers in the unscripted condition naturally produced utterances that fit the target much better than the distractor. (B) The extent to which an utterance fits the target more than the distractor is highly predictive of error rates at an item-by-item level (dotted line is linear regression fit). All error bars are bootstrapped 95% confidence intervals.

We then recruited an independent sample of 20 judges on Amazon Mechanical Turk to rate how well each label fit the target and hidden distractor objects on a slider from “strongly disagree” (meaning the label “doesn’t match the object at all”) to “strongly agree” (meaning the label “matches the object perfectly”). They were shown objects in the context of the full grid (with no occlusions) such that they could feasibly judge spatial or relative references like “bottom block.” We excluded 4 judges for guessing with response times . Inter-rater reliability was relatively high, with intra-class correlation coefficient of . We computed the informativity of an utterance (the tape) as the difference in how well it was judged to apply to the target (the cassette tape) relative to the distractor object (the roll of tape).

Our primary measure of interest is the difference in informativity across scripted and unscripted utterances. We found that speakers in the unscripted condition systematically produced more informative utterances than the scripted utterances (, 95% bootstrapped CI = ; see Appendix C for details). Scripted labels fit the hidden distractor just as well or better than the target, but unscripted labels fit the target better and the hidden distractor much worse (see Fig. 6A). In other words, the scripted labels used in Keysar et al. (2003) were less informative than expressions speakers would normally produce to refer to the same object in this context.

These results strongly suggest that the speaker’s informativity influences listener accuracy. In support of this hypothesis, we found a strong negative correlation between informativity and error rates across items and conditions: listeners make fewer errors when utterances are a better fit for the target relative to the distractor (, bootstrapped 95% CI ; Fig. 6B). In other words, a large proportion of the variance in listener errors can be explained by how well utterances fit each object in their own egocentric view, consistent with an expectation of higher speaker informativity.

Finally, we examined how these error rates change over the course of the interaction. If the effort a listener chooses to exert depends on their expectations about the speaker’s informativity, we would expect them to gradually re-calibrate their expectations through repeated observations of the speaker’s behavior. That is, listeners (and speakers in unscripted interactions) may learn that the allocation of perspective-taking they initially adopted is not sufficient and flexibly adjust the extent to which they weight their partner’s perspective, leading to fewer errors on later trials. As a first test of this hypothesis, we ran a mixed-effects logistic regression predicting whether participants made an error on critical trials as a function of the trial’s position in the sequence (coded one through four). We included random intercepts and slopes for each pair of participants, and used a fully Bayesian fitting procedure (Bürkner, 2017) because the random effect structure was too complex to converge using standard maximum likelihood methods. We found a significant decrease in the probability of critical errors (i.e. attempting to move hidden objects) across both unscripted and scripted conditions () from an average of 43% on the first critical trial to only 30% on the fourth and final trial.

3.3 Discussion

Experiment 2 was designed to test the broader consequences of the speaker behavior we directly isolated in our first experiment. If speakers do in fact allocate effort to produce more informative utterances in the presence of occlusions, how much effort should resource-rational listeners exert toward visual perspective-taking in return? We grounded this question in the prior literature by turning from the simplified 3-dimensional stimulus space used in Experiment 1 to the stimuli used by Keysar et al. (2003) to elicit surprising failures of listener perspective-taking. By comparing the utterances produced by a naive speaker to the scripted utterances produced by confederates in prior work, we found further evidence that naive speakers spontaneously produce costlier and more informative utterances than would be required from their own perspective and that their contribution to the division of labor leads to fewer listener errors. Additionally, error rates decreased over the course of interaction, suggesting that even if listeners’ initial expectations of the division of labor were violated, they could adapt by increasing their perspective-taking effort. These results prompt several questions.

First, while the small number of features along which the finite stimulus space varied in Experiment 1 made it straightforward for speakers to anticipate the identity of hidden objects and provide maximally distinguishing expressions, it is computationally implausible that speakers could enumerate all possible hidden distractors in the open-ended space of objects used in Experiment 2. What algorithm speakers use to nevertheless produce more redundant and informative descriptions in this open-ended space remains an open question. One possibility is that speakers use the distribution of visible objects as a cue to the distribution of hidden objects, or that visible objects serve as anchors in a truncated search of semantic space. Another possibility is that speakers do not consider specific distractors at all and use the uncertainty introduced by occlusions as a generic cue to increase their production effort along accessible properties.

Second, while our results closely matched those of Keysar et al. (2003), several key differences between the procedure of our online version and the original in-lab version prevent it from being considered a direct replication. Most prominently, there are important differences between the textual and verbal modalities with implications for the speaker’s cost of production and the listener’s processing mechanisms. Listeners in an in-lab version may make eye-movements toward possible targets before the utterance has been completed while participants in our version read the message in its entirety after it had been sent. Additionally, because we were not able to obtain the scripts that confederates in prior work used on filler instructions (or even the identity of filler objects), it is possible that listeners in our scripted condition adapted to different input between critical items. In particular, we observed that speakers in our scripted conditions used highly specific descriptions for the portion of trials on which they were allowed to freely send messages (e.g. “the red over ear headphones” when there was only one pair of headphones). These filler trials perhaps set even stronger expectations of hyper-informativity leading to larger prediction error when scripted labels were substituted in.

Finally, we note that the critical items introduced by Keysar et al. (2003) were highly heterogeneous from a linguistic point of view. They included homonyms (“mouse” for a visible stuffed animal and hidden computer device), shared basic-level terms (e.g. “brush” for a visible round-brush and a hidden flat-brush), size contrasts (e.g. “large candle” for a visible large candle and an even larger hidden candle), and position contrasts (e.g. “top block” for a visible block on the second-to-top row and a hidden block on the top row). The idea that speakers naturally take on more effort in the face of occlusions applies equally to all of these, with naive speakers often mentioning multiple properties (e.g. “the clear audio cassette tape”). This level of informativity was especially effective on contrast trials, where speakers tended to avoid the relative size or position adjectives used in the confederate’s script, and would instead appeal to other descriptive properties (e.g. the color or style of candle) which happened to be more disambiguating.

However, because the speaker cannot not know the relevant dimension for distinguishing the target from a hidden distractor, their additional effort did not always pay off. For example, the highest proportion of errors made in the unscripted condition occurred on the “brush” item, where the target and hidden distractor were so similar that almost any increase in specificity would fail to distinguish them. This limitation of naive speakers reflects the importance of labor and resource rationality on both sides. Our findings suggest that the speaker is expected to contribute more effort than previously recognized, easing the listener’s perspective-taking burden, but they too are resource-limited and do not produce infinitely informative utterances. Even when they expect to be interacting with an appropriately informative speaker, listeners are not freed of the burden of perspective-taking, they are simply justified in allocating relatively less effort to perspective-taking than previously considered “optimal” or “rational.”

4 General Discussion

The longstanding debate over the role of theory of mind in communication has largely centered around the extent to which listeners (or speakers) deviate from “optimal” perspective-taking toward egocentric influences (Barr and Keysar, 2006; Hanna et al., 2003). Our work aims to present a more nuanced analysis of how resource-constrained speakers and listeners nonetheless make reasonable decisions about how to allocate these limited resources based on contextual expectations. In particular, the Gricean cooperative principle emphasizes a natural division of labor in how the joint effort of being cooperative is shared (Clark, 1996; Mainwaring et al., 2003). It can be asymmetric when one partner is initially expected to, and able to, take on more complex, costly reasoning than the other, in the form of visual perspective-taking, pragmatic inference, or avoiding further exchanges of clarification and repair. One such case is when the speaker has uncertainty over what the listener can see, as in the director-matcher task. Our Rational Speech Act (RSA) formalization of cooperative reasoning in this context predicts that speakers (directors) naturally increase the informativity of their referring expressions to hedge against the increased risk of misunderstanding; Experiment 1 presents direct evidence in support of this hypothesis.

Importantly, when the director (speaker) is expected to be additionally informative, communication can be successful even when the matcher (listener) contributes less than “optimal” perspective-taking effort. Indeed, the matcher will actually minimize joint effort by not taking the director’s visual perspective. This suggests a resource rational explanation of when and why listeners downweight the speaker’s visual perspective; they do so when they expect the speaker to disambiguate referents sufficiently. While adaptive in most natural communicative contexts, such neglect might backfire and lead to errors when the speaker (inexplicably) violates this expectation. From this point of view, the “failure” of listener theory of mind in these tasks is not really a failure; instead, it suggests that both speakers and listeners may use theory of mind to know when (and how much) they should expect others to be cooperative and informative, and subsequently allocate their resources accordingly (Griffiths et al., 2015). Experiment 2 is consistent with this hypothesis; when directors used underinformative scripted instructions (taken from prior work), listeners made significantly more errors than when speakers were allowed to provide referring expressions at their natural level of informativity, and speaker informativeness strongly modulated listener error rates.

Our work adds to the growing literature on the debate over the role of pragmatics in the director-matcher task. A recent study questions the communicative nature of the task itself by showing that selective attention alone is sufficient for successful performance on this task, and that listeners become suspicious of the director’s visual access when the director shows unexpectedly high levels of specificity in their referring expressions (Rubio-Fernández, 2017). Our results further sbolster the argument that pragmatic reasoning about appropriate levels of informativity is an integral aspect of theory of mind use in the director-matcher task (and communication more generally). Note however that in Rubio-Fernández (2017), participants became suspicious, while in our study participants overtrusted the speaker to be informative; a more detailed look at differences between experimental paradigms, as well as further experimental work, is necessary to better understand why participants had different expectations about the speaker. Prior work also suggests that although speakers tend to be over-informative in their referring expressions (Koolen et al., 2011) a number of situational factors (e.g., perceptual saliency of referents) can modulate this tendency. Our work hints at an additional principle that guides speaker informativity: speakers maintain uncertainty about the listener’s visual context and their ability to disambiguate the referent in that context.

Additionally, while our model builds on probabilistic models weighting different perspectives (Heller et al., 2016; Mozuraitis et al., 2018), we leave the formal integration of resource-rational recursive reasoning mechanisms with perspective-weighting mechanisms for future work. While Mozuraitis et al. (2018) focused on cases where the speaker has private information unknown to the listener, our model focuses on the reverse case: how speakers behave when they know that the listener has additional private information (Keysar et al., 2003). Furthermore, whether the allocation of perspective-taking resources is a fixed strategy or one that adjusts dynamically remains an open question: given sufficient evidence of an unusually underinformative partner, listeners may realize that vigilance about which objects are occluded yields a more effective strategy for the immediate interaction. We reported indirect evidence in support of this idea, but an important direction for future work is to directly explore listener adaptability in adjusting their use of visual perspective-taking as a function of Gricean expectations for a given partner (Grodner and Sedivy, 2011; Pogue et al., 2016; Ryskin et al., 2019). Such adaptation could be particularly functionally important in light of individual differences in working memory or executive control: variability in the capabilities of different partners should lead to variability in the appropriate division of labor, and it may not be possible to anticipate at the outset of an interaction.

Finally, while our experiments have focused directly on the demands of asymmetries in visual perspective, closely following the design of Keysar et al. (2003), variations on this basic paradigm have also manipulated other dimensions of non-visual knowledge asymmetry, including those based on spoken information (Keysar et al., 1998b; Hanna et al., 2003), spatial cues (Schober, 1993; Galati and Avraamides, 2013), private pre-training on object labels (Wu and Keysar, 2007), cultural background (Isaacs and Clark, 1987), and other task-relevant information (Hanna and Tanenhaus, 2004; Yoon et al., 2012). We expect that each of these variants introduce subtly different processing demands and pragmatic expectations, but would be amenable to a similar resource rational analysis. Studies of speaker audience design during production have reversed the direction of the asymmetry so the speaker has private knowledge that the listener does not. We expect that resource rational consideration of the processing mechanisms of audience design in production (e.g Ferreira, 2019) may similarly yield predictions about the extent to which private information leaks into speaker utterances (see also Nadig and Sedivy, 2002; Heller et al., 2012; Brown-Schmidt and Tanenhaus, 2008; Savitsky et al., 2011; Yoon and Brown-Schmidt, 2014; Lane et al., 2006).

In sum, our findings suggest that language use is well-adapted to contexts of uncertainty and knowledge asymmetry. The pragmatic use of theory of mind to navigate division of labor is also critical for other forms of social cooperation, including pedagogy (Shafto et al., 2014) and team-based problem solving (Woolley et al., 2010; Krafft, 2018). Enriching our notion of theory of mind use to encompass the resource rational deployment of these pragmatic inferences, not only expectations about what our partner knows or desires, may shed new light on the flexibility of social interaction more broadly.

5 Acknowledgements

This manuscript is based in part on work presented at the 38th Annual Conference of the Cognitive Science Society. An early pilot of Experiment 2 was originally conducted with input from Michael Frank and Desmond Ong. We’re grateful to Vic Ferreira and Judith Fan for thoughtful conversations and to Boaz Keysar for providing selected materials for our replication.

Unless otherwise mentioned, all analyses and materials were preregistered at Code and materials for reproducing the experiment as well as all data and analysis scripts are open and available at


  • C. L. Baker, J. Jara-Ettinger, R. Saxe, and J. B. Tenenbaum (2017) Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour 1, pp. 0064. Cited by: §1.
  • D. J. Barr and B. Keysar (2006) Perspective taking and the coordination of meaning in language use. In Handbook of Psycholinguistics (Second Edition), pp. 901–938. Cited by: §4.
  • D. J. Barr (2014) Perspective Taking and Its Impostors in Language Use: Four Patterns. The Oxford handbook of language and social psychology, pp. 98. Cited by: §1.
  • J. Bavelas and S. Healing (2013) Reconciling the effects of mutual visibility on gesturing: a review. Gesture 13 (1), pp. 63–92. Cited by: §1.
  • L. Bergen and D. J. Grodner (2012) Speaker knowledge influences the comprehension of pragmatic inferences.. J Exp Psychol Learn Mem Cogn 38 (5), pp. 1450. Cited by: §1.
  • E. E. Bradford, I. Jentzsch, and J. Gomez (2015) From self to social cognition: theory of mind mechanisms and their relation to executive functioning. Cognition 138, pp. 21–34. Cited by: §1.
  • S. E. Brennan and H. H. Clark (1996) Conceptual pacts and lexical choice in conversation.. J Exp Psychol Learn Mem Cogn 22 (6), pp. 1482. Cited by: §1, §2.1.2.
  • S. Brown-Schmidt and J. E. Hanna (2011) Talking in another person’s shoes: incremental perspective-taking in language processing. Dialogue and Discourse 2, pp. 11–33. Cited by: §1.
  • S. Brown-Schmidt and M. K. Tanenhaus (2008) Real-time investigation of referential domains in unscripted conversation: a targeted language game approach. Cognitive Science 32 (4), pp. 643–684. Cited by: §1, §4.
  • S. Brown-Schmidt (2009) The role of executive function in perspective taking during online language comprehension. Psychonomic bulletin & review 16 (5), pp. 893–900. Cited by: §1.
  • P. Bürkner (2017) Advanced bayesian multilevel modeling with the r package brms. arXiv preprint arXiv:1705.11123. Cited by: §3.2.2.
  • F. Callaway, F. Lieder, P. Das, S. Gul, P. M. Krueger, and T. Griffiths (2018) A resource-rational analysis of human planning.. In Proceedings of the 37th annual conference of the cognitive science society, Cited by: §1.
  • H. H. Clark (1996) Using language. Cambridge university press Cambridge. Cited by: §1, §1, §4.
  • R. Dale and E. Reiter (1995) Computational interpretations of the gricean maxims in the generation of referring expressions. Cognitive science 19 (2), pp. 233–263. Cited by: §2.1.2.
  • J. Degen and M.K. Tanenhaus (to appear) Constraint-based pragmatic processing. In Handbook of Experimental Semantics and Pragmatics., C. Cummins and N. Katsos (Eds.), Cited by: §1.
  • N. Epley, B. Keysar, L. Van Boven, and T. Gilovich (2004) Perspective taking as egocentric anchoring and adjustment.. Journal of personality and social psychology 87 (3), pp. 327. Cited by: §1.
  • H. J. Ferguson, I. Apperly, J. Ahmad, M. Bindemann, and J. Cane (2015) Task constraints distinguish perspective inferences from perspective use during discourse interpretation in a false belief task. Cognition 139, pp. 50–70. Cited by: §1.
  • V. S. Ferreira (2019) A Mechanistic Framework for Explaining Audience Design in Language Production. Annual Review of Psychology 70 (1), pp. 29–51. External Links: Link, Document Cited by: §1, §4.
  • M. C. Frank and N. D. Goodman (2012) Predicting pragmatic reasoning in language games. Science 336 (6084), pp. 998–998. Cited by: §1, §1, §2.3.
  • M. Franke and G. Jäger (2016) Probabilistic pragmatics, or why bayes’ rule is probably important for pragmatics. Zeitschrift für sprachwissenschaft 35 (1), pp. 3–44. Cited by: §1, §1, §2.3.
  • A. Galati and M. N. Avraamides (2013) Flexible spatial perspective-taking: conversational partners weigh multiple cues in collaborative tasks. Frontiers in human neuroscience 7, pp. 618. Cited by: §4.
  • N. D. Goodman and M. C. Frank (2016) Pragmatic language interpretation as probabilistic inference. Trends Cognitive Science 20 (11), pp. 818 – 829. Cited by: §1, §1, §2.3.
  • N. D. Goodman and A. Stuhlmüller (2013) Knowledge and implicature: modeling language understanding as social cognition. Topics in Cognitive Science 5 (1), pp. 173–184. Cited by: §2.3, Appendix A: Derivation of qualitative model predictions.
  • H. P. Grice (1975) Logic and conversation. In Syntax and Semantics, P. Cole and J. Morgan (Eds.), pp. 43–58. Cited by: §1.
  • T. L. Griffiths, F. Lieder, and N. D. Goodman (2015) Rational use of cognitive resources: levels of analysis between the computational and the algorithmic. Top Cognitive Science 7 (2), pp. 217–229. Cited by: §1, §4.
  • D. Grodner and J. C. Sedivy (2011) The effect of speaker-specific information on pragmatic inferences. The processing and acquisition of reference, pp. 239. Cited by: §4.
  • J. E. Hanna, M. K. Tanenhaus, and J. C. Trueswell (2003) The effects of common ground and perspective on domains of referential interpretation. J Mem Lang 49 (1), pp. 43–61. Cited by: §1, §4, §4.
  • J. E. Hanna and M. K. Tanenhaus (2004) Pragmatic effects on reference resolution in a collaborative task: evidence from eye movements. Cognitive Science 28 (1), pp. 105–115. Cited by: §4.
  • D. Heller, C. Parisien, and S. Stevenson (2016) Perspective-taking behavior as the probabilistic weighing of multiple domains.. Cognition 149, pp. 104. Cited by: §1, §4.
  • D. Heller, K. S. Gorman, and M. K. Tanenhaus (2012) To name or to describe: shared knowledge affects referential form. Topics in cognitive science 4 (2), pp. 290–305. Cited by: §4.
  • D. Heller, D. Grodner, and M. K. Tanenhaus (2008) The role of perspective in identifying domains of reference. Cognition 108 (3), pp. 831–836. Cited by: §1.
  • E. A. Isaacs and H. H. Clark (1987) References in conversation between experts and novices.. Journal of Experimental Psychology: General 116 (1), pp. 26. Cited by: §4.
  • J. Jara-Ettinger, H. Gweon, L. E. Schulz, and J. B. Tenenbaum (2016) The naïve utility calculus: computational principles underlying commonsense psychology. Trends Cognitive Science 20 (8), pp. 589–604. Cited by: §1.
  • O. Jouravlev, R. Schwartz, D. Ayyash, Z. Mineroff, E. Gibson, and E. Fedorenko (2019) Tracking colisteners’ knowledge states during language comprehension. Psychological science 30 (1), pp. 3–19. Cited by: §1.
  • J. T. Kao, J. Y. Wu, L. Bergen, and N. D. Goodman (2014) Nonliteral understanding of number words. Proceedings of the National Academy of Sciences 111 (33), pp. 12002–12007. Cited by: §2.3.
  • B. Keysar, D. J. Barr, and W. S. Horton (1998a) The egocentric basis of language use: insights from a processing approach. Current directions in psychological science 7 (2), pp. 46–49. Cited by: §1.
  • B. Keysar, D. J. Barr, J. A. Balin, and T. S. Paek (1998b) Definite reference and mutual knowledge: Process models of common ground in comprehension. Journal of Memory and Language 39 (1), pp. 1–20. External Links: ISSN 1096-0821(Electronic),0749-596X(Print) Cited by: §4.
  • B. Keysar, S. Lin, and D. J. Barr (2003) Limits on theory of mind use in adults. Cognition 89 (1), pp. 25 – 41. Cited by: §1, §3.1.2, §3.1.2, §3.1.2, §3.2.1, §3.2.2, §3.3, §3.3, §3.3, §3, §4, §4.
  • B. Keysar (2007) Communication and miscommunication: The role of egocentric processes. Intercultural Pragmatics 4 (1), pp. 71–84. Cited by: §1.
  • W. Kool and M. Botvinick (2013) The intrinsic cost of cognitive control. Behav Brain Sci 36 (6), pp. 697–8. Cited by: §1.
  • W. Kool and M. Botvinick (2018) Mental labour. Nature Human Behaviour 2 (12), pp. 899 (En). External Links: ISSN 2397-3374, Link, Document Cited by: §1.
  • R. Koolen, A. Gatt, M. Goudbeek, and E. Krahmer (2011) Factors causing overspecification in definite descriptions. J Pragmat 43 (13), pp. 3231–3250. Cited by: §4.
  • P. M. Krafft (2018) A simple computational theory of general collective intelligence. Topics in cognitive science. Cited by: §4.
  • A. K. Kuhlen and S. E. Brennan (2013) Language in dialogue: when confederates might be hazardous to your data. Psychon Bull Rev 20 (1), pp. 54–72. Cited by: §1.
  • L. W. Lane, M. Groisman, and V. S. Ferreira (2006) Don’t talk about pink elephants! Speakers’ control over leaking private information during language production. Psychological science 17 (4), pp. 273–277. Cited by: §4.
  • F. Lieder, T. L. Griffiths, and M. Hsu (2018) Overrepresentation of extreme events in decision making reflects rational use of cognitive resources.. Psychological review 125 (1), pp. 1. Cited by: §1.
  • F. Lieder and T. L. Griffiths (2019) Resource-rational analysis: understanding human cognition as the optimal use of limited computational resources. Behavioral and Brain Sciences, pp. 1–85. Cited by: §1.
  • S. Lin, B. Keysar, and N. Epley (2010) Reflexively mindblind: using theory of mind to interpret behavior requires effortful attention. J Exp Soc Psychol 46 (3), pp. 551–556. Cited by: §1.
  • J. Low and J. Perner (2012) Implicit and explicit theory of mind: state of the art. British Journal of Developmental Psychology 30 (1), pp. 1–13. Cited by: §1.
  • S. D. Mainwaring, B. Tversky, M. Ohgishi, and D. J. Schiano (2003) Descriptions of simple spatial scenes in english and japanese. Spatial Cognition and Computation 3 (1), pp. 3–42. Cited by: §4.
  • D. Marr (2010) Vision : A computational investigation into the human representation and processing of visual information. MIT Press, Cambridge, Mass.. Cited by: §1.
  • W. Monroe, R. X. D. Hawkins, N. D. Goodman, and C. Potts (2017) Colors in context: a pragmatic neural model for grounded language understanding. arXiv preprint arXiv:1703.10186. Cited by: §2.1.2.
  • M. Mozuraitis, S. Stevenson, and D. Heller (2018) Modeling reference production as the probabilistic combination of multiple perspectives. Cognitive Science 42 (S4). Cited by: §1, §4.
  • A. S. Nadig and J. C. Sedivy (2002) Evidence of perspective-taking constraints in children’s on-line reference resolution. Psychological Science 13 (4), pp. 329–336. Cited by: §1, §4.
  • T. Pechmann (1989) Incremental speech production and referential overspecification. Linguistics 27 (1), pp. 89–110. Cited by: §2.1.2.
  • A. Pogue, C. Kurumada, and M. K. Tanenhaus (2016) Talker-specific generalization of pragmatic inferences based on under-and over-informative prenominal adjective use. Front Psychol 6, pp. 2035. Cited by: §4.
  • D. Premack and G. Woodruff (1978) Does the chimpanzee have a theory of mind?. Behavioral and brain sciences 1 (04), pp. 515–526. Cited by: §1.
  • P. Rubio-Fernández and J. Jara-Ettinger (2018) Joint inferences of speakers’ beliefs and referents based on how they speak.. In Proceedings of the 37th annual conference of the cognitive science society, Cited by: §1.
  • P. Rubio-Fernández, F. Mollica, M. O. Ali, and E. Gibson (2019) How do you know that? automatic belief inferences in passing conversation. Cognition 193, pp. 104011. Cited by: §1.
  • P. Rubio-Fernández (2017) The director task: a test of theory-of-mind use or selective attention?. Psychon Bull Rev 24 (4), pp. 1121–1128. Cited by: §1, §4.
  • R. Ryskin, C. Kurumada, and S. Brown‐Schmidt (2019) Information Integration in Modulation of Pragmatic Inferences During Online Language Comprehension. Cognitive Science 43 (8), pp. e12769 (en). External Links: ISSN 1551-6709, Link, Document Cited by: §4.
  • K. Savitsky, B. Keysar, N. Epley, T. Carter, and A. Swanson (2011) The closeness-communication bias: Increased egocentrism among friends versus strangers. Journal of Experimental Social Psychology 47 (1), pp. 269–273. Cited by: §4.
  • R. Saxe, L. E. Schulz, and Y. V. Jiang (2006) Reading minds versus following rules: dissociating theory of mind and executive control in the brain. Social neuroscience 1 (3-4), pp. 284–298. Cited by: §1.
  • M. F. Schober (1993) Spatial perspective-taking in conversation. Cognition 47 (1), pp. 1–24. Cited by: §4.
  • P. Shafto, N. D. Goodman, and T. L. Griffiths (2014) A rational account of pedagogical reasoning: teaching by, and learning from, examples. Cogn Psychol 71, pp. 55–89. Cited by: §4.
  • A. Shenhav, M. M. Botvinick, and J. D. Cohen (2013) The expected value of control: an integrative theory of anterior cingulate cortex function. Neuron 79 (2), pp. 217–240. Note: bibtex[publisher=Elsevier]Read for PSYCH 202 (neuro core) Cited by: §1.
  • A. Shenhav, S. Musslick, F. Lieder, W. Kool, T. L. Griffiths, J. D. Cohen, and M. M. Botvinick (2017) Toward a rational and mechanistic account of mental effort. Annual review of neuroscience 40, pp. 99–124. Cited by: §1.
  • M. K. Tanenhaus and S. Brown-Schmidt (2008) Language processing in the natural world. Philos Trans R Soc Lond B Biol Sci 363 (1493), pp. 1105–1122. Cited by: §1.
  • M. Tomasello (2009) The cultural origins of human cognition. Harvard university press. Note: Citation Key: Tomasello09_CulturalOrigins Cited by: §1.
  • K. van Deemter (2016) Computational models of referring: a study in cognitive science. MIT Press. Cited by: §1.
  • A. W. Woolley, C. F. Chabris, A. Pentland, N. Hashmi, and T. W. Malone (2010) Evidence for a collective intelligence factor in the performance of human groups. Science 330 (6004), pp. 686–688. Cited by: §4.
  • S. Wu and B. Keysar (2007) The effect of information overlap on communication effectiveness. Cognitive Science 31 (1), pp. 169–181. Cited by: §4.
  • S. O. Yoon and S. Brown-Schmidt (2014) Adjusting conceptual pacts in three-party conversation.. Journal of Experimental Psychology: Learning, Memory, and Cognition 40 (4), pp. 919. Cited by: §4.
  • S. O. Yoon, S. Koh, and S. Brown-Schmidt (2012) Influence of perspective and goals on reference production in conversation. Psychon Bull Rev 19 (4), pp. 699–707. Cited by: §4.

Appendix A: Derivation of qualitative model predictions

Our experiments are motivated by the Gricean observation that speakers should attempt to be more informative when there is an asymmetry in visual access, such that their partner sees something they do not. In this appendix, we formalize this scenario in a computational model of communication as recursive social reasoning and prove that the predicted increase in informativity qualitatively holds under fairly unrestrictive conditions.

Following recent advances in the Rational Speech Act (RSA) framework, we define a speaker as a decision-theoretic agent who must choose a referring expression to refer to a target object in a context by (soft)-maximizing a utility function :


The basic utility used in RSA models captures the informativeness of each utterance to an imagined literal listener agent who is attempting to select the target object from alternatives in context:

This information-theoretic expression measures how certain the listener becomes about the intended object after hearing the utterance. The literal listener is assumed to update their beliefs about the target object according to Bayesian inference, conditioning on the literal meaning of the utterance being true of it:

where normalization takes place over objects and represents the lexical semantics of . If is true of then ; otherwise, .

This basic setup assumes that the speaker reasons about a listener sharing the same context in common ground. How should it be extended to handle asymmetries in visual access between the speaker and listener, where the speaker has uncertainty over the possible distractors behind the occlusions? In the RSA framework, speaker uncertainty is represented straightforwardly by a prior over the state of the world: for example, Goodman and Stuhlmüller (2013) examined a case where the speaker has limited perceptual access to the objects they are describing. For the director-matcher task, we construct this prior by positing a space of alternative objects , introducing uncertainty over which object , if any, is hidden behind an occlusion, and marginalizing over these alternatives when reasoning about the listener.


This gives us a utility for conditions of asymmetries in visual access:

where denotes the set of objects in context that the speaker perceives.

We define “specificity” extensionally, in the sense that if is more specific than , then the objects for which is true is a subset of the objects for which is true:


Utterance is said to be more specific than iff and there exists a subset of objects such that and for .

We now show that the recursive reasoning model predicts that speakers should prefer more informative utterances in contexts with occlusions. In other words, that the asymmetry utility leads to a preference for more specific referring expressions than the basic utility.


If is more specific than , then the following holds for any target and shared context :


Since it is sufficient to show

We first break apart the sum on the left-hand side:


By the definition of “more specific” and because we defined to be precisely the subset of objects for which , for objects in the complementary set we have . Therefore, for 2, , giving us

For the ratio in 2, we can substitute the definition of the listener and simplify:


Note that this proof also holds when an utterance-level cost term penalizing longer or more effortful utterances is incorporated into the utilities

since the same constant appears on both sides of inequality. In principle, it can also be extended to real-valued meanings , though additional assumptions must be made.

Appendix B: Quantitative model fit for Experiment 1

In addition to the qualitative predictions derived in the previous section, our speaker model makes direct quantitative predictions about Experiment 1 data. Here, we describe the details of a Bayesian Data Analysis evaluating this model on the empirical data, and comparing it to an occlusion-blind model which does not reason about possible hidden objects.

Because there were no differences observed in production based on the particular levels of target features (e.g. whether the target was blue or red), we collapse across these details and only feed the model which features of each distractor differed from the target on each trial. After this simplification, there were only 4 possible contexts: distractor-absent contexts, where the other objects differed in every dimension, and three varieties of distractor-present contexts, where the critical distractor differed in only shape, shape and color, or shape and texture. In addition, we included in the model information about whether each trial had cells occluded or not.

The space of utterances used in our speaker model is derived from our feature annotations: for each trial, the speaker model selected among 7 utterances referring to each combination of features: only mentioning the target’s shape, only mentioning the target’s color, mentioning the shape and the color, and so on. For the set of alternative objects , we used the full 64-object stimulus space used in our experiment design, and we placed a uniform prior over these objects such that the occlusion-sensitive speaker assumed they were equally likely to be hidden.

Our model has four free parameters which we infer from the data using Bayesian inference222Note that this use of Bayesian statistics in analyzing and evaluating our cognitive model is completely dissociable from the assumption of Bayesian recursive reasoning within the model.. The speaker optimality parameter, , is a soft-max temperature such that at , the speaker produces utterances directly proportional to their utility, and as the speaker maximizes. In addition, to account for the differential production of the three features (see Fig. 2B), we assume separate production costs for each feature: a texture cost , a color cost , and a shape cost . We use (uninformative) uniform priors for all parameters:

We compute speaker predictions for a particular parameter setting using (nested) enumeration and infer the posterior over parameters using MCMC. We discard 5000 burn-in samples and then take 5000 samples from the posterior with a lag of 2. Our posterior predictives are computed from these posteriors by taking the expected number of features produced by the speaker marginalizing over parameters and possible non-critical distractors in context (this captures the statistics of our experimental contexts, where there was always a distractor sharing the same color or texture but a different shape as the target). Finally, to precisely compute the Bayes Factor, we enumerated over a discrete grid of parameter values in the prior. We implemented our models and conducted inference in the probabilistic programming language WebPPL (Goodman & Stuhlmuller, 2014). All code necessary to reproduce our model results are available at the project github:

Appendix C: Multi-stage bootstrap procedure for Experiment 2

The statistical dependency structure of our ratings was more complex than standard mixed-effect model packages are designed to handle and the summary statistic we needed for our test was a simple difference score across conditions, so we instead implemented a simple multi-stage, non-parametric bootstrap scheme to appropriately account for different sources of variance. In particular, we needed to control for effects of judge, item, and speaker.

First, to control for the repeated measurements of each judge rating the informativity of all labels, we resampled our set of sixteen judge ids with replacement. For each label, we then computed informativity as the difference between the target and distractor fits within every judge’s ratings, and took the mean across our bootstrapped sample of judges. Next, we controlled for item effects by resampling our eight item ids with replacement. Finally, we resampled speakers from pairs within each condition (scripted vs. unscripted), and looked up the mean informativity of each utterance they produced for each of the resampled set of items. Now, we can take the mean within each condition and compute the difference across conditions, which is our desired test statistic. We repeated this multi-stage resampling procedure 1000 times to get the bootstrapped distribution of our test statistic that we reported in the main text. Individual errors bars in Fig. 4 are derived from the same procedure but without taking difference scores.

Figure 7: Screenshot of experiment interface.
Figure 8: Supplementary figure of model likelihoods.
Figure 9: Supplementary figure of parameter posteriors. All parameters shown on log scale. MAP estimates with 95% highest posterior density intervals are as follows: ; ; ;
Figure 10: Supplementary figure of heterogeneity in errors across the 8 object sets used in Experiment 2 (from Keysar, 2003). Error rates across object diverge significantly from a uniform distribution in both scripted () and unscripted () conditions under a non-parametric test.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description