Overabundant Information and Learning Traps1footnote 11footnote 1We thank Vasilis Syrgkanis for insightful comments in early conversations about this project. We are also grateful to Aislinn Bohren, Ben Golub, Carlos Segura, and Yuichi Yamamoto for suggestions that improved the paper.

# Overabundant Information and Learning Traps111We thank Vasilis Syrgkanis for insightful comments in early conversations about this project. We are also grateful to Aislinn Bohren, Ben Golub, Carlos Segura, and Yuichi Yamamoto for suggestions that improved the paper.

Annie Liang   Xiaosheng Mu University of PennsylvaniaHarvard University
###### Abstract

We develop a model of social learning from overabundant information: Agents have access to many sources of information, and observation of all sources is not necessary in order to learn the payoff-relevant state. Short-lived agents sequentially choose to acquire a signal realization from the best source for them. All signal realizations are public. Our main results characterize two starkly different possible long-run outcomes, and the conditions under which each obtains: (1) efficient information aggregation, where the community eventually achieves the highest possible speed of learning; (2) “learning traps,” where the community gets stuck using a suboptimal set of sources and learns inefficiently slowly. A simple property of the correlation structure separates these two possibilities. In both regimes, we characterize which sources are observed in the long run and how often.

## 1 Introduction

In many learning problems, agents cannot design their information in a completely flexible way. Instead, they choose from a given, finite (though often large) set of information sources. For instance, a researcher studying depression cannot—at any cost—access arbitrarily precise signals about the importance of genetic factors. He can however acquire many kinds of information related to this question; for example, he might acquire neurochemical and genetic data from affected individuals, or observe the incidence of depression within families.

These sources of information contribute to learning in different ways, and the value of new information depends on its relationship to what is already known. Consequently, past information acquisitions can change the perception of which kinds of information are most useful. For example, research about the relationship between neurochemicals and depression increases the value of future neurochemical measurements. Models of information often abstract away from explicit description of heterogeneity across kinds of information, because it complicates the analysis of information acquisition.222Exceptions include Borgers, Hernando-Veciana and Krahmer (2013), Chen and Waggoner (2016), and Chade and Eeckhout (2018) among others. But the relationships across these sources can have significant implications for behavior, in particular in dynamic learning environments where information is passed down over time.

The main contribution of this paper is to identify a new externality driven by complementarities across sources, and to characterize the consequences of this externality for long-run aggregation of information. We show that past information acquisitions have the possibility to shape long-run acquisitions in two starkly different ways:

• Efficient information aggregation: Past information helps future agents to identify the “best” kinds of information. At all sufficiently late periods, a social planner cannot improve on the history of acquisitions.

• Learning traps: Past information pushes future decision-makers to acquire information that leads to inefficiently slow learning. Early suboptimal choices propagate over time.

Relationships across the entire set of information sources are relevant to which of these outcomes emerges, and our main results reveal the key property that determines the outcome.

In our model, agents are indexed by (discrete) time and sequentially choose from a large number of information sources, each of which is associated with a signal about the payoff-relevant state. We allow for flexible correlation across the sources by modeling each kind of information as a (noisy) linear combination of the payoff-relevant state and a set of “confounding” variables.

In contrast to the classic sequential learning model (Bikhchandani, Hirshleifer and Welch, 1992; Banerjee, 1992; Smith and Sorenson, 2000), we suppose that all signal realizations are public. This departure permits us to focus on the externalities created by agents’ choice of kind of information, as opposed to the more frequently studied frictions that emerge from inference. We assume that each agent takes an action based on all prior information, and maximizes an individual objective that depends only on his action and the payoff-relevant state. We work with normal signals, which yields the additional tractability that each agent’s information choice is the one that maximizes that period’s reduction of uncertainty about the payoff-relevant state.

Our focus is on settings with many sources of information, including some of which are redundant. Formally, agents can completely learn the payoff-relevant state from (repeated observation of) various subsets of signals. As a benchmark, we first derive the optimal long-run frequency of signal acquisitions, corresponding to the choices that maximize the speed of information revelation about the payoff-relevant state.

We then show that whether society’s acquisitions converge to this optimal long-run frequency depends critically on how many signals are needed to identify this state. The key intuition refers back to an observation made in Sethi and Yildiz (2016): An agent who repeatedly observes a source confounded by an unknown parameter learns both about the payoff-relevant state and also about the confounding term, and hence improves his interpretation of this source over time. In our setting, where a single confounding term can affect multiple sources, there is a further spillover effect: Learning from one source helps agents to interpret information from all sources confounded by the same parameters.

Suppose that in order to learn the payoff-relevant state, agents must observe a set of sources that additionally reveals all of the confounding terms. Then endogenously, agents will acquire information that (collectively) reveals all of the unknowns. The aggregated information eventually overwhelms prior beliefs, so that agents come to evaluate all sources by an “objective” asymptotic criterion. This leads them to discover the best set of sources. More formally, we obtain the following result: If sources are required to recover the payoff-relevant state (where is also the number of unknown states), then long-run acquisitions are optimal, independently of the prior belief.

In contrast, if it is possible to learn the payoff-relevant state without recovering all of the confounding terms, then long-run learning may be inefficient. This is because agents can persistently undervalue sources that provide information confounded by the remaining unknowns. Our second main result says that any set of fewer than sources that recovers the payoff-relevant state creates a “learning trap” under some set of prior beliefs. We further show that the long-run inefficiency under a learning trap—measured as the ratio of the optimal speed of learning to the achieved speed of learning—can be arbitrarily large.

The basic friction here is that investment in learning about confounding terms is socially beneficial, but not necessarily optimal for individuals. For example, better brain imaging may allow researchers to quickly uncover the causes of depression. If researchers are rewarded for their immediate contribution to understanding depression, however, they may prefer to exploit existing techniques rather than contribute towards (long-term) projects for developing these tools. Our main results show that this wedge between individual incentives and social objectives does not guarantee that long-run efficient learning will not obtain. For certain correlation structures, individual incentives will endogenously drive individuals to acquire information in a way that is socially efficient.

In the remaining cases, interventions may be needed to transition agents towards better sets of sources. In the final part of our paper, we study possible such interventions. We show that policymakers can restore efficient information aggregation by providing certain kinds of free information (that we characterize), or by reshaping the reward structure so that agents’ payoffs depend on information that they acquire over many periods. The success of these interventions, however, depend on specific features of the informational environment.

### 1.1 Related Literature

A recent literature considers choice from different kinds of information (Sethi and Yildiz, 2016; Che and Mierendorff, 2017; Fudenberg, Strack and Strzalecki, 2017; Mayskaya, 2017; Liang, Mu and Syrgkanis, 2017). We build upon Liang, Mu and Syrgkanis (2017), which introduced the framework we describe in Section 2 under a restriction that the number of sources and states are the same (thus ruling out the possibility of informational overabundance, which is the focus of the present paper).

Sethi and Yildiz (2016, 2017) study long-run (myopic) acquisitions from a large number of Gaussian sources, as we do. Our model differs from this work in a few key ways: First, Sethi and Yildiz (2016, 2017) consider stochastic error variances, so that the “best” sources vary from period to period, while we fix error variances, so that there is (generically) a unique “best” asymptotic set. Second, Sethi and Yildiz (2016, 2017) focus on correlation structures that fall under our “learning traps” result (Theorem 2), while we explore arbitrary correlation structures and show that many lead to optimal learning; thus, the welfare comparisons that we make here are new.

Our model builds on the social learning and herding literatures (Banerjee, 1992; Bikhchandani, Hirshleifer and Welch, 1992), which consider information aggregation by short-lived agents who sequentially acquire information. At a high level, the externality identified in our paper relates to the classic externality from this literature: in both settings, the precision of public information can grow inefficiently slowly because of endogenous information acquisitions driven by past choices. But in the present paper, all signal realizations are publicly and perfectly observed, which turns off the inference problem essential to the existence of cascades in standard herding models. Our focus is on a new mechanism, in which externalities arise through choice of kind of information; as we will see, this externality has a rather different structure.

Our setting with choice of information connects to Burguet and Vives (2000), Mueller-Frank and Pai (2016), and Ali (2018), which introduced endogenous information acquisition to social learning. Relative to this work, our paper considers choice from a fixed set of information sources (with a capacity constraint), in contrast to choice from a flexible set of information sources (with a cost on precision). Our results focus on the speed of learning, as in Vives (1992), Golub and Jackson (2012), Hann-Caruthers, Martynov and Tamuz (2017), and Harel et al. (2018) among others.

Finally, our social planner problem in Section 4 is related to the experimental design literature in statistics, and in particular to the notion of -optimality (choice of experiments to minimize the posterior variance of an unknown state). Chaloner (1984) showed that a -optimal design exists on at most points. Our Theorem 1 extends this result, supplying a characterization of the optimal design itself and demonstrating uniqueness.333Another difference is that Chaloner (1984) studies the optimal continuous design, while we impose an integer constraint on signal counts.

## 2 Framework

There are persistent unknown states: a payoff-relevant state , and additional states . We assume that the state vector follows a multivariate normal distribution ,444All vectors in this paper are column vectors. where the prior covariance matrix has full rank.555The full rank assumption is without loss of generality: If there is linear dependence across the states, the model can be mapped into an equivalent setting with a lower dimensional state-space.

Agents have access to sources of information. Observation of source produces an independent realization of the random variable

 Xi=⟨ci,θ⟩+ϵi,ϵi∼N(0,1)

where is a vector of constants, and the error terms are independent from each other and over time. (It is without loss to normalize the error terms, since the coefficients are unrestricted; thus, signals can be of differing precision levels.666Scaling up the coefficients is equivalent to scaling up the precision of the signal.) Throughout, let denote the coefficient matrix whose -th row is .

The payoff-irrelevant states produce correlations across the sources, and we can interpret them for example as:

• Confounding explanatory variables: Each observation of signal corresponds to observation of a tuple , where . For example, might be the average incidence of depression in a group of individuals with characteristic vector . The main state of interest is the coefficient on a given characteristic , and the payoff-irrelevant states are the unknown coefficients on the auxiliary characteristics. Different sources represent subpopulations with different characteristics.

• Knowledge and technologies that aid interpretation of information: Researchers can acquire measurements of various neurochemicals in individuals affected by depression. Each observation is obscured by measurement noise that depends on the state of the technology. Some neurochemicals are harder to measure than others, and the more advanced the tools for measuring a specific neurochemical, the more valuable it is to take that measurement.

Agents indexed by (discrete) time move sequentially. Each agent acquires an independent realization of one of the signals, and then chooses an action to maximize an individual objective . He bases his action on the realization of his own signal acquisition, as well as the history of signal acquisitions and realizations thus far. Thus, all signal realizations are public.

Payoff functions may differ across agents, but we assume that all decision problems depend only on the unknown state and the agent’s own action, and are moreover non-trivial in the following way.

###### Assumption 1 (Payoff Sensitivity to Mean).

For every , any variance and any action , there exists a positive Lebesgue measure of for which does not maximize .

That is, for every belief variance, we require that the expected value of affects the optimal action to take. This rules out cases with a “dominant” action and ensures that each agent strictly prefers to choose the most informative signal.

Throughout, we use to index the set of signals. We call a set of signals spanning if the vectors span the coordinate vector , so that it is possible to learn the payoff-relevant state by repeatedly observing signals from only . We call minimally spanning if it is spanning, and moreover no proper subset is spanning.

We assume in this paper that the complete set of signals is spanning, so that the payoff-relevant state can be recovered by observing all signals infinitely often.777This assumption is without loss, and our results do extend to situations where is not identified from the available signals. To see this, we first take a linear transformation and work with the following equivalent model: The state vector is -dimensional standard Gaussian, each signal , and the payoff-relevant parameter is for some fixed vector . Let be the subspace of spanned by . Then project onto : with and orthogonal to . Thus . By assumption, the random variable is independent from any random variable with (because they have zero covariance). Thus the uncertainty about cannot be reduced upon any signal observation. Consequently, agents only seek to learn about , returning to the case where the payoff-relevant parameter is identified. This assumption nests two interesting cases. Say that the informational environment has exactly sufficient information if is minimally spanning. Then, it is possible to recover by observing each information source infinitely often, but not by observing any proper subset of sources.

Our main interest is in settings of informational overabundance, where is spanning but not minimally spanning. In these cases, multiple different sets of signals allow for recovery of , and a key point of our analysis is to compare the set of sources that “should” be observed in the long run with the set of sources that is in fact observed in the long run. Except for trivial cases, informational overabundance corresponds to (more signals than states).888It is possible for to be “overidentified” from a set of signals, e.g. , , and . In this case, the set is spanning, but not minimally spanning since both of its subsets and are also spanning. Although in this example, it is equivalent to a model in which there is a single bias , and the three signals are rewritten , and . Then we do have .

## 3 Preliminaries

Each agent faces a history consisting of all past signal choices and their realizations. The agent’s beliefs about the state vector, prior to making his own signal choice, are . Given an observation of the signal , his posterior beliefs become where is a deterministic function of the prior covariance matrix and the signal choice , and the posterior expected value is the random variable .

Agent ’s posterior belief about the payoff-relevant state is given by .999Subscripts indicate particular entries of a vector or matrix. His maximum expected payoff (after observing his signal) is

 maxa∈AE[ut(a,ω)∣ω∼N(μt1,Vt11)]. (1)

Each agent chooses the signal that maximizes the expected value of (1), where the expectation is taken with respect to the random variable . From this we see that the agent’s expected payoff is measurable with respect to the posterior variance .

The signal acquisition that maximizes agent ’s payoffs is the one that minimizes his posterior variance about .101010Under our normality assumption, the signal that maximally reduces posterior variance about Blackwell dominates the remaining signals; see e.g. Hansen and Torgersen (1974). So the statement here is independent of the payoff function. Thus, we can track society’s acquisitions as a sequence of division vectors , where is the number of times that signal has been observed up to and including time . Let denote the posterior variance about , given the initial prior covariance matrix and observations of each signal .111111For normal prior and signals, the posterior covariance matrix does not depend on signal realizations. See Appendix A for the complete (closed-form) expression for . Then, evolves deterministically according to the following rule: is the zero vector, and for each time and signal ,

 mi(t+1)={mi(t)+1if f(mi(t)+1,m−i(t))≤f(mj(t)+1,m−j(t)) ∀j.mi(t)otherwise.

That is, in each period the division vector increases by in exactly one coordinate, corresponding to the signal that allows for the greatest immediate reduction in posterior variance. We allow ties to be broken arbitrarily, so there may be multiple possible paths .

The long-run frequencies of observation are for each signal . Our subsequent results in Section 5 show these limits to be well-defined. In Section 4, we first characterize the “optimal” acquisitions that a social planner might impose, and identify the corresponding long-run observation frequencies. In Section 5, we characterize the actual signal acquisitions, and compare this to the optimal benchmark.

## 4 Optimal Information Revelation

Our optimal benchmark describes the maximum possible information revelation about . For each period , define a -optimal vector to be any allocation of observations that minimizes posterior variance about :

 n(t)∈argmin(q1,…,qK):qi∈Z+,∑iqi=tf(q1,…,qK).

Then, is the lowest achievable posterior variance by period . Generically, there is a unique -optimal division vector for every .121212Throughout the paper, “generic” means with probability for signal coefficients randomly drawn from a full support distribution on .

We interpret each as the optimal social benchmark for the finite horizon problem with final period . Suppose a social planner takes an action on behalf of the society at period to satisfy a payoff criterion that depends on his action and the payoff-relevant state . Then, at every , the social planner’s payoffs are maximized if the history of signal acquisitions corresponds to a -optimal division.131313The social planner’s optimal strategy is to observe each signal exactly times, in an arbitrary order. In particular, such a strategy does not need to condition on signal realizations; see Liang, Mu and Syrgkanis (2017) for further discussion.

The limiting frequencies are well-defined under a subsequent condition (Assumption 2), and we refer to these as the optimal frequencies. Note that the strategy that samples signals (randomly) according to these optimal frequencies is best among stationary information acquisition strategies. Since payoffs are continuous in signal frequencies, such a strategy also approximates optimal aggregate payoffs under a -discounted criterion, as .141414We conjecture that under general conditions on agents’ utility functions, for any close to , the strategy that maximizes the -discounted objective eventually approximates the optimal frequencies. That is, if is the vector of signal counts at time under this optimal strategy (for the -discounted objective), then we conjecture . These observations further justify the use of as a benchmark.

We consider first a restricted version of the social planner problem, supposing that agents must acquire signals (only) from some minimal spanning set . By definition, the setting is one of exactly sufficient information, where all signals must be observed in order to recover . In such a case, it is possible to decompose the first coordinate vector as the following (unique) linear combination of signals in :

 e1=∑i∈SβSi⋅ci,

where the coefficients are non-zero. We showed in prior work that each signal should be (asymptotically) observed in proportional to its coefficient :

###### Proposition 1 (Liang, Mu and Syrgkanis (2017)).

Suppose agents are constrained to a minimal spanning set . Then, for every signal , the optimal count satisfies

 nSi(t)=|βSi|∑j∈S|βSj|⋅t+O(1). (2)

Throughout, represents a residual term that remains bounded as .

This proposition implies the following corollary regarding the speed of learning achievable from signals in :

###### Corollary 1.

Suppose agents are constrained to a minimal spanning set . The minimum achievable posterior variance after observations satisfies the following approximation:

 f(nS(t))∼(∑i∈S|βSi|)2/t.

where the notation “” means .

Thus, sampling according to the frequencies will approximate (at large period ) a posterior variance of

 AsympVart(S)=⎛⎝∑i∈missingS|βSi|⎞⎠2/t:=ϕ(S)2t.

In what follows, we work with the simpler statistic (roughly an asymptotic standard deviation), noting that the asymptotic variance is strictly increasing in . The smaller is, the faster the community learns from , so establishes an ordering over minimal spanning sets.151515We can extend this definition to an arbitrary set (not necessarily minimally-spanning) of signal as follows. For any set that contains a minimal spanning set, define , where the minimum is taken over all minimal spanning sets contained in . If such does not exist (i.e., is not itself spanning), we let . In particular, represents the minimum asymptotic standard deviation achievable by only observing the signals in some minimal spanning set.

We assume throughout that there is a best minimal spanning set according to this ordering:

###### Assumption 2 (Unique Minimizer).

has a unique minimizer among minimal spanning sets .

This assumption is a restriction on the coefficient matrix , and it rules out examples such as the following:

###### Example 1.

The signals are and . Unique Minimizer fails, because learning occurs equally fast from either of the minimal spanning sets or .

###### Example 2.

The signals are , , , and . Unique Minimizer fails, because learning occurs equally fast from either of the minimal spanning sets and .

These examples are special, in the sense that Assumption 2 holds under arbitrarily small perturbations of the above environments.

If we restrict agents to sample exclusively from a single minimal spanning set, then the optimal sampling rule (under Assumption 2) is clearly the frequency vector satisfying

 λ∗i=⎧⎪⎨⎪⎩|βS∗i|∑j∈S∗|βS∗j|∀i∈S∗0∀i∉S∗ (3)

This sampling rule assigns zero frequency to signals outside of the set , and samples signals within according to the frequencies given in Proposition 1.

In principle, the community may improve on by sampling from multiple spanning sets. Our first theorem shows to the contrary that remains optimal when we are permitted arbitrary sampling procedures. So long as satisfies Unique Minimizer, then the best long run strategy is to restrict to the best minimal spanning set, and to sample from that set as in the previous section.

###### Theorem 1.

Under Assumption 2, let be given by (3). Then for each signal .161616We conjecture that the stronger conclusion also holds. In Remark 2 in the appendix, we prove this conjecture assuming .

The conclusion can be loosely interpreted as stating that is the “most efficient linear representation” of the payoff-relevant state in terms of the signal coefficients.171717Specifically, consider the following constrained minimization problem: subject to It can be shown by linear programming that the minimum is attained exactly when — that is, when focusing on a single minimal spanning set.

We show in Appendix G that Assumption 2 is necessary: Indeed, in the environment described in Example 2, there are priors such that it is strictly optimal to observe all four available signals with positive frequency.

Theorem 1 directly implies the following comparative static: If signal is viewed with positive frequency in the social planner problem, then its optimal frequency is (locally) decreasing in its precision.

###### Corollary 2.

Suppose the coefficient matrix satisfies Unique Minimizer. Write each signal as , so that the precision of signal is increasing in . Then, either or is locally decreasing in .

Consider a problem complementary to ours, in which information sources choose the precision of their signals in order to maximize the optimal frequencies with which they are viewed (as given in (3)).181818Similar comparative statics hold for society’s long-run frequencies, which are characterized later. Corollary 2 highlight that there are two forces: each source should choose information sufficiently precise that it is included in the best set ; conditional on inclusion, however, each source wants to provide signals as imprecise as possible. These conflicting forces suggest that characterization of the equilibrium provisions of information precision is not straightforward.

## 5 Main Results

In general, we may expect a difference between the best one-shot allocation of acquisitions, described in the previous section, and the set of acquisitions that are chosen by sequential decision-makers. Below, we show that whether society’s acquisitions eventually approximate the optimal acquisitions depends critically on how many signals are required to identify .

We first present our main results under the following technical assumption:

###### Assumption 3 (Strong Linear Independence).

and every submatrix of is of full rank.

Strong Linear Independence requires that every set of signals is linearly independent.191919Besides trivial cases with redundant signals, Strong Linear Independence also rules out settings such as the following: , , , , , and . Then but the four signals are not linearly independent. We impose this restriction in Sections 5.1 and 5.2 to allow for a simpler exposition of the main forces. Our results extend beyond Strong Linear Independence, and we characterize the general setting in Section 5.3.

### 5.1 Learning Traps

The following simple example demonstrates that sequential information acquisition need not lead to the optimal frequencies. Indeed, the set of signals that are observed with positive frequency in the long run can be disjoint from the optimal set (as defined in Assumption 2).

###### Example 3.

There are three available signals:

 X1 =ω/2+ϵ1 X2 =ω+b1+ϵ2 X3 =ω−b1+ϵ3

Both and are minimal spanning sets, but the latter is optimal because .

Consider a prior where and are independent, and the prior variance of exceeds . In the first period, the precision of the first signal exceeds that of the latter two signals (where all signals are interpreted as noisy observations of ).202020The signal is equivalent to , which is distributed as . Each of the signals and has greater variance conditional on . Thus the best choice is to observe . Since this observation does not affect the variance of , the same argument shows that every agent observes signal .

Generalizing this example, the result below (stated as a corollary, since it will follow from the subsequent Theorem 2) gives a sufficient condition for learning traps.

###### Corollary 3.

Under Strong Linear Independence, for every minimal spanning set that contains fewer than signals, there exists an open set of prior beliefs under which agents exclusively observe signals from .

Thus, every small set (fewer than signals) that identifies is a candidate learning trap.

We note that the size of inefficiency, measured as the ratio of the optimal speed of learning to he achieved speed of learning, can be an arbitrarily large constant. Specifically, for any positive number , there exists an environment in which

 ϕ(S)ϕ(S∗)>L

where is the set of signals observed in the long run with positive frequency, and is the optimal set from before. This can be shown by direct construction: modify the example above so that with sufficiently large.212121The “region of inefficient priors” (that result in suboptimal learning) does decrease in size as the level of inefficiency increases. As increases, the prior variance of has to increase correspondingly in order for the first agent to choose .

### 5.2 Efficient Information Aggregation

Suppose in contrast that repeated observation of sources is required to recover . Our next result shows that a very different long-run outcome obtains: Starting from any prior, information acquisition eventually approximates the optimal frequencies. Thus, despite the wedge between individual and societal objectives, agents will end up acquiring information in a way that is (eventually) socially best.

###### Corollary 4.

Under Unique Minimizer and Strong Linear Independence, if every minimal spanning set has size , then starting from any prior belief, it holds that for every signal .

This result obtains because if signals must be observed in order to recover , then the incentive to learn will endogenously drive agents to sample from at least different signals. Under the assumption of Strong Linear Independence, these signals further reveal all unknown states. Thus, as observations accumulate, agents not only learn about but about all of the confounding terms. This allows agents to eventually evaluate all signals according to an “objective” asymptotic value, and to identify the best set.

The condition that all minimal spanning sets have size is generically satisfied.222222We point out that the set of coefficient matrices satisfying Unique Minimizer is “generic” in the following stronger sense: fixing the directions of coefficient vectors, and suppose that the precisions are drawn at random, then generically different minimal spanning sets correspond to different speed of learning. In contrast, whether every minimal spanning set has size is a condition on the directions themselves. However, if we expect that sources are endogenous to design or strategic motivations, the relevant informational environments may not fall under this condition. For example, the existence of any source that directly reveals (that is, ) is non-generic in the probabilistic sense, but plausible in practice. Sets of signals that partition into different groups (with group-specific biases) are also economically interesting but non-generic. The previous Corollary 3 shows that inefficiency is a likely outcome in these cases.

### 5.3 General Setting

We now state a more general version of our results that does not require Strong Linear Independence. Here we need to consider subspaces spanned by different signal sets. Formally, for any spanning set of signals , let be the set of available signals whose coefficient vectors belong to the subspace spanned by signals in . For example, if the available signals are and , and we define , then . We say a minimal spanning set is subspace-optimal if it uniquely maximizes the speed of learning among “feasible” sets of signals within its subspace.

###### Definition 1.

A minimal spanning set is subspace-optimal if it uniquely minimizes among minimal spanning subsets of .

For example, given the signals and described above, the set is minimally spanning but not subspace-optimal.

We introduce one final assumption, which strengthens Unique Minimizer to require the existence of a best minimal spanning set within every subspace.

###### Assumption 4 (Unique Minimizer in Every Subspace).

For every , there exists a unique minimal spanning set that minimizes among subsets of .

This assumption is guaranteed if different minimal spanning sets correspond to different -values.

Our next result generalizes both the learning trap and also the efficient information aggregation results from Sections 5.1 and 5.2. It says that long-run information acquisitions eventually concentrate on a set (starting from some prior belief) if and only if is a subspace-optimal minimal spanning set.

###### Theorem 2.

(a) Suppose is a subspace-optimal minimal spanning set. Then, there exists an open set of prior beliefs under which long-run frequencies are strictly positive for signals in , and zero everywhere else.

(b) Under Assumption 4, long-run frequencies exist for every signal. Moreover, if denotes the signals viewed with positive long-run frequencies, then is a minimal spanning set that is subspace-optimal.

This theorem directly implies our previous Corollaries 3 and 4. To see this implication, note that under Strong Linear Independence, for every minimal spanning set with fewer than signals. This implies that every minimal spanning set with fewer than signals is (trivially) optimal in its subspace, producing Corollary 3.

On the other hand, if every minimal spanning set has size and Strong Linear Independence is satisfied, then all minimal spanning sets belong to the same subspace. Under Unique Minimizer, there can only be one minimal spanning set that is optimal in this subspace, and this must also be the best set overall (in the sense of Section 4). This yields Corollary 4 from the theorem above.

## 6 Intuitions for Results

### 6.1 High-Level Argument

Each agent’s information acquisition decision is made by comparing the marginal value of observations from different sources. Thus, a necessary condition for efficient information aggregation to obtain is that agents eventually find the marginal values of signals in the optimal set (described in Section 4) to be persistently higher than the marginal values of signals outside of this set. This is also a sufficient condition: if agents eventually concentrate their acquisitions on the optimal set, then they will come to approximate the optimal frequencies.

The above argument relies on the assumption that agents repeatedly observe all signals in the best set. With exactly sufficient information, agents are driven to observe all available signals in order to learn . When information is overabundant, agents can learn from many different (proper) subsets of signals, and there is no guarantee that agents will observe signals in the best set at all.

It is exactly this difference that leads to our learning trap result (Corollary 3): Observation of different minimal spanning sets in the long run can be sustained by prior beliefs (and resulting posterior beliefs) that overvalue the signals within the set relative to signals outside of the set.

However, our analysis reveals that as agents repeatedly acquire signals from any fixed subspace of signals, they will eventually discover the asymptotic marginal values of each signal (which is independent of the prior beliefs) in that subspace. In the long run, agents choose from the best set of signals within that subspace. Thus, only those sets of signals that are best in their subspace are potentially “self-sustaining.” And if all sets of signals that reveal span the entire space, agents will identify the best set of signals overall and achieve efficient information aggregation.

### 6.2 Proof Sketch for Theorem 2

Society’s acquisitions follow a procedure of “pseudo”-gradient descent, where the frequency vector evolves according to

 λ(t+1)=tt+1λ(t)+1t+1ei

and represents the coordinate vector that yields the greatest (immediate) reduction in the posterior variance function .

Instead of working directly with posterior variance, we define the following related function, which takes as input frequency vectors and describes a “normalized” asymptotic posterior variance:

 f∗(λ1,…,λN)=limt→∞t⋅f(λ1t,…,λNt).

We establish the following relationships between and . First, signal acquisitions chosen according to a frequency vector that minimizes will asymptotically also minimize the posterior variance function (Lemma 3); this justifies our study of . Second, is convex in and its unique minimum is the optimal frequency vector (Lemma 5). So the question of whether efficient information aggregation obtains is equivalent to the question of whether the frequencies come to minimize . Third, under a condition (that we show will be met at late periods), the signal that achieves the greatest reduction in also roughly achieves the greatest reduction in (Lemma 9). This allows us to consider (pseudo-)gradient descent in terms of .

While the convexity of ensures that standard gradient descent is well-behaved, the process of descent in our problem can only occur along a finite set of feasible directions (indexed by the available signals). This constraint corresponds to our assumption that each agent acquires a single, discrete, observation of a chosen signal. The limitation is without loss whenever is differentiable, since all directional derivatives can then be rewritten as convex combinations of the partial derivatives along basis vectors.232323The limitation also goes away if each agent is allowed to acquire many observations of different signals; see Proposition 2 below. The function , however, is not differentiable everywhere. Consider our learning trap example with signals

 X1 =ω/2+ϵ1 X2 =ω+b1+ϵ2 X3 =ω−b1+ϵ3

and set the frequency vector to be . It is easy to verify that beliefs are made less precise if we re-assign weight from to , or from to . But beliefs are made more precise if we simultaneously re-assign weight from to both and .242424That is, is strictly smaller than both and , but it is strictly larger than . This means that the derivative of in the directions and are both positive, while its derivative in the direction is in fact positive. Hence, is not differentiable at .

Gradient descent can become stuck at vectors such as this, so that agents repeatedly sustain the frequency vector instead of moving to another frequency vector with smaller . This is reflected in our learning trap results: Corollary 3 and part (a) of Theorem 2. A key lemma towards efficient information aggregation shows that is differentiable whenever places nonzero weight on a spanning set of signals. Under the assumptions of Corollary 4, the community will satisfy this condition as agents try to recover . Thus, eventually the frequency vector puts positive weight on a spanning set of signals, at which point descent is well-behaved and ends at the global minimum . This explains the result in Corollary 4. Part (b) of Theorem 2 follows from a similar (albeit slightly more involved) argument.

## 7 Interventions

Section 5 demonstrated the possibility for sequential information acquisition to result in inefficient learning. We ask now whether it is possible for a policymaker to push agents towards efficient learning. Naturally, this question applies only when agents would otherwise achieve a suboptimal speed of learning (with conditions given in part (a) of Theorem 2).

We compare several possible policy interventions. One is to subsidize the quality of information acquisition, so that each individual observation is more informative. We show that this intervention is of limited effectiveness: Any set of signals that is a potential learning trap remains a potential learning trap under arbitrary improvements to signal precision. Another possibility is to restructure the incentive structure so that agents’ payoffs are based on information obtained over several periods (equivalent to acquisition of a block of signals each period). We show that it is possible to guarantee efficient information aggregation, but the length of the delay needed depends on subtle features of the information environment. Finally, we consider one-shot provision of free information, and provide an upper bound on the number of kinds of information that are needed to restore efficient learning.

### 7.1 More Precise Information

Consider an intervention that increases the information value of any signal draw. If sources represent experiments or data sets, this intervention can be interpreted as a subsidy that increases the number of data points. Formally, we suppose that each signal acquisition produces independent observations from that source.

The result below (which follows as a corollary from Theorem 2) says that every set of signals that is a potential learning trap given (as in our main model) remains a potential learning trap for every choice of .

###### Corollary 5.

Suppose that for , there is a set of priors given which signals in are (exclusively) viewed in the long run. Then, for every , there is a set of priors given which is exclusively viewed in the long run.

However, the sets of prior beliefs corresponding to different values of need not be the same. For a fixed prior belief, subsidizing higher quality acquisitions may or may not move the community out of a learning trap. To see this, consider first the informational environment and prior belief from Example 3. This is an environment in which increasing the precision of signals is ineffective: At the first period, the best choice is regardless of the value of . Then, our previous logic again implies that each subsequent agent will also choose signal , so that remains a learning trap. In Appendix H, we provide a contrasting example in which increasing the precision of signals can indeed break agents out of a learning trap from a specified prior belief. Which of these examples is relevant depends on fine details of the informational environment and the prior.

### 7.2 More Kinds of Information

Suppose that the policymaker can restructure the incentive structure, so that each agent has periods to acquire information prior to taking an action. This is equivalent to supposing that each agent receives a block of signals to allocate across sources. Then:

###### Proposition 2.

Under Unique Minimizer, there is a such that given acquisition of signals every period, long-run frequencies are starting from every prior belief.

Thus, given sufficiently many observations each period, agents will allocate observations in a way that eventually approximates the optimal frequencies.

The number of observations needed, however, does not admit a uniform bound over all environments of fixed size; that is, there does not exist a function such that acquisition of signals by each agent ensures efficient information aggregation for all signal structures with states and signals. The required instead depends on the difference in learning speed under the best set and under the alternative minimal spanning sets; see Appendix E. The larger this difference, the smaller can be.

### 7.3 Free Information

Finally, suppose the policy-maker can provide observations of signals , where each so that signal precisions are bounded by . At time , this information is made public. All subsequent agents update their prior beliefs based on this free information in addition to the history of signal acquisitions thus far.

Is there a sufficient number of (kinds of) signals, such that efficient learning can be guaranteed under such an intervention? We answer in the affirmative below: Suppose Unique Minimizer holds, and let be the size of the optimal set . Then precise signals are sufficient to produce efficient learning:

###### Proposition 3.

There exist and signals with such that with these free signals provided at , society’s long-run frequencies are starting from every prior belief.

We emphasize that the policy-maker does not need to teach directly about the payoff-relevant state , which the agents will learn on their own. Rather, auxiliary information should be provided to help agents better interpret the confounding terms. Our proof shows that as long as agents understand those confounding terms that appear in the best set of signals (these parameters have dimension ), they will come to evaluate the signals in the best set according to their asymptotic marginal values.252525This intervention requires knowledge of the full correlation structure. An alternative intervention, with higher demands on information provision but lower demands on knowledge of the environment, is to provide signals about all of the confounding terms.

## 8 Conclusion

We study a model of sequential learning, where short-lived agents choose what kind of information to acquire from a large set of available information sources. We compare their information acquisitions with an optimal benchmark, under which the speed of information revelation is maximized.

In general, because agents do not internalize the impact of their information acquisitions on later decision-makers, inefficient information acquisition may obtain. Specifically, past information acquisitions can increase the value of “low-quality” sources relative to “high-quality” sources, pushing future agents to acquire information from a set of sources that yields inefficiently slow learning. We show however that inefficiency is not guaranteed: depending on the correlation structure, myopic concerns can endogenously push agents to identify and observe only the most informative sources. Our main results separate these outcomes, and fully characterize the set of possible long-run outcomes.

Our framework and results highlight some of the forces that are important for the design of incentives for information acquisition. In particular, do the kinds of information that are of immediate societal interest also have spillovers for knowledge that is only of indirect value? When such spillovers are present, simple incentive schemes—in which agents care about immediate contributions to knowledge—are sufficient to enable efficient long-run learning. When these spillovers are not built into the environment, other incentives are needed. For example, forward-looking funding agencies can encourage investment in learning about unknowns that are not directly of interest, but which are useful as intermediate steps. Alternatively, agents can be rewarded for advancements developed across several contributions. These observations are consistent with practices that have arisen in academic research, including evaluation of the influence of a body of work, and the establishment of third-party funding agencies to support methodological and foundational research.

## References

• (1)
• Ali (2018) Ali, Nageeb. 2018. “Herding with Costly Information.” Journal of Economic Theory.
• Banerjee (1992) Banerjee, Abhijit. 1992. “A Simple Model of Herd Behavior.” Quaterly Journal of Economics, 107(3): 797–817.
• Bikhchandani, Hirshleifer and Welch (1992) Bikhchandani, Sushil, David Hirshleifer, and Ivo Welch. 1992. “A Theory of Fads, Fashion, Custom, and Cultural Change as Information Cascades.” Journal of Political Economy, 100(5): 992–1026.
• Borgers, Hernando-Veciana and Krahmer (2013) Borgers, Tilman, Angel Hernando-Veciana, and Daniel Krahmer. 2013. “When Are Signals Complements Or Substitutes.” Journal of Economic Theory, 148(1): 165–195.
• Burguet and Vives (2000) Burguet, Roberto, and Xavier Vives. 2000. “Social Learning and Costly Information.” Economic Theory.
• Chade and Eeckhout (2018) Chade, Hector, and Jan Eeckhout. 2018. “Matching Information.” Theoretical Economics.
• Chaloner (1984) Chaloner, Kathryn. 1984. “Optimal Bayesian Experimental Design for Linear Models.” The Annals of Statistics, 12(1): 283–300.
• Chen and Waggoner (2016) Chen, Yiling, and Bo Waggoner. 2016. “Informational Substitues.”
• Che and Mierendorff (2017) Che, Yeon-Koo, and Konrad Mierendorff. 2017. “Optimal Sequential Decision with Limited Attention.” Working Paper.
• Fudenberg, Strack and Strzalecki (2017) Fudenberg, Drew, Philip Strack, and Tomasz Strzalecki. 2017. “Stochastic Choice and Optimal Sequential Sampling.” Working Paper.
• Golub and Jackson (2012) Golub, Benjamin, and Matthew Jackson. 2012. “How Homophily Affects the Speed of Learning and Best-Response Dynamics.” The Quarterly Journal of Economics.
• Hann-Caruthers, Martynov and Tamuz (2017) Hann-Caruthers, Wade, Vadim Martynov, and Omer Tamuz. 2017. “The Speed of Sequential Asymptotic Learning.” Working Paper.
• Hansen and Torgersen (1974) Hansen, Ole Havard, and Eric N. Torgersen. 1974. “Comparison of Linear Normal Experiments.” The Annals of Statistics, 2: 367–373.
• Harel et al. (2018) Harel, Matan, Elchanan Mossel, Philipp Strack, and Omer Tamuz. 2018. “Groupthink and the Failure of Information Aggregation in Large Groups.” Working Paper.
• Liang, Mu and Syrgkanis (2017) Liang, Annie, Xiaosheng Mu, and Vasilis Syrgkanis. 2017. “Optimal Myopic Information Acquisition.” Working Paper.
• Mayskaya (2017) Mayskaya, Tatiana. 2017. “Dynamic Choice of Information Sources.” Working Paper.
• Mueller-Frank and Pai (2016) Mueller-Frank, Manuel, and Mallesh Pai. 2016. “Social Learning with Costly Search.” American Economic Journal: Microeconomics.
• Sethi and Yildiz (2016) Sethi, Rajiv, and Muhamet Yildiz. 2016. “Communication with Unknown Perspectives.” Econometrica, 84(6): 2029–2069.
• Sethi and Yildiz (2017) Sethi, Rajiv, and Muhamet Yildiz. 2017. “Culture and Communication.” Working Paper.
• Smith and Sorenson (2000) Smith, Lones, and Peter Sorenson. 2000. “Pathological Outcomes of Observational Learning.” Econometrica.
• Vives (1992) Vives, Xavier. 1992. “How Fast do Rational Agents Learn?” Review of Economic Studies.

## Appendix A Posterior Variance Function

### a.1 A Basic Lemma

Here we review and extend a basic result from Liang, Mu and Syrgkanis (2017). Specifically, we show that the posterior variance about weakly decreases over time, and the marginal value of any signal decreases in its signal count.

###### Lemma 1.

Given prior covariance matrix and observations of each signal , society’s posterior variance about is given by

 f(q1,…,qN)=[((V0)−1+C′QC)−1]11 (4)

where . The function is decreasing and convex in each whenever these arguments take non-negative real values.

###### Proof.

Note that is the prior precision matrix, and is the total precision from the signals. Thus (4) simply represents the fact that for Gaussian prior and signals, the posterior precision matrix is the sum of prior and signal precision matrices. To prove the monotonicity of , consider the partial order on positive semi-definite matrices where if and only if is positive semi-definite. As increases, the matrix and increase in this order. Thus the posterior covariance matrix decreases in this order, which implies that the posterior variance about decreases. Intuitively, more information always improves the decision-maker’s estimates.

To prove is convex, it suffices to prove is midpoint-convex since the function is clearly continuous. Take , and let . Define the corresponding diagonal matrices to be , , . Observe that . Thus by the AM-HM inequality for positive-definite matrices, we have in matrix order

 ((V0)−1+C′QC)−1+((V0)−1+C′RC)−1⪰2((V0)−1+C′SC)−1.

Using (4), we conclude

 f(q1,…,qN)+f(r1,…,rN)≥2f(s1,…,sN).

This proves the convexity of . ∎

### a.2 Inverse of Positive Semi-definite Matrices

For future use, we provide a definition of for positive semi-definite matrices . When is positive definite, its eigenvalues are strictly positive, and its inverse matrix is defined as usual. In general, we can apply the spectrum theorem to write

 X=UDU′

with being a orthogonal matrix whose columns are eigenvectors of , and being a diagonal matrix consisting of non-negative eigenvalues. Even if some of these eigenvalues are zero, we can think of as

 X−1=(UDU′)−1=UD−1U′=K∑j=11dj⋅[uju′j]

with being the -th column vector of . We thus define

 [X−1]11=K∑j=1(⟨uj,e1⟩)2dj, (5)

with the convention that . Note that by this definition,

 [X−1]11=limϵ→0+(K∑j=1(⟨uj,e1⟩)2dj+ϵ)=[(X+ϵIK)−1]11

since the matrix has the same set of eigenvectors as , with eigenvalues increased by . Hence our definition of is a continuous extension of the usual definition to positive semi-definite matrices. Note that we allow to be infinite.

## Appendix B Proof of Theorem 1

### b.1 Asymptotic Behavior of Posterior Variance

We first approximate the posterior variance as a function of the frequencies with which each signal is observed. Specifically,

###### Lemma 2.

For any , let . Then

 f∗(λ1,…,λN):=limt→∞t⋅f(λ1t,…,λNt)=[(C′ΛC)−1]11 (6)

Note that the matrix is positive semi-definite. So the value of is well defined, see (5).

###### Proof.

Recall that with . Thus

 tf(λ1t,…,λNt)=[(1t(V0)−1+C′ΛC)−1]11.

Hence by the continuity of in the matrix , we obtain the lemma. ∎

We note that is the Fisher Information Matrix when the signals are observed according to frequencies . Thus the above lemma can also be seen as an application of the Bayesian Central Limit Theorem.

### b.2 Reduction to the Study of f∗

The development of the function is useful for the following reason:

###### Lemma 3.

Suppose uniquely minimizes subject to (the -dimensional simplex), then the -optimal divisions satisfy for each .

###### Proof.

Fix any increasing sequence of times . It suffices to show that whenever the limit exists for each , this limit must be . Suppose not, then by assumption . For , define another vector with . By the continuity of , it holds that for sufficiently small .

Since , there exists sufficiently large such that for each and . Hence, for ,

 tm⋅f(n1(tm),…,nN(tm))≥tm⋅f(~λ1⋅tm,…,~λN⋅tm)→f∗(~λ1,…,~λN)

The first inequality uses the monotonicity of . On the other hand,

 tm⋅f(^λ1⋅tm,…,^λN⋅tm)→f∗(^λ1,…,^λN).

Comparing the above two displays, we see that for sufficiently large , . But this contradicts the -optimality of the division , as society could do better by following frequencies . The lemma is thus proved. ∎

### b.3 Crucial Lemma

We pause to demonstrate the following technical lemma:

###### Lemma 4.

Suppose uniquely minimizes and let be the submatrix of corresponding to the first signals. Further suppose is positive for . Then for any signal , we can write with .

###### Proof.

By assumption, we have the vector identity

 e1=K∑j=1βj⋅cjwith βj=[(C∗)−1]1j>0.

Suppose for contradiction that (the opposite case where the sum is can be similarly treated). In particular, some is positive. Without loss of generality, we assume is the largest among such ratios. Then and

 e1=K∑j=1βj⋅cj=(K∑j=2(βj−β1α1⋅αj)⋅cj)+β1α1⋅(K∑j=1αj⋅cj)

This represents as a linear combination of the vectors and , with coefficients and . Observe that these coefficients are non-negative: for each , is clearly positive if (since ). And if , then by assumption and is again non-negative.

By definition, is the sum of the absolute value of these coefficients. This sum is

 K∑j=2(βj−β1α1⋅αj)+β1α1=K∑j=1βj+β1α1⋅(1−K∑j=1αj)≤K∑j=1βj.

But then , leading to a contradiction. Hence the lemma must be true. ∎

### b.4 Proof of Theorem 1 when |S∗|=K

Given Lemma 3, Theorem 1 will follow once we show that uniquely minimizes over the simplex—recall that denotes the optimal frequencies for the minimal spanning set that minimizes . In this section, we prove is indeed the unique minimizer whenever this “best” subset contains exactly signals. Later on we will prove the same result even when , but that proof will require additional techniques.

###### Lemma 5.

Suppose is the unique minimizer of over minimal spanning sets. Define by

 λ∗i=|[(C∗)−1]1i|∑Kj=1|[(C∗)−1]1j|,1≤i≤K

with ,262626For any subset and , write for the sub-matrix of with row indices in and column indices in . Likewise, let be the sub-matrix of after deleting rows in and columns in . and . Then for any .

###### Proof.

First, we will assume that is positive for . This is without loss because we can always work with the “negative” of any signal (replace with ), which does not affect agents’ behavior.

Since is convex in its arguments, is also convex in . To show , we only need to show for some . In other words, it suffices to show for in an -neighborhood of . By assumption, is minimally-spanning and so its signals are linearly independent. Thus its signals must span all of the states. From this it follows that the matrix is positive definite, and by (6) the function is differentiable near (see Remark 1 below).

We claim that the partial derivatives of satisfy the following inequality:

 ∂Kf∗(λ∗)<∂if∗(λ∗)≤0,∀i>K. (**)

Once this is proved, we will have, for close to ,

 f∗(λ1,…,λK,λK+1,…,λN)≥f∗(λ1,…,λK−1,λK+λK+1+⋯+λN,0,…,0)≥f∗(λ∗). (7)

The first inequality is based on (**B.4) and continuous differentiability of , while the second inequality is because uniquely minimizes if society only observes the first signals. Moreover, when , one of these inequalities is strict so that strictly.

To prove (**B.4), we recall that

 f∗(λ