Sequential Cooperative Bayesian Inference

# Sequential Cooperative Bayesian Inference

## Abstract

Cooperation is often implicitly assumed when learning from other agents. Cooperation implies that the agent selecting the data, and the agent learning from the data, have the same goal, that the learner infer the intended hypothesis. Recent models in human and machine learning have demonstrated the possibility of cooperation. We seek foundational theoretical results for cooperative inference by Bayesian agents through sequential data. We develop novel approaches analyzing consistency, rate of convergence and stability of Sequential Cooperative Bayesian Inference (SCBI). Our analysis of the effectiveness, sample efficiency and robustness show that cooperation is not only possible in specific instances but theoretically well-founded in general. We discuss implications for human-human and human-machine cooperation.

## 1 Introduction

Learning often occurs sequentially, as opposed to in batch, and from data provided by other agents, as opposed to from a fixed random sampling process. The canonical example of sequential learning from an agent occurs in educational contexts where the other agent is a teacher whose goal is to help the learner. However, instances appear across a wide range of contexts including informal learning, language, and robotics. In contrast with typical contexts considered in machine learning, it is reasonable to expect the cooperative agent to adapt their sampling process after each trial, consistent with the goal of helping the learner learn more quickly. It is also reasonable to expect that learners, in dealing with such cooperative agents, would know the other agent intends to cooperate and incorporate that knowledge when updating their beliefs. In this paper, we analyze basic statistical properties of such sequential cooperative inferences.

Large behavioral and computational literatures highlight the importance cooperation for learning. Across behavioral sciences, cooperative information sharing is believed to be a core feature of human cognition. Education, where a teacher selects examples for a learner, is perhaps the most obvious case. Other examples appear in linguistic pragmatics Frank and Goodman (2012), in speech directed to infants Eaves Jr et al. (2016), and children’s learning from demonstrations Bonawitz et al. (2011). Indeed, theorists have posited that the ability to select data for and learn cooperatively from others explains humans’ ability to learning quickly in childhood and accumulate knowledge over generations Tomasello (1999); Csibra and Gergely (2009).

Across computational literatures, cooperative information sharing is also believed to be central to human-machine interaction. Examples include pedagogic-pragmatic value alignment in robotics Fisac et al. (2017), cooperative inverse reinforcement learning Hadfield-Menell et al. (2016), machine teaching Zhu (2013), and Bayesian teaching Eaves Jr et al. (2016) in machine learning, and Teaching dimension in learning theory Zilles et al. (2008); Doliwa et al. (2014). Indeed, rather than building in knowledge or training on massive amounts of data, cooperative learning from humans is a strong candidate for advancing machine learning theory and improving human-machine teaming more generally.

While behavioral and computational research makes clear the importance of cooperation for learning, we lack mathematical results that would establish statistical soundness. In the development of probability theory, proofs of consistency and rate of convergence were celebrated results that put Bayesian inference on strong mathematical footing Doob (1949). Moreover, establishment of stability with respect to mis-specification ensured that theoretical results could apply despite the small differences between the model and reality Kadane and Chuang (1978); Berger et al. (1994). Proofs of consistency, convergence, and stability ensure that intuitions regarding probabilistic inference were formalized in ways that satisfied basic desiderata.

Our goal is to provide a comparable foundation for sequential Cooperative Bayesian Inference as statistical inference for understanding the strengths, limitations, and behavior of cooperating agents. Grounded strongly in machine learning (Murphy, 2012; Ghahramani, 2015) and human learning (Tenenbaum et al., 2011), we adopt a probabilistic approach. We approach consistency, convergence, and stability using a combination of new analytical and empirical methods. The result will be a model agnostic understanding of whether and under what conditions sequential cooperative interactions result in effective and efficient learning.

Notations are introduced at the end of this section. Section 2 introduces the model of sequential cooperative Bayesian inference (SCBI), and Bayesian inference (BI) as the comparison. Section 3 presents a new analysis approach which we apply to understanding consistency of SCBI. Section 4 presents empirical results analyzing the sample efficiency of SCBI versus BI, showing convergence of SCBI is considerably faster. Section 5 presents the empirical results testing robustness of SCBI to perturbations. Section 6 introduces an application of SCBI in Grid world model. Section 7 describes our contributions in the context of related work, and Section 8 discusses implications for machine learning and human learning.

Preliminaries. Throughout this paper, for a vector , we denote its -th entry by or . Similarly, for a matrix , we denote the vector of -th row by , the vector of -th column by , and the entry of -th row and -th column by or simply . Further, let be the column vectors representing the row and column marginals (sums along row/column) of . Let or simply be the vector of ones. The symbol is used to denote the normalization of a non-negative vector , i.e., . Similarly, the normalization of matrices are denoted by , with “col” indicating column normalization (for row normalization, write “row” instead), and denotes to which vector of sums the matrix is normalized. The probability distribution on a finite set is denoted by , we do not distinguish it with the simplex . The language of statistical models and estimators follows the notations of the book Miescke and Liese (2008).

## 2 The Construction

In this paper, we consider cooperative communication models with two agents, which we call a teacher and a learner. Let be the set of hypotheses, i.e., concepts to teach. The shared goal is for the learner to infer the correct hypothesis which is only known by the teacher at the beginning. To facilitate learning, the teacher passes one element from a finite data set sequentially. The relation between and is given by , a positive matrix satisfying which represents the joint distribution (JD) between hypotheses and data. We assume that the sets , and the matrix satisfy:

(i) There are no fewer data than hypotheses ().
(ii) The hypotheses are distinguishable, i.e., there is no real number such that for any .

For simplicity, we denote the column-normalization of by which is more widely used.

In our setup, the teacher teaches in sequence. At each round the teacher chooses a data point from by sampling according to a distribution. And the learner learns by maintaining a posterior distribution on through Bayesian inference (with a sequence of likelihood distributions).

Precisely, the “teacher” represents the process of selecting data to convey the desired concept, essentially, a statistical model , where , with the power set of (as a -algebra of )1, is the space of probability measures on , and is a probability measure on for the first rounds and true parameter , representing the probability of each set of data taught given to teach.2 The loss function is taken as the -distance on inherited from .

The “learner” represents the process of interpreting the received data, i.e., a sequence of estimators factoring through via natural projection, abused to . Note that also depend on the learner’s original prior on . We may assume .3

Bayesian inference dealing with sequential data is a well-studied model to solve this problem. However, there is no cooperation in Bayesian inference since the teaching distribution and learning likelihood are constant on time (not necessarily to have a teacher there). To introduce cooperation following cooperative inference Yang et al. (2018), we propose Sequential Cooperative Bayesian Inference (SCBI), which is a sequential version of the cooperative inference.

### 2.1 Sequential Cooperative Bayesian Inference

Sequential Cooperative Bayesian Inference (SCBI) assumes that the two agents—a teacher and a learner—cooperate to facilitate learning. Prior research has formalized this cooperation (in a single-round game) as a system of two interrelated equations in which the teacher’s choice of data depends on the learner’s inference, and the learner’s inference depends on reasoning about the teacher’s choice. This prior research into such Cooperative Inference has focused on batch selection of data Yang et al. (2018); Wang et al. (2019a), and has been shown to be formally equivalent to Sinkhorn scaling Wang et al. (2019b). Following this principle, we propose a new sequential setting in which the teacher chooses data sequentially, and both agents update the likelihood at each round to optimize learning.

Cooperative Inference. Let be the learner’s prior of hypothesis , be the teacher’s prior of selecting data . For learning, let be the teacher’s likelihood of selecting to convey and be the learner’s posterior for given . Teaching is similar. Cooperative inference is then a system of two equations shown below, with and the normalizing constants:

 PL(h|d)=PT(d|h)PL0(h)PL(d), PT(d|h)=PL(h|d)PT0(d)PT(h). (1)

It is shown (Wang et al., 2019a, b) that Eq. (1) can be solved using Sinkhorn scaling, where -Sinkhorn scaling of a matrix is simply the iterated alternation of row normalization of with respect to and column normalization of with respect to . The limit of such iterations exist if the sums of elements in and are the same Schneider (1989). We use here instead of since they have the same Sinkhorn scaling limit Hershkowitz et al. (1988).

Sequential Cooperation. SCBI allows multiple rounds of teaching and requires each choice of data to be generated based on cooperative inference, with the learner updating their beliefs between each round. In each round, based on the data being taught and the learner’s initial prior on as common knowledge, the teacher and learner update their common likelihood matrix according to cooperative inference (using Sinkhorn scaling), then the data selection and inference proceed based on the updated likelihood matrix.

Precisely, starting from learner’s prior , let the data been taught up to round be and the posterior of the learner after round be , which is actually predictable for both agents (obvious for and inductively correct for by later argument). To teach, the teacher calculates the Sinkhorn scaling of given the uniform row sums and column sums (to make the sum of equal that of , which guarantees the existence of the limit in Sinkhorn scaling), denoted by . The teacher’s data selection is proportional to columns of . Thus let be the column normalization of by , i.e., . Then the teacher uses distribution (treat as a column vector) to sample from and passes it to the learner. Thus, is defined.

On learner’s side, with datum past from the teacher, the learner calculates the likelihood matrix in the same way and applies normal Bayesian inference. First take the prior to be the posterior of the last round, , then multiply it by the likelihood of selecting — the row of corresponding to , which results . Then the posterior is obtained by row normalizing . Inductively, in the next round, the learner will start with and . The learner’s calculation in round can be simulated by the teacher, so the teacher can predict , which inductively shows the assumption in previous paragraph.

There is a shortcut. Consider that the vector , which is proportional to the prior, is used in the normalization , then . Furthermore, since is row normalized to , each row of it is a distribution on . Thus . 4

### 2.2 Bayesian Inference: the Control

In order to test the performance of SCBI, we recall the classical Bayesian inference (BI). In BI, a fixed likelihood matrix is used throughout the communication process. Bayes’ rule requires to be the conditional distribution on the set of data given each hypothesis, thus is the column-normalization defined before.

For the teacher, given , treated as a column vector, the teaching distribution is . Then the teacher selects data via i.i.d. sampling according to . Therefore, the probability of teaching a sequence of form is .

The learner first chooses a prior ( is part of the model, usually the uniform distribution), then uses Bayes’ rule with likelihood to update the posterior distribution repeatedly. Given taught datum , the map from the prior to the posterior distribution is denoted by . Thus the learner’s estimation over given a sequential data can be written recursively by , and . Thus, by induction, .

## 3 Consistency

We investigate the effectiveness of the estimators in both BI and SCBI by testing their consistency: setting the true distribution on hypotheses by , in a model , we examine the convergence (use loss function as distance on ) of the posterior sequence given sampled data as random variables and check whether the limit is .

Although the definition of consistency is with a specified true distribution , here, we focus on atomic cases where for some , meaning the truth is a fixed in rather than a general distribution on . Derivations and proofs can be found in the Supplementary Material.

### 3.1 BI and KL Divergence

The consistency of BI has been well studied since Bernstein and von Mises and Doob Doob (1949). In this section, we state it in our situation and derive a formula for the rate of convergence, as a baseline for the cooperative theory.

###### Theorem 1.

[(Miescke and Liese, 2008, Theorem 7.115)] In BI, the model is strongly consistent at for each , with arbitrary choice of an interior point (i.e. for all ) as prior.

###### Remark 1.

For a fixed true distribution , strong consistency of a model is defined to be: the sequence of posteriors given by the estimator , as a sequence of random variables, converges to almost surely according to the teaching distribution . If the convergence is in probability, the sequence of estimators is said to be consistent.

###### Remark 2.

Theorem 1 also assumes that hypotheses are distinguishable (Section 2). It is critical to have for some , for BI with a general is almost never consistent or strongly consistent.

Consistency—independent of the choice of prior interior of —guarantees that BI is always effective.

Rate of Convergence. After effectiveness, we provide the efficiency of BI in terms of asymptotic rate of convergence.

###### Theorem 2.

In BI, with for some , let the -component of posterior given as random variables valued in . Then converges to almost surely.

###### Remark 3.

We call the asymptotic rate of convergence (RoC) of BI, denoted by .

### 3.2 SCBI as a Markov Chain

From the proof of Theorem 1, the pivotal property is that the variables are commutative in posteriors (the variables can occur in any order without affecting the posterior) thanks to commutativity of multiplication. However, in SCBI, the commutativity does not hold, for the likelihood matrix depends on previous outcome. Thus the method used in BI analysis no longer works here.

Because the likelihood matrix depends on the predecessive state only, the process is in fact Markov, we may analyze the model as a Markov chain on the continuous state space .

To describe this process, let be the space of states, and let be the true hypothesis to teach (), let learner’s prior be , or say, the distribution of learner’s initial state is .

The operator . In the Markov chain, in each round, the transition operator maps the prior as a probability distribution on state space to the posterior as another, i.e., .

To make the formal definition of simpler, we need to define some maps. For any , let be the map bringing the learner’s prior to posterior when data is chosen by the teacher, that is, sends each normalized vector to according to SCBI. Each is a bijection based on the uniqueness of Sinkhorn scaling limits of , shown in Hershkowitz et al. (1988). Further, the map is continuous on and smooth in its interior according to Wang et al. (2019b). Continuity and smoothness of make it natural to induce a push-forward on Borel measures. Explicitly, for each Borel measure and each Borel measurable set . Let be the map of teacher’s adjusting sample distribution based on the learner’s prior, that is, given a learner’s prior , by definition of SCBI, the distribution of the teacher is adjusted to . Each component of is denoted by . We can use only for in which case teacher can trace learner’s state. Now we can define formally.

###### Definition 3.

Given a fixed hypothesis , or say , the operator translating a prior as a Borel measure to the posterior distribution according to one round of SCBI is given below, for any Borel measurable set .

 (Ψ(h)(μ))(E):=∫E∑d∈Dτd(T−1d(θ))d(T∗d(μ))(θ). (2)

In our case, we start with a distribution where is the prior of the learner on the set of hypotheses. In each round of inference, there are different possibilities according to the data taught. Thus in any finite round , the distribution of the posterior is the sum of at most atoms (actually, we can prove is exact). Thus in the following discussions, we assume that is atomic. action on atomic distributions is determined by that of an atom:

 Ψ(h)(δθ)=n∑i=1M⟨nθ⟩  (i,h)nθ(h)δ(M⟨nθ⟩  (i,_)). (3)

Moreover, since the SCBI behavior depends only on the prior (with fixed and ) as a random variable, the same operator applies to every round in SCBI. Thus we can conclude that the following proposition is valid:

###### Proposition 4.

Given , let , the sequence of estimators in SCBI forms a time-homogeneous Markov chain on state space with transition operator characterized by Eq. (2) and Eq. (3).

Thanks to the fact that the SCBI is a time homogeneous Markov process, we can further show the consistency.

###### Theorem 5 (Consistency).

In SCBI, let be a positive matrix, the teacher is teaching one hypothesis (i.e., ), and the prior distribution satisfies with , then the estimator sequence is consistent, for each .

###### Remark 4.

The assumption in Theorem 5 that is necessary in any type of Bayesian inference since it is impossible to get the correct answer in posterior by Bayes’ rule, if it is excluded in the prior at the beginning. In practice, the prior distribution is usually chosen to be with the uniform distribution vector in , i.e., .

Rate of Convergence. Thanks to consistency, we can calculate the asymptotic rate of convergence for SCBI.

###### Theorem 6.

With matrix , hypothesis , and a prior same as in Theorem. 5, let denote the posterior after rounds of SCBI, then

 (4)

where with . Thus we call the asymptotic rate of convergence (RoC) of SCBI.

## 4 Sample Efficiency

In this section, we present some empirical results comparing the sample efficiency of SCBI and BI.

### 4.1 Asymptotic RoC Comparison

We first compare the asymptotic rate of convergence ( for BI and for SCBI, see Theorems 2 and 6). Assume that the joint distribution is sampled uniformly. Equivalently, the corresponding is sampled through i.i.d. uniform distributions on , one for each column.

For each column-normalized matrix , we compute two variables to compare BI with SCBI: the probability and the expected value of averaged difference .

Two-column Cases. Consider the case where is of shape with the two columns sampled from uniformly and independently, we simulated for with a size- Monte Carlo method for each to calculate and . The result is shown in Fig. 1.

We can reduce the calculation of to a numerical integral . 5

Since goes too close to as the rank grows, we use to show the increasing in detail. 6

More Columns of a Fixed Row Size. To verify the general cases, we simulated and by Monte Carlo on matrices of -row and various-column shapes (see Fig. 2).

Square Matrices. Fig. 3 shows the square cases with size from to , simulated by size Monte Carlo.

The empirical is the mean of (sample-size) i.i.d. variables valued or , thus the standard deviation of a single variable is smaller than . By Central Limit Theorem, the standard deviation (precision threshold). So we draw lines in each log-figure, but only in one figure the line lies in the view area.

In all simulated cases, we observe that and , indicating that SCBI converges faster than BI in most cases and in average. It is also observed that SCBI behaves even better as matrix size grows, especially when the teacher has more choices on the data to be chosen (i.e., more rows).

### 4.2 Distribution of Inference Results

The promises of cooperation is that one may infer hypotheses from small amounts of data. Hence, we compare SCBI with BI after small, fixed numbers of rounds.

We sample matrices of shape whose columns are distributed evenly in to demonstrate. Equivalently, they are column-normalizations of the uniformly sampled matrices whose sum of all entries is one.

Assume that the correct hypothesis to teach is We first simulate a -round inference behavior, exploring all possible ways that the teacher may teach, then calculate the expectation and standard deviation of . With matrices sampled in the above way, Fig. 4 shows this comparison between BI and SCBI.

Similarly, we extend the number of rounds to by Monte Carlo since an exact calculation on exhausting all possible teaching paths becomes impossible. With sampling matrices independently, we simulate a teacher teaches times to round for each matrix, and the statistics are also shown in Fig. 4. From Fig. 4, we observe that SCBI have better expectation and less variance in the short run.

In conclusion, experiments indicate that SCBI is both more efficient asymptotically, and in the short run.

## 5 Stability

In this section, we test the robustness of SCBI by setting the initial conditions of teacher and learner different. This could happen when agents do not have full access to their partner’s exact state.

Theory. Let and be the initial matrices (the previous matrix , but now perturbed into two matrices) that the teacher and the learner start with, respectively. Let and be elements in representing the prior on hypotheses that the teacher and learner use in the estimation of inference (teacher) and in the actual inference (learner), i.e., and . During the inference, let and be the distribution of posteriors of the teacher and the learner after round , and denote the corresponding random variables by and , for all positive and , where represents the limit in probability.

Let be a random variable on , we define an operator similar to the in Section 3. Let , then .

###### Proposition 1.

Given a sequence of identical independent -valued random variables following the uniform distribution. Let be a prior distribution on , and , then converges, in probability, to where .

###### Remark 5.

This proposition helps accelerate the simulation, that one may terminate the teaching process when is sufficiently close to , since Prop. 1 guarantees that the expectation of the learner’s posterior on the true hypothesis at that time is close enough to the eventual probability of getting , i.e. .

###### Definition 2.

We call the successful rate of the inference given , , and . By the setup in Section 2, the failure probability, , is where is the loss function.

Simulations with Perturbation on Priors. We simulated the square cases of rank and . We sample matrices ( to ) of size , whose columns distribute uniformly on , and priors ( to ) in , used as . Similarly, we sample matrices (, and ) of size , and priors (, , ) from in the same way as above. In both cases, we assume to be the true hypothesis to teach.

Our simulation is based on Monte Carlo method of teaching sequences (for each single point plotted) then use Proposition 1 to calculate the successful rate of inference. For matrices, we perturb in two ways: (1) take around distributed evenly on concentric circles, thus points for each are taken. In this area, there are points lying on given directions ( apart, see Supplementary Material for figures). (2) sample evenly in the whole simplex ( points for each ). For matrices, we perturb in two ways: (1) along randomly chosen directions in evenly take points on each direction, and (2) sample points evenly in . Then we have the following figure samples (for figures demonstrating the entire simulation, please see Supplementary Material).

From the figures we see: 1. left pictures indicate that the learner’s expected posterior on is roughly linear to perturbations along a line. 2. right pictures indicate that the learner’s expected posterior on is closely bounded by a multiple of the learner’s prior on true . Thus we have the following conjecture:

###### Conjecture 3.

Given and , let be the true hypothesis to teach. For any , let be learner’s prior with a distance to less than . Then the successful rate for sufficiently many rounds is greater than , where .

Simulations with Perturbation on Matrices. We now investigate robustness of SCBI to perturbations on agents’ common or shared likelihood matrix . Let and be a perturbed matrix. The simulations are performed on the matrices to mentioned above with a fixed common prior .

Let all matrices mentioned be column-normalized (this does not affect SCBI since cross-ratios and marginal conditions determines the Sinkhorn scaling results), we call the column determined by the true hypothesis (the first column in our simulation) the target column (“true hypo” on figures), the column which uses (argmin column) the relevant column (“rel. hypo”) and the other column the irrelevant column (“irrel. hypo”). Let , and is obtained from by perturbing along the relevant / irrelevant column.

Without loss of generality, we assume that only one column of the learner’s matrix is perturbed at a time as other perturbations may be treated as compositions of such.

For each and each column , we apply perturbations on concentric circles around (the disc), perturbations preserving the normalized-KL divergence ( used in ) from the target column and linear interpolations with target column. Each point in Fig. 6 is estimated using a size- Monte Carlo method using Proposition 1.

From the graphs, we can see that the successful rate varies continuously on perturbations, slow on one direction (the yellow strip crossing the center) and rapid on the perpendicular direction (color changed to blue rapidly).

## 6 Grid World: an Application

Consider a grid world with two possible terminal goals, A and B, and a starting position as shown below. Let the reward at the terminal position be . Assuming no step costs, the value of a grid that distanced from is then (in the RL-sense), where is the discount factor.

 A B ⇑ ⇐ S ⇒

Suppose the structure of the world is accessible to both agents whereas the true location of the goal is only known to a teacher. The teacher performs a sequence of actions to teach to a learner. At each round, there are three available actions, left, up and right. After observing the teacher’s actions, the learner updates their belief on accordingly.

We now compare BI and SCBI agents’ behaviours under this grid world. In terms of previous notations, the hypothesis set , the data set . Let the learner’s prior over be and the true hypothesis be , then at each blue grid, agents’ (unnormalized) initial matrix . Assume both BI teacher and SCBI teacher start with grid . Based on , the BI teacher would choose equally between left and up, whereas the SCBI teacher is more likely to choose left as the teacher’s likelihood matrix , obtained from Sinkhorn scaling on , assigns higher probability for left. Hence, comparing to the BI teacher who only aims for the final goal, the SCBI teacher tends to cooperate with the learner by selecting less ambiguous moves towards the goal. This point is aligned with the core idea of many existing models of cooperation in cognitive development Jara-Ettinger et al. (2016); Bridgers et al. (in press), pragmatic reasoning Frank and Goodman (2012); Goodman and Stuhlmüller (2013) and robotics Ho et al. (2016); Fisac et al. (2017).

Moreover, even under the same teaching data, the SCBI learner is more likely to infer than the BI learner. For instance, given the teacher’s trajectory , the left plot in Fig. 7 shows the SCBI and BI learners’ posteriors on the true hypothesis . Hence, comparing to the BI learner who reads the teacher’s action literally, the SCBI learner interprets teacher’s data corporately by updating belief sequentially after each round.

Regarding the stability, consider the case where the learner’s discount factor is either greater or less (with equal probability) than the teacher’s by 0.1. The right plot in Fig. 7 illustrates the expected difference between the learner’s posterior on after observing a teacher’s trajectory of length and the teacher’s estimation of it.

As discussed in Sec 4.1, showing in Fig. 1, Fig. 2 and Fig. 3, as the board gets wider and the number of possible goals gets more (i.e. the number of hypotheses increases), the gap between posteriors of SCBI and BI learners will increase whereas the expected difference between agents for the same magnitude of perturbation will decrease. Thus, this example illustrates the consistency, sample efficiency, and stability of SCBI versus BI.

## 7 Related Work

Literatures on Bayesian teaching (Eaves and Shafto, 2016; Eaves Jr et al., 2016), Rational Speech act theory (Frank and Goodman, 2012; Goodman and Stuhlmüller, 2013), and machine teaching (Zhu, 2015, 2013) consider the problem of selecting examples that improve a learner’s chances of inferring a concept. These literatures differ in that they consider the single step, rather than sequential problem, that they do not formalize learners who reason about the teacher’s selection process, and that they models without a mathematical analysis.

The literature on pedagogical reasoning in human learning (Shafto and Goodman, 2008; Shafto et al., 2012, 2014) and cooperative inference (Yang et al., 2018; Wang et al., 2019a, b) in machine learning formalize full recursive reasoning from the perspectives of both the teacher and the learner. These only consider the problem of a single interaction between the teacher and learner.

The literature on curriculum learning considers sequential interactions with a learner by a teacher in which the teacher presents data in an ordered sequence (Bengio et al., 2009), and traces back to various literatures on human and animal learning (Skinner, 1958; Elman, 1993). Curriculum learning involves one of a number of methods for optimizing the sequence of data presented to the learner, most commonly starting with easier / simpler examples first and gradually moving toward more complex or less typical examples. Curriculum learning considers only problems where the teacher optimizes the sequence of examples, where the learner does not reason about the teaching.

## 8 Conclusions

Cooperation is central to learning in humans and machines. We set out to provide a mathematical foundation for sequential cooperative Bayesian inference (SCBI). We presented new analytic results demonstrating the consistency and asymptotic rate of convergence of SCBI. Empirically, we demonstrated the sample efficiency and stability to perturbations as compared to Bayesian inference, and illustrated with a simple reinforcement learning problem. We therefore provide strong evidence that SCBI satisfies basic desiderata. Future work will aim to provide mathematical proofs of the empirically observed efficiency and stability.

## Appendix A Proof of Consistency

### a.1 Proof of Theorem 1

###### Theorem 1.

[Theorem 1, (Miescke and Liese, 2008, Theorem 7.115)] In BI, the model is strongly consistent at for each , with arbitrary choice of an interior point (given for all ) as prior.

###### Proof.

We follow the same line as discussed right after this theorem in the paper. Let be the original prior, and let be the posterior after having data points . Then for and , by Bayes’ rule. In other words,

 θl(i)=M(xl,i)[θ(l−1)(i)]∑mj=1M(xl,j)[θ(l−1)(j)]. (5)

This is a recursive formula, so we may move forward to calculate from a smaller round index with :

 θl(i)=[∏ls=tM(xs,i)]θ(t−1)(i)∑mj=1[∏ls=tM(xs,j)]θ(t−1)(j).

This recursion stops at prior , so we have an explicit expression of :

 θk(i)=[∏ks=1M(xs,i)]θ0(i)∑mj=1[∏ks=1M(xs,j)]θ0(j). (6)

It can be seen that for each hypothesis , the denominator of the -th posterior on are the same, so we have

 θk(i)θk(h)=[∏ks=1M(xs,i)]θ0(i)[∏ks=1M(xs,h)]θ0(h). (7)

So we define to be the frequency of the occurrence of data in the first rounds of a episode. And then

 log(θk(i)θk(h))=log(θ0(i)θ0(h))+n∑d=1αk(d)log(M(d,i)M(d,h)). (8)

Since we know that the data in the model is sampled following the i.i.d. with distribution , then for a fixed , follows the multinomial distribution with parameter .

By the strong law of large numbers, almost surely as . Thus

 1klog(θk(i)θk(h))→n∑d=1M(d,h)log(M(d,i)M(d,h))a.s. (9)

That is,

 (10)

By the assumption in Section 2 of the paper that has distinct columns, the KL divergence between the -th column and the -th column is strictly positive, thus almost surely, , or equivalently, , for any .

Therefore, almost surely, equivalently, BI at is strongly consistent. ∎

### a.2 Proof of Theorem 2

###### Theorem 2.

[Theorem 2] In BI, with for some , let the -component of posterior given as random variables valued in . Then converges to almost surely.

###### Proof.

Follow the previous proof. First recall that almost surely. Let , then decays slowest among almost surely.

Therefore, asymptotically,

 1klog[θk(η)θk(h)]≤1klog[1−θk(h)θk(h)]≤1klog[(m−1)θk(η)θk(h)].

So when we are taking limits , with probability one, we have

 −KL(M(_,h),M(_,η))≤limk→∞1klog[1−θk(h)θk(h)] ≤limk→∞−KL(M(_,h),M(_,η))+1klog(m−1) =−KL(M(_,h),M(_,η)). (11)

### a.3 Proof of Theorem 5

To prove Theorem 5, we need the following lemmas.

###### TheoremLemma A.1 ().

Given a fixed hypothesis , for any ,

 Eμ(θ(h))≤EΨ(h)(μ)(θ(h)). (12)

equality happens when for any , and -almost everywhere for .

###### Remark 6.

This lemma shows that the expectation of , in each round is increasing, thus the sequence obtained from all the rounds has an limit since the sequence is monotonic and upper bounded by . To prove the theorem we, then just need to show the limit is .

###### Proof.

We start from the right hand side of Eq. 12. Let denote for short.

 EΨ(h)(μ)(θ(h)) = ∫Δθ(h)d(Ψ(h)(μ))(θ) = ∫Δn∑d=1τd(T−1d(θ))θ(h)d(T∗d(μ))(θ) = n∑d=1∫Δτd(θ)(Td(θ))(h)d(T∗d(μ))(Td(θ)) = n∑d=1∫Δτd(θ)(Td(θ))(h)dμ(θ) = n∑d=1∫Δ