Bayesian MAP Model Selection of Chain Event Graphs

# Bayesian MAP Model Selection of Chain Event Graphs

G. Freeman J.Q. Smith Department of Statistics, University of Warwick, Coventry, CV4 7AL
###### Abstract

The class of chain event graph models is a generalisation of the class of discrete Bayesian networks, retaining most of the structural advantages of the Bayesian network for model interrogation, propagation and learning, while more naturally encoding asymmetric state spaces and the order in which events happen. In this paper we demonstrate how with complete sampling, conjugate closed form model selection based on product Dirichlet priors is possible, and prove that suitable homogeneity assumptions characterise the product Dirichlet prior on this class of models. We demonstrate our techniques using two educational examples.

###### keywords:
chain event graphs, Bayesian model selection, Dirichlet distribution
journal: Journal of Multivariate Analysis\newproof

proofProof

## 1 Introduction

Bayesian networks (BNs) are currently one of the most widely used graphical models for representing and analysing finite discrete graphical multivariate distributions with their explicit coding of conditional independence relationships between a system’s variables cowell_probabilistic_1999 (); lauritzen_graphical_1996 (). However, despite their power and usefulness, it has long been known that BNs cannot fully or efficiently represent certain common scenarios. These include situations where the state space of a variable is known to depend on other variables, or where the conditional independence between variables is itself dependent on the values of other variables. Some examples of such latter scenarios are given by Poole and Zhang poole_exploiting_2003 (). In order to overcome such deficiencies, enhancements have been proposed to the basic Bayesian network in order to create so-called “context-specific” Bayesian networks poole_exploiting_2003 (). These have their own problems, however: either they represent too much of the information about a model in a non-graphical way, thus undermining the rationale for using a graphical model in the first place, or they struggle to represent a general class of models efficiently. Other graphical approaches that seek to account for “context-specific” beliefs suffer from similar problems.

This has led to the proposal of a new graphical model — the chain event graph (CEGs) — which first propounded in smith_conditional_2008 (). As well as solving the aforementioned problems associated with Bayesian networks and related graphical models, CEGs are able, not unrelatedly, to encode far more efficiently the common structure in which models are elicited — as asymmetric processes — in a single graph. To this end, CEGs are based not on Bayesian networks, but on event trees (ETs) shafer_art_1996 (). Event trees are trees where nodes represent situations — i.e. scenarios in which a unit might find itself — and each node’s extending edges represent possible future situations that can develop from the current one. It follows that every atom of the event space is encoded by exactly one root-to-leaf path, and each root-to-leaf path corresponds to exactly one atomic event. It has been argued that ETs are expressive frameworks to directly and accurately represent beliefs about a process, particularly when the model is described most naturally, as in the example below, through how situations might unfold shafer_art_1996 (). However, as explained in smith_conditional_2008 (), ETs can contain excessive redundancy in their structure, with subtrees describing probabilistically isomorphic unfoldings of situations being represented separately. They are also unable to explicitly express a model’s non-trivial conditional independences. The CEG deals with these shortcomings by combining the subtrees that describe identical subprocesses (see smith_conditional_2008 () for further details), so that the CEG derived from a particular ET has a simpler topology while in turn expressing more conditional independence statements than is possible through an ET.

We illustrate the construction and the types of symmetries it is possible to code using a CEG with the following running example.

###### Example 1

Successful students on a one year programme study components and , but not everyone will study the components in the same order: each student will be allocated to study either module or for the first 6 months and then the other component for the final 6 months. After the first 6 months each student will be examined on their allocated module and be awarded a distinction (denoted with ), a pass () or a fail (), with an automatic opportunity to resit the module in the last case. If they resit then they can pass and be allowed to proceed to the other component of their course, or fail again and be permanently withdrawn from the programme. Students who have succeeded in proceeding to the second module can again either fail, pass or be awarded a distinction. On this second round, however, there is no possibility of resitting if the component is failed. With an obvious extension of the labelling, this system can be depicted by the event tree given in Figure 1.

To specify a full probability distribution for this model it is sufficient to only specify the distributions associated with the unfolding of each situation a student might reach. However, in many applications it is often natural to hypothesise a model where the distribution associated with the unfolding from one situation is assumed identical to another. Situations that are thus hypothesised to have the same transition probabilities to their children are said to be in the same stage. Thus in Example 1 suppose that as well as subscribing to the ET of Figure 1 we want to consider a model also embodying the following three hypotheses:

1. The chances of doing well in the second component are the same whether the student passed first time or after a resit.

2. The components and are equally hard.

3. The distribution of marks for the second component is unaffected by whether students passed or got a distinction for the first component.

These hypotheses can be identified with a partitioning of the non-leaf nodes (situations). In Figure 1 the set of situations is

 S={V0,A,B,P1,A,P1,B,D1,A,D1,B,F1,A,F1,B,PR,A,PR,B}.

The partition of that encodes exactly the above three hypotheses consists of the stages , , and together with the singleton . Thus the second stage , for example, implies that the probabilities on the edges and are equal, as are the probabilities on and . Clearly the joint probability distribution of the model – whose atoms are the root to leaf paths of the tree – is determined by the conditional probabilities associated with the stages. A CEG is the graph that is constructed to encode a model that can be specified through an event tree combined with a partitioning of its situations into stages.

In this paper we suppose that we are in a context similar to that of Example 1, where, for any possible model, the sample space of the problem must be consistent with a single event tree, but where on the basis of a sample of students’ records we want to select one of a number of different possible CEG models, i.e. we want to find the “best” partitioning of the situations into stages. We take a Bayesian approach to this problem and choose the model with the highest posterior probability — the Maximum A Posteriori (MAP) model. This is the simplest and possibly most common Bayesian model selection method, advocated by, for example, Dennison et al denison_bayesian_2002 (), Castelo castelo_discrete_2002 (), and Heckerman heckerman_tutoriallearning_1999 (), the latter two specifically for Bayesian network selection.

The paper is structured as follows. In the next section we review the definitions of event trees and CEGs. In Section 3 we develop the theory of how conjugate learning of CEGs is performed. In Section 4 we apply this theory by using the posterior probability of a CEG as its score in a model search algorithm that is derived using an analogous procedure to the model selection of BNs. We characterise the product Dirichlet distribution as a prior distribution for the CEGs’ parameters under particular homogeneity conditions. In Section 5 the algorithm is used to discover a good explanatory model for real students’ exam results. We finish with a discussion.

## 2 Definitions of event trees and chain event graphs

In this section we briefly define the event tree and chain event graph. We refer the interested reader to smith_conditional_2008 () for further discussion and more detail concerning their construction. Bayesian networks, which will be referenced throughout the paper, have been defined many times before. See heckerman_tutoriallearning_1999 () for an overview.

### 2.1 Event Trees

Let be a directed tree where is its node set and its edge set. Let be the set of situations of , where is the set of leaf (or terminal) nodes. Furthermore, define , where is the path from node to node , and is the root node, so that is the set of root-to-leaf paths of . Each element of is called an atomic event, each one corresponding to a possible unfolding of events through time by using the partial ordering induced by the paths. Let denote the set of children of . In an event tree, each situation has an associated random variable with sample space , defined conditional on having reached . The distribution of is determined by the primitive probabilities . With random variables on the same path being mutually independent, the joint probability of events on a path can be calculated by multiplying the appropriate primitive probabilities together. Each primitive probability is a colour for the directed edge , so that we can have .

###### Example 2

Figure 2 shows a tree for two Bernoulli random variables, and , with occurring before . In an educational example could be the indicator variable of a student passing one module, and the indicator variable for a subsequent module.

Here we have random variables , and , and primitive probabilities , and so on for every other edge. Joint probabilities can be found by multiplying primitive probabilities along a path, e.g.  as and are on a path.

### 2.2 Chain Event Graphs

Starting with an event tree , define a floret of as

 F(v,T)=(V(F(v,T)),E(F(v,T)))

where and . The floret of a vertex is thus a sub-tree consisting of , its children, and the edges connecting and its children, as shown in Figure 3. This represents, as defined in section 2.1, the random variable and its sample space .

One of the redundancies that can be eliminated from an ET is that of the florets’ edges of two situations, and say, which have identical associated edge probabilities despite being defined by different conditioning paths. We say these two situations are at the same stage. This concept is formally defined as follows.

###### Definition 3

Two situations are in the same stage if and only if and have the same distribution under a bijection

 ψu(v,v′):E(F(v,T))→E(F(v′,T))

i.e.

 ψu(v,v′):X(v)→X(v′)

The set of stages of an ET is written . This set partitions the set of situations .

We can construct a staged tree with , , and colour its edges such that:

• If and contains no other vertices, then all are left uncoloured;

• If and contains other vertices, then all are coloured; and

• Whenever under , then the two edges must have the same colour.

There is another type of situation that is of further interest. When the whole development from two situations and have identical distributions, i.e. there exists a bijection between their respective subtrees similar to that between stages as defined in Definition 2.2, then the situations are said to be in the same position. This is defined formally as follows.

###### Definition 4

Two situations are in the same position if and only if there exists a bijection

 ϕw(v,v′):Λ(v,T)→Λ(v′,T)

where is the set of paths in from to a leaf node of , such that

• all edges in all of the paths in and are coloured in ; and

• for every path , the ordered sequence of colours in equals the ordered sequence of colours in

This ensures that when and are in the same position, then under the map future development from either node follows identical probability distributions.

We denote the set of positions as . Positions are an obvious way of equating situations, because the different conditioning variables of different nodes in the same position have no effect on any subsequent development. It is clear that is a finer partition of than , and indeed that partitions , as situations in the same position will also be in the same stage.

We now use stages and positions to compress the event tree into a chain event graph. First, the probability graph of the event tree

 H(G(T))=H(T)=(V(H),E(H))

is drawn, where and is constructed as follows.

• For each pair of positions , if there exists such that , and , then an associated edge is drawn. Furthermore, if for a position there exists , and such that , then an associated edge is drawn.

• The colour of this edge, , is the same as the colour of the associated edge .

Now the CEG can finally be constructed by taking the probability graph and connecting the positions that are in the same stage using undirected edges: Let be a mixed graph with vertex set , directed edge set , and undirected edge set .

An example of a CEG that could be constructed from the event tree in Figure 1 is shown in Figure 5.

## 3 Conjugate learning of CEGs

One convenient property of CEGs is that conjugate updating of the model parameters proceeds in a closely analogous fashion to that on a BN. Conjugacy is a crucial part of the model selection algorithm that will be described in Section 4, because it leads to closed form expressions for the posterior probabilities of candidate CEGs. This in turn makes it possible to search the often very large model space quickly to find optimal models. We demonstrate here how a conjugate analysis on a CEG proceeds.

Let a CEG have set of stages , and let each stage have emanating edges (labelled ) with associated probability vector (where and for ). Then, under random sampling, the likelihood of the CEG can be decomposed into a product of the likelihood of each probability vector, i.e.

 p(x|π,C)=k∏i=1pi(xi|πi,C)

where , and is the complete sample data such that each is the vector of the number of units in the sample (for example, the students in Example 1) that start in stage and move to the stage at the end of edge for .

If it is further assumed that then

 pi(xi|πi,C)=ki∏j=1πxijij (1)

Thus, just as for the analogous situation with BNs, the likelihood of a random sample also separates over the components of . With BNs, a common modelling assumption is of local and global independence of the probability parameters spiegelhalter_sequential_1990 (); the corresponding assumption here is that the parameters ,,, of are all mutually independent a priori. It will then follow, with the separable likelihood, that they will also be independent a posteriori.

If the probabilities are assigned a Dirichlet distribution, , a priori, where , so that for values of such that and for , the density of , , can be written

 qi(πi|C)=Γ(αi1+…+αiki)Γ(αi1)…Γ(αiki)ki∏j=1παij−1ij

where is the Gamma function. It then follows that also has a Dirichlet distribution, , a posteriori, where , for . The marginal likelihood of this model can be written down explicitly as the function of the prior and posterior Dirichlet parameters:

 p(x|C)=k∏i=1[Γ(∑jαij)Γ(∑jα∗ij)ki∏j=1Γ(α∗ij)Γ(αij)].

The computationally more useful logarithm of the marginal likelihood is therefore a linear combination of functions of and . Explicitly,

 logp(x|C)=k∑i=1[s(αi)−s(α∗i)]+k∑i=1[t(α∗i)−t(αi)] (2)

where for any vector ,

 s(c)=logΓ(n∑v=1cv) and t(c)=n∑v=1logΓ(cv) (3)

So the posterior probability of a CEG after observing , , can be calculated using Bayes’ Theorem, given a prior probability :

 logq(C|x)=logp(x|C)+logq(C)+K (4)

for some value which does not depend on . This is the score that will be used when searching over the candidate set of CEGs for the model that best describes the data.

## 4 A Local Search Algorithm for Chain Event Graphs

### 4.1 Preliminaries

With the log marginal posterior probability of a CEG model, , as its score, searching for the highest-scoring CEG in the set of all candidate models is equivalent to trying to find the Maximum A Posteriori (MAP) model bernardo_bayesian_1994 (). The intuitive approach for searching , the candidate set of CEGs — calculating (or ) for every and choosing — is infeasible for any but the most trivial problems. We describe in this section an algorithm for efficiently searching the model space by reformulating the model search problem as a clustering problem.

As mentioned in Section 2.2, every CEG that can be formed from a given event tree can be identified exactly with a partition of the event tree’s nodes into stages. The coarsest partition has all nodes with outgoing edges in the same stage, ; the finest partition has each situation in its own stage, except for the trivial cases of those nodes with only one outgoing edge. Defined this way, the search for the highest-scoring CEG is equivalent to searching for the highest-scoring clustering of stages.

Various Bayesian clustering algorithm exist lau_bayesian_2007 (), including many involving MCMC richardson_bayesian_1997 (). We show here how to implement an Bayesian agglomerative hierarchical clustering (AHC) exact algorithm related to that of Heard et al heard_quantitative_2006 (). The AHC algorithm here is a local search algorithm that begins with the finest partition of the nodes of the underlying ET model (called above and henceforth) and seeks at each step to find the two nodes that will yield the highest-scoring CEG if combined.

Some optional steps can be taken to simplify the search, which we will implement here. The first of these involves the calculation of the scores of the proposed models in the algorithm. By assuming that the probability distributions of stages that are formed from the same nodes of the underlying ET are equal in all CEGs, i.e. , it becomes more efficient to calculate the differences of model scores, i.e. the logarithms of the relevant Bayes factors, than to calculate the two individual model scores absolutely. This is because, if for two CEGs their stage sets and differ only in that stages are combined into , with all other stages unchanged, then the calculation of the logarithm of their posterior Bayes factor depends only on the stages involved; using the notation of Equation (3),

 logq(C1|x)q(C2|x) =logq(C1|x)−logq(C2|x) (5) =logq(C1)−logq(C2)+logq(x|C1)−logq(x|C2) (6) =logq(C1)−logq(C2)+∑i[s(α1i)−s(α∗1i)]+∑i[t(α∗1i)−t(α1i)]−∑i[s(α2i)−s(α∗2i)]−∑i[t(α∗2i)−t(α2i)] (7) =logq(C1)−logq(C2)+s(α1a)−s(α∗1a)+t(α∗1a)−t(α1a)+s(α1b)−s(α∗1b)+t(α∗1b)−t(α1b)−s(α2c)+s(α∗2c)−t(α∗2c)+t(α2c) (8)

Using the trivial result that for any three CEGs

 logq(C3|x)−logq(C2|x)=[logq(C3|x)−logq(C1|x)]−[logq(C2|x)−logq(C1|x)],

it can be seen that in the course of the AHC algorithm, comparing two proposal CEGs from the current CEG can be done equivalently by comparing their log Bayes factors with the current CEG, which as shown above requires fewer calculations.

The calculation of the score for each CEG , as shown by Equation (4), shows that it is formed of two components: the prior probability of the CEG being the true model and the marginal likelihood of the data. These must therefore be set before the algorithm can be run, and it is here that the other simplifications are made.

### 4.2 The prior over the CEG space

For any practical problem , the set of all possible CEGs for a given ET, is likely to be a very large set, making setting a value for a non-trivial task. An obvious way to set a non-informative or exploratory prior is to choose the uniform prior, so that . This has the advantages of being simple to set and of eliminating the term in Equation (8).

A more sophisticated approach is to consider which potential clusters are more or less likely a priori, according to structural or causal beliefs, and to exploit the modular nature of CEGs by stating that the prior log Bayes factor of a CEG relative to is the sum of the prior log Bayes factors of the individual clusters relative to their components completely unclustered, and that these priors are modular across CEGs. This approach makes it simple to elicit priors over from a lay expert, by requiring the elicitation only of the prior probability of each possible stage.

A particular computational benefit of this approach is when the prior Bayes factor of any CEG with is believed to be zero, because one or more of its clusters is considered to be impossible. This is equivalent in the algorithm to not including the CEG in its search at all, as though it was never in in the first place, with the obvious simplification of the search following.

### 4.3 The prior over the parameter space

Just as when attempting to set , the size of most CEGs in practise leads to intractability of setting for each CEG individually. However, the task is again made possible by exploiting the structure of a CEG with judicious modelling assumptions.

Assuming independence between the likelihoods of the stages for every CEG, so that is as determined by Equation (1), and the fact that , it is clear that to set the marginal likelihood for each CEG is equivalent to setting the prior over the CEG’s parameters, i.e. setting for each . With the two further structural assumptions that the stage priors are independent for all CEGs (so that ) and that equivalent stages in different CEGs have the same prior distributions on their probability vectors, (i.e. ), it can be seen that the problem of setting is reduced to setting the parameter priors of each non-trivial floret in () and the parameter priors of stages that are clusters of stages of .

The usual prior put on the probability parameters of finite discrete BNs is the product Dirichlet distribution. In Geiger and Heckerman geiger_characterization_1997 () the surprising result was shown that a product Dirichlet prior is inevitable if local and global independence are assumed to hold over all Markov equivalent graphs on at least two variables. In this paper we show that a similar characterisation can be made for CEGs given the assumptions in the previous paragraph. We will first show that the floret parameters in must have Dirichlet priors, and second that all CEGs formed by clustering the florets in have Dirichlet priors on the stage parameters. One characterisation of is given by Theorem 5.

###### Theorem 5

If it is assumed a priori that the rates at which units take the root-to-leaf paths in are independent (“path independence”) and that the probability of which edge units take after arriving at a situation is independent of the rate at which units arrive at (“floret independence”), then the non-trivial florets of have independent Dirichlet priors on their probability vectors.

{proof}

The proof is in the Appendix.

Thus is entirely determined by the stated rates on the root-to-leaf paths of . This is similar to the “equivalent sample sizes” method of assessing prior uncertainty of Dirichlet hyperparameters in BNs as discussed in Section 2 of Heckerman heckerman_tutoriallearning_1999 ().

Another way to show that all non-trivial situations in have Dirichlet priors on their parameter spaces is to use the characterisation of the Dirichlet distribution first proven by Geiger and Heckerman geiger_characterization_1997 (), repeated here as Theorem 6.

###### Theorem 6

Let , where and are integers greater than 1, be positive random variables having a strictly positive pdf . Define , , , and .

Then if are mutually independent, is Dirichlet.

{proof}

Theorem 2 of Geiger and Heckerman geiger_characterization_1997 ().

###### Corollary 7

If has a composite number of root-to-leaf paths and all Markov equivalent CEGs have independent floret distributions then the vector of probabilities on the root-to-leaf paths of must have a Dirichlet prior. This means in particular that, from the properties of the Dirichlet distribution, the floret of each situation with at least two outgoing edges has a Dirichlet prior on its edges.

{proof}

Construct an event tree with root-to-leaf paths, where the floret of the root node has edges and each of the florets extending from the children of have edges terminating in leaf nodes, where . This will always be possible with a composite . describes the same atomic events as with a different decomposition.

Let the random variable associated with the root floret of be , and let the random variable associated with each of the other florets be . Let . Then by the definition of event trees, and . By the notation of Theorem 6, and .

By hypothesis the floret distributions of are independent. Therefore the condition of Theorem 6 holds and hence is Dirichlet. From the equivalence of the atomic events, the probability distribution over the root-to-leaf path probabilities of is also Dirichlet, and so by Lemma 16, all non-trivial florets of therefore have Dirichlet priors on their probability vectors.

To show that the stage parameters of all the other CEGs in have independent Dirichlet priors, an inductive approach will be taken. Because of the assumption of consistency – that two identically composed stages in different CEGs have identical priors on their parameter space – for any given CEG whose stages all have independent Dirichlet priors on their parameters spaces, it is known that another CEG formed by clustering two stages from into one stage will have independent Dirichlet priors on all its stages apart from . It is thus only required to show that has a Dirichlet prior. We prove this result for a class of CEGs called regular CEGs.

###### Definition 8

A stage is regular if and only if every path contains either one situation in or none of the situations in .

###### Definition 9

A CEG is regular if and only if every situation is regular.

###### Theorem 10

Let be a regular CEG, and let be the CEG that is formed from by setting two of its stages, and , as being in the same stage , where is a regular stage, with all other attributes of the CEG unchanged from .

If all stages in have Dirichlet priors, then assuming that equivalent stages in different CEGs have equivalent priors, all stages in have Dirichlet priors.

{proof}

Without loss of generality, let all situations in and have children each, and let the total number of situations in and be . Thus there are situations in , each with children. By the assumption of prior consistency across stages, all stages in have Dirichlet priors on their parameter spaces, so it is only required to prove that has a Dirichlet prior.

Consider the CEG formed as follows: Let the root node of , , have 2 children, and . Let be a terminal node, and let have children, , which are equivalent to the situations in , including the property that they are in the same stage . Lastly, let the children of , , be leaf nodes in .

By construction, the prior for is the same as that for .

Now construct another CEG from by reversing the order of the stages and . The new CEG has root node with the same distribution as . now has two children – the same as before – and , which has children in the same stage. Each node has children , all of which are leaf nodes.

The two CEGs and are Markov equivalent, as it is clear that . The probabilities on the floret of are thus equal to the probabilities of the situations in the stage of , and hence . Because is a stage with only one situation, Theorem 5 implies that it has a Dirichlet prior. Therefore has a Dirichlet prior.

An alternative justification for assigning a Dirichlet prior to any stage that is formed by clustering situations with Dirichlet priors on their state spaces can be obtained which does not depend on assuming Markov equivalency between CEGs derived from different event trees by assuming a property analogous to that of “parameter modularity” for BNs heckerman_bayesian_1995 (). This property states that the distribution over structures common to two CEGs should be identical.

###### Definition 11

Let be a stage in a CEG composed of the situations from , each of which has children such that are the same colour for all for each . Then has the property of margin equivalency if

 πuj =P(v1j or v2j or … or vnj|v1 or v2 or … or vn) (9) =∑ni=1P(vij)∑ni=1P(vi) (10)

is the same for both and for .

###### Definition 12

has margin equivalency if all of its stages have margin equivalency.

###### Theorem 13

Let be a stage as defined in Definition 11 with . Then assuming independent priors between the situations for the associated finest-partition CEG of , where for each , . Furthermore, for both and , , where .

{proof}

From Theorem [5] or Corollary [7], every non-trivial floret in has a Dirichlet prior on its edges, which includes in this case the situations .

Let for for some . Then it is a well-known fact that for all for some , and that . As , . Then by Lemma 15, letting be the set of edges for ,

 πu∼Dir(∑iαi1,…,∑iαim)

By margin equivalency, must be set the same way for .

Note that the posterior of for a stage that is composed of the situations is thus where . Equation (8), therefore, becomes

 logq(C1|x)q(C2|x)=logq(C1)−logq(C2)+s(α1a)−s(α∗1a)+t(α∗1a)−t(α1a)+s(α1b)−s(α∗1b)+t(α∗1b)−t(α1b)−s(α1a+α1b)+s(α∗1a+α∗1b)−t(α∗1a+α∗1b)+t(α1a+α1b) (11)

### 4.4 The algorithm

The algorithm thus proceeds as follows:

1. Starting with the initial ET model, form the CEG with the finest possible partition, where all leaf nodes are placed in the terminal stage and all nodes with only one emanating edge are placed in the same stage. Calculate using (4).

2. For each pair of situations with the same number of edges, calculate where is the CEG formed by having in the same stage and keeping all others in their own stage; do not calculate if .

3. Let .

4. Now calculate for each pair of stages in except where , and record .

5. Continue for , and so on until the coarsest partition has been reached.

6. Find , and select this as the MAP model.

We note that the algorithm can also be run backwards, starting from and splitting one cluster in two at each step. This has the advantage of making the identification of positions in the MAP model easier.

## 5 Examples

### 5.1 Simulated data

To first demonstrate the efficacy of the algorithm described above we implement the algorithm using simulated data for Example 1, where the CEG generating the data was as known and described in Section 1. Figure 4 shows the number of students in the sample who reached each situation in the tree.

In this complete dataset the progress of 1000 students has been tracked through the event tree. Half are assigned to take module first and the other half . By finding the MAP CEG model in the light of this data we may find out whether the three hypotheses posed in the introduction are valid. We repeat them here for convenience:

1. The chances of doing well in the second component are the same whether the student passed first time or after a resit.

2. The components and are equally hard.

3. The distribution of marks for the second component is unaffected by whether students passed or got a distinction for the first component.

We set a uniform prior on the CEG priors and on the root-to-leaf paths of , the finest partition of the tree, for illustration purposes. The algorithm is then implemented as follows.

There are only two florets with two edges; with Beta(1,3) priors on each and a Beta(2,6) prior on the combined stage, the log Bayes factor is -1.85. Carrying out similar calculations for all the pairs of nodes with three edges, it is first decided to merge the nodes and , which has a log Bayes factor of -3.76 against leaving them apart. Applying the algorithm to the updated set of nodes and iterating, the CEG in Figure 5 is found to be the MAP one.

Under this model, it can be seen that all three hypotheses above are satisfied and that the MAP model is the correct one.

### 5.2 Student test data

In our second example we apply the learning algorithm to a real dataset in order to test the algorithm’s efficacy in a real-life situation and to identify remaining issues with its usage. The dataset we used was an appropriately disguised set of marks taken over a 10-year period from four core modules of the MORSE degree course taught at the University of Warwick. A part of the event tree used as the underlying model for the first two modules is shown in Figure 6, along with a few illustrative data points. This is a simplification of a much larger study that we are currently investigating but large enough to illustrate the richness of inference possible with our model search.

For simplicity, the prior distributions on the candidate models and on the root-to-leaf paths for were both chosen to be uniform distributions.

The MAP CEG model was not , so that there were some non-trivial stages. In total, 170 situations were clustered into 32 stages. Some of the more interesting stages of this model are described in Table 1.

From inspecting the membership of stages it was possible to identify various situations which were discovered to share distributions. From example, students who reach one of the two situations in stage 7 have an expected probability of 0.47 in getting a high mark, an expected probability of 0.44 of getting a middling grade, and only an expected probability of 0.08 of achieving the lowest grade. From being in a stage of their own, it can be deduced that students in these situations have qualitatively different prospects from students in any other situations. In contrast, students who reach one of the four situations in stage 17 have an expected probability of 0.66 of getting the lowest grade.

## 6 Discussion

In this paper we have shown that chain event graphs are not just an efficient way of storing the information contained in an event tree, but also a natural way to represent the information that is most easily elicited from a domain expert: the order in which events happen, the distributions of variables conditional on the process up to the point they are reached, and prior beliefs about the relative homogeneity of different situations. This strength is exploited when the MAP CEG is discovered, as this can be used in a qualitative fashion to detect homogeneity between seemingly disparate situations.

There are a number extensions to the theory in this paper that are currently being pursued. These fall mostly into the two categories: creating even richer model classes than those considered here; and developing even more efficient algorithms for selecting the MAP model in these model classes.

The first category includes dynamic chain event graphs. This framework can supply a number of different model classes. The simplest case involves selecting a CEG structure that is constant across time, but with a time series on its parameters. A bigger class would allow the MAP CEG structure to change over time. These larger model classes would clearly be useful in the educational setting considered in this paper, as they would allow for background changes in the students’ abilities, for example.

Another important model class is that which arises from uncertainty about the underlying event tree. A similar model search algorithm to the one described in this paper is possible in this case after setting a prior distribution on the candidate event trees.

In order to search any of these model classes more effectively, the problem of finding the MAP model can be reformulated as a weighted MAX-SAT problem, for which algorithms have been developed. This approach was used to great effect for finding a MAP BN by Cussens cussens_bayesian_2008 ().

## Appendix

Theorem 5 is based on three well-known results concerning properties of the Dirichlet distribution, which we review below.

###### Lemma 14

Let where for , and . Furthermore, let for , where .

Then .

{proof}
###### Lemma 15

Let , and .

Then for any partition of ,

 θ(I)=(θ(I[1]),θ(I[2]),…,θ(I[k]))∼Dir(α(I[1]),…,α(I[k]))

where .

{proof}

For any , , (a well-known result; see, for example, Weatherburn weatherburn_first_1949 ()), and for any partition of , . Therefore, as

 θ(I[j])=∑i∈I[j]θi=∑i∈I[j]γiγ=γ(I[j])γ,j=1,…,k

and , the result follows from Lemma 14.

###### Lemma 16

For any where ,

 θI[j]=(θiθ(I[j]))i∈I[j]∼Dir((αi)i∈I[j])
{proof}
###### Theorem 17

Let the rates of units along the root-to-leaf paths of an event tree have independent Gamma distributions with the same scale parameter, i.e.