Bayesian Structural Inference for Hidden Processes
Abstract
We introduce a Bayesian approach to discovering patterns in structurally complex processes. The proposed method of Bayesian Structural Inference (BSI) relies on a set of candidate unifilar hidden Markov model (uHMM) topologies for inference of process structure from a data series. We employ a recently developed exact enumeration of topological machines. (A sequel then removes the topological restriction.) This subset of the uHMM topologies has the added benefit that inferred models are guaranteed to be machines, irrespective of estimated transition probabilities. Properties of machines and uHMMs allow for the derivation of analytic expressions for estimating transition probabilities, inferring start states, and comparing the posterior probability of candidate model topologies, despite process internal structure being only indirectly present in data. We demonstrate BSI’s effectiveness in estimating a process’s randomness, as reflected by the Shannon entropy rate, and its structure, as quantified by the statistical complexity. We also compare using the posterior distribution over candidate models and the single, maximum a posteriori model for point estimation and show that the former more accurately reflects uncertainty in estimated values. We apply BSI to inclass examples of finite and infiniteorder Markov processes, as well to an outofclass, infinitestate hidden process.
Keywords: stochastic process, hidden Markov model, machine, causal states
pacs:
02.50.r 89.70.+c 05.45.Tp 02.50.Ey 02.50.Ga1.1
I Introduction
Emergent patterns are a hallmark of complex, adaptive behavior, whether exhibited by natural or designed systems. Practically, discovering and quantifying the structures making up emergent patterns from a sequence of observations lies at the heart of our ability to understand, predict, and control the world. But, what are the statistical signatures of structure? A common modeling assumption is that observations are independent and identically distributed (IID). This is tantamount, though, to assuming a system is structureless. And so, pattern discovery depends critically on testing when the IID assumption is violated. Said more directly, successful pattern discovery extracts the (typically hidden) mechanisms that create departures from IID structurelessness. In many applications, the search for structure is made all the more challenging by limited available data. The very real consequences, when pattern discovery is done incorrectly with finite data, are that structure can be mistaken for randomness and randomness for structure.
Here, we develop an approach to pattern discovery that removes these confusions, focusing on data series consisting of a sequence of symbols from a finite alphabet. That is, we wish to discover temporal patterns, as they occur in discretetime and discretestate time series. (The approach also applies to spatial data exhibiting onedimensional patterns.) Inferring structure from data series of this type is integral to many fields of science ranging from bioinformatics [1, 2], dynamical systems [3, 4, 5, 6], and linguistics [7, 8] to singlemolecule spectroscopy [9, 10], neuroscience [11, 12], and crystallography [13, 14]. Inferred structure assumes a meaning distinctive to each field. For example, in single molecule dynamics structure reflects stable molecular configurations, as well as the rates and types of transition between them. In the study of coarsegrained dynamical systems and linguistics, structure often reflects forbidden words and relative frequencies of symbolic strings that make the language or dynamical system functional. Thus, the results of successful pattern discovery teach one much more about a process than models that are only highly predictive.
Our goal is to infer structure using a finite data sample from some process of interest and a set of candidate machine model topologies. This choice of model class is made because machines provide optimal prediction as well as being a minimal and unique representation [15]. In addition, given an machine, structure and randomness can be quantified using the statistical complexity and Shannon entropy rate . Previous efforts to infer machines from finite data include subtree merging (SM) [16], machine spectral reconstruction (MSR) [17], and causalstate splitting reconstruction (CSSR) [18, 19]. These methods produce a single, bestestimate of the appropriate machine given the available data.
The following develops a distinctively different approach to the problem of structural inference—Bayesian Structural Inference (BSI). BSI requires a data series and a set of candidate unifilar hidden Markov model (uHMM) topologies, which we denote . However, for our present goal of introducing BSI, we consider only a subset of unifilar hidden Markov models—the topological machines—that are guaranteed to be machines irrespective of estimated transition probabilities [20]. Unlike the inference methods cited above, BSI’s output is not a single bestestimate. Instead, BSI determines the posterior probability of each model topology conditioned on and . One result is that many model topologies are viable candidates for a given data set. The shorter the data series, the more prominent this effect becomes. We argue, in this light, that the most careful approach to structural inference and estimation is to use the complete set of model topologies according to their posterior probability. Another consequence, familiar in a Bayesian setting, is that principled estimates of uncertainty—including uncertainty in model topology—can be straightforwardly obtained from the posterior distribution.
The methods developed here draw from several fields, ranging from computational mechanics [15] and dynamical systems [21, 22, 23] to methods of Bayesian statistical inference [24]. As a result, elements of the following will be unfamiliar to some readers. To create a bridge, we provide an informal overview of foundational concepts in Sec. II before moving to BSI’s technical details in Sec. III.
Ii Process Structure, Model Topologies, and Finite Data
To start, we offer a nontechnical introduction to structural inference to be clear how we distinguish (i) a process and its inherent structure from (ii) model topology and these from (iii) sampled data series. A process represents all possible behaviors of a system of interest. It is the object of our focus. Saying that we infer structure means we want to find the process’s organization—the internal mechanisms that generate its observed behavior. However, in any empirical setting we only have samples of the process’s behavior in the form of finite data series. A data series necessarily provides an incomplete picture of the process due to the finite nature of the observation. Finally, we use a model or, more precisely, a model topology to express the process’s structure. The model topology—the set of states and transitions, their connections and observed output symbols—explicitly represents the process’s structure. Typically, there are many model topologies that accurately describe the probabilistic structure of a given process. Machines are special within the set of accurate models, however, in that they are the model topology that provides the unique and minimal representation of process structure.
To ground this further, let’s graphically survey different model topologies and consider what processes they represent and how they generate finite data samples. Figure 1 shows models with one or two states that generate binary processes—observed behavior is a sequence of s and s. For example, the smallest model topology is shown in Fig. 1(a) and represents the IID binary process. This model generates data by starting in state and outputs a with probability and a with probability , always returning to state .
A more complex model topology, shown in Fig. 1(g), has two states and four edges. In this case, when the model is in state it generates a with probability and returns to state , or it generates a with probability and moves to state . When in state , a is generated with probability and with probability , moving to state in both cases. If this model topology represents a unique, structured process. However, if the probability of generating a or does not depend on states and and the resulting process is IID. Thus, this model topology with becomes an overly verbose representation of the IID process, which requires only a single state—the topology of Fig. 1(a). This setting of the transition probabilities is an example where a model topology describes the probabilistic behavior of a process, but does not reflect the structure. In fact, the model topology in Fig. 1(g) is not an machine when . Rather, the process structure is properly represented by Fig. 1(a), which is.
This example and other cases where specific model topologies are not minimal and unique representations of a process’s structure motivate identifying a subclass of model topologies. All model topologies in Fig. 1 are unifilar hidden Markov models (defined shortly). However, the six model topologies with two states and four edges, Fig. 1(gi, ln), are not minimal when . As with the previous example, they all become overly complex representations of the IID process for this parameter setting. Excluding these uHMMs leaves a subset of topologies called topological machines, Fig. 1(af,jk), that are guaranteed to be minimal and unique representations of process structure for any transition probabilities setting, other than or . Partly to emphasize the role of process structure and partly to simplify technicalities, in this first introduction to BSI we only consider topological machines. A sequel lifts this restriction, adapting BSI to work with all machines.
In this way, we see how a process’s structure is expressed in model topology and how possible ambiguities arise. This is the forward problem of statistical inference. Now, consider the complementary inverse problem: Given an observed data series, find the model topology that most effectively describes the unknown process structure. In a Bayesian setting, the first step is to identify those model topologies that can generate the observed data. As just discussed, we do this by choosing a specific model topology and start state and attempting to trace the hiddenstate path through the model, using the observed symbols to determine the edges to follow. If there is a path for at least one start state, the model topology is a viable candidate. This process is repeated for each model topology in a specified set, such as that displayed in Fig. 1. The procedure that lists, and tests, model topologies in a set of candidates we call enumeration.
To clarify the procedure for tracing hiddenstate paths let’s consider a specific example of observed data consisting of the short binary sequence:
(1) 
If tested against each candidate in Fig. 1, eight of the sixteen model topologies are possible: (a, e, gi, ln). For example, using Fig. 1(i) and starting in state , the observed data is generated by the hiddenstate path:
(2) 
One way to describe this path—one that is central to statistical estimation—is to count the number of times each edge in the model was traversed. Using to denote the number of times that symbol is generated using an edge from state given that the sequence starts in state , we obtain: , , , and , again assuming . Similar paths and sets of edge counts are found for the eight viable topologies cited above. These counts are the basis for estimating a topology’s transition and startstate probabilities. From these, one can then calculate the probability that each model topology produced the observed data series—each candidate’s posterior probability.
By way of outlining what is to follow, let’s formalize the procedure just sketched in terms of the primary goal of estimating candidates’ posterior probabilities. First, Sec. III recapitulates what is known about the space of structured processes, reviewing how they are represented as machines and how topological machines are exactly enumerated. Then, Sec. IV adapts Bayesian inference methods to this model class, analyzing transition probability and start state estimation for a single, known topology. Next, setting the context for comparing model topologies, it explores the organization of the prior over the set of candidate models. Section IV closes with a discussion of how to estimate various process statistics from functions of model parameters. Finally, Sec. V applies BSI to a series of increasingly complex processes: (i) a finiteorder Markov process, (ii) an infiniteorder Markov process, and, finally, (iii) an infinitememory process. Each illustrates BSI’s effectiveness by emphasizing its ability to accurately estimate a process’s stored information (statistical complexity ) and randomness (Shannon entropy rate ).
Iii Structured Processes
We describe a system of interest in terms of its observed behavior, following the approach of computational mechanics, as reviewed in [15]. Again, a process is the collection of behaviors that the system produces. A process’s probabilistic description is a biinfinite chain of random variables, denoted by capital letters . A realization is indicated by lowercase letters . We assume the value belongs to a discrete alphabet . We work with blocks , where the first index is inclusive and the second exclusive.
Machines were originally defined in terms of prediction, in the socalled history formulation [16, 15]. Given a past realization and future random variables , the conditional distributions define the predictive equivalence relation over pasts:
(3) 
Within the history formulation, a process determines the machine topology through : The causal states are its equivalence classes and these, in turn, induce statetransition dynamics [15]. This way of connecting a process and its machine influenced previous approaches to structural inference [16, 25, 19].
The machine generator formulation, an alternative, was motivated by the problem of synchronization [26, 27]. There, an machine topology defines the process that can be generated by it. Recently, the generator and history formulations were proven to be equivalent [28]. Although, the history view is sometimes more intuitive, the generator view is useful in a variety of applications, especially the approach to structural inference developed here.
Following [26, 27, 28], we start with four definitions that delineate the model classes relevant for temporal pattern discovery.
Definition 1.
A finitestate, edgelabeled hidden Markov model (HMM) consists of:

A finite set of hidden states .

A finite output alphabet .

A set of symbollabeled transition matrices , , where is the probability of transitioning from state to state and emitting symbol . The corresponding overall statetostate transition matrix is denoted .
Definition 2.
A finitestate, edgelabeled, unifilar HMM (uHMM) is a finitestate, edgelabeled HMM with the following property:

Unifilarity: For each state and each symbol there is at most one outgoing edge from state that outputs symbol .
Definition 3.
A finitestate machine is a uHMM with the following property:

Probabilistically distinct states: For each pair of distinct states there exists some finite word such that:
Definition 4.
A topological machine is a finitestate machine where the transition probabilities for leaving each state are equal for all outgoing edges.
These definitions provide a hierarchy in the model topologies to be considered. The most general set (Def. 1) consists of finitestate, edgelabeled HMM topologies with few restrictions. These are similar to models employed in many machine learning and bioinformatics applications; see, e.g., [1]. Using Def. 2, the class of HMMs is further restricted to be unifilar. The inference methods developed here apply to all model topologies in this class, as well as all more restricted subclasses. As a point of reference, Fig. 1 shows all binary, fullalphabet (able to generate both s and s) uHMM topologies with one or two states. If all states in the model are probabilistically distinct, following Def. 3, these model topologies are also valid generator machines. Whether a uHMM is also a valid machine often depends on the specific transition probabilities for the machine; see Sec. II for an example. This dependence motivates the final restriction to topological machines (Def. 4), which are guaranteed to be minimal even if transition probabilities are equal.
Here, we employ the set of topological machines for structural inference. Although specific settings of the transition probabilities are used to define the set of allowed model topologies this does not affect the actual inference procedure. For example, in Fig. 1 only (af, jk) are topological machines. However, the set of topological machines does exclude a variety of model topologies that might be useful for general timeseries inference. For example, when Def. 4 is applied, all processes with full support (all words allowed) reduce to a singlestate model. However, broadening the class of topologies beyond the set considered here is straightforward and so we address extending the present methods to them in a sequel. The net result emphasizes structure arising from the distribution’s support and guarantees that inferred models can be interpreted as valid machines. And, the goal is to present BSI’s essential ideas for one class of structured processes—the topological machines.
States  Machines 

1  1 
2  7 
3  78 
4  1,388 
5  35,186 
The set of topological machines can be exactly and efficiently enumerated [20], motivating the use of this model class as our first example application of BSI. Table 1 lists the number of fullalphabet topologies with states and alphabet size . Compare this table with the model topologies in Fig. 1, where all and uHMMs are shown. Only Fig. 1(af,jk) are topological machines, accounting for the difference between the eight models in the table above and the fourteen in Fig. 1. For comparison, the library has been enumerated up to eight states, containing approximately distinct topologies. However, for the examples to follow we employ all binary model topologies up to and including five states as the candidate basis for structural inference.
Iv Bayesian Inference
Previously, we developed methods for thorder Markov chains to infer models of discrete stochastic processes and coarsegrained continuous chaotic dynamical systems [29, 6]. There, we demonstrated that correct models for inclass data sources could be effectively and parsimoniously estimated. In addition, we showed that the hiddenstate nature of outofclass data sources could be extracted via model comparison between Markov orders as a function of data series length. Notably, we also found that the entropy rate can be accurately estimated, even when outofclass data was considered.
The following extends the Markov chain methods to the topologically richer model class of unifilar hidden Markov models. The starting point depends on the unifilar nature of the HMM topologies considered here (Def. 2)—transitions from each state have a unique emitted symbol and destination state. As we demonstrated in Sec. II unifilarity also means that, given an assumed start state, an observed data series corresponds to at most one path through the hidden states. The ability to directly connect observed data and hiddenstate paths is not possible in the more general class of HMMs (Def. 1) because they can have many, often exponentially many, possible hidden paths for a single observed data series. In contrast, as a result of unifilarity, our analytic methods previously developed for “nonhidden” Markov chains [29] can be applied to infer uHMMs and machines by adding a latent (hidden) variable for the unknown start state. We note in passing that for the more general class of HMMs, including nonunifilar topologies, there are two approaches to statistical inference. The first is to convert them to a uHMM (if possible), using mixed states [30, 31]. The second is to use more conventional computational methods, such as BaumWelch [32].
Setting aside these alternatives for now, we formalize the connection between observed data series and a candidate uHMM topology discussed in Sec. II. We assume that a data series of length has been obtained from the process of interest, with taking values in a discrete alphabet . When a specific model topology and start state are assumed, a hiddenstate sequence corresponding to the observed data can sometimes, but not always, be found. We denote a hidden state at time as and a hiddenstate sequence corresponding to as . Note that the state sequence is longer than the observed data series since the start and final states are included. Using this notation, an observed symbol is emitted when transitioning from state to . For example, using the observed data in Eq. (1), a hiddenstate path corresponding to Eq. (2) can be obtained by assuming topology Fig. 1(i) and start state .
We can now write out the probability of an observed data series. We assume a stationary uHMM topology with a set of hidden states . We add the subscript to make it clear that we are analyzing a set of distinct, enumerated model topologies. As demonstrated in the example from Sec. II, edge counts are obtained by tracing the hiddenstate path given an assumed start state . Putting this all together, the probability of observed data and corresponding statepath is:
A slight manipulation of Eq. (IV) lets us write the probability of observed data and hidden dynamics, given an assumed start state , as:
(5) 
The development of Eq. (5) and the simple example provided in Sec. Fig. II lay the groundwork for our application of Bayesian methods. That is, given topology and start state , the probability of observed data and hidden dynamics can be calculated. For the purposes of inference, the combination of observed and hidden sequences is our data .
iv.1 Inferring Transition Probabilities
The first step is to infer transition probabilities for a single uHMM or topological machine . As noted above, we must assume a start state so that edge counts can be obtained from . This requirement means that the inferred transition probabilities also depend on the assumed start state. At a later stage, when comparing model topologies, we demonstrate that the uncertainty in start state can be averaged over.
The set of parameters to estimate consists of those transition probabilities defined to be neither one nor zero by the assumed topology: , where is the subset of hidden states with more than one outgoing edge. The resulting likelihood follows directly from Eq. (5):
(6) 
We note that the set of transition probabilities used in the above expression are unknown when doing statistical inference. However, we can still write the probability of the observed data given a setting for these unknown values, as indicated by the notation for the likelihood: . Although not made explicit above, there is also a possibility that the likelihood vanishes for some, or all, start states if the observed data is not compatible with the topology. For example, if we attempt to use Fig. 1(d) for the observed data in Eq. (1) we find that neither nor leads to viable paths for the observed data, resulting in zero likelihood.
For later use, we denote the number of times a hidden state is visited by .
Equation (6) exposes the Markov nature of the dynamics on the hidden states and suggests adapting the methods we previously developed for Markov chains [29]. Said simply, states that corresponded there to histories of length for Markov chain models are replaced by a hidden state . Mirroring the earlier approach, we employ a conjugate prior for transition probabilities. This choice means that the posterior distribution has the same form as the prior, but with modified parameters. In the present case, the conjugate prior is a product of Dirichlet distributions:
where . In the examples to follow we take for all parameters of the prior. This results in a uniform density over the simplex for all transition probabilities to be inferred, irrespective of start state [33].
The product of Dirichlet distributions includes transition probabilities only from hidden states in because these states have more than one outgoing edge. For transition probabilities from states there is no need for an explicit prior because the transition probability must be zero or one by definition of the uHMM topology. As a result, the prior expectation for transition probabilities is:
(8) 
for states .
Next, we employ Bayes’ Theorem to obtain the posterior distribution for the transition probabilities given data and prior assumptions. In this context, it takes the form:
(9) 
The terms in the numerator are already specified above as the likelihood and the prior, Eqs. (6) and (IV.1), respectively.
The normalization factor in Eq. (9) is called the evidence, or marginal likelihood. This term integrates the product of the likelihood and prior with respect to the set of transition probabilities :
resulting in the average of the likelihood with respect to the prior. In addition to normalizing the posterior distribution (Eq. (9)), the evidence is important in our subsequent applications of Bayes’ Theorem. In particular, the quantity is central to the model selection to follow and is used to (i) determine the start state given the model and (ii) compare model topologies.
As discussed above, conjugate priors result in a posterior distribution of the same form, with prior parameters modified by observed counts:
(11)  
Comparing Eqs. (IV.1) and (IV.1)—prior and posterior, respectively—shows that the distributions are very similar: (prior only) is replaced by (prior plus data). Thus, one can immediately write down the posterior mean for the transition probabilities:
(12) 
for states . As with the prior, probabilities for transitions from states are zero or one, as defined by the model topology.
Notably, the posterior mean for the transition probabilities does not completely specify our knowledge since the uncertainty, reflected in functions of the posterior’s higher moments, can be large. These moments are available elsewhere [33]. However, using methods detailed below, we employ sampling from the posterior at this level, as well as other inference levels, to capture estimation uncertainty.
iv.2 Inferring Start States
The next task is to calculate the probabilities for each start state given a proposed machine topology and observed data. Although we are not typically interested in the actual start state, introducing this latent variable is necessary to develop the previous section’s analytic methods. And, in any case, another level of Bayes’ Theorem allows us to average over uncertainty in start state to obtain the probability of observed data for the topology, independent of start state.
We begin with the evidence derived in Eq. (IV.1) to estimate transition probabilities. When determining the start state, the evidence (marginal likelihood) from inferring transition probabilities becomes the likelihood for start state estimation. As before, we apply Bayes’ Theorem, this time with unknown start states, instead of unknown transition probabilities:
(13) 
This calculation requires defining a prior over start states . In practice, setting start states as equally probable a priori is a sensible choice in light of the larger goal of structural inference. The normalization , or evidence, at this level follows by averaging over the uncertainty in :
(14) 
The result of this calculation no longer explicitly depends on start states or transition probabilities. The uncertainty created by these unknowns has been averaged over, producing a very useful quantity for comparing different topologies: . However, one must not forget that inferring transition and start state probabilities underlies the structural comparisons to follow. In particular, the priors set at the levels of transition probabilities and start states can impact the structures detected due to the hierarchical nature of the inference: .
iv.3 Inferring Model Topology
So far, we inferred transition probabilities and start states for a given model topology. Now, we are ready to compare different topologies in a set of candidate models. As with inferring start states given a topology, we write down yet another version Bayes’ Theorem, except one for model topology:
(15) 
writing the likelihood as to make the nature of the conditional distributions clear. This is exactly the same, however, as the evidence derived above in Eq. (14): . Equality holds because nothing in calculating the previous evidence term directly depends on the set of models considered. The evidence , or normalization term, in Eq. (15) has the general form:
(16) 
To apply Eq. (15) we must first provide an explicit prior over model topologies. One general form, tuned by single parameter , is:
(17) 
where is some desired function of model topology. In the examples to follow we use the number of causal states——thereby penalizing for model size. This is particularly important when a short data series is being investigated. However, setting removes the penalty, making all models in a priori equally likely. It is important to investigate the effects of choosing a specific for a given set of candidate topologies. Below, we first demonstrate the effect of choosing , , or . After that, however, we employ since this value, in combination with the set of one to fivestate binaryalphabet topological machines, produces a preference for one and twostate machines for short data series and still allows for inferring larger machines with only a few thousand symbols. Experience with this shows that it is structurally conservative.
In the examples we explore two approaches to using the results of structural inference. The first takes into account all model topologies in the set considered, weighted according to the posterior distribution given in Eq. (15). The second selects a single model that is the maximum a posteriori (MAP) topology:
(18) 
The difference between these methods is most dramatic for short data series. Also, using the MAP topology often underestimates the uncertainty in functions of the model parameters; which we discuss shortly. Of course, since one throws away any number of comparable models, estimating uncertainty in any quantity that explicitly depends on the model topology cannot be done properly if MAP selection is employed. However, we expect some will want or need to use a single model topology, so we consider both methods.
iv.4 Estimating Functions of Model Parameters
A primary goal in inference is estimating functions that depend on an inferred model’s parameters. We denote this to indicate the dependence on transition probabilities. Unfortunately, substituting the posterior mean for the transition probabilities into some function of interest does not provide the desired expectation. In general, obtaining analytic expressions for the posterior mean of desired functions is quite difficult; see, for example, [34, 35]. Deriving expressions for the uncertainty in the resulting estimates is equally involved and typically not done; although see [34].
Above, the inference method required inferring transition probabilities, start state, and topology. Function estimation, as a result, should take into account all these sources of uncertainty. Instead of deriving analytic expressions for posterior means (if possible), we turn to numerical methods to estimate function means and uncertainties in great detail. We do this by repeatedly sampling from the posterior distribution at each level to obtain a sample machine and evaluating the function of interest for the sampled parameter values. The algorithms in Fig. 2 detail the process of sampling using all candidate models (Algorithm 1) or the single model (Algorithm 2). Given a set of samples of the function of interest, any summary statistic can be employed. In the examples, we generate samples from which we estimate a variety of properties. More specifically, these samples are employed to estimate the posterior mean and the 95%, equaltailed, credible interval (CI) [24]. This means there is a 5% probability of samples being outside the specified interval, with equal probability of being above or below the interval. Finally, a Gaussian kernel density estimation (Gkde) is used to visualize the posterior density for the functions of interest.
The examples demonstrate estimating process randomness and structure from data series using the two algorithms introduced above. For a known machine topology , with specified transition probabilities , these properties are quantified using the entropy rate and statistical complexity , respectively. The entropy rate is:
(19) 
and the statistical complexity is:
(20) 
In these expressions, the are the asymptotic state probabilities determined by the left eigenvector (normalized in probability) of the internal Markov chain transition matrix . Of course, and are also functions of the model topology and transition probabilities, so these quantities provide good examples of how to estimate functions of model parameters in general.
V Examples
We divide the examples into two parts. First, we demonstrate inferring transition probabilities and start states for a known topology. Second, we focus on inferring machine topology using the set of all binary, one to fivestate topological machines, consisting of candidates; see Table 1. We use the convergence of estimates for the informationtheoretic values and to monitor structure discovery. However, estimating model parameters is at the core of the later examples and so we start with this procedure.
For each example we generate a single data series of length . When analyzing convergence, we consider subsamples of lengths , using . For example, a foursymbol sequence starting at the first data point is designated . The overlapping analysis of a single data series gives insight into convergence for the inferred models and for the statistics estimated.
v.1 Estimating Parameters
v.1.1 Even Process
We first explore a single example of inferring properties of a known data source using Eqs. (6)(IV.1). We generate a data series from the Even Process and then, using the correct topology (Fig. 3), we infer start states and transition probabilities and estimate the entropy rate and statistical complexity. We do not concentrate on this level of inference in subsequent examples, preferring to focus instead on model topology and its representation of the unknown process structure. Nonetheless, the procedure detailed here underlies all of the examples.
The Even Process is notable because it has infinite Markov order. This means no finiteorder Markov chain can reproduce its word distribution [29]. It can be finitely modeled, though, with a finitestate unifilar HMM—the machine of Fig. 3. A single data series was generated using the Even Process machine with . The start state was randomized before generating sequence data of length . As it turned out, the initial segment was , indicating that the unknown start state was on that realization. This is so because the first symbol is a , which can be generated starting in either state or , but the sequence is only possible by starting at node .
Next, we estimate the transitions from the generated data series using length subsamples to track convergence. Although the mean and other moments of the Dirichlet posterior can be calculated analytically [33], we sample values using Algorithm 2 in Fig. 2. However, in this example we employ instead of because we are focused on the model parameters for a known topology. The posterior density for each subsample is plotted in Fig. 4 using Gaussian kernel density estimation (Gkde). The true value of is shown as a black, dashed line and the posterior mean as a solid, gray line. (Both lines connect values evaluated at each length .) The convergence of the posterior density to the correct value of with increasing data size is clear and, moreover, the true value is always in a region of positive probability.
For our final example using a known topology we estimate and from the Even Process data. This illustrates estimating these functions of model parameters when the machine topology is known but there is uncertainty in start state and transition probabilities. As above, we use Algorithm 2 in Fig. 2 and employ the known machine structure. We sample start states and transition probabilities, followed by calculating and —via Eqs. (19) and (20), respectively—to build a posterior density for these quantities.
Figure 5 presents the joint distribution for and along with the Gkde estimation of their marginal densities. Samples from the joint posterior distribution are plotted in the lower left panel for subsample lengths and . Only of the available samples are displayed in this panel to minimize the graphic’s size. The marginal densities for (top panel) and (right panel) are plotted using a Gkde with all samples. Small data size (, indicated by black points) samples allow a wide range of structure and randomness constrained only by the Even Process machine topology. The range of and reflect the flat priors set for start states and transition probabilities. We note that a uniform prior distribution over transition probabilities and start states does not produce a uniform distribution over or . Increasing the size of the data subsample to (brown points) results in a considerable reduction in the uncertainty for both functions. For this amount of data, the possible values of entropy rate and statistical complexity curve around the true value in the plane and result in a shifted peak for the marginal density for . For subsample length (blue points) the estimates of both functions of model parameters converge to the true values, indicated by the black, dashed lines.
v.2 Inferring Process Structure
We are now ready to demonstrate BSI’s efficacy for structural inference via a series of increasingly complex processes, monitoring convergence using data subsamples up to a length of . In this, we determine the number of hidden states, number of edges connecting them, and symbols output on each transition. As discussed above, we use the set of topological machines as candidates because an efficient and exhaustive enumeration is available.
For comparison, we first explore the organization of the prior over the set of candidate machines using intrinsic informational coordinates—the process entropy rate and statistical complexity . We focus on their joint distribution, as induced by various settings of the prior parameter . The results lead us to use for the subsequent examples. This value creates a preference for small models when little data is available but allows for a larger number of states when reasonable amounts of data support it.
We establish the BSI’s effectiveness by inferring the structure of a finiteorder Markov process, an infiniteorder Markov process, and an infinite memory process. Again, the proxy for convergence is estimating structure and randomness as a function of the data subsample length . Comparing these quantities’ posterior distributions with their prior illustrates uncertainty reduction as more data is analyzed.
v.2.1 Priors for Structured Processes
Here, we use a prior over all binaryalphabet, topological machines with one to five states. (Recall Table 1.) We denote the set of topological machines detailed in Table 1 as . Equation (17) allows specifying a preference for smaller machines by setting and defining the function of model structure to be the number of states: . Beyond setting this explicitly, there is an inherent bias to smaller models inversely proportional to the parameter space dimension. The parameter space is that of the estimated transition probabilities. Its dimension is the number of states with more than one outgoing transition. However, candidate machine topologies with many states and few transitions result in a small parameter space and so may be assigned high probability for short data series. In addition, the prior over topologies must take into account the increasing number of candidates as the number of states increases. Setting sufficiently high so that large models are not given high probability under these conditions is reasonable, as we would like to approach structure estimates () monotonically from below, as data size increases.
Figure 6 plots samples from the resulting joint prior for as well as the corresponding Gkde for marginal densities of both quantities. The data are generated by using the method of Sec. IV.4 and replacing the posterior density with the prior density. Specifically, rather than sampling a topology from , we sample from . Similar substitutions are made at each level, using the distributions that do not depend on observed data, resulting in samples from the prior. Each color in the figure reflects samples using all machines in with different values for the prior parameter: (black), (brown) and (blue). While has many samples at high , reflecting the large number of fivestate machines, increasing to results in noticeable bands in the plane and peaks at , , bits, and so on. This reflects the fact that larger makes smaller machines more likely. As a consequence, the emergence of patterns due to one, two, and threestate topologies is seen. Setting shows a stronger a priori preference for one and twostate machines, reflected by the strong peaks at bits and bit. Interestingly, the prior distribution over and is quite similar for and , with more distributional structure due to smaller machines at . However, the prior distribution for and is quite different for , creating a strong preference for one and twostate topologies. This results in an a priori preference for low and high that, as we demonstrate shortly, is modified for moderate amounts of data. We employ as a reasonable value in all subsequent examples. In practice, sensitivity to this choice should be tested in each application to verify that the resulting behavior is appropriate. We suggest small, nonzero values as reasonable starting points. As always, sufficient data makes the choice relatively unimportant for the resulting inference.
v.2.2 Markov Example: The Golden Mean Process
The first example of structural inference explores the Golden Mean Process, pictured in Fig. 7. Although it is illustrated as an HMM in the figure, it is effectively a Markov chain with no hidden states: observing a corresponds to state , whereas observing means the process is in state . Previously, we showed that this data source can be inferred using the model class of th order Markov chains, as expected [29]. However, the Golden Mean Process is also a member of the class of binaryalphabet, topological machines considered here. As a result, structural inference from Golden Mean data is an example of inclass modeling.
We proceed using the approach laid out above for the Even Process transition probabilities and start states. We generated a single data series by randomizing the start state and creating a symbol sequence of length using the Golden Mean Process machine. As above, we monitor the convergence using subsamples for lengths , . The candidate machines consist of all machine topologies in Table 1. Estimating and aids in monitoring convergence of inferred topology and related properties to the correct values. In addition, we provide supplementary tables and figures, using both and the maximum a posteriori model at each data length , to give a detailed view of structural inference.
Figure 8 plots samples from the joint posterior over , as well as their marginal distributions, for three subsample lengths. As in Fig. 5, we consider (black), (brown), and (blue). However, this example employs the full set of candidate topologies. For small data size () the distribution closely approximates the prior distribution for , as it should. At data size , the samples of both the and are still broad, resulting in multimodal behavior with considerable weight given to both two and threestate topologies. Consulting Table S2 in the supplementary material, we see that this is the shortest length that selects the correct topology for the Golden Mean Process (denoted n2k2id5 in Table S2). For smaller , the singlestate, twoedge topology is preferred (denoted n1k2id3). However, the probability of the correct model is only 78.7%, leaving a substantial probability for alternative candidates. The uncertainty is further reflected in the large credible interval for provided by the complete set of models (see Table S1), ranging from bits as the lower bound to bits as the upper bound. However, by subsample length the probability of the correct topology is 99.998%, given the set of candidate machines , and estimates of both and have converged to accurately reflect the correct values.
In addition to Tables S1 and S2, the supplementary materials provide Fig. S1 showing the Gkde estimates of both and using and as a function of subsample length. The four panels clearly show the convergence of estimates to the correct values as increases. For long data series, there is little difference between the inference made using the maximum a posteriori (MAP) model and the posterior over the entire candidate set. However, this is not true for short time series, where using the full set more accurately captures the uncertainty in estimation of the informationtheoretic quantities of interest. We note that the estimates approach the true value from below, preferring small topologies when there is little data and selecting the correct, larger topology only as available data increases. This desired behavior results from setting for the prior over . Setting , shown in Fig. S2, does not have this effect. This value of is insufficient to overcome the large number of three, four, and fivestate machines. Finally, Fig. S3 plots samples from the joint posterior of and using only the MAP model for subsample lengths , and . This should be compared with Fig. 8 where the complete set is used. Again, there is a substantial difference for short data series and much in common for larger .
Before moving to the next example, let’s briefly return to consider startstate inference. The data series generated to test inferring the Golden Mean Process started with the sequence . We note that the correct start state, which happens to be state in that realization, cannot be inferred and has lower probability than state due to the process’s structure: using Eq. (13). The reason for the inability to discern the start state is straightforward. Consulting Fig. 7, we can see that the string can be produced beginning in both states and . On the one hand, assuming , the state path would be with probability . On the other hand, assuming , the state path is with probability . The only difference in the probabilities is a factor of versus resulting in:
This calculation agrees nicely with the result stated above, using finite data and the inference calculations from Eq. (13).
It turns out that any observed data series from the Golden Mean Process that begins with a will have this ambiguity in start state. However, observed sequences that begin with a uniquely identify as the start state since a is not allowed leaving state . Despite this, the correct topology is inferred and accurate estimates of and are obtained.
v.2.3 Infiniteorder Markov Example: The Even Process
Next, we consider inferring the structure of the Even Process using the same set of binaryalphabet, one to fivestate, topological machines. To be clear, this example differs from Sec. V.1.1, where the correct topology was assumed. Now, we explore Even Process structure using . As noted above, the Even Process is an infiniteorder Markov process and inference requires the set of topological machines considered here. (However, see outofclass inference of the Even Process using thorder Markov chains in [29].) As a result, this is an example of inclass inference since the Even Process topology is contained within the set . As with the previous example, a single data series was generated from the Even Process.
Figure 9 shows samples from the posterior distribution over using three subsample lengths , and as before. An equivalent plot using only the MAP model is provided in the supplementary materials for comparison; see Fig. S6. Again, for short data series the samples mirror the prior distribution as they should. (See black points for .) At subsample length the values of and are much more tightly delineated. Comparing samples for the Golden Mean Process in Fig. 8 shows that there is much less uncertainty in structure for the Even Process at this data size. Consulting Table S4, the MAP topology for this value of already identifies the correct topology (denoted n2k2id7) and assigns a probability of 99.41%. This high probability is reflected by the smaller spread, when compared with the Golden Mean example, of the samples of and . At subsample length the probability of the correct topology has grown to 99.998%. Estimates of both and are also very accurate, with small uncertainties, at this ; see Table S3.
The supplementary materials provide Figs. S4 and S5 to show the convergence of the posterior densities for and as a function of subsample length. Figure S4 shows estimates using both and for . Whereas, Fig. S5 demonstrates the effects of using a small penalty () for model size. As seen with the Golden Mean Process, the difference is most apparent at small data sizes. At large , the difference between using the complete set of models versus the MAP model is minor, as is the effect of choosing or . However, at small data sizes the choices impact the resulting inference. In particular, the choice of allows the inference machinery to approach the correct from below whereas the choice of approaches from above; see Figs. S4 and S5. This behavior, which we believe is desirable, is similar to the inference dynamics observed for the Golden Mean Process, further strengthening the apparent suitability of using .
Unlike the previous example, the start state for the correct structure is inferred with little data. In this example, the data series begins with the symbols , which can only be generated from state . So, at the start state for the correct topology is determined, but it takes more data— symbols in this case—for this structure to become the most probable in the set considered.
v.2.4 OutofClass Structural Inference: The Simple Nonunifilar Source
The Simple Nonunifilar Source (SNS) is our final and most challenging example of structural inference due its being outofclass. The SNS is not only infiniteorder Markov, any unifilar presentation requires a infinite number of states. In particular, its machine, the minimal unifilar presentation, has a countable infinity of causal states [36]. We can see the difference between the SNS and previous processes by inspecting state , where both outgoing edges emit a symbol ‘1’. (See Fig. 10 for a hidden Markov model presentation that is not an machine.) This makes the SNS a nonunifilar topology, as the name suggests. Importantly, even if we assume a start state, there is no longer a single, unique path through the hidden states for an observed output data series. This is completely different from the unifilar examples previously considered, where an assumed start state and observed data series either determined a unique path through hidden states or was disallowed. As a result, the inference tools developed here cannot use the HMM topology of Fig. 10. Concretely, this class of representation breaks our method for counting transitions.
Our goal, though, is to use the set of unifilar, topological machines at our disposal to infer properties of the Simple Nonunifilar Source. (One reason to do this is that unifilar models are required to calculate .) Typical data series generated by the SNS model are accepted by many of the unifilar topologies in and a posterior distribution over these models can be calculated. As with previous examples, we demonstrate estimating and for the data source. Due to the nonunifilar nature of the source, we expect estimates to increase with the size of the available data series. However, the ability to estimate accurately is unclear a priori. Of course, in this example we cannot find the correct model topology because infinite structures are not contained in .
Figure 11 presents the joint posterior for for three subsample lengths. As previously, a single data series of length is generated using the SNS and analysis of subsamples are employed to demonstrate convergence. The short subsample (L=1, black points) is predictably uninteresting, reflecting the the prior distribution over models. For subsamples shorter than the MAP model is the singlestate, twoedge topology. (Denoted n1k2id3 in Table S6.) At the Golden Mean Process topology becomes most probable with a posterior probability of 53.01%. The probability of the singlestate topology is still 43.98%, though, resulting in ’s strongly bimodal marginal posterior observed for . (See Fig. 11 brown points, right panel.) Bimodality also appears in the marginal posterior for , with the largest peak coming from the twostate topology and the high entropy rates being contributed by the singlestate model. At large data size (, blue points) has converged on the true value, while has sharp, bimodal peaks due to many nearly equally probable fivestate topologies. Consulting Table S6, we see that the MAP structure for this value of has five states (denoted n5k2id22979, there) and a low posterior probability of only 8.63%. Further investigation reveals that there are four additional machine topologies (making a total of five) with similar posterior probability. These general details persist for longer subsamples sequences including the complete data series at length . Although estimating converges smoothly, the inference of structure as reflected by does not show signs of graceful convergence.
We provide supplementary plots in Figs. S7 and S8 that show the convergence of and using and for prior parameters and , respectively. Again, the choice of matters most at small data sizes. While the estimate increases as function of for , the use of results in posterior means for that first decrease as function of , then increase. Again, this supports the use of for this set of binaryalphabet, topological machines. The need to employ the complete model set versus the MAP topology is most evident at small data sizes; as was also seen in previous examples. However, the inference in this example is more complicated due to the large number of fivestate topologies with roughly equal probability. The MAP method selects just one model, of course, and so cannot represent the posterior distribution’s bimodal behavior. Given that the data source is outofclass, this trouble is perhaps not surprising. Figure S9 shows samples from the joint posterior of using only the MAP topology. Using the latter also suffers from requiring one to select a single exemplar topology for a posterior distribution that is simply not well represented by a single machine.
Vi Discussion
The examples demonstrated structural inference of unifilar hidden Markov models using the set of one to fivestate, binaryalphabet, topological machines. We found that inclass examples, including the Golden Mean and Even Processes, were effectively and efficiently discovered. That is, the correct topology was accorded the largest posterior probability and estimates of information coordinates and were accurate. However, we found that a sufficiently large value of , providing the model size penalty, was key to a conservative structural inference. Conservative means that estimates approach the true value from below, effectively counteracting the increasing number of topologies with larger state sets. For the outofclass example, given by the Simple Nonunifilar Source, these broader patterns held true. However, structure could not be captured as reflected in the increasing number of states inferred as a function of data length. Also, many topologies had relevant posterior probability for the SNS data, reflecting a lack of consensus and a large degeneracy with regard to structure. This resulted in a multimodal posterior distribution for and a MAP model with very low posterior probability.
One of the surprises was the number of accepting topologies for a given data set. By this we mean the number of candidate structures for which the data series of interest had a valid path through hidden states, resulting in nonzero posterior probability. In many ways, this aspect of structural inference mirrors grammatical inference for deterministic finite automaton (DFA) [37, 38]. In the supplementary material we provide plots for the three processes considered above showing the number of accepting topologies in the set of one to fivestate machines used for . (See Supplemental Fig. S10.) For all of these topologies, a rapid decline in the number of accepting topologies occurs for the first to symbols, followed by a plateau at a set of accepting topologies. For smaller topologies, which come from the model class under consideration, this pattern makes sense. Often, the smaller topology is embedded within a larger set of states, some of which are never used. For outofclass examples like the SNS this behavior is less transparent. The rejection of a data series by a given topology provides a first level of filtering by assigning zero posterior probability to the structure due to vanishing likelihood of the data given the model. For the examples given above, of the possible topologies, accepted Golden Mean data, topologies accepted Even Process data, and accepted SNS data when the full data series was considered.
In all of the examples the data sources were stationary, so that statistics did not change over the course of the data series. This is important because stationarity is built into the model class definition employed: the model topology and transition probabilities did not depend on time. However, given a general data series with unknown properties, it is unwise to assume stationarity holds. How can this be probed? One method is to subdivide the data into overlapping segments of equal length. Given these, inference using or should return similar results for each segment. For inclass data sources like the Even and Golden Mean Processes, the true model should be returned for each data subsegment. For outofclass, but stationary models like the Simple Nonunifilar Source, the true topology cannot be returned, but a consistent model within should be returned for each data segment.
However, one form of relatively simple nonstationarity—a structural changepoint problem such as switching between the Golden Mean and Even Processes—can be detected by BSI applied to subsegments. The inferred topology for early segments returns the Golden Mean topology and later segments return the Even topology. Notably, the inferred topology using all of the data or a subsegment overlapping the switch returns a more complicated model topology reflecting both structures. Of course, detection of this behavior requires sufficient data and slow switching between data sources.
In a sequel we compare BSI to alternative structural inference methods. The range of and differences with these is large and so a comparison demands its own venue. Also, the sequel addresses expanding the model candidates beyond the set of topological machines to the full set of unifilar hidden Markov models. A necessary step before useful comparisons can be explored.
Vii Conclusion
We demonstrated effective and efficient inference of topological machines using a library of candidate structures and the tools of Bayesian inference. Several avenues for further development are immediately obvious. First, as just noted, using full unrestricted machines—allowing models outside the set of topological machines—is straightforward. This will provide a broad array of candidates within the more general class of unifilar hidden Markov models. In the present setting, by way of contrast, processes with full support (all words allowed) can map only to the singlestate topology. Second, refining the eminently parallelizable Bayesian Structural Inference algorithms will allow them to take advantage of large compute clusters and cloud computing to dramatically expand the number of candidate topologies considered. For comparison, the current implementation uses nonoptimized Python on a single thread. This configuration (running on contemporary Linux compute node) takes between and hours, depending on the number of accepting topologies, to calculate the posterior distribution over the candidates for a data series of length . An additional to minutes is needed to generate the samples from the posterior to estimate functions of model parameters, like and .
We note that the methods of Bayesian Structural Inference can be applied to any set of unifilar hidden Markov models and, moreover, they do not have to employ a large, enumerated library. For example, a small set of candidate fiftystate topologies could be compared for a given data series. This ability opens the door to automated methods for generating candidate structures. Of course, as always, one must keep in mind that all inferences are then conditioned on the, possibly limited or inappropriate, set of model topologies chosen.
Finally, let’s return to the scientific and engineering problem areas cited in the introduction that motivated structural inference in the first place. Generally, Bayesian Structural Inference will find application in fields, such as those mentioned, that rely on finiteorder Markov chains or the broader class of (nonunifilar) hidden Markov models. It will also find application in areas requiring accurate estimates of various system statistics. The model class considered here (machines) consists of a novel set of topologies and usefully allows one to estimate both randomness and structure using and . Two of the most basic informational measures. As a result, we expect Bayesian Structural Inference to find an array of applications in bioinformatics, linguistics, and dynamical systems.
Acknowledgments
The authors thank Ryan James and Chris Ellison for helpful comments and implementation advice. Partial support was provided by ARO grants W911NF1210234 and W911NF1210288.
References
 [1] B.J. Yoon. Hidden Markov models and their applications in biological sequence analysis. Curr. Genomics, 10:402–415, 2009.
 [2] L. Narlikar, N. Mehta, S. Galande, and M. Arjunwadkar. One size does not fit all: On how Markov model order dictates performance of genomic sequence analyses. Nucleic Acids Res., 2012.
 [3] R. L. Davidchack, Y.C. Lai, E. M. Bollt, and M. Dhamala. Estimating generating partitions of chaotic systems by unstable periodic orbits. Phys. Rev. E, 61:1353–1356, 2000.
 [4] C. S. Daw, C. E A Finney, and E. R. Tracy. A review of symbolic analysis of experimental data. Rev. Sci. Instrum., 74:915–930, 2003.
 [5] M. B. Kennel and M. Buhl. Estimating good discrete partitions from observed data: Symbolic false nearest neighbors. Phys. Rev. Lett., 91:084102, 2003.
 [6] C. C. Strelioff and J. P. Crutchfield. Optimal instruments and models for noisy chaos. CHAOS, 17:043127, 2007.
 [7] R. P. N. Rao, N. Yadav, M. N. Vahia, H. Joglekar, R. Adhikari, and I. Mahadevan. A Markov model of the Indus script. Proc. Natl. Acad. Sci. USA, 106:13685–13690, 2009.
 [8] R. Lee, P. Jonathan, and P. Ziman. Pictish symbols revealed as a written language through application of Shannon entropy. Proc. Roy. Soc. A, 2010.
 [9] D. Kelly, M. Dillingham, A. Hudson, and K. Wiesner. A new method for inferring hidden Markov models from noisy time sequences. PLoS ONE, 7:e29703, 2012.
 [10] C.B. Li, H. Yang, and T. Komatsuzaki. Multiscale complex network of protein conformational fluctuations in singlemolecule time series. Proc. Natl. Acad. Sci. USA, 105:536–541, 2008.
 [11] P. Graben, J. D. Saddy, M. Schlesewsky, and J. Kurths. Symbolic dynamics of eventrelated brain potentials. Phys. Rev. E, 62:5518–5541, 2000.
 [12] R. Haslinger, K. L. Klinkner, and C. R. Shalizi. The computational structure of spike trains. Neural Comput, 22:121–157, 2009.
 [13] D. P. Varn, G. S. Canright, and J. P. Crutchfield. Inferring pattern and disorder in closepacked structures via machine reconstruction theory: Structure and intrinsic computation in zinc sulphide. Acta. Cryst. Sec. B, 63(2):169–182, 2007.
 [14] D. P. Varn, G. S. Canright, and J. P. Crutchfield. Machine spectral reconstruction theory: A direct method for inferring planar disorder and structure from Xray diffraction studies. Acta. Cryst. Sec. A, 69(2):197–206, 2013.
 [15] J. P. Crutchfield. Between order and chaos. Nature Physics, 8(January):17–24, 2012.
 [16] J. P. Crutchfield and K. Young. Inferring statistical complexity. Phys. Rev. Let., 63:105–108, 1989.
 [17] D. P. Varn, G. S. Canright, and J. P. Crutchfield. Discovering planar disorder in closepacked structures from XRay diffraction: Beyond the fault model. Phys. Rev. B, 66(17):174110–3, 2002.
 [18] C. R. Shalizi, K. L. Shalizi, and J. P. Crutchfield. Pattern discovery in time series, Part I: Theory, algorithm, analysis, and convergence. 2002. Santa Fe Institute Working Paper 0210060; arXiv.org/abs/cs.LG/0210025.
 [19] C. R. Shalizi, K. L. Shalizi, and R. Haslinger. Quantifying selforganization with optimal predictors. Phys. Rev. Lett., 93:118701, 2004.
 [20] B. D. Johnson, J. P. Crutchfield, C. J. Ellison, and C. S. McTague. Enumerating finitary processes. 2012. SFI Working Paper 1011027; arxiv.org:1011.0036 [cs.FL].
 [21] E. Ott. Chaos in Dynamical Systems. Cambridge University Press, New York, 1993.
 [22] S. H. Strogatz. Nonlinear Dynamics and Chaos: with applications to physics, biology, chemistry, and engineering. AddisonWesley, Reading, Massachusetts, 1994.
 [23] D. Lind and B. Marcus. An Introduction to Symbolic Dynamics and Coding. Cambridge University Press, New York, 1995.
 [24] A. B. Gelman, J. S. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall, CRC, 1995.
 [25] C. R. Shalizi and J. P. Crutchfield. Computational mechanics: Pattern and prediction, structure and simplicity. J. Stat. Phys., 104:817–879, 2001.
 [26] N. Travers and J. P. Crutchfield. Exact synchronization for finitestate sources. J. Stat. Phys., 145(5):1181–1201, 2011.
 [27] N. Travers and J. P. Crutchfield. Asymptotic synchronization for finitestate sources. J. Stat. Phys., 145(5):1202–1223, 2011.
 [28] N. Travers and J. P. Crutchfield. Equivalence of history and generator machines. 2011. SFI Working Paper 1111051; arxiv.org:1111.4500 [math.PR].
 [29] C. C. Strelioff, J. P. Crutchfield, and A. W. Hübler. Inferring Markov chains: Bayesian estimation, model comparison, entropy rate, and outofclass modeling. Phys. Rev. E, 76:011106, 2007.
 [30] C. J. Ellison, J. R. Mahoney, and J. P. Crutchfield. Prediction, retrodiction, and the amount of information stored in the present. J. Stat. Phys., 136(6):1005–1034, 2009.
 [31] C. J. Ellison and J. P. Crutchfield. States of states of uncertainty. page arxiv.org: 13XX.XXXX, in preparation.
 [32] L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77:257–286, 1989.
 [33] S. S. Wilks. Mathematical Statistics. John Wiley & Sons, Inc., New York, NY, 1962.
 [34] D. H. Wolpert and D. R. Wolf. Estimating functions of probability distributions from a finite set of samples. Phys. Rev. E, 52:6841–6854, 1995.
 [35] L. Yuan and H. K. Kesavan. Bayesian estimation of Shannon entropy. Commun. Stat. Theory Methods, 26:139–148, 1997.
 [36] J. P. Crutchfield. The calculi of emergence: Computation, dynamics, and induction. Physica D, 75:11–54, 1994.
 [37] K. J. Lang, B. A. Pearlmutter, and R. A. Price. Results of the Abbadingo One DFA learning competition and a new evidencedriven state merging algorithm. In V. Honavar and G. Slutzki, editors, Grammatical Inference, volume 1433 of Lect. Notes Comp. Sci., pages 1–12. Springer Berlin Heidelberg, 1998.
 [38] C. de la Higuera. A bibliographical study of grammatical inference. Patt. Recog., 38:1332–1348, 2005.
Supplementary Material
Bayesian Structural Inference for Hidden Processes
Christopher C. Strelioff and James P. Crutchfield
Appendix A Overview
The supplementary materials provide tables and figures that lend an indepth picture of the Bayesian Structural Inference examples. Unless otherwise noted, all analyses presented here use the same single data series and parameter settings detailed in the main text. Please use the main text as the primary guide.
The first three sections address the Golden Mean, Even, and SNS processes. Each provides a table of estimates of and using the complete set of one to fivestate machines denoted. Estimates are given for each subsample length , where , as in the main text. To be clear, this means that we analyze subsamples using different initial segments of a single long data series, allowing for a consistent view of estimate convergence. For both informationtheoretic quantities, we list the posterior mean and equaltailed, 95% credible interval (CI) constructed using the 2.5% and 97.5% quantiles estimated from samples of the posterior distribution. The CI is denoted by parenthesized number pairs. A second table provides the same estimates of and using only the model. As a result, this table no longer reflects uncertainty in model topology, which may be small or large depending on the data and subsample length under consideration. An additional column in this second table provides the MAP topology along with its posterior probability. The latter is denoted in parentheses.
In addition to the tables of estimates, figures demonstrate the convergence of and marginal posterior distributions as a function subsample length . In this, we consider the difference between posteriors using the complete set of candidate models and those that only employ the MAP topology. This set of figures also illustrates the difference between and . (We use different data, but still a single time series, for the example.) In all plots the marginal posterior distribution for the quantity of interest is estimated using a Gaussian kernel density estimation (Gkde) of the density using samples from the appropriate density. If there is little or no variation in the samples the Gkde fails and no density is drawn. This happens, for example, when the MAP topology has one state, and , for small data sizes. Posterior samples are valid, however, and posterior mean and credible interval can be provided (see tables).
Section E plots the number of accepting topologies as a function of subsample length for each of the example data sources in Fig. S10. The panels demonstrate that there are many valid candidate topologies for a given data series, even when subsamples of considerable length are available.
Finally, Sec. F illustrates all topologies that met the MAP criterion for the data sources considered. Notably, there are not many structures to consider despite the large number of topologies that accept the data.
Appendix B Golden Mean Process: Structural Inference
L  

1  6.767e01 (3.682e02,9.994e01)  1.467e01 (0.000e+00,1.333e+00) 
2  6.400e01 (6.662e02,9.990e01)  1.074e01 (0.000e+00,1.089e+00) 
4  7.771e01 (2.760e01,9.996e01)  1.146e01 (0.000e+00,1.000e+00) 
8  7.753e01 (3.557e01,9.994e01)  1.441e01 (0.000e+00,1.000e+00) 
16  7.941e01 (4.751e01,9.976e01)  1.128e01 (0.000e+00,9.469e01) 
32  7.697e01 (5.221e01,9.773e01)  2.564e01 (0.000e+00,1.556e+00) 
64  6.440e01 (5.207e01,6.942e01)  1.052e+00 (8.235e01,1.797e+00) 
128  6.575e01 (5.953e01,6.930e01)  9.209e01 (8.667e01,9.590e01) 
256  6.684e01 (6.311e01,6.917e01)  9.128e01 (8.740e01,9.437e01) 
512  6.718e01 (6.477e01,6.889e01)  9.107e01 (8.835e01,9.338e01) 
1024  6.622e01 (6.428e01,6.780e01)  9.217e01 (9.048e01,9.369e01) 
2048  6.618e01 (6.483e01,6.736e01)  9.225e01 (9.107e01,9.333e01) 
4096  6.587e01 (6.490e01,6.678e01)  9.253e01 (9.172e01,9.329e01) 
8192  6.645e01 (6.582e01,6.704e01)  9.203e01 (9.143e01,9.259e01) 
16384  6.643e01 (6.599e01,6.685e01)  9.205e01 (9.164e01,9.245e01) 
32768  6.647e01 (6.615e01,6.676e01)  9.202e01 (9.173e01,9.231e01) 
65536  6.662e01 (6.640e01,6.682e01)  9.188e01 (9.167e01,9.208e01) 
131072  6.670e01 (6.655e01,6.684e01)  9.180e01 (9.165e01,9.194e01) 
L  MAP Topology  

1  7.221e01 (9.729e02,9.996e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (8.570e01) 
2  6.603e01 (6.849e02,9.992e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (8.954e01) 
4  8.116e01 (3.066e01,9.997e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (8.896e01) 
8  8.129e01 (3.811e01,9.995e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (8.600e01) 
16  8.141e01 (4.787e01,9.981e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (8.795e01) 
32  8.134e01 (5.668e01,9.830e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (7.324e01) 
64  6.636e01 (5.842e01,6.942e01)  9.061e01 (8.188e01,9.622e01)  n2k2id5 (7.873e01) 
128  6.577e01 (5.962e01,6.929e01)  9.198e01 (8.666e01,9.583e01)  n2k2id5 (9.971e01) 
256  6.684e01 (6.316e01,6.918e01)  9.125e01 (8.736e01,9.433e01)  n2k2id5 (9.987e01) 
512  6.717e01 (6.477e01,6.889e01)  9.108e01 (8.836e01,9.338e01)  n2k2id5 (9.994e01) 
1024  6.621e01 (6.429e01,6.781e01)  9.217e01 (9.046e01,9.369e01)  n2k2id5 (9.997e01) 
2048  6.617e01 (6.481e01,6.735e01)  9.226e01 (9.108e01,9.335e01)  n2k2id5 (9.998e01) 
4096  6.588e01 (6.491e01,6.677e01)  9.253e01 (9.172e01,9.328e01)  n2k2id5 (9.999e01) 
8192  6.645e01 (6.582e01,6.705e01)  9.202e01 (9.143e01,9.259e01)  n2k2id5 (1.000e+00) 
16384  6.643e01 (6.599e01,6.685e01)  9.205e01 (9.164e01,9.245e01)  n2k2id5 (1.000e+00) 
32768  6.646e01 (6.616e01,6.677e01)  9.202e01 (9.173e01,9.231e01)  n2k2id5 (1.000e+00) 
65536  6.662e01 (6.640e01,6.682e01)  9.188e01 (9.167e01,9.208e01)  n2k2id5 (1.000e+00) 
131072  6.670e01 (6.655e01,6.684e01)  9.180e01 (9.165e01,9.194e01)  n2k2id5 (1.000e+00) 
Appendix C Even Process: Structural Inference
L  

1  6.777e01 (3.811e02,9.994e01)  1.480e01 (0.000e+00,1.388e+00) 
2  7.414e01 (0.000e+00,9.997e01)  2.222e01 (0.000e+00,1.528e+00) 
4  7.697e01 (2.359e01,9.996e01)  1.191e01 (0.000e+00,1.224e+00) 
8  8.572e01 (4.097e01,9.998e01)  1.249e01 (0.000e+00,1.422e+00) 
16  8.235e01 (4.751e01,9.998e01)  3.080e01 (0.000e+00,9.454e01) 
32  6.457e01 (4.655e01,9.616e01)  6.909e01 (0.000e+00,8.961e01) 
64  6.804e01 (6.276e01,6.942e01)  8.746e01 (7.675e01,9.464e01) 
128  6.824e01 (6.453e01,6.942e01)  8.854e01 (8.166e01,9.359e01) 
256  6.783e01 (6.485e01,6.939e01)  8.993e01 (8.568e01,9.333e01) 
512  6.679e01 (6.422e01,6.868e01)  9.151e01 (8.890e01,9.374e01) 
1024  6.756e01 (6.602e01,6.874e01)  9.069e01 (8.875e01,9.243e01) 
2048  6.700e01 (6.581e01,6.801e01)  9.144e01 (9.016e01,9.260e01) 
4096  6.666e01 (6.578e01,6.744e01)  9.181e01 (9.096e01,9.263e01) 
8192  6.704e01 (6.647e01,6.757e01)  9.142e01 (9.080e01,9.202e01) 
16384  6.666e01 (6.623e01,6.707e01)  9.183e01 (9.141e01,9.225e01) 
32768  6.660e01 (6.629e01,6.689e01)  9.189e01 (9.160e01,9.219e01) 
65536  6.657e01 (6.635e01,6.677e01)  9.193e01 (9.172e01,9.213e01) 
131072  6.658e01 (6.643e01,6.672e01)  9.192e01 (9.177e01,9.206e01) 
L  MAP Topology  

1  7.226e01 (1.003e01,9.996e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (8.570e01) 
2  8.426e01 (3.541e01,9.998e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (7.893e01) 
4  8.100e01 (2.982e01,9.997e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (8.721e01) 
8  9.027e01 (5.764e01,9.999e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (8.626e01) 
16  9.517e01 (7.735e01,9.999e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (6.023e01) 
32  6.316e01 (4.650e01,6.941e01)  7.152e01 (4.825e01,8.861e01)  n2k2id7 (9.434e01) 
64  6.802e01 (6.282e01,6.942e01)  8.728e01 (7.690e01,9.445e01)  n2k2id7 (9.941e01) 
128  6.823e01 (6.456e01,6.942e01)  8.845e01 (8.165e01,9.351e01)  n2k2id7 (9.973e01) 
256  6.783e01 (6.483e01,6.939e01)  8.991e01 (8.566e01,9.334e01)  n2k2id7 (9.989e01) 
512  6.681e01 (6.426e01,6.869e01)  9.149e01 (8.887e01,9.370e01)  n2k2id7 (9.995e01) 
1024  6.757e01 (6.604e01,6.873e01)  9.068e01 (8.878e01,9.241e01)  n2k2id7 (9.997e01) 
2048  6.700e01 (6.581e01,6.801e01)  9.143e01 (9.017e01,9.260e01)  n2k2id7 (9.999e01) 
4096  6.666e01 (6.579e01,6.744e01)  9.181e01 (9.096e01,9.262e01)  n2k2id7 (9.999e01) 
8192  6.704e01 (6.647e01,6.757e01)  9.142e01 (9.080e01,9.202e01)  n2k2id7 (1.000e+00) 
16384  6.666e01 (6.623e01,6.707e01)  9.183e01 (9.141e01,9.224e01)  n2k2id7 (1.000e+00) 
32768  6.660e01 (6.629e01,6.689e01)  9.189e01 (9.160e01,9.219e01)  n2k2id7 (1.000e+00) 
65536  6.657e01 (6.635e01,6.678e01)  9.193e01 (9.172e01,9.213e01)  n2k2id7 (1.000e+00) 
131072  6.658e01 (6.642e01,6.672e01)  9.192e01 (9.177e01,9.207e01)  n2k2id7 (1.000e+00) 
Appendix D SNS Process: Structural Inference
L  

1  6.780e01 (3.817e02,9.993e01)  1.483e01 (0.000e+00,1.325e+00) 
2  7.425e01 (0.000e+00,9.997e01)  2.207e01 (0.000e+00,1.525e+00) 
4  7.698e01 (2.398e01,9.997e01)  1.207e01 (0.000e+00,1.225e+00) 
8  7.781e01 (3.449e01,9.994e01)  1.326e01 (0.000e+00,1.357e+00) 
16  7.952e01 (2.702e01,9.994e01)  3.679e01 (0.000e+00,2.084e+00) 
32  7.555e01 (4.978e01,9.605e01)  8.161e02 (0.000e+00,8.579e01) 
64  7.228e01 (5.935e01,9.142e01)  4.627e01 (0.000e+00,1.043e+00) 
128  6.808e01 (6.365e01,6.942e01)  8.006e01 (6.982e01,8.808e01) 
256  6.756e01 (6.411e01,6.937e01)  7.801e01 (7.088e01,8.407e01) 
512  6.799e01 (6.562e01,6.929e01)  8.151e01 (7.419e01,1.390e+00) 
1024  6.849e01 (6.693e01,6.931e01)  9.021e01 (7.717e01,1.757e+00) 
2048  6.827e01 (6.701e01,6.922e01)  1.441e+00 (7.905e01,2.219e+00) 
4096  6.825e01 (6.756e01,6.896e01)  1.787e+00 (1.673e+00,2.228e+00) 
8192  6.828e01 (6.782e01,6.874e01)  2.002e+00 (1.692e+00,2.233e+00) 
16384  6.800e01 (6.769e01,6.832e01)  2.198e+00 (2.168e+00,2.231e+00) 
32768  6.789e01 (6.766e01,6.811e01)  2.197e+00 (2.170e+00,2.229e+00) 
65536  6.784e01 (6.769e01,6.800e01)  2.199e+00 (2.174e+00,2.228e+00) 
131072  6.788e01 (6.777e01,6.799e01)  2.201e+00 (2.178e+00,2.230e+00) 
L  MAP Topology  

1  7.231e01 (9.607e02,9.996e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (8.570e01) 
2  8.414e01 (3.462e01,9.998e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (7.893e01) 
4  8.086e01 (2.981e01,9.997e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (8.721e01) 
8  8.136e01 (3.826e01,9.996e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (8.829e01) 
16  8.800e01 (5.927e01,9.997e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (7.774e01) 
32  7.665e01 (5.040e01,9.641e01)  0.000e+00 (0.000e+00,0.000e+00)  n1k2id3 (9.105e01) 
64  6.712e01 (5.947e01,6.942e01)  7.842e01 (6.406e01,8.918e01)  n2k2id5 (5.301e01) 
128  6.803e01 (6.370e01,6.942e01)  7.981e01 (7.021e01,8.756e01)  n2k2id5 (9.835e01) 
256  6.755e01 (6.408e01,6.937e01)  7.786e01 (7.083e01,8.393e01)  n2k2id5 (9.953e01) 
512  6.804e01 (6.600e01,6.928e01)  7.887e01 (7.416e01,8.313e01)  n2k2id5 (9.721e01) 
1024  6.858e01 (6.746e01,6.929e01)  8.029e01 (7.714e01,8.321e01)  n2k2id5 (8.989e01) 
2048  6.871e01 (6.801e01,6.922e01)  8.066e01 (7.848e01,8.273e01)  n2k2id5 (3.419e01) 
4096  6.826e01 (6.760e01,6.893e01)  1.703e+00 (1.672e+00,1.733e+00)  n4k2id3334 (1.336e01) 
8192  6.834e01 (6.792e01,6.877e01)  1.709e+00 (1.687e+00,1.730e+00)  n4k2id3334 (6.462e02) 
16384  6.800e01 (6.769e01,6.831e01)  2.177e+00 (2.166e+00,2.188e+00)  n5k2id22979 (8.630e02) 
32768  6.789e01 (6.766e01,6.810e01)  2.176e+00 (2.169e+00,2.184e+00)  n5k2id22979 (8.632e02) 
65536  6.784e01 (6.769e01,6.799e01)  2.178e+00 (2.173e+00,2.184e+00)  n5k2id22979 (8.560e02) 
131072  6.788e01 (6.777e01,6.798e01)  2.181e+00 (2.177e+00,2.185e+00)  n5k2id22979 (8.539e02) 