Tensor Variable Elimination for Plated Factor Graphs

Tensor Variable Elimination for Plated Factor Graphs

Tensor Variable Elimination for Plated Factor Graphs

Abstract

A wide class of machine learning algorithms can be reduced to variable elimination on factor graphs. While factor graphs provide a unifying notation for these algorithms, they do not provide a compact way to express repeated structure when compared to plate diagrams for directed graphical models. To exploit efficient tensor algebra in graphs with plates of variables, we generalize undirected factor graphs to plated factor graphs and variable elimination to a tensor variable elimination algorithm that operates directly on plated factor graphs. Moreover, we generalize complexity bounds based on treewidth and characterize the class of plated factor graphs for which inference is tractable. As an application, we integrate tensor variable elimination into the Pyro probabilistic programming language to enable exact inference in discrete latent variable models with repeated structure. We validate our methods with experiments on both directed and undirected graphical models, including applications to polyphonic music modeling, animal movement modeling, and latent sentiment analysis.

\icmlsetsymbol

equal*

{icmlauthorlist}\icmlauthor

Fritz Obermeyerequal,uber \icmlauthorEli Binghamequal,uber \icmlauthorMartin Jankowiakequal,uber \icmlauthorJustin Chiuharvard \icmlauthorNeeraj Pradhanuber \icmlauthorAlexander Rushharvard \icmlauthorNoah Goodmanuber,stanford

\icmlaffiliation

uberUber AI Labs \icmlaffiliationstanfordStanford University \icmlaffiliationharvardHarvard University

\icmlcorrespondingauthor

Fritz Obermeyerfritzo@uber.com

\icmlkeywords

Tensors, Message Passing, Probabilistic Programming


\printAffiliationsAndNotice\icmlEqualContribution

1 Introduction

Factor graphs Kschischang et al. (2001) provide a unifying representation for a wide class of machine learning algorithms as undirected bipartite graphs between variables and factors. Factor graphs can be used with both directed and undirected graphical models to represent probabilistic inference algorithms performed by variable elimination Pearl (1986); Lauritzen & Spiegelhalter (1988). In the most common case, variable elimination is performed by sum-product inference, but other variable elimination algorithms can be derived through alternative semirings and adjoints.

In recent years researchers have exploited the equivalence of sum-product on discrete factor graphs and tensor contraction Hirata (2003); Smith & Gray (2018). Standard tensor contraction can provide efficient implementations of sum-product inference and the tensor contraction operations can be generalized to alternate semirings Kohlas & Wilson (2008); Belle & De Raedt (2016); Khamis et al. (2016).

Yet, a major downside of factor graphs as an intermediary representation is that they discard useful information from higher-level representations, in particular, repeated structure. Directed graphical models explicitly denote repeated structure through plate notation Buntine (1994). Plates have not seen widespread use in factor graphs or their inference algorithms (Dietz (2010) being an exception for directed factor graphs). Nor have plates been exploited by tensor contraction, despite the highly parallel nature of variable elimination algorithms. This gap can result in suboptimal algorithms, since repeated structure can provide information that can be directly exploited for inference optimizations.

In this work we consider the class of plated factor graphs and the corresponding tensor variable elimination algorithms. We propose a natural definition for plated factor graphs and lift a number of classic results and algorithms from factor graphs to the plated setting. In particular, we generalize treewidth-based bounds on computational complexity for factor graphs to bounds depending on plate sizes, and characterize the boundary between plated factor graphs leading to computational complexity either polynomial or exponential in the sizes of plates in the factor graph.

We consider several different applications of these techniques. First we describe how plated factor graphs can provide an efficient intermediate representation for generative models with discrete variables, and incorporate a tensor-based contraction into the Pyro probabilistic programming language. Next, we develop models for three real-world problems: polyphonic music modeling, animal movement modeling, and latent sentiment analysis. These models combine directed networks (Bayesian networks), undirected networks (conditional random fields), and deep neural networks. We show how plated factor graphs can be used to concisely represent structure and provide efficient general-purpose inference through tensor variable elimination.

2 Related Work

Dense sum-product problems have been studied by the HPC community under the name tensor contraction. Hirata (2003) implemented a Tensor Contraction Engine for performing optimized parallel tensor contractions, an instance of dense sum-product message passing. Solomonik et al. (2014) implement a similar framework for large-scale distributed computation of tensor contractions. Wiebe (2011) implemented a popular interface np.einsum for performing tensor contractions in NumPy. Smith & Gray (2018) implement a divide-and-conquer optimizer for einsum operations; we extended their implementation and use it as a primitive for non-plated variable elimination throughout this paper. Abseher et al. (2017) also address the problem of finding efficient junction tree decompositions.

Sparse sum-product problems have a long history in database query optimization, as they can be seen as a specific type of a join-groupby-aggregate query. Kohlas & Wilson (2008) define an abstract framework for variable elimination over valuations and semirings. Khamis et al. (2016) formulate an abstract variable elimination algorithm, prove complexity bounds, and connect these algorithms to database query optimization. Lifted inference algorithms Taghipour et al. (2013) developed for probabilistic databases are also concerned with extending classical methods for sum-product problems to exploit repeated structure or symmetry in graphical models.

A recent line of work formulates classical machine learning algorithms (e.g. solving convex optimization problems) such that they can be incorporated as differentiable submodules in neural networks (e.g. Djolonga & Krause (2017); Amos & Kolter (2017)). Our work can also be viewed in this light, with the difference that in our case the focus is not on enabling differentiability (which comes for free) but rather on efficient use of tensor abstractions.

3 Model: Plated Factor Graphs

Definition 1.

A factor graph is a bipartite graph whose vertices are either variables or factors , and whose edges are pairs of vertices. We say factor involves variable iff . Each variable has domain , and each factor involving variables maps values to scalars.

In this work we are interested in discrete factor graphs where variable domains are finite and factors are tensors with one dimension per neighboring variable. The key quantity of interest is defined through the sum-product contraction,

where we use named tensor dimensions and assume that tensors broadcast by ignoring uninvolved variables, i.e. if then .

Definition 2.

A plated factor graph is a labeled bipartite graph whose vertices are labeled by the plates on which they are replicated , where is a set of plates. We require that each factor is in each of the plates of its variables: .111While the final requirement could perhaps be relaxed, it is required if factors are to be represented as multi-dimensional tensors.

To instantiate a plated factor graph, we assume a map that specifies the number of times to replicate each plate . Under this definition each plated factor is represented by a tensor with dimensions for both its plates and its involved variables. Using the same partial tensor notation above, we can access a specific grounded factor by indexing its plate dimensions, i.e. , where for each dimension and if .

The key operation for plated factor graphs will be “unrolling” to standard factor graphs.222This operation is conceptual: it is used in our definitions and proofs, but none of our algorithms explicitly unroll factor graphs. First define the following plate notation for either a factor or variable : if and otherwise. This is the set of indices that index into the replicated variable or factor. Now define a function to unroll a plate,

where indicates an unrolled index of and,

Sum-product contraction naturally generalizes to plated factor graphs via unrolling. We define:

where and are constructed by unrolling each plate in the original plated factor graph .

Example 3.1.

Consider a plated factor graph with two variables , three factors , and two nested plates:

Assuming sizes and , this plated factor graph unrolls to a factor graph:

Example 3.2.

Consider a plated factor graph with two variables, one factor, and two overlapping non-nested plates, denoting a Restricted Boltzmann Machine (RBM) Smolensky (1986):

Assuming sizes and , this plated factor graph unrolls to a factor graph:

4 Inference: Tensor Variable Elimination

We now describe an algorithm—tensor variable elimination—for computing PlatedSumProduct on a tractable subset of plated factor graphs. We show that variable elimination cannot be generalized to run in polynomial time on all plated factor graphs and that the algorithm succeeds for exactly those plated factor graphs that can be run in polynomial time. Finally, we briefly discuss extensions of the algorithm, including a plated analog of the Viterbi algorithm on factor graphs.

4.1 An algorithm for tensor variable elimination

The main algorithm is formulated in terms of several standard functions: , introduced above, computes sum-product contraction of a set of tensors along a subset of their dimensions (here always variable dimensions) via variable elimination333Available in many machine learning libraries as einsum.; product-reduces a single tensor along a subset of its dimensions (here always plate dimensions)

and separates a bipartite graph into its connected components. SumProduct and Product each return a tensor, with dimensions corresponding to remaining plates and variables, unless all dimensions get reduced, in which case they return a scalar. Intuitively, SumProduct eliminates variables and Product eliminates plates.

At a high level, the strategy of the algorithm is to greedily eliminate variables and plates along a tree of factors. At each stage it picks the most deeply nested plate set, which we call a leaf plate. It eliminates all variables in exactly that plate set via standard variable elimination, producing a set of reduced factors. Each reduced factor is then replaced by a product factor that eliminates one or more plates. Repeating this procedure until no variables or plates remain, all scalar factors are finally combined with a product operation.

Algorithm 1 specifies the full algorithm. Elimination of variables and plates proceeds by modifying the input plated factor graph and a partial result containing scalars. Both loops preserve the invariant that is a valid plated factor graph and preserve the quantity

At each leaf plate set , the algorithm decomposes that plate set’s factor graph into connected components. Each connected component is SumProduct-contracted to a single factor with no variables remaining in the leaf plate set . If the resulting factor has no more variables, it is Product-reduced to a single scalar. Otherwise the algorithm seeks a plate set where other variables of can be eliminated; is partially Product-reduced to a factor that is added back to the plated factor graph, including edges from that had been removed. Finally, when no more variables or plates remain, all scalar factors are product-combined by a trivial . The algorithm can fail with error if the search for a next plate set fails.

input variables , factors , edges ,
input plate sets ,
input plate sizes .
output or error.
Initialize an empty list of scalars .
while 
is not empty do

       Choose a leaf plate set
owith a maximal number of plates.
Let be the variables in .
Let be the factors in .
Let be the edges in .
for  in Partition do
             Let .
Let
obe the set of ’s remaining variables.
Remove component from .
if  is empty then
                   Add to scalars .
            else
                   Let be the next
oplate set where has variables.
if  then error(‘‘Intractable!’’);
                   Let .
Add to appropriately.
return
Algorithm 1 TensorVariableElimination

To help characterize when Algorithm 1 succeeds, we now introduce the concept of a graph minor.

Definition 3.

A plated graph444When considering graph minors, we view factor graphs as undirected graphs . is a minor of the plated graph if it can be obtained from by a sequence of edits of the form: deleting a vertex, deleting an edge, deleting a plate, or merging two vertices connected by an edge and in identical plates .

Theorem 1.

Algorithm 1 succeeds iff has no plated graph minor where , , , , and both include variables.

Proof.

See Appendix A.1. ∎

This plated graph minor exclusion property essentially excludes the RBM Example 3.2 above, which is a minimal example of an intractable input.555See Appendix C.1 for a detailed walk-through of Algorithm 1 on this model, leading to error. If Algorithm 1 fails, one could fall back to unrolling a plate and continuing (at cost exponential in plate size); we leave this for future research.

4.2 Complexity of tensor variable elimination

It is well known that message passing algorithms have computational complexity exponential in the treewidth of the input factor graph but only linear in the number of variables Chandrasekaran et al. (2012); Kwisthout et al. (2010). In this section we generalize this result to the complexity of plated message passing on tensor factors. We show that for an easily identifiable class of plated factor graphs, serial complexity is polynomial in the tensor sizes of the factors, and parallel complexity is sublinear in the size of each plate. Essentially this characterizes when the tensor size of one factor of a plated factor graph determines the treewidth of the unrolled non-plated factor graph: in the polynomial case, treewidth is independent of tensor size.

Example 4.1.

Consider the plated factor graph of Example 3.1 with nested plates. We wish to compute,

where we are able to commute the sums over inside the product over , thereby reducing an algorithm of cost exponential in to a polynomial-cost algorithm.

Example 4.2.

Consider the plated factor graph of Example 3.2 with overlapping, non-nested plates. Although the plated factor graph is a tree of tensors, the unrolled factor graph has treewidth ; hence complexity will be exponential in the tensor sizes of dimensions .

We can reach the same conclusion from sum-product,

Here we cannot commute both Cartesian product summations inside the product and so the computation is necessarily exponential in .

We now show Algorithm 1 accepts the largest class of plated factor graphs for which a polynomial time strategy exists.

Theorem 2.

Let be a plated factor graph. Assume variable domains have nontrivial sizes . Then Algorithm 1 succeeds on iff can be computed with complexity polynomial in plate sizes .

Proof sketch.

(see Appendix A.2 for full proof).

() Algorithm 1 has complexity polynomial in .

() Appeal to Kwisthout et al. (2010) (which assumes the Exponential Time Hypothesis) to show that the unrolled factor graph must have uniformly bounded treewidth. Apply Ramsey theory arguments to show there is a single “plated junction tree” with certain properties. Show this tree can only exist if excludes the minor of Thm. 1, hence Algorithm 1 succeeds. ∎

When a plated factor graph is asymptotically tractable according to Thm. 2, it is also tractable on a parallel machine.

Theorem 3.

If Algorithm 1 runs in sequential time when , then it runs in time on a parallel machine with processors and perfect efficiency.

Proof.

Algorithm 1 depends on only through calls to SumProduct and Product. SumProduct operates independently over plates, and hence parallelizes with perfect efficiency. Product reductions over each plate incur only time on a parallel machine. ∎

This result suggests that (tractable) plated factor graphs are a practical model class for modern hardware. See Sec. 6.4 for empirical verification of the computational complexity described in Thm. 3.

While tensor variable elimination enables parallelism without increasing arithmetic operation count, it also reduces overhead involved in graph manipulation. A significant portion of runtime in SumProduct is spent on constructing a junction tree. By exploiting repeated structure, tensor variable elimination can restrict its junction tree problems to much smaller factor graphs (constructed locally for each connected component of each plate set), leading to lower overhead and the opportunity to apply better heuristics.666E.g. opt_einsum uses an optimal strategy for factor graphs with up to four variables.

4.3 Generic tensor variable elimination

Because Algorithm 1 is generic777In the sense of generic programming Musser & Stepanov (1988) in its two operations (plus and multiply), it can immediately be repurposed to yield other algorithms on plated factor graphs, for example a polynomial-time PlatedMaxProduct algorithm that generalizes the MaxProduct algorithm on factor graphs.

Further extensions can be efficiently formulated as adjoint algorithms, which proceed by recording an adjoint compute graph alongside the forward computation and then traversing the adjoint graph backwards starting from the final result of the forward computation Darwiche (2003); Eisner (2016); Azuma et al. (2017); Belle & De Raedt (2016). These extensions include: computing marginal distributions of all variables (generalizing the forward-backward algorithm); maximum a posteriori estimation (generalizing Viterbi-like algorithms); and drawing joint samples of variables (generalizing the forward-filter backward-sample algorithm). Note that while most message passing algorithms need only assume that the sum and product operations have semiring structure Kohlas & Wilson (2008); Khamis et al. (2016), marginal computations additionally require division.

5 Application to Probabilistic Programming

Plated factor graphs provide a general-purpose intermediate representation for many applications in probabilistic modeling requiring efficient inference and easy parallelization. To make tensor variable elimination broadly usable for these applications, we integrate it into two frameworks888Open-source implementations are available; see http://docs.pyro.ai/en/dev/ops.html and use both frameworks in our experiments.

Figure 1: The tractable and intractable plated factor graphs in Examples 3.1 and  3.2 arise from the plated graphical model on the left and right, respectively. Here the two random variables are unobserved and the random variable is observed.

5.1 Plated probabilistic programs in Pyro

First we integrate our implementation into the Pyro probabilistic programming language Bingham et al. (2018). This allows us to specify discrete latent variable models easily, programmatically constructing complex distributions and explicitly indicating repeated structure. The syntax relies on a plate context manager.999We also introduce a markov context manager to deal with markov structure. Inside these plates, sample statements are batched and assumed to be conditionally independent along the plate dimension.

Example 5.1.

The directed graphical model in Fig. 1 (left) can be specified by the Pyro program

def model(z):
  I, J = z.shape
  x = pyro.sample("x", Bernoulli(Px))
  with pyro.plate("I", I, dim=-2):
    y = pyro.sample("y", Bernoulli(Py))
    with pyro.plate("J", J, dim=-1):
      pyro.sample("z", Bernoulli(Pz[x,y]),
                  obs=z)

To extract plated factor graphs from such programs without using static analysis, we use a nonstandard interpretation of Pyro sample statements Wingate et al. (2011) as vectorized enumeration over each distribution’s support. That is, when running a program forward we create a tensor at each sample site that lists all possible values of each distribution rather than draw random samples. To avoid conflict among multiple enumerated variables, we dynamically assign each variable a distinct tensor dimension along which its values are enumerated, relying on array broadcasting Walt et al. (2011) to correctly combine results of multiple nonstandard sample values (as in Pz[x,y] above).

5.2 Undirected graphical models using einsum

A standard interface for expressing inference in undirected graphical models is einsum notation. Basic einsum computes the product of a collection of tensors, with some dimensions preserved in the final result and remaining dimensions summed out. This operation behaves analogously to SumProduct, with the inputs to einsum given as: a string denoting the tensor dimensions of all input and output tensors, and a list of tensors. The former gives the factor graph topology and the latter the factors themselves. For example, einsum("xy,yz->xz",F,G) corresponds to matrix multiplication for matrices F,G. This einsum expression also corresponds to reducing the variable in a factor graph with three nodes and two factors.

Plated einsum generalizes einsum notation, e.g. {Verbatim}[fontsize=] einsum(”xy,iyz-¿xz”,F,G,plates=”i”) == einsum(”xy,yz,yz,yz-¿xz”,F,G[0],G[1],G[2]) where dimension x,z are preserved, variable y is sum-reduced, and plate i is product-reduced. More generally, factors can include plated variables, e.g. Example 3.1 can be written as {Verbatim}[fontsize=] einsum(”x,iy,ijxy-¿”,F,G,H,plates=”ij”) == einsum(”x,y,z,xy,xy,xy,xz,xz,xz-¿”, F,G[0],G[1],H[0,0],H[0,1], H[0,2],H[1,0],H[1,1],H[1,2]) where the rightmost dimension of G and H denotes a distinct variable for each slice i (y and z in the unrolled version). Thus the effective number of variables grows with the size of plate i, but the plated notation requires only a constant number of symbols. Formally we infer each variable’s plate set as the largest set consistent with the input string. In addition, this version of einsum implements generic tensor variable elimination, and so it can also be used to compute marginals and draw samples with the same syntax.

6 Experiments

We experiment with plated factor graphs as a modeling language for three tasks: polyphonic music prediction, animal movement modeling and latent sentiment analysis. Experiments consider different variants of discrete probabilistic models and their combination with neural neworks.

6.1 Hidden Markov Models with Autoregressive Likelihoods

In our first experiment we train variants of a hidden Markov model101010For an introduction see e.g. reference Ghahramani (2001). (HMM) on a polyphonic music modeling task Boulanger-Lewandowski et al. (2012). The data consist of sequences , where each denotes the presence or absence of 88 distinct notes. This task is a common challenge for latent variable models such as continuous state-space models (SSMs), where one of the difficulties of inference is that training is typically stochastic, i.e. the latent variables need to be sampled.111111This is true for all five reference models in this section, apart from the RTRBM. In contrast, the discrete latent variable models explored here—each defined through a plated factor graph—admit efficient tensor variable elimination, obviating the need for sampling.

We consider 12 different latent variable models, where the emission likelihood for each note is replicated on a plate of size 88, so that notes at each time step are conditionally independent. Writing these models as probabilistic programs allows us to easily experiment with dependency structure and parameterization, with variation along two dimensions:

  1. the dependency structure of the discrete latent variables

  2. whether the likelihood is autoregressive, i.e. whether it depends on , and if so whether is parameterized by a neural network

{tabu}

—c—[1pt]c—c—c— & Dataset  

Model & JSB & Piano & Nottingham

\tabucline

[1pt]- HMM & 8.28& 9.41& 4.49

FHMM & 8.40& 9.55& 4.72

PFHMM & 8.30& 9.49& 4.76

2HMM & 8.70& 9.57& 4.96

\tabucline

[1pt]- arHMM & 8.00& 7.30& 3.29

arFHMM & 8.22& 7.36& 3.57

arPFHMM & 8.39& 9.57& 4.82

ar2HMM & 8.19& 7.11& 3.34

\tabucline

[1pt]- nnHMM & 6.73& 7.32& 2.67

nnFHMM & 6.86& 7.41& 2.82

nnPFHMM & 7.07& 7.47& 2.81

nn2HMM & 6.78& 7.29& 2.81

Table 1: Negative log likelihoods for HMM variants on three polyphonic music test datasets; lower is better. See Sec. 6.1 for details.

Dependency structures include a vanilla HMM (HMM); two variants of a Factorial HMM (FHMM & PFHMM) Ghahramani & Jordan (1996); and a second-order HMM (2HMM).121212For the second-order HMM we use a parsimonious Raftery parameterization of the transition probabilities Raftery (1985). The models denoted by arXXX and nnXXX include an autoregressive likelihood: the former are explicitly parameterized with a conditional probability table, while the latter use a neural network to parameterize the likelihood. (See the supplementary materials for detailed descriptions.)

We report our results in Table 6.1. We find that our ability to iterate over a large class of models131313See https://git.io/fh95J for a reference implementation for a selection of HMM variants.—in particular different latent dependency structures, each of which requires a different message passing algorithm—proved useful, since different datasets preferred different classes of models. Autoregressive models yield the best results, and factorial HMMs perform worse across most models.

We note that these classic HMM variants (upgraded with neural network likelihoods) are competitive with baseline state space models, outperforming STORN Bayer & Osendorfer (2014) on 3 of 3 datasets, LV-RNN Gu et al. (2015) on 2 of 3 datasets, Deep Markov Model Krishnan et al. (2017) on 2 of 3 datasets, RTRBM Sutskever et al. (2009); Boulanger-Lewandowski et al. (2012) on 1 of 3 datasets, and TSBN Gan et al. (2015) on 3 of 3 datasets.

6.2 Hierarchical Mixed-Effect Hidden Markov Models

In our second set of experiments, we consider models for describing the movement of populations of wild animals. Recent advances in sensor technology have made it possible to capture the movements of multiple animals in a population at high spatiotemporal resolution McClintock et al. (2013). Time-inhomogeneous discrete SSMs, where the latent state encodes an individual’s behavior state (like “foraging” or “resting”) and the state transition matrix at each timestep is computed with a hierarchical discrete generalized linear mixed model, have become popular tools for data analysis thanks to their interpretability and tractability Zucchini et al. (2016); McClintock & Michelot (2018).

Rapidly iterating over different variants of such models, with nested plates and hierarchies of latent variables that couple large groups of individuals within a population, is difficult to do by hand but can be substantially simplified by expressing models as plated probabilistic programs and performing inference with tensor variable elimination.141414See https://git.io/fh95k for a reference implementation.

To illustrate this, we implement a version of the model selection process for movement data from a colony of harbour seals in the United Kingdom described in McClintock et al. (2013), fitting three-state hierarchical discrete SSMs with no random effects (No RE, a vanilla HMM), sex-level discrete random effects (Group RE), individual-level discrete random effects (Individual RE), and both sex- and individual-level discrete random effects (Individual+Group RE). See the supplement for details on the dataset, models and training procedure.

{tabu}

—c—[1pt]c— Model & AIC

\tabucline

[1pt]-

No RE (HMM) &

Individual RE &

Group RE &

Individual+Group RE &

Table 2: Akaike Information Criterion (AIC) scores for hierarchical mixed effect HMM variants fit with maximum likelihood on animal movement data. Lower is better. See Sec. 6.2.

We report AIC scores for all models in Table 2. Although our models do not exactly match those in the original analysis,151515See supplement for a discussion of differences. our results support theirs in suggesting that including individual-level random effects is essential because there is significant behavioral variation across individuals and sexes that is unexplained by the available covariates.

6.3 Latent Variable Classification

We next experiment with model flexibility by designing a conditional random field Lafferty et al. (2001) model for latent variable classification on a sentiment classification task. Experiments use the Sentihood dataset Saeidi et al. (2016), which consists of sentences containing named location entities with sentiment labels along different aspects. For a sentence labels are tuples that contain an aspect , a location , and a sentiment . The task is to predict the sentiment of a sentence given a location and aspect, for example .


Standard approaches to this sentence-level classification task use neural network models to directly make sentence-level predictions. We instead propose a latent variable approach that explicitly models the sentiment of each word with respect to all locations and aspects. This approach can provide clearer insight into the specific reasoning of the model and also permits conditional inference.


Figure 2: The graphical model for the sentiment analysis CRF in Sec. 6.3. The aspect and location are observed, while the word-level sentiments and sentence-level sentiment for the particular aspect and location pair must be inferred. See the supplementary material for the exact parameterization of the factors.

Our conditional random field model is represented as a plated factor graph in Figure 2. Here is the latent word-level sentiment for fixed and . The two plated factors and represent the word aspect-location-sentiment potentials and the word-sentence potentials, respectively. We parameterize these factors with a bidirectional LSTM (BLSTM) over word embeddings of whose initial state is given by an embedding of the location and aspect . We experiment with three variants:161616See https://github.com/justinchiu/sentclass for a reference implementation. 1) CRF-LSTM-Diag parameterizes with the output of the BLSTM and with a diagonal matrix; 2) CRF-LSTM-LSTM parameterizes both and with the output of the BLSTM; and 3) CRF-Emb-LSTM parameterizes with word embeddings and with the output of the BLSTM. As a baseline we reimplement the direct BLSTM model LSTM-Final model from Saeidi et al. (2016). See the supplementary material for full details.



The accuracy of our models is given in Table  6.3.171717The results of previous work can be found summarized succinctly in Liu et al. (2018). Note that our goal is not to improve upon previous results (other models have higher accuracy); rather we aim to capitalize on the plated representation to infer latent word-level sentiment. The CRF models all achieve similar performance to the baseline LSTM-Final model. However, as a result of the factor graph representation, we demonstrate the ability to infer word-level sentiment conditioned on a sentence-level sentiment as well as sparsity of the conditional word-level sentiments for the CRF-Emb-LSTM model. We provide an example sentence and the inferred conditional word sentiments for the CRF-Emb-LSTM model in Fig. 3.



{tabu}

—c—[1pt]c—c— & Metric  

Model & Acc & F1

\tabucline

[1pt]-

LSTM-Final & 0.821& 0.780

CRF-LSTM-Diag & 0.805& 0.764

CRF-LSTM-LSTM & 0.843& 0.799

CRF-Emb-LSTM & 0.833& 0.779

Table 3: Test sentiment accuracies and aspect F1 scores for sentiment classification models on Sentihood. Higher is better for both metrics. Both metrics are calculated by ignoring examples with gold labelled ‘none’ sentiments; see Sec. 6.3 for details.
Figure 3: The inferred word-level sentiment for aspect transit-location and location Loc1 conditioned on a negative sentence-level sentiment. The ellipsis contains the elided words ‘is a great place to live’, all of which get none sentiment. The model assigns the none sentiment to most words, preventing them from influencing the polarity of the sentence-level sentiment.

6.4 Performance Evaluation

To measure the performance of our implementation we use the benchmark plated model181818See Appendix C.2 for a detailed walkthrough of Algorithm 1 on this model. of Fig. 4. We compute PlatedSumProduct and four different adjoint operations: gradient, marginal, sample, and MAP. Figure 5 shows results obtained on an Nvidia Quadro P6000 GPU. We find that runtime is approximately constant until the GPU is saturated (at ), and runtime is approximately linear in for larger plate sizes, empirically validating Thm 3.


Figure 4: The tractable model used in our performance benchmark. Note that this model would be intractable if there were a factor directly connecting variables and .

Figure 5: Runtime of Algorithm 1 as we vary plate size (where ) and fix domain size for all variables.

7 Conclusion

This work argues for plated factor graphs as an intermediate representation that preserves shared structure for calculations with discrete random variables, and develops a tensor variable elimination algorithm for efficiently computing plated sum-product and related factor graph queries for a subset of plated models. We show how this approach can be used to compute key quantities for directed and undirected graphical models and demonstrate its implementation as a general purpose intermediary representation for probabilistic programming with discrete random variables. Applications further demonstrate that this provides a simple, flexible, and efficient framework for working with discrete graphical models, and that these models can often outperform more complicated counterparts, while providing interpretable latent representations. The work itself is integrated into a widely used probabilistic programming system, and we hope it provides a framework for experiments incorporating discrete variable models into large-scale systems.

Acknowledgements

We would like to thank many people for early discussions and feedback, including JP Chen, Theofanis Karaletsos, Ruy Ley-Wild, and Lawrence Murray. MJ would like to thank Felipe Petroski Such for help with infrastructure for efficient distribution of experiments.

References

  • Abseher et al. (2017) Abseher, M., Musliu, N., and Woltran, S. Improving the efficiency of dynamic programming on tree decompositions via machine learning. Journal of Artificial Intelligence Research, 58:829–858, 2017.
  • Amos & Kolter (2017) Amos, B. and Kolter, J. Z. Optnet: Differentiable optimization as a layer in neural networks. arXiv preprint arXiv:1703.00443, 2017.
  • Azuma et al. (2017) Azuma, A., Shimbo, M., and Matsumoto, Y. An algebraic formalization of forward and forward-backward algorithms. arXiv preprint arXiv:1702.06941, 2017.
  • Bayer & Osendorfer (2014) Bayer, J. and Osendorfer, C. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014.
  • Belle & De Raedt (2016) Belle, V. and De Raedt, L. Semiring programming: A framework for search, inference and learning. arXiv preprint arXiv:1609.06954, 2016.
  • Bingham et al. (2018) Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P., Horsfall, P., and Goodman, N. D. Pyro: Deep Universal Probabilistic Programming. arXiv preprint arXiv:1810.09538, 2018.
  • Boulanger-Lewandowski et al. (2012) Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392, 2012.
  • Buntine (1994) Buntine, W. L. Operations for learning with graphical models. Journal of artificial intelligence research, 2:159–225, 1994.
  • Chandrasekaran et al. (2012) Chandrasekaran, V., Srebro, N., and Harsha, P. Complexity of inference in graphical models. arXiv preprint arXiv:1206.3240, 2012.
  • Darwiche (2003) Darwiche, A. A differential approach to inference in bayesian networks. Journal of the ACM (JACM), 50(3):280–305, 2003.
  • Dietz (2010) Dietz, L. Directed factor graph notation for generative models. Max Planck Institute for Informatics, Tech. Rep, 2010.
  • Djolonga & Krause (2017) Djolonga, J. and Krause, A. Differentiable learning of submodular models. In Advances in Neural Information Processing Systems, pp. 1013–1023, 2017.
  • Eisner (2016) Eisner, J. Inside-outside and forward-backward algorithms are just backprop (tutorial paper). In Proceedings of the Workshop on Structured Prediction for NLP, pp. 1–17, 2016.
  • Gan et al. (2015) Gan, Z., Li, C., Henao, R., Carlson, D. E., and Carin, L. Deep temporal sigmoid belief networks for sequence modeling. In Advances in Neural Information Processing Systems, pp. 2467–2475, 2015.
  • Ghahramani (2001) Ghahramani, Z. An introduction to hidden markov models and bayesian networks. International journal of pattern recognition and artificial intelligence, 15(01):9–42, 2001.
  • Ghahramani & Jordan (1996) Ghahramani, Z. and Jordan, M. I. Factorial hidden markov models. In Advances in Neural Information Processing Systems, pp. 472–478, 1996.
  • Gu et al. (2015) Gu, S. S., Ghahramani, Z., and Turner, R. E. Neural adaptive sequential monte carlo. In Advances in Neural Information Processing Systems, pp. 2629–2637, 2015.
  • Hirata (2003) Hirata, S. Tensor contraction engine: Abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories. The Journal of Physical Chemistry A, 107(46):9887–9897, 2003. URL http://iqc.udg.es/articles/cc/2003-jpca-hirata.pdf.
  • Honnibal & Montani (2017) Honnibal, M. and Montani, I. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017.
  • Khamis et al. (2016) Khamis, M. A., Ngo, H. Q., and Rudra, A. Faq: questions asked frequently. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 13–28. ACM, 2016.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kohlas & Wilson (2008) Kohlas, J. and Wilson, N. Semiring induced valuation algebras: Exact and approximate local computation algorithms. Artificial Intelligence, 172(11):1360–1399, 2008.
  • Krishnan et al. (2017) Krishnan, R. G., Shalit, U., and Sontag, D. Structured inference networks for nonlinear state space models. In AAAI, pp. 2101–2109, 2017.
  • Kschischang et al. (2001) Kschischang, F. R., Frey, B. J., and Loeliger, H.-A. Factor graphs and the sum-product algorithm. IEEE Transactions on information theory, 47(2):498–519, 2001.
  • Kwisthout et al. (2010) Kwisthout, J., Bodlaender, H. L., and van der Gaag, L. C. The necessity of bounded treewidth for efficient inference in bayesian networks. In ECAI, volume 215, pp. 237–242, 2010.
  • Lafferty et al. (2001) Lafferty, J. D., McCallum, A., and Pereira, F. C. N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pp. 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-778-1. URL http://dl.acm.org/citation.cfm?id=645530.655813.
  • Lauritzen & Spiegelhalter (1988) Lauritzen, S. L. and Spiegelhalter, D. J. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society. Series B (Methodological), pp. 157–224, 1988.
  • Liu et al. (2018) Liu, F., Cohn, T., and Baldwin, T. Recurrent entity networks with delayed memory update for targeted aspect-based sentiment analysis. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 278–283. Association for Computational Linguistics, 2018. doi: 10.18653/v1/N18-2045. URL http://aclweb.org/anthology/N18-2045.
  • Ma et al. (2018) Ma, Y., Peng, H., Khan, T., Cambria, E., and Hussain, A. Sentic LSTM: a hybrid network for targeted aspect-based sentiment analysis. Cognitive Computation, 10(4):639–650, 2018.
  • McClintock & Michelot (2018) McClintock, B. T. and Michelot, T. momentuhmm: R package for generalized hidden markov models of animal movement. Methods in Ecology and Evolution, 9(6):1518–1530, 2018. doi: 10.1111/2041-210X.12995. URL https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.12995.
  • McClintock et al. (2013) McClintock, B. T., Russell, D. J., Matthiopoulos, J., and King, R. Combining individual animal movement and ancillary biotelemetry data to investigate population-level activity budgets. Ecology, 94(4):838–849, 2013.
  • Musser & Stepanov (1988) Musser, D. R. and Stepanov, A. A. Generic programming. In International Symposium on Symbolic and Algebraic Computation, pp. 13–25. Springer, 1988.
  • Pearl (1986) Pearl, J. Fusion, propagation, and structuring in belief networks. Artificial intelligence, 29(3):241–288, 1986.
  • Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In In EMNLP, 2014.
  • Raftery (1985) Raftery, A. E. A model for high-order markov chains. Journal of the Royal Statistical Society. Series B (Methodological), pp. 528–539, 1985.
  • Ramsey (1930) Ramsey, F. P. On a problem of formal logic. In Classic Papers in Combinatorics, pp. 1–24. Springer, 1930.
  • Saeidi et al. (2016) Saeidi, M., Bouchard, G., Liakata, M., and Riedel, S. Sentihood: Targeted aspect based sentiment analysis dataset for urban neighbourhoods. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 1546–1556. The COLING 2016 Organizing Committee, 2016. URL http://aclweb.org/anthology/C16-1146.
  • Smith & Gray (2018) Smith, D. G. A. and Gray, J. opt_einsum - a python package for optimizing contraction order for einsum-like expressions. Journal of Open Source Software, 3(26):753, 2018. URL https://doi.org/10.21105/joss.00753.
  • Smolensky (1986) Smolensky, P. Information processing in dynamical systems: Foundations of harmony theory. Technical report, COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, 1986.
  • Solomonik et al. (2014) Solomonik, E., Matthews, D., Hammond, J. R., Stanton, J. F., and Demmel, J. A massively parallel tensor contraction framework for coupled-cluster computations. Journal of Parallel and Distributed Computing, 74(12):3176–3190, 2014.
  • Sutskever et al. (2009) Sutskever, I., Hinton, G. E., and Taylor, G. W. The recurrent temporal restricted boltzmann machine. In Advances in neural information processing systems, pp. 1601–1608, 2009.
  • Taghipour et al. (2013) Taghipour, N., Fierens, D., Davis, J., and Blockeel, H. Lifted variable elimination: Decoupling the operators from the constraint language. Journal of Artificial Intelligence Research, 47:393–439, 2013.
  • Towner et al. (2016) Towner, A. V., Leos-Barajas, V., Langrock, R., Schick, R. S., Smale, M. J., Kaschke, T., Jewell, O. J., and Papastamatiou, Y. P. Sex-specific and individual preferences for hunting strategies in white sharks. Functional Ecology, 30(8):1397–1407, 2016.
  • Walt et al. (2011) Walt, S. v. d., Colbert, S. C., and Varoquaux, G. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22–30, 2011.
  • Wiebe (2011) Wiebe, M., January 2011. URL https://mail.python.org/pipermail/numpy-discussion/2011-January/054586.html.
  • Wingate et al. (2011) Wingate, D., Goodman, N., Stuhlmüller, A., and Siskind, J. M. Nonstandard interpretations of probabilistic programs for efficient inference. In Advances in Neural Information Processing Systems, pp. 1152–1160, 2011.
  • Zucchini et al. (2016) Zucchini, W., MacDonald, I. L., and Langrock, R. Hidden Markov models for time series: an introduction using R. Chapman and Hall/CRC, 2016.



Appendix A Proofs

a.1 Proof of Theorem 1

Note that while the () direction would follow from Theorem 4 below, we provide an independent proof that does not rely on the Exponential Time Hypothesis.

Proof of Theorem 1.

() Suppose by way of contradiction that Algorithm 1 has failed. Since but , there must be at least two distinct and two distinct plates such that and . Letting with provides the required graph minor .

() Suppose by way of contradiction that has such a graph minor . Algorithm 1 must eventually either fail or reach a factor in leaf plate set where the plate is Product-reduced. Similarly at another time, the algorithm must reach a factor in plate set where the plate is Product-reduced.

Consider the provenance of factors and . Since contains a common factor in both plates , factors and must share as a common ancestor. Note that in Algorithm 1’s search for a next plate set , the plate set is strictly decreasing. Thus if and share a common ancestor, then either or . But this contradicts and . ∎

a.2 Proof of Theorem 2

We first define plated junction trees, then prove a crucial lemma, and finally prove a stronger form of Theorem 2.

Definition 4.

A junction tree (or tree decomposition) of a factor graph is a tree whose vertices are subsets of variables such that: (i) each variable is in at least one junction node (i.e.  ); (ii) for each factor , there is a junction vertex containing all of that factor’s variables; and (iii) each variable appears in nodes forming a contiguous subtree of .

Definition 5.

A plated junction tree of a plated factor graph is a plated graph where is a junction tree, , and (iv) each variable appears in some junction vertex with exactly the same plates .

We extend unrolling from plated factor graphs to plated junction trees in the obvious way. Note that a plated junction tree may unroll to a junction graph that is not a tree, since unrolling may introduce cycles.

Definition 6.

The width of a junction tree is . The treewidth of a factor graph is the width of its narrowest junction tree, .

Lemma A.1.

Let be a plated factor graph and . If for every plate size assignment there is a junction tree of with , then there is a single plated junction tree of such that for every size assignment , .

Proof.

By induction on it suffices to show that, splitting off a single plate , if there is a family of plated junction trees satisfying

(1)

then for a single plated junction tree ,

(2)

Thus let us assume the hypothesis (1) and prove (2).

Our strategy is to progressively strengthen the hypothesis (1) by forcing symmetry properties of each using Ramsey-theory arguments. Eventually we will show that for any , there is a plated junction tree satisfying hypothesis (1) that is the unrolled plated junction tree for some plated junction tree . Choosing , it follows that satisfies (2).

First we force to have nodes that are symmetric along plate . Split variables into plated and unplated sets. For each and color all plate indices by a coloring function

where we use the shorthand . By choosing , we can find with a subset of plate indices all of a single color. Rename these plate indices to construct a new whose vertices are symmetric in this sense.

Next for each , color all pairs of plate indices by a coloring function

By Ramsey’s theorem Ramsey (1930) we can choose a sufficiently large such that has a subset of plates such that all pairs have a single color. Indeed by choosing , we can force the color to be 0, since the contiguity property (iii) of Definition 4 would require a single junction tree node to contain at least vertices, violating our hypothesis.

At this point we have forced to be symmetric in plate indices up to duplicates, but duplicates may have been introduced in selecting out plate indices from a larger set (e.g. upon removing plate index 4, becomes ). We now show that these duplicates can be removed by merging them into other nodes.

Let be two duplicate nodes in the plated junction tree , so that . Let be the path connecting (with possibly ). By the contiguity property (iii) of Definition 4, must have at least the variables of , hence must have at least as deep plate nesting, i.e.  . Replace the edge with a new edge . No cycles can have been created, since is more deeply plated than . No instances of the forbidden graph minor can have been created, since the new edge lies in a single plate. Indeed hypothesis 1 is still preserved, and symmetry of is preserved. Merge into . Again the hypothesis and symmetry are preserved. Iterating this merging process we can force to have no duplicate nodes.

Now since is symmetric in plate and is defined in terms of , also must be symmetric in plate . We next force to be symmetric along plate .

For each and , color all plate indices by a coloring function

By choosing , we can find a with a subset of plate indices with symmetric within-plate-index edges.

For each and , color all pairs of plate indices by a coloring function

By Ramsey’s theorem we can choose a sufficiently large such that has a subset of plates such that all pairs have a single color. Indeed we can force the color to be 0, since for any complete bipartite graph would create cycles, violating the tree assumption. Hence there are no between-plate-index edges.

At this point, we can construct a that is symmetric in plate indices and such that never contains edges between nodes with -plated variables with different indices. Hence can be rolled into a plated junction tree that always unrolls to a junction tree. Finally, since never contains -plated variables with more than one index,

We now prove a strengthening of Theorem 2.

Theorem 4.

Let be a plated factor graph with nontrivial variable domain , and be plate sizes. The following are equivalent:

  1. Algorithm 1 succeeds on .

  2. can be computed with complexity polynomial in .

  3. The treewidth of ’s unrolled factor graph is (asymptotically) independent of plate sizes .

  4. There is a plated junction tree of that unrolls to a junction tree for all plate sizes .

  5. There is a plated junction tree of that has no plated graph minor where , , , and .

  6. has no plated graph minor where , , , , and both include variables.

Proof.

() Algorithm 1 has complexity polynomial in , since both SumProduct and Product are polynomial in and the while loop is bounded independently of .

() Apply Kwisthout et al. (2010) to the unrolled factor graph.

() Apply Lemma A.1.

() Any plated junction tree with plated graph minor in (5) would unroll to a non-tree, since the minor would induce cycles when and (as in Example 3.2). Hence the plated junction tree of (4) must satisfy (5).

() If has such a plated graph minor, then any plated junction tree must have the corresponding plated graph minor.

() Apply Theorem 1. ∎

Proof of Theorem 2.

Apply Theorem 4 (). ∎

Appendix B Experimental details

b.1 Hidden Markov Models with Autoregressive Likelihoods

The joint probability (for a single sequence) for the HMM model is given by

(3)

where are the discrete latent variables and is the sequence of observations. For all twelve models the likelihood is given by a Bernoulli distribution factorized over the 88 distinct notes. The two Factorial HMMs have two discrete latent variables at each timestep: and . They differ in the dependence structure of and . In particular for the FHMM the joint probability is given by

(4)

i.e. the dependency structure of the discrete latent variables factorizes at each timestep, while for the PFHMM (i.e. Partially Factorial HMM) the joint probability is given by

(5)

All models correspond to tractable plated factor graphs (in the sense used in the main text) and admit efficient maximum likelihood gradient-based training.

The autoregressive models have the same dependency structure as in Eqn. 3-5, with the difference that the likelihood term at each timestep has an additional dependence on . For the four arXXX models, this dependence is explicitly parameterized with a conditional probability table, with the likelihood for each note depending explicitly on (but not on for ). For the four nnXXX models this dependence is parameterized by a neural network so that depends on the entire vector of notes . In detail the computational path of the neural network is as follows. First, a 1-dimensional convolution with a kernel size of three and channels is applied to to produce . We then apply separate affine transformations to the latent variables (either or and ) and , add together the resulting hidden representations, and apply a ReLU non-linearity. A final affine transformation then maps the hidden units to the 88-dimensional logits space of the Bernoulli likelihood. We vary and fix the number of hidden units to 50.

We evaluate our models on three of the polyphonic music datasets considered in Boulanger-Lewandowski et al. (2012), using the same train/test splits. Each dataset contains at least 7 hours of polyphonic music; after pre-processing each dataset consists of sequences, with each sequence containing timesteps. For each model we do a grid search over hyperparameters and train the model to approximate convergence and report test log likelihoods on the held-out test set (normalized per timestep). For all models except for the second-order HMMs we vary the hidden dimension . For the Factorial HMMs is interpreted as the size of the entire latent space at each timestep so that the dimension of each of the two discrete latent variables and at each timestep is given by . For the second-order HMMs we vary the hidden dimension so as to limit total memory usage (which scales as for an -order HMM).

We use the Adam optimizer with an initial learning rate of and default momentum hyperparameters Kingma & Ba (2014); over the course of training we decay the learning rate geometrically to a final learning rate of . For the JSB and Piano datasets we train for up to 300 epochs, while for the Nottingham dataset we train for up to 200 epochs. We follow (noisy) gradient estimates of the log likelihood by subsampling the data into mini-batches of sequences; we use mini-batch sizes of 20, 15, and 30 for the JSB, Piano, and Nottingham datasets, respectively. We clamp all probabilities to satisfy to avoid numerical instabilities. In order to better avoid bad local minima, for each hyperparameter setting we train with 4 (2) different random number seeds for models without (with) neural networks in the likelihood, respectively. For each dataset we then report results for the model with the best training log likelihood.

b.2 Hierarchical Mixed Effect Hidden Markov Models

b.2.1 Harbour seal tracking dataset details

We downloaded the data from momentuHMM (McClintock & Michelot (2018)), an R package for analyzing animal movement data with generalized hidden Markov models. The raw datapoints are in the form of irregularly sampled time series (datapoints separated by 5-15 minutes on average) of GPS coordinates and diving activity for each individual in the colony (10 males and 7 females) over the course of a single day recorded by lightweight tracking devices physically attached to each animal by researchers. We used the momentuHMM harbour seal example191919https://github.com/bmcclintock/momentuHMM/blob/master/vignettes/harbourSealExample.R preprocessing code (whose functionality is described in detail in section 3.7 of McClintock & Michelot (2018)) to independently convert the raw data for each individual into smoothed, temporally regular time series of step sizes, turn angles, and diving activity, saving the results and using them for our population-level analysis.

b.2.2 Model details

Figure 6: A single state transition in the hierarchical mixed-effect hidden Markov model used in our experiments in Section 6.2 and described below. s and s are learnable parameters. We omit fixed effects from the diagram as there were none in our experiments.

Our models are special cases of a time-inhomogeneous discrete state space model whose state transition distribution is specified by a hierarchical generalized linear mixed model (GLMM). At each timestep , for each individual trajectory in each group (male and female in our experiments), we have

where correspond to plate indices, s are independent random variables, s are covariates, and s are parameter vectors. See Fig. 6 for the corresponding plate diagram. The models in our experiments did not include fixed effects as there were no covariates in the harbour seal data, but they are frequently used in the literature with similar models (see e.g. Towner et al. (2016)) so we include them in our general definition.

The values of the independent random variable and are each sampled from a set of three possible values shared across the individual and group plates, respectively. That is, for each individual trajectory in each group , we sample single random effect values for an entire trajectory:

Note that each is a matrix, where is the number of hidden states per timestep in the HMM.

Observations are represented as sequences of real-valued step lengths and turn angles, modelled by zero-inflated Gamma and von Mises likelihoods respectively. The seal models also include a third observed variable indicating the amount of diving activity between successive locations, which we model with a zero-inflated Beta distribution following McClintock & Michelot (2018). We grouped animals by sex and implemented versions of this model with (i) no random effects (as a baseline), and with random effects present at the (ii) group, (iii) individual, or (iv) group+individual levels.

We chose the Gamma and von Mises likelihoods because they were the ones most frequently used in other papers describing the application of similar models to animal movement data, e.g. Towner et al. (2016); another combination, used in McClintock & Michelot (2018), modelled step length with a Weibull distribution and turn angle with a wrapped Cauchy distribution.

Unlike our models, the models in McClintock et al. (2013); McClintock & Michelot (2018) incorporate substantial additional prior knowledge in the form of hard constraints on various parameters and random variables (e.g. a maximum step length corresponding to the maximum distance a harbour seal can swim in the interval of a single timestep).

b.2.3 Training details

We used the Adam optimizer with initial learning rate and default momentum hyperparameters Kingma & Ba (2014), annealing the learning rate geometrically by a factor of when the training loss stopped decreasing. We trained the models for epochs with restarts from random initializations, using batch gradient descent because the number of individuals () was relatively small. The number of random effect parameter values was taken from McClintock & Michelot (2018) and all other hyperparameters were set by choosing the best values on the model with no random effects.

b.3 Sentiment Analysis

b.3.1 Dataset details

Sentences in Sentihood were collected from Yahoo! Answers by filtering for answers about neighbourhoods of London. Specific neighbourhood mentions were replaced with generic location1 or location2 tokens. We follow the previous work Saeidi et al. (2016); Ma et al. (2018); Liu et al. (2018) and restrict training and evaluation to the 4 most common aspects: price, safety, transit-location, and general.

To give a clearer picture of the task, consider the sentence “Location1 is more expensive but has a better quality of life than Location2”, the labels encode that with respect to Location1 the sentence is negative in aspect price, but positive in general. Similarly Location2 would have the opposite sentiments in those two aspects. The remaining aspects for both locations would have the none sentiment.

We preprocess the text with SpaCy Honnibal & Montani (2017) and lowercase all words. We append the reserved symbols ‘’ and ‘’ to the start and end of all sentences.




b.3.2 Model details

Our reimplemented LSTM-Final baseline uses a BLSTM on top of word embeddings. The hidden state of the BLSTM is initialized with an embedding of the location and aspect. A projection of the final hidden state is then used to classify the sentence-level sentiment by applying the softmax transformation. For the CRF-LSTM-Diag model, the potentials of the graphical model are given by:

(6)

where contains a dimensional vector for each sentiment, is the dimensionality of the LSTM, and the output of the LSTM is of dimension . The LSTM function takes as input a sequence of word embeddings as well as an initial hidden state given by embedding the location and aspect. The Emb function projects the words or location and aspect into a low-dimensional representation. The potentials of the CRF-LSTM-LSTM model are given by:

(7)

where is a matrix that dictates the interaction between positive/negative word sentiment and positive/negative sentence sentiment. The projection where is the dimensionality of the LSTM’s output. is then reshaped into an matrix. The potentials of the CRF-Emb-LSTM model, are similar:

(8)