Stochastic AndOr Grammars: A Unified Framework and Logic Perspective^{†}^{†}thanks: This work was supported by the National Natural Science Foundation of China (61503248).
Abstract
Stochastic AndOr grammars (AOG) extend traditional stochastic grammars of language to model other types of data such as images and events. In this paper we propose a representation framework of stochastic AOGs that is agnostic to the type of the data being modeled and thus unifies various domainspecific AOGs. Many existing grammar formalisms and probabilistic models in natural language processing, computer vision, and machine learning can be seen as special cases of this framework. We also propose a domainindependent inference algorithm of stochastic contextfree AOGs and show its tractability under a reasonable assumption. Furthermore, we provide two interpretations of stochastic contextfree AOGs as a subset of probabilistic logic, which connects stochastic AOGs to the field of statistical relational learning and clarifies their relation with a few existing statistical relational models.
1 Introduction
Formal grammars are a popular class of knowledge representation that is traditionally confined to the modeling of natural and computer languages. However, several extensions of grammars have been proposed over time to model other types of data such as images [1, 2, 3] and events [4, 5, 6]. One prominent type of extension is stochastic AndOr grammars (AOG) [2]. A stochastic AOG simultaneously models compositions (i.e., a large pattern is the composition of several small patterns arranged according to a certain configuration) and reconfigurations (i.e., a pattern may have several alternative configurations), and in this way it can compactly represent a probabilistic distribution over a large number of patterns. Stochastic AOGs can be used to parse data samples into their compositional structures, which help solve multiple tasks (such as classification, annotation, and segmentation of the data samples) in a unified manner. In this paper we will focus on the contextfree subclass of stochastic AOGs, which serves as the skeleton in building more advanced stochastic AOGs.
Several variants of stochastic AOGs and their inference algorithms have been proposed in the literature to model different types of data and solve different problems, such as image scene parsing [7] and video event parsing [6]. Our first contribution in this paper is that we provide a unified representation framework of stochastic AOGs that is agnostic to the type of the data being modeled; in addition, based on this framework we propose a domainindependent inference algorithm that is tractable under a reasonable assumption. The benefits of a unified framework of stochastic AOGs include the following. First, such a framework can help us generalize and improve existing ad hoc approaches for modeling, inference and learning with stochastic AOGs. Second, it also facilitates applications of stochastic AOGs to novel data types and problems and enables the research of generalpurpose inference and learning algorithms of stochastic AOGs. Further, a formal definition of stochastic AOGs as abstract probabilistic models makes it easier to theoretically examine their relation with other models such as constraintbased grammar formalism [8] and sumproduct networks [9]. In fact, we will show that many of these related models can be seen as special cases of stochastic AOGs.
Stochastic AOGs model compositional structures based on the relations between subpatterns. Such probabilistic modeling of relational structures is traditionally studied in the field of statistical relational learning [10]. Our second contribution is that we provide probabilistic logic interpretations of the unified representation framework of stochastic AOGs and thus show that stochastic AOGs can be seen as a novel type of statistical relational models. The logic interpretations help clarify the relation between stochastic AOGs and a few existing statistical relational models and probabilistic logics that share certain features with stochastic AOGs (e.g., tractable Markov logic [11] and stochastic logic programs [12]). It may also facilitate the incorporation of ideas from statistical relational learning into the study of stochastic AOGs and at the same time contribute to the research of novel (tractable) statistical relational models.
2 Stochastic AndOr Grammars
An AOG is an extension of a constituency grammar used in natural language parsing [13]. Similar to a constituency grammar, an AOG defines a set of valid hierarchical compositions of atomic entities. However, an AOG differs from a constituency grammar in that it allows atomic entities other than words and compositional relations other than string concatenation. A stochastic AOG models the uncertainty in the composition by defining a probabilistic distribution over the set of valid compositions.
Stochastic AOGs were first proposed to model images [2, 7, 14, 15], in particular the spatial composition of objects and scenes from atomic visual words (e.g., Garbor bases). They were later extended to model events, in particular the temporal and causal composition of events from atomic actions [6] and fluents [16]. More recently, these two types of AOGs were used jointly to model objects, scenes and events from the simultaneous input of video and text [17].
In each of the previous work using stochastic AOGs, a different type of data is modeled with domainspecific and problemspecific definitions of atomic entities and compositions. Tu et al. [18] provided a first attempt towards a more unified definition of stochastic AOGs that is agnostic to the type of the data being modeled. We refine and extend their work by introducing parameterized patterns and relations in the unified definition, which allows us to reduce a wide range of related models to AOGs (as will be discussed in section 2.1). Based on the unified framework of stochastic AOGs, we also propose a domainindependent inference algorithm and study its tractability (section 2.2). Below we start with the definition of stochastic contextfree AOGs, which are the most basic form of stochastic AOGs and are used as the skeleton in building more advanced stochastic AOGs.
A stochastic contextfree AOG is defined as a 5tuple :

is a set of terminal nodes representing atomic patterns that are not decomposable;

is a set of nonterminal nodes representing highlevel patterns, which is divided into two disjoint sets: Andnodes and Ornodes;

is a start symbol that represents a complete pattern;

is a function that maps an instance of a terminal or nonterminal node to a parameter (the parameter can take any form such as a vector or a complex data structure; denote the maximal parameter size by );

is a set of grammar rules, each of which takes the form of representing the generation from a nonterminal node to a set of nonterminal or terminal nodes (we say that the rule is “headed” by node and the nodes in are the “child nodes” of ).
The set of rules is further divided into two disjoint sets: Andrules and Orrules.

An Andrule, parameterized by a triple , represents the decomposition of a pattern into a configuration of nonoverlapping subpatterns. The Andrule specifies a production for some , where is an Andnode and are a set of terminal or nonterminal nodes representing the subpatterns. A relation between the parameters of the child nodes, , specifies valid configurations of the subpatterns. This socalled parameter relation is typically factorized to the conjunction of a set of binary relations. A parameter function is also associated with the Andrule specifying how the parameter of the Andnode is related to the parameters of the child nodes: . We require that both the parameter relation and the parameter function take time polynomial in and to compute. There is exactly one Andrule that is headed by each Andnode.

An Orrule, parameterized by an ordered pair , represents an alternative configuration of a pattern. The Orrule specifies a production , where is an Ornode and is either a terminal or a nonterminal node representing a possible configuration. A conditional probability is associated with the Orrule specifying how likely the configuration represented by is selected given the Ornode . The only constraint in the Orrule is that the parameters of and must be the same: . There typically exist multiple Orrules headed by the same Ornode, and together they can be written as .
Note that unlike in some previous work, in the definition above we assume deterministic Andrules for simplicity. In principle, any uncertainty in an Andrule can be equivalently represented by a set of Orrules each invoking a different copy of the Andrule.
Fig. 1(a) shows an example stochastic contextfree AOG of line drawings. Each terminal or nonterminal node represents an image patch and its parameter is a 2D vector representing the position of the patch in the image. Each terminal node denotes a line segment of a specific orientation while each nonterminal node denotes a class of line drawing patterns. The start symbol denotes a class of line drawing images (e.g., images of animal faces). In each Andrule, the parameter relation specifies the relative positions between the subpatterns and the parameter function specifies the relative positions between the composite pattern and the subpatterns.
With a stochastic contextfree AOG, one can generate a compositional structure by starting from a data sample containing only the start symbol and recursively applying the grammar rules in to convert nonterminal nodes in the data sample until the data sample contains only terminal nodes. The resulting compositional structure is a tree in which the root node is , each nonleaf node is a nonterminal node, and each leaf node is a terminal node; in addition, for each appearance of Andnode in the tree, its set of child nodes in the tree conforms to the Andrule headed by , and for each appearance of Ornode in the tree, it has exactly one child node in the tree which conforms to one of the Orrules headed by . The probability of the compositional structure is the product of the probabilities of all the Orrules used in the generation process. Fig. 1(b) shows an image and its compositional structure generated from the example AOG in Fig. 1(a). Given a data sample consisting of only atomic patterns, one can also infer its compositional structure by parsing the data sample with the stochastic contextfree AOG. We will discuss the parsing algorithm later.
Our framework is flexible in that it allows different types of patterns and relations within the same grammar. Consider for example a stochastic AOG modeling visually grounded events (e.g., videos of people using vendingmachines). We would have two types of terminal or nonterminal nodes that model events and objects respectively. An event node represents a class of events or subevents, whose parameter is the start/end time of an instance event. An object node represents a class of objects or subobjects (possibly in a specific state or posture), whose parameter contains both the spatial information and the time interval information of an instance object. We specify temporal relations between event nodes to model the composition of an event from subevents; we specify spatial relations between object nodes to model the composition of an object from its component subobjects as well as the composition of an atomic event from its participant objects; we also specify temporal relations between related object nodes to enforce the alignment of their time intervals.
Note that different nonterminal nodes in an AOG may share child nodes. For example, in Fig.1 each terminal node representing a line segment may actually be shared by multiple parent nonterminal nodes representing different line drawing patterns. Furthermore, there could be recursive rules in an AOG, which means the direct or indirect production of a grammar rule may contain its lefthand side nonterminal. Recursive rules are useful in modeling languages and repetitive patterns.
In some previous work, stochastic AOGs more expressive than stochastic contextfree AOGs are employed. A typical augmentation over contextfree AOGs is that, while in a contextfree AOG a parameter relation can only be specified within an Andrule, in more advanced AOGs parameter relations can be specified between any two nodes in the grammar. This can be very useful in certain scenarios. For example, in an image AOG of indoor scenes, relations can be added between all pairs of 2D faces to discourage overlap [7]. However, such relations make inference much more difficult. Another constraint in contextfree AOGs that is sometimes removed in more advanced AOGs is the nonoverlapping requirement between subpatterns in an Andrule. For example, in an image AOG it may be more convenient to decompose a 3D cube into 2D faces that share edges [7]. We will leave the formal definition and analysis of stochastic AOGs beyond contextfreeness to future work.
2.1 Related Models and Special Cases
Stochastic contextfree AOGs subsume many existing models as special cases. Because of space limitation, here we informally describe these related models and their reduction to AOGs and leave the formal definitions and proofs in Appendix A.
Stochastic contextfree grammars (SCFG) are clearly a special case of stochastic contextfree AOGs. Any SCFG can be converted into an AndOr normal form that matches the structure of a stochastic AOG [19]. In a stochastic AOG representing a SCFG, each node represents a string and the parameter of a node is the start/end positions of the string in the complete sentence; the parameter relation and parameter function in an Andrule specify string concatenation, i.e., the substrings must be adjacent and the concatenation of all the substrings forms the composite string represented by the parent Andnode.
There have been a variety of grammar formalisms developed in the natural language processing community that go beyond the concatenation relation of strings. For examples, in some formalisms the substrings are interwoven to form the composite string [20, 21]. More generally, in a grammar rule a linear regular string function can be used to combine lists of substrings into a list of composite strings, as in a linear contextfree rewriting system (LCFRS) [22]. All these grammar formalisms can be represented by contextfree AOGs with each node representing a list of strings, the node parameter being a list of start/end positions, and in each Andrule the parameter relation and parameter function defining a linear regular string function. Since LCFRSs are known to generate the larger class of mildly contextsensitive languages, contextfree AOGs when instantiated to model languages can be at least as expressive as mildly contextsensitive grammars.
Constraintbased grammar formalisms [8] are another class of natural language grammars, which associate socalled feature structures to nonterminals and use them to specify constraints in the grammar rules. Such constraints can help model natural language phenomena such as English subjectverb agreement and underlie grammatical theories such as headdriven phrase structure grammars [23]. It is straightforward to show that constraintbased grammar formalisms are also special cases of contextfree AOGs (with a slight generalization to allow unary Andrules), by establishing equivalence between feature structures and node parameters and between constraints and parameter relations/functions.
In computer vision and pattern recognition, stochastic AOGs have been applied to a variety of tasks as discussed in the previous section. In addition, several other popular models, such as the deformable part model [24] and the flexible mixtureofparts model [25], can essentially be seen as special cases of stochastic contextfree AOGs in which the node parameters encode spatial information of image patches and the parameter relations/functions encode spatial relations between the patches.
Sumproduct networks (SPN) [9] are a new type of deep probabilistic models that extend the ideas of arithmetic circuits [26] and AND/OR search spaces [27] and can compactly represent many probabilistic distributions that traditional graphical models cannot tractably handle. It can be shown that any decomposable SPN has an equivalent stochastic contextfree AOG: Ornodes and Andnodes of the AOG can be used to represent sum nodes and product nodes in the SPN respectively, all the node parameters are set to null, parameter relations always return true, and parameter functions always return null. Because of this reduction, all the models that can reduce to decomposable SPNs can also be seen as special cases of stochastic contextfree AOGs, such as thin junction trees [28], mixtures of trees [29] and latent tree models [30].
2.2 Inference
The main inference problem associated with stochastic AOGs is parsing, i.e., given a data sample consisting of only terminal nodes, infer its most likely compositional structure (parse). A related inference problem is to compute the marginal probability of a data sample. It can be shown that both problems are NPhard (see Appendix B for the proofs). Nevertheless, here we propose an exact inference algorithm for stochastic contextfree AOGs that is tractable under a reasonable assumption on the number of valid compositions in a data sample. Our algorithm is based on bottomup dynamic programming and can be seen as a generalization of several previous exact inference algorithms designed for special cases of stochastic AOGs (such as the CYK algorithm for text parsing).
Algorithm 1 shows the inference algorithm that returns the probability of the most likely parse. After the algorithm terminates, the most likely parse can be constructed by recursively backtracking the selected Orrules from the start symbol to the terminals. To compute the marginal probability of a data sample, we simply replace the max operation with sum in line 20 of Algorithm 1.
In Algorithm 1 we assume the input AOG is in a generalized version of Chomsky normal form, i.e., (1) each Andnode has exactly two child nodes which must be Ornodes, (2) the child nodes of Ornodes must not be Ornodes, and (3) the start symbol is an Ornode. By extending previous studies [31], it can be shown that any contextfree AOG can be converted into this form and both the time complexity of the conversion and the size of the new AOG is polynomial in the size of the original AOG. We give more details in Appendix C.
The basic idea of Algorithm 1 is to discover valid compositions of terminal instances of increasing sizes, where the size of a composition is defined as the number of terminal instances it contains. Size 1 compositions are simply the terminal instances (line 2–6). To discover compositions of size , the combination of any two compositions of sizes and are considered (line 7–20). A complete parse of the data sample is a composition of size with its root being the start symbol (line 21).
The time complexity of Algorithm 1 is where and is the set of valid compositions of size in the data sample . In the worst case when all possible compositions of terminal instances from the data sample are valid, we have which is exponential in . To make the algorithm tractable, we restrict the value of with the following assumption on the input data sample.
Composition Sparsity Assumption.
For any data sample and any positive integer , the number of valid compositions of size in is polynomial in .
This assumption is reasonable in many scenarios. For text data, for a sentence of length , a valid composition is a substring of the sentence and the number of substrings of size is . For image data, if we restrict the compositions to be rectangular image patches (as in the hierarchical space tiling model [14]), then for an image of size it is easy to show that the number of valid compositions of any specific size is no more than .
3 Logic Perspective of Stochastic AOGs
In a stochastic AOG, Andrules model the relations between terminal and nonterminal instances and Orrules model the uncertainty in the compositional structure. By combining these two types of rules, stochastic AOGs can be seen as probabilistic models of relational structures and are hence related to the field of statistical relational learning [10]. In this section, we manifest this connection by providing probabilistic logic interpretations of stochastic AOGs. By establishing this connection, we hope to facilitate the exchange of ideas and results between the two previously separated research areas.
3.1 Interpretation as Probabilistic Logic
We first discuss an interpretation of stochastic contextfree AOGs as a subset of firstorder probabilistic logic with a possibleworld semantics. The intuition is that we interpret terminal and nonterminal nodes of an AOG as unary relations, use binary relations to connect the instances of terminal and nonterminal nodes to form the parse tree, and use material implication to represent grammar rules.
We first describe the syntax of our logic interpretation of stochastic contextfree AOGs. There are two types of formulas in the logic: Andrules and Orrules. Each Andrule takes the following form (for some ).
The unary relation corresponds to the lefthand side Andnode of an Andrule in the AOG; each unary relation corresponds to a child node of the Andrule. We require that for each unary relation , there is at most one Andrule with as the lefthand side. The binary relation is typically the HasPart relation between an object and one of its parts, but could also denote any other binary relation such as the Agent relation between an action and its initiator, or the HasColor relation between an object and its color. Note that these binary relations make explicit the nature of the composition represented by each Andrule of the AOG. is a function that maps an object to its parameter. is a relation that combines the parameter relation and parameter function in the Andrule of the AOG and is typically factorized to the conjunction of a set of binary relations.
Each Orrule takes the following form.
The unary relation corresponds to the lefthand side Ornode and to the child node of an Orrule in the AOG; is the conditional probability of being true when the grounded lefthand side is true. We require that for each true grounding of , among all the grounded Orrules with as the lefthand side, exactly one is true. This requirement can be represented by two additional sets of constraint rules. First, Orrules with the same lefthand side are mutually exclusive, i.e., for any two Orrules and , we have where is the Sheffer stroke. Second, given a true grounding of , the Orrules with as the lefthand side cannot be all false, i.e., where ranges over all such Orrules. Further, to simplify inference and avoid potential inconsistency in the logic, we require that the righthand side unary relation of an Orrule cannot appear in the lefthand side of any Orrule (i.e., the second requirement in the generalized Chomsky normal form of AOG described earlier).
We can divide the set of unary relations into two categories: those that appear in the lefthand side of rules (corresponding to the nonterminal nodes of the AOG) and those that do not (corresponding to the terminal nodes). The first category is further divided into two subcategories depending on whether the unary relation appears in the lefthand side of Andrules or Orrules (corresponding to the Andnodes and Ornodes of the AOG respectively). We require these two subcategories to be disjoint. There is also a unique unary relation that does not appear in the righthand side of any rule, which corresponds to the start symbol of the AOG.
Now we describe the semantics of the logic. The interpretation of all the logical and nonlogical symbols follows that of firstorder logic. There are two types of objects in the universe of the logic: normal objects and parameters. There is a bijection between normal objects and parameters, and function maps a normal object to its corresponding parameter. A possible world is represented by a pair where is a set of objects and is a set of literals that are true. We require that there exists exactly one normal object such that . In order for all the deterministic formulas (i.e., all the Andrules and the two sets of constraint rules of all the Orrules) to be satisfied, the possible world must contain a tree structure in which:

each node denotes an object in with the root node being ;

each edge denotes a binary relation defined in some Andrule;

for each leaf node , there is exactly one terminal unary relation such that ;

for each nonleaf node , there is exactly one Andnode unary relation such that , and for the child nodes of in the tree, according to the Andrule associated with relation ;

for each node , if for some Ornode unary relation we have , then among all the Orrules with as the lefthand side, there is exactly one Orrule such that where is the righthand side unary relation of the Orrule, and for the rest of the Orrules we have .
We enforce the following additional requirements to ensure that the possible world contains no more and no less than the tree structure:

No two nodes in the tree denote the same object.

and contain only the objects and relations specified above.
The probability of a possible world is defined as follows. Denote by the set of Orrules. For each Orrule , denote by the conditional probability associated with and define . Then we have:
In this logic interpretation, parsing corresponds to the inference problem of identifying the most likely possible world in which the terminal relations and parameters of the leaf nodes of the tree structure match the atomic patterns in the input data sample. Computing the marginal probability of a data sample corresponds to computing the probability summation of the possible worlds that match the data sample.
Our logic interpretation of stochastic contextfree AOGs resembles tractable Markov logic (TML) [11, 32] in many aspects, even though the two have very different motivations. Such similarity implies a deep connection between stochastic AOGs and TML and points to a potential research direction of investigating novel tractable statistical relational models by borrowing ideas from the stochastic grammar literature. There are a few minor differences between stochastic AOGs and TML, e.g., TML does not distinguish between Andnodes and Ornodes, does not allow recursive rules, enforces that the righthand side unary relation in each Orrule is a subtype of the lefthand side unary relation, and disallows a unary relation to appear in the righthand side of more than one Orrule.
3.2 Interpretation as a Stochastic Logic Program
Stochastic logic programs (SLP) [12] are a type of statistical relational models that, like stochastic contextfree AOGs, are a generalization of stochastic contextfree grammars. They are essentially equivalent to two other representations, independent choice logic [33] and PRISM [34]. Here we show how a stochastic contextfree AOG can be represented by a pure normalized SLP [35]. Since several inference and learning algorithms have been developed for SLPs and PRISM, our reduction enables the application of these algorithms to stochastic AOGs.
In our SLP program, we have one SLP clause for each Andrule and each Orrule in the AOG. The overall structure is similar to the probabilistic logic interpretation discussed in section 3.1. For each Andrule, the corresponding SLP clause takes the following form:
The head represents the lefthand side Andnode of the Andrule, where represents the set of terminal instances generated from the Andnode and is the parameters of the Andnode. In the body of the clause, represents the th child node of the Andrule, represents the relation between the Andnode and its th child node, states that the terminal instance set of the Andnode is the union of the instance sets from all the child nodes, and represents a relation that combines the parameter relation and parameter function of the Andrule. For relations and , we need to have additional clauses to define them according to the type of data being modeled.
For each Orrule in the AOG, if the righthand side is a nonterminal, then we have:
where is the conditional probability associated with the Orrule, and represent the lefthand and righthand sides of the Orrule respectively, whose arguments and have the same meaning as explained above. If the righthand side of the Orrule is a terminal, then we have:
where is the righthand side terminal node and the second argument represents the parameters of the terminal node.
Finally, the goal of the program is
which represents the start symbol of the AOG, whose arguments have the same meaning as explained above.
4 Conclusion
Stochastic AndOr grammars extend traditional stochastic grammars of language to model other types of data such as images and events. We have provided a unified representation framework of stochastic AOGs that can be instantiated for different data types. We have shown that many existing grammar formalisms and probabilistic models in natural language processing, computer vision, and machine learning can all be seen as special cases of stochastic contextfree AOGs. We have also proposed an inference algorithm for parsing data samples using stochastic contextfree AOGs and shown that the algorithm is tractable under the composition sparsity assumption. In the second part of the paper, we have provided interpretations of stochastic contextfree AOGs as a subset of firstorder probabilistic logic and stochastic logic programs. Our interpretations connect stochastic AOGs to the field of statistical relational learning and clarify their relation with a few existing statistical relational models.
References
 [1] King Sun Fu. Syntactic pattern recognition and applications, volume 4. PrenticeHall Englewood Cliffs, 1982.
 [2] SongChun Zhu and David Mumford. A stochastic grammar of images. Found. Trends. Comput. Graph. Vis., 2(4):259–362, 2006.
 [3] Ya Jin and Stuart Geman. Context and hierarchy in a probabilistic image model. In CVPR, 2006.
 [4] Yuri A. Ivanov and Aaron F. Bobick. Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):852–872, 2000.
 [5] M. S. Ryoo and J. K. Aggarwal. Recognition of composite human activities through contextfree grammar based representation. In CVPR, 2006.
 [6] Mingtao Pei, Yunde Jia, and SongChun Zhu. Parsing video events with goal inference and intent prediction. In ICCV, 2011.
 [7] Yibiao Zhao and Song Chun Zhu. Image parsing with stochastic scene grammar. In NIPS, 2011.
 [8] Stuart M Shieber. Constraintbased grammar formalisms: parsing and type inference for natural and computer languages. MIT Press, 1992.
 [9] Hoifung Poon and Pedro Domingos. Sumproduct networks : A new deep architecture. In UAI, 2011.
 [10] Lise Getoor and Ben Taskar. Introduction to statistical relational learning. MIT press, 2007.
 [11] Pedro Domingos and William Austin Webb. A tractable firstorder probabilistic logic. In AAAI, 2012.
 [12] Stephen Muggleton. Stochastic logic programs. Advances in inductive logic programming, 32:254–264, 1996.
 [13] Christopher D. Manning and Hinrich Schütze. Foundations of statistical natural language processing. MIT Press, Cambridge, MA, USA, 1999.
 [14] Shuo Wang, Yizhou Wang, and SongChun Zhu. Hierarchical space tiling for scene modeling. In ACCV. 2013.
 [15] Brandon Rothrock, Seyoung Park, and SongChun Zhu. Integrating grammar and segmentation for human pose estimation. In CVPR, 2013.
 [16] A. Fire and S.C. Zhu. Using causal induction in humans to learn and infer causality from video. In 35th Annual Cognitive Science Conference (CogSci), 2013.
 [17] Kewei Tu, Meng Meng, Mun Wai Lee, Tae Eun Choe, and SongChun Zhu. Joint video and text parsing for understanding events and answering queries. IEEE MultiMedia, 2014.
 [18] Kewei Tu, Maria Pavlovskaia, and SongChun Zhu. Unsupervised structure learning of stochastic andor grammars. In NIPS, 2013.
 [19] Kewei Tu and Vasant Honavar. Unsupervised learning of probabilistic contextfree grammar using iterative biclustering. In ICGI, 2008.
 [20] Carl Pollard. Generalized contextfree grammars, head grammars and natural language. Ph.D. diss., Stanford University, 1984.
 [21] Mark Johnson. Parsing with discontinuous constituents. In ACL, 1985.
 [22] David Jeremy Weir. Characterizing mildly contextsensitive grammar formalisms. Ph.D. diss., University of Pennsylvania, 1988.
 [23] Carl Pollard and Ivan A. Sag. Informationbased Syntax and Semantics: Vol. 1: Fundamentals. Center for the Study of Language and Information, Stanford, CA, USA, 1988.
 [24] Pedro Felzenszwalb, David McAllester, and Deva Ramanan. A discriminatively trained, multiscale, deformable part model. In CVPR, 2008.
 [25] Yi Yang and Deva Ramanan. Articulated pose estimation with flexible mixturesofparts. In CVPR, 2011.
 [26] Adnan Darwiche. A differential approach to inference in bayesian networks. Journal of the ACM (JACM), 50(3):280–305, 2003.
 [27] Rina Dechter and Robert Mateescu. And/or search spaces for graphical models. Artificial intelligence, 171(2):73–106, 2007.
 [28] Francis R Bach and Michael I Jordan. Thin junction trees. In NIPS, 2001.
 [29] Marina Meila and Michael I Jordan. Learning with mixtures of trees. The Journal of Machine Learning Research, 1:1–48, 2001.
 [30] Myung Jin Choi, Vincent YF Tan, Animashree Anandkumar, and Alan S Willsky. Learning latent tree graphical models. The Journal of Machine Learning Research, 12:1771–1812, 2011.
 [31] Martin Lange and Hans Leiß. To CNF or not to CNF? an efficient yet presentable version of the CYK algorithm. Informatica Didactica, 8:2008–2010, 2009.
 [32] William Austin Webb and Pedro Domingos. Tractable probabilistic knowledge bases with existence uncertainty. In AAAI Workshop: Statistical Relational Artificial Intelligence, 2013.
 [33] David Poole. Probabilistic horn abduction and bayesian networks. Artificial intelligence, 64(1):81–129, 1993.
 [34] Taisuke Sato and Yoshitaka Kameya. Parameter learning of logic programs for symbolicstatistical modeling. Journal of Artificial Intelligence Research, pages 391–454, 2001.
 [35] James Cussens. Parameter estimation in stochastic logic programs. Machine Learning, 44(3):245–271, 2001.
 [36] Robert Peharz, Sebastian Tschiatschek, Franz Pernkopf, and Pedro Domingos. On theoretical properties of sumproduct networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pages 744–752, 2015.
Appendix Appendix A Related Models and Special Cases
Appendix A.1 Stochastic ContextFree Grammars
Definition 1.
A stochastic contextfree grammar (SCFG) is a 4tuple :

is a set of terminal symbols

is a set of nonterminal symbols

is a special nonterminal called the start symbol

is a set of production rules, each of the form where , , and is the conditional probability .
Any SCFG can be converted into AndOr normal form as described in [19]. The conversion results in a linear increase in the grammar size.
Definition 2.
An SCFG is in AndOr normal form iff. its nonterminal symbols are divided into two disjoint subsets: Andsymbols and Orsymbols, such that:

each Andsymbol appears on the lefthand side of exactly one production rule, and the righthand side of the rule contains a sequence of two or more terminal or nonterminal symbols;

each Orsymbol appears on the lefthand side of one or more rules, each of which has a single terminal or nonterminal symbol on the righthand side.
Proposition 1.
Any SCFG can be converted into AndOr normal form with linear increase in size.
Proof.
We construct a SCFG in AndOr normal form as follows. For each production rule with two or more symbols in , create an Andsymbol and replace the production rule with two new rules: and . Regard all the nonterminals in the original SCFG as Orsymbols. ∎
Proposition 2.
Any SCFG can be represented by a stochastic contextfree AOG with linear increase in size.
Proof.
We first convert the SCFG into AndOr normal form. We then construct an equivalent stochastic contextfree AOG :

is the set of terminal symbols in the SCFG.

is the set of nonterminal symbols in the SCFG, with a correspondence from Andsymbols to Andnodes and from Orsymbols to Ornodes.

is the start symbol of the SCFG.

maps a substring represented by a terminal or nonterminal symbol to its start/end positions in the complete sentence.

is constructed from the set of production rules in the AndOr normal form SCFG; each rule headed by an Andsymbol becomes an Andrule, with its parameter relation specifies that the substrings represented by the child nodes must be adjacent (by checking their start/end positions) and its parameter function outputs the start/end positions of the concatenated string represented by the parent Andnode (i.e., the start position of the leftmost substring and the end position of the rightmost substring); each rule headed by an Orsymbol becomes an Orrule with the same conditional probability.
It is easy to verify that the size of the stochastic contextfree AOG is linear in the size of the original SCFG. ∎
Appendix A.2 Linear ContextFree Rewriting Systems
Linear contextfree rewriting systems (LCFRS) [22] are a class of mildly contextsensitive grammars, which subsume as special cases a few other grammar formalisms [20, 21].
Definition 3.
A linear contextfree rewriting system is a 4tuple :

is a set of terminal symbols

is a set of nonterminal symbols

is a special nonterminal called the start symbol

is a set of production rules, each of the form such that is the conditional probability of the rule given , , (for ) where specifies the fanout of a nonterminal symbol and is a set of variables, and is a composition function that is linear and regular, i.e., in the equation
each variable in appears at most once on each side of the equation and the two sides of the equation contain exactly the same set of variables.
We can define AndOr normal form of LCFRS in a similar way as for SCFG.
Definition 4.
An LCFRS is in AndOr normal form iff. its nonterminal symbols are divided into two disjoint subsets: Andsymbols and Orsymbols, such that:

each Andsymbol appears on the lefthand side of exactly one production rule, and the number of nonterminal symbols on righthand side of the rule plus the number of terminals inserted by the composition function is larger than or equal to two;

each Orsymbol appears on the lefthand side of one or more rules, in each of which the number of nonterminal symbols on righthand side plus the number of terminals inserted by the composition function is one.
Proposition 3.
Any LCFRS can be converted into AndOr normal form with linear increase in size.
Proof.
The conversion can be done in the same way as for SCFG. ∎
Proposition 4.
Any LCFRS can be represented by a stochastic contextfree AOG with linear increase in size.
Proof.
We first convert the LCFRS into AndOr normal form. We then construct an equivalent stochastic contextfree AOG :

is the set of terminal symbols in the LCFRS.

is the set of nonterminal symbols in the LCFRS, with a correspondence from Andsymbols to Andnodes and from Orsymbols to Ornodes.

is the start symbol of the LCFRS.

maps a list of substrings represented by a terminal or nonterminal symbol to a list of start/end positions of these substrings in the complete sentence.

is constructed from the set of production rules in the AndOr normal form LCFRS:

Each rule headed by an Andsymbol becomes an Andrule, whose righthand side includes all the righthand side nonterminal symbols of the original rule as well as all the terminal symbols added by the composition function. Note that each of the substrings represented by the Andsymbol is formed by the composition function by concatenating terminals and/or substrings represented by the nonterminal symbols on the righthand side of the rule. The parameter relation enforces that these component substrings are adjacent (by checking their start/end positions), and the parameter function outputs the start/end positions of the concatenated strings.

Each rule headed by an Orsymbol becomes an Orrule with the same conditional probability, whose righthand side contains the single righthand side nonterminal symbol of the original rule or the single terminal symbol from the composition function.

It is easy to verify that the size of the stochastic contextfree AOG is linear in the size of the original LCFRS. ∎
Appendix A.3 Constraintbased Grammar Formalisms
Constraintbased grammar formalisms [8] associate feature structures to nonterminals and use them to specify constraints in the grammar rules.
Definition 5.
A feature structure is a set of attributevalue pairs. The value of an attribute is either an atomic symbol or another feature structure. A feature path in a feature structure is a list of attributes that leads to a particular value.
Below is an example feature structure, and Agreement Number is a feature path leading to the atomic symbol value singular.
Definition 6.
A constraintbased grammar formalism is a 4tuple :

is a set of terminal symbols

is a set of nonterminal symbols

is a special nonterminal called the start symbol

is a set of production rules, each of the form where is the conditional probability , , , and is a set of feature constraints; each nonterminal symbol in the rule is associated with a feature structure; each feature constraint takes the form of either “ featurepath atomicvalue” or “ featurepath featurepath”, where are nonterminal symbols in the rule.
Proposition 5.
Any constraintbased grammar formalism can be represented with linear increase in size by a generalization of stochastic contextfree AOG that allows an Andrule to have only one symbol on the righthand side.
Proof.
We construct an equivalent stochastic contextfree AOG in which we allow an Andrule to have only one symbol on the righthand side:

is the set of terminal symbols in the constraintbased grammar formalism.

For , all the nonterminal symbols of the constraintbased grammar formalism become Ornodes, and for each production rule we create an Andnode.

is the start symbol of the constraintbased grammar formalism.

maps a word represented by a terminal symbol to the start/end positions of the word in the complete sentence and maps a substring represented by a nonterminal symbol to a feature structure in addition to the start/end positions of the substring.

is constructed as follows. For each rule in the constraintbased grammar formalism, create one Orrule and one Andrule where is a new Andnode. Suppose is a copy of with all the appearance of changed to . Then the parameter relation of the Andrule is the conjunction of the constraints in that does not involve plus the constraint that the substrings represented by the child nodes must be adjacent (by checking their start/end positions); the parameter function outputs the start/end positions of the concatenated string as well as a new feature structure constructed according to the constraints in that involve .
It is easy to verify that the size of the stochastic contextfree AOG is linear in the size of the original constraintbased grammar formalism. ∎
Appendix A.4 SumProduct Networks
Sumproduct networks (SPN) [9] are a new type of deep probabilistic models that can be more compact than traditional graphical models.
Definition 7.
A sumproduct network over random variables is a rooted directed acyclic graph. Each leaf node is an indicator or . Each nonleaf node is either a sum node or a product node. A sum node computes a weighted sum of its child nodes. A product node computes the product of its child nodes. The value of an SPN is the value of its root node. The scope of a node is the set of variables appearing in its descendant leaf nodes. For an SPN to correctly compute the probability of all evidence, the children of any sum node must have identical scopes and the children of any product node cannot contain conflicting descendant leaf nodes (i.e., in one child and in another).
Definition 8.
A decomposable SPN is an SPN in which the children of any product node have disjoint scopes.
It has been shown that any SPN can be converted into a decomposable SPN with polynomial increase in size [36].
Proposition 6.
Any decomposable SPN can be represented by a stochastic contextfree AOG with linear increase in size.
Proof.
We construct an equivalent stochastic contextfree AOG :

is the set of leaf nodes (indicators) in the SPN.

is the set of nonleaf nodes in the SPN, with a correspondence from product nodes to Andnodes and from sum nodes to Ornodes.

is the root node of the SPN.

maps any node instance to null (i.e., we set all the instance parameters to null).

is constructed as follows: for each product node in the SPN, create an Andrule with the product node as the lefthand side and the set of child nodes as the righthand side, let the parameter relation be always true, and let the parameter function always return null; for each child node of each sum node in the SPN, create an Orrule with the sum node as the lefthand side, the child node as the righthand side, and the normalized weight of the child node as the conditional probability.
As shown in [36], normalization of the child node weights of the sum nodes do not change the distribution modeled by the SPN. Therefore, for any assignment to the random variables, the marginal probability computed by the constructed stochastic contextfree AOG and the probability computed by the original SPN are always equivalent. It is easy to verify that the size of the stochastic contextfree AOG is linear in the size of the original SPN. ∎
Note that although SPNs are also generalpurpose probabilistic models that can be used in modeling many types of data, stochastic AOGs go beyond SPNs in a few important aspects. Specifically, stochastic AOGs can simultaneously model data samples of different sizes, explicitly model relations, reuse grammar rules over different scopes, and allow recursive rules. These differences make stochastic AOGs better suited for certain domains and applications, e.g., to model recursion in language and translation invariance in computer vision.
Appendix Appendix B Computational Complexity of Inference
We prove that the parsing problem of stochastic AOGs (i.e., given a data sample consisting of only terminal nodes, finding its most likely parse) is NPhard.
Theorem 1.
The parsing problem of stochastic AOGs is NPhard.
Proof.
Below we reduce 3SAT to the parsing problem.
For a 3SAT CNF formula with variables and clauses, we construct a stochastic AOG of polynomial size in and . The node parameters in this AOG always take the value of null (i.e., no parameter), and accordingly in any Andrule of the AOG the parameter relation always returns true and the parameter function always returns null. For each variable , create one Ornode , two Andnode and , and two Orrules with equal probabilities. Create an Andrule where is the start symbol. For each clause , create an Ornode , a terminal node and two Orrules with equal probabilities. Here represents the empty set. For each literal (which can be either or for some ), suppose is the corresponding Andnode (i.e., or ), if appears in one or more clauses , then create an Andrule ; otherwise create an Andrule . Note that the constructed AOG does not conform to the standard definition of AOG in that it contains the empty set symbol and that some Andrules may have only one child node. However, the constructed AOG can be converted to the standard form with at most polynomial increase in grammar size. See [31] for a list of CFG conversion approaches, which can be extended for AOGs. For simplicity in proof, we will still use the nonstandard form of the constructed AOG below.
We then construct a data sample which simply contains all the terminal nodes with no duplication: .
We first prove that if the 3SAT formula is satisfiable, then the most likely parse of the data sample can be found (i.e., there exists at least one valid parse). Given a truth assignment that satisfies the 3SAT formula, we can construct a valid parse tree. First of all, the parse tree shall contain the start symbol and hence the production . For each variable , if it is true in the assignment, then the parse tree shall contain production ; if it is false, then the parse tree shall contain production . For each clause , select one of its literals that are true and suppose is the corresponding Andnode; then the parse tree shall contain productions , where the first production is based on the Andrule headed by and the second production is based on Orrule . In this way, all the terminal nodes in the data sample are covered by the parse tree. Finally, for any node (for some ) in the parse tree that does not produce , add production to the parse tree. The parse tree construction is now complete.
Next, we prove that if the most likely parse of the data sample can be found, then the 3SAT formula is satisfiable. For each variable , the parse tree must contain either production or production but not both. In the former case, we set to true; in the latter case, we set it to false. We can show that this truth assignment satisfies the 3SAT formula. For each clause in the formula, suppose in the parse tree the corresponding terminal node is a descendant of Andnode (which can be or for some ). Let be the literal corresponding to Andnode . According to the construction of the AOG, clause must contain . Based on our truth assignment specified above, must be true and hence is true. Therefore, the 3SAT formula is satisfied. ∎
Another inference problem of stochastic AOGs is to compute the marginal probability of a data sample. The proof above can be easily adapted to show that this problem is NPhard as well (with the same AOG construction, one can show that the 3SAT formula is satisfiable iff. the marginal probability is nonzero).
Appendix Appendix C Conversion to Generalized Chomsky Normal Form
In our inference algorithm, we assume the input AOG is in a generalized version of Chomsky normal form, i.e., (1) each Andnode has exactly two child nodes which must be Ornodes, (2) the child nodes of Ornodes must not be Ornodes, and (3) the start symbol is an Ornode.
By extending previous approaches for contextfree grammars [31], we can convert any AOG into this generalized Chomsky normal form with the following steps. Both the time complexity of the conversion and the size of the new AOG is polynomial in the size of the original AOG.

(START) If the start symbol is an Andnode, create a new Ornode as the start symbol that produces the original start symbol.

(BIN) For any Andrule that contains more than two nodes on the righthand side, replace the Andrule with a set of binary Andrules, i.e., convert () to , where are new Andnodes. We will discuss how to convert parameter relation and function later.

(UNIT) For any Orrule with an Ornode on the righthand side, , remove the Orrule and for each Orrule create a new Orrule (unless it already exists in the grammar).

(ALT) If an Andrule contains an Andnode or terminal node on the righthand side, replace the node with a new Ornode that produces the node.
In the BIN step, we have to binarize the parameter relation and function along with the production rule, such that:
and
In some cases (e.g., the example AOG of line drawings in the main text), the parameter relation and function can be naturally factorized into this form. In general, however, we have to cache multiple parameters of the righthand side nodes of the Andrule in the intermediate parameters :
then we define
and
Note that the sizes of the intermediate parameters can be polynomial in . This actually violates the requirement that the parameter size shall be upper bounded by a constant. Nevertheless, when running our inference algorithm on the resulting Chomsky normal form AOG, the inference time complexity is only slightly affected, with the last factor changed to a function polynomial in and , and hence the condition for tractable inference remains unchanged.