Mildly context sensitive grammar induction and variational bayesian inference

Mildly context sensitive grammar induction and variational bayesian inference

Eva Portelance Email:; Corresponding author Dept. of Linguistics, Stanford University Leon Bergen Linguistics Dept., UC San Diego Chris Bruno Dept. of Linguistics, McGill University Timothy J. O’Donnell Dept. of Linguistics, McGill University
June 27, 2019

We define a generative model for a minimalist grammar formalism. We present a generalized algorithm for the application of variational bayesian inference to lexicalized mildly context sensitive grammars. We apply this algorithm to the minimalist grammar model.

1 Introduction

It is well known that natural language exhibits complex syntactic relations such as long distance dependencies and cross dependencies, as in examples 1 and 1.
{exe} \ex Long-distance dependency.
{dependency}[theme=default] {deptext}[column sep=1em] What & did & you & see & GAP.
\depedge[edge unit distance=1ex]45OBJ \depedge[edge unit distance=0.5ex]51WH \ex Crossing dependency. (Swiss German (Shieber, 1985))
\gll…mer d’chind em Hans es huus lönd hälfe aastriiche
…we the children.ACC Hans.DAT the house.ACC let help paint
\trans’…we let the children help Hans paint the house’
{dependency}[theme=default] {deptext}[column sep=1em] …d’chind & em Hans & es huus & lönd & hälfe & aastriiche
\depedge[edge unit distance=0.5ex]41OBJ \depedge[edge unit distance=1ex]52OBJ \depedge[edge unit distance=1.5ex]63OBJ

Some of these dependencies cannot be expressed by simple Context Free grammars (CFG). On the other hand, Context sensitive grammars (CSG) which can derive such relations are considered unnecessarily expressive and thus too costly. For this reason,the class of Mildly Context Sensitive grammar formalisms (MCSG) characterised first by Joshi (1985) are the best suited for deriving natural language since they are capable of expressing long-distance and cross dependencies, but do not introduce excessive expressibility. Many different formalisms have emerged within this equivalency class, including Tree Adjoining grammars (TAG) (Joshi et al., 1969), Combinatory Categorial grammars (CCG) (Steedman, 1987; Szabolcsi, 1992) and slightly more expressive Minimalist grammars (MG) (Stabler, 1997), Linear Context Free Rewriting Systems (LCFRS) (Weir, 1988) and Multiple Context Free grammars (MCFG) (Seki et al., 1991).

Probabilistic language models which use these formalisms allow us to build parsing systems which can capture the right semantic interpretations and to develop machines which interact with people more fluently. Such models are adaptable to new inputs and can help manage ambiguities which exist in natural language, as in example 1. {exe} \ex

  1. The sweater was worn (by Mary). - Passive verb

  2. The sweater was (very) worn. - Adjective

From a theoretical point of view, probabilistic language models present a standardised testing ground for hypotheses about how given syntactic structures and relations such as long distance and cross dependencies are learnt and pattern within natural language. Probabilistic inference has been applied to CCG (Bisk and Hockenmaier, 2013, 2012a, 2012b; Bisk, 2015; Wang, 2016) and TAG (Blunsom and Cohn, 2010), but none of these use a variational bayesian approach. The variational bayesian approach first developed by Attias (1999) has been shown to be more efficient than sampling based approaches (Zhai, 2014). It turns an inference problem into an optimization problem which can be seen as more tractable when working with complex generative models. This method has been applied to CFG (Kurihara and Sato, 2004) and Adaptor Grammars (Cohen and Smith, 2010), but has yet to be applied to the wider class of MCSG formalisms, including MGs, which we propose to do in this paper. In section 2, first define a MG formalism and then present the generative model for this grammar. In section 3, we present our generalized algorithm for variational bayesian inference applied to the MG described in the previous section and a set of conditions for its application to other equivalent formalisms.

2 Minimalist grammar induction

2.1 The choice of MG

Minimalist Grammars are a formalization of Chomsky (1995)’s Minimalist program, a language processing model which is the theoretical ground for much of the current research in linguistic syntactic theory. MGs are classified as MCSG and equivalent to Multiple Context Free Grammars(MCFG) (Harkema, 2001). The interest in studying these grammars is that they offer a direct line of comparison for resulting syntactic structure predictions within the linguistic minimalist literature as well as a direct mapping between derivational and semantic relations within a given clause.

In the subsections which follow, we present a working definition of our Grammar and a generative model, to which we then apply our variational bayesian inference model to do unsupervised induction in section 3.

2.2 Defining a MG

We begin by presenting a more standard definition of a MG before introducing our generative model. The following MG is based on the formalisations of Harkema (2001) and Stabler (2011).

Let be a MG, such that .

  • is the set of terminals, that is, all the possible phonological realisations of words and strings in a given language.

  • is the set of syntactic and semantic features used to define lexical items. There exist the following subclasses of syntactic features which interact with structure building operations :

    1. category (e.g. v, d, p) - define the syntactic categories (verb, noun …);

    2. selector (e.g. =d^L, =p^R) - select arguments;

    3. licensor (e.g. +case, +wh) - select moving lexical items;

    4. licensee (e.g. -case, -wh) - selected moving lexical items.

  • are the structure building operations.

In an MG, the sequence of features determines which lexical items can merge. Each lexical item is categorized by one category feature. If a lexical item has selector features, these are checked by merging with lexical items of the corresponding category feature which has not yet merge into the derivation. Licensor feature are used to move previously merged lexical items with the corresponding licensee feature.

Let be a lexical item such that and and is a probability weight on , such that . Let be a lexicon and is a vector of probabilities on lexical items.

contains two structure building operations, merge defined in 2.2 and move defined in 2.2. These operations apply to lexical items, both simple (terminals) and complex (non-terminals) and are defined as follows:

For every where the syntactic features of are and the phonetic features of cover words of a string , we note the axiom .
Let , then define {exe} \exmerge

  • merge-L (merge a non-moving item to left)

  • merge-R (merge non-moving item to right)

  • merge-m (merge an eventually moving item)



  • move-1 (moving item to final landing position)

  • move-2 (moving item which will move again)

Here, represent chains, items which have merged into the derivation, but still carry licensee features - i.e. items which have not reached their final landing position in the syntactic structure.

The following figure is an example of a derivation in this system using and . {exe} \ex Derivation of ’what did you see?’
what did you see [.merge
+wh c, -wh [.=i^R +wh c ] [.merge
i, -wh
(did you see, what) [.=v^R i
did ] [.merge
v, -wh
(you see, what) [.d
you ] [.merge
=d^L v, -wh
(see, what) [.=d =d^L v
see ] [.d -wh
what ] ] ] ] ] ]

  1. merge-m: v selects d what;

  2. merge-L: v selects d you;

  3. merge-R: i selects v you see, what, where what is a lexical item which still has a licensee feature;

  4. merge-R: c selects i did you see, what;

  5. move: -wh moves to satisfy +wh what did you see;

  6. All features are satisfied.

For a more descriptive account of how derivations of structures in this MG work, we refer you to Harasim et al. (2017), a detailed account of our grammar formalism and of the parser we use build our initial derivations for test corpora.

2.3 The generative model

The following generative model builds on the concepts presented in the previously defined MG. However, to guarantee that no probability mass is given to derivations which crash and do not cannot conclude due to mismatches in licensor and licensee features, instead of generating derivations bottom-up as do most MG formalizations, the following generative model presents derivations in a top-down fashion. Thus, we begin every derivation with the category feature of a lexical item. We then proceed to unmerge and unmove down the derivation tree. We will refer to lexical items feature sets as chains in the derivation.

For every unmove operation we have exactly one licensor feature (+f_i) which belongs to the head chain of the derivation and one licensee feature (-f_i) which is added as a new chain. For every unmerge operation there is exactly one selector feature (=x_i) which is added to the head branch and one category feature which is added to the argument branch. Argument branches can also take a set of licensee features from the chains which follow the head branch head chain in one of two ways: by creating a new chain with the licensee features or by adding them to the chain of the argument branch head.

We can define unmerge and unmove as follows:
Let c,x be category features and are selectee features, is a selector feature, is a set of feature and . We use to indicate a chain and to indicate a branch. {exe} \exunmerge

  • unmerge-1 (licensee features stay in head branch)

  • unmerge-2 (licensee features move to merged chains of argument branch)

  • unmerge-3 (licensee features move to head chain of argument branch)

  • unmerge-4 (licensee features move to both head chain and merged chains of argument branch)



  • unmove-1

As a result of how these rules are defined, a chain which is not the head can contain at most one licensee feature. We define the spine of a derivation tree as the path from the root of all nodes headed by the root category feature. Every derivation tree of a given lexical item will have the same spine.

To simplify the model, we will assume that our grammar only contains one licensee feature: +f. Thus, a derivation can have at most one open chain. This means that if we apply unmerge so that a head chain with at licensee chain [-f] requires an argument of category x, there are only three possible outcomes, either we apply unmerge-1 and the licensee chain stays on the head branch, or we apply unmerge-2 and the licensee chain joins the argument branch, or finally we apply unmerge-3 and the licensee feature is added to the head chain of the argument branch. We can therefore define 3|C| conditional distributions, where C is the set of category features in our grammar.

Let , return a distribution over lexical items of category , where . Let , where is a lexical item.

We define a Dirichlet prior over parameterized by a distribution over categories of lexical items .




Using variational bayes, we can infer the optimal estimated distribution of .

3 The variational bayesian inference approach

The variational bayesian approach provides an approximation of the actual posterior distribution.


We wish to approximate the posterior as . One way of optimizing this approximation is by minimizing the Kullback-Leibler distance between the and . This method is called ’Mean-field variational Bayes’.


Minimizing Kullback-Leibler distance is equivalent to maximizing the lower bound on the log likelihood (Kurihara and Sato, 2004).

3.1 MG induction using variational bayesian inference

Let us define , our corpus, as , and , the set of possible derivations for , as . We can now define the posterior we are interested in as follows:


To optimize by minimising the Kullback-Leibler distance with the true posterior, we have to have an updating schema. By making the mean-field assumption, we can create the foundations of such a schema. Using the mean-field approximation method, we define our variational distribution as:


where is some derivation of .

We can now proceed to update and in alternation in an iterative manner until our variational approximation approaches our true posterior.

First, we update .






is a hyperparameter of the Dirichlet prior, returns all the possible derivations for and returns the counts of in , given the current estimate of the grammar, where expected lexical item counts were calculated using Inside-Outside. Thus, we can proceed to update .


Second, we update .




and is the digamma function.

By updating and in alternation, given that they are codependent, we can optimize . From - a probability distribution over distributions of lexical items - we can infer an estimated optimal probabilistic lexicon for .

3.2 Variational inference for lexicalized MCSG formalisms

Given that variational bayesian inferencetransforms an inference problem into an optimization one, unlike sampling-based methods of inference, the method can easily be translated to work for other models as long as they can satisfy certain assumptions. The previous example of variational inference applied to a MG was built on three assumptions. {exe} \exAssumptions for generalized variational bayesian inference for MCSGs

  1. The latent derivation trees are context-free;

  2. The probabilities of the trees in 1 can be decomposed into the product of their components’ probabilities;

  3. There is a Dirichlet prior over lexical items/rules.

The first assumption supposes that each lexical item is well-formed and conditionally independent, however the linearization assumption usually given in CFGs is not required of MCSG derivation trees. The variational inference schema presented can be generalized to any lexicalized MCSG formalism that satisfies these three assumptions.

4 Conclusion

In this paper, we presented a generalized variational bayesian approach to MCSG induction. We demonstrated how this approach can be applied to a specific grammar instantiation within this equivalency class using the case of a MG. In future work, we hope to implement a computable version of this algorithm which will be integrated into a MCSG parsing framework which is also currently being developed (Harasim et al., 2017). This new framework will then be used to test linguistic syntactic theories and provide a baseline for future development of unsupervised language learning models as well as more ’human’-like natural language processing applications.


  • Attias (1999) H. Attias. Inferring parameters and structure of latent variable models by variational bayes. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 21–30. Morgan Kaufmann Publishers Inc., 1999.
  • Bisk and Hockenmaier (2012a) Y. Bisk and J. Hockenmaier. Induction of linguistic structure with combinatory categorial grammars. In Proceedings of the NAACL-HLT Workshop on the Induction of Linguistic Structure, pages 90–95. Association for Computational Linguistics, 2012a.
  • Bisk and Hockenmaier (2012b) Y. Bisk and J. Hockenmaier. Simple robust grammar induction with combinatory categorial grammars. In AAAI, 2012b.
  • Bisk and Hockenmaier (2013) Y. Bisk and J. Hockenmaier. An hdp model for inducing combinatory categorial grammars. Transactions of the Association for Computational Linguistics, 1:75–88, 2013.
  • Bisk (2015) Y. Y. Bisk. Unsupervised Grammar Induction with Combinatory Categorial Grammars. PhD thesis, University of Illinois at Urbana-Champaign, 2015.
  • Blunsom and Cohn (2010) P. Blunsom and T. Cohn. Unsupervised induction of tree substitution grammars for dependency parsing. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1204–1213. Association for Computational Linguistics, 2010.
  • Chomsky (1995) N. Chomsky. The minimalist program (current studies in linguistics 28). Cambridge et al, 1995.
  • Cohen and Smith (2010) S. B. Cohen and N. A. Smith. Covariance in unsupervised learning of probabilistic grammars. Journal of Machine Learning Research, 11(Nov):3017–3051, 2010.
  • Harasim et al. (2017) D. Harasim, C. Bruno, E. Portelance, and T. J. O’Donnell. A generalized parsing framework for abstract grammars technical report. arXiv, page submit/2054453, 2017.
  • Harkema (2001) H. Harkema. Parsing Minimalist Languages. PhD thesis, University of California, Los Angeles, 2001.
  • Joshi (1985) A. K. Joshi. Tree adjoining grammars: How much context-sensitivity is required to provide reasonable structural descriptions? In D. R. Dowty, L. Karttunen, and A. M. Zwicky, editors, Natural Language Parsing. Cambridge University Press, 1985.
  • Joshi et al. (1969) A. K. Joshi, S. R. Kosaraju, and H. Yamada. String adjunct grammars. In 10th Annual Symposium on Switching and Automata Theory (swat 1969), pages 245–262, Oct 1969. doi: 10.1109/SWAT.1969.23.
  • Kurihara and Sato (2004) K. Kurihara and T. Sato. An application of the variational bayesian approach to probabilistic context-free grammars. In IJCNLP-04 Workshop beyond shallow analyses, 2004.
  • Seki et al. (1991) H. Seki, T. Matsumura, M. Fujii, and T. Kasami. On multiple context-free grammars. Theoretical Computer Science, 88(2):191 – 229, 1991. ISSN 0304-3975. doi:
  • Shieber (1985) S. M. Shieber. Evidence against the context-freeness of natural language. The Formal complexity of natural language, 33:320–332, 1985.
  • Stabler (1997) E. Stabler. Derivational minimalism. Logical Aspects of Computational Linguistics, pages 68–95, 1997.
  • Stabler (2011) E. P. Stabler. Computational perspectives on minimalism. Oxford handbook of linguistic minimalism, pages 617–643, 2011.
  • Steedman (1987) M. Steedman. Combinatory grammars and parasitic gaps. Natural Language & Linguistic Theory, 5(3):403–439, 1987.
  • Szabolcsi (1992) A. Szabolcsi. Combinatory grammar and projection from the lexicon. Lexical matters, 1192, 1992.
  • Wang (2016) A. X. Wang. Linguistically Motivated Combinatory Categorial Grammar Induction. PhD thesis, University of Washington, 2016.
  • Weir (1988) D. J. Weir. Characterizing mildly context-sensitive grammar formalisms. PhD thesis, 1988.
  • Zhai (2014) K. Zhai. Models, Inference, and Implementation for Scalable Probabilistic Models of Text. PhD thesis, 2014.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description