Mildly context sensitive grammar induction and variational bayesian inference
Abstract
We define a generative model for a minimalist grammar formalism. We present a generalized algorithm for the application of variational bayesian inference to lexicalized mildly context sensitive grammars. We apply this algorithm to the minimalist grammar model.
1 Introduction
It is well known that natural language exhibits complex syntactic relations such as long distance dependencies and cross dependencies, as in examples 1 and 1.
{exe}
\ex Longdistance dependency.
{dependency}[theme=default]
{deptext}[column sep=1em]
What & did & you & see & GAP.
\depedge[edge unit distance=1ex]45OBJ
\depedge[edge unit distance=0.5ex]51WH
\ex Crossing dependency. (Swiss German (Shieber, 1985))
\gll…mer d’chind em Hans es huus lönd hälfe aastriiche
…we the children.ACC Hans.DAT the house.ACC let help paint
\trans’…we let the children help Hans paint the house’
{dependency}[theme=default]
{deptext}[column sep=1em]
…d’chind & em Hans & es huus & lönd & hälfe & aastriiche
\depedge[edge unit distance=0.5ex]41OBJ
\depedge[edge unit distance=1ex]52OBJ
\depedge[edge unit distance=1.5ex]63OBJ
Some of these dependencies cannot be expressed by simple Context Free grammars (CFG). On the other hand, Context sensitive grammars (CSG) which can derive such relations are considered unnecessarily expressive and thus too costly. For this reason,the class of Mildly Context Sensitive grammar formalisms (MCSG) characterised first by Joshi (1985) are the best suited for deriving natural language since they are capable of expressing longdistance and cross dependencies, but do not introduce excessive expressibility. Many different formalisms have emerged within this equivalency class, including Tree Adjoining grammars (TAG) (Joshi et al., 1969), Combinatory Categorial grammars (CCG) (Steedman, 1987; Szabolcsi, 1992) and slightly more expressive Minimalist grammars (MG) (Stabler, 1997), Linear Context Free Rewriting Systems (LCFRS) (Weir, 1988) and Multiple Context Free grammars (MCFG) (Seki et al., 1991).
Probabilistic language models which use these formalisms allow us to build parsing systems which can capture the right semantic interpretations and to develop machines which interact with people more fluently. Such models are adaptable to new inputs and can help manage ambiguities which exist in natural language, as in example 1. {exe} \ex

The sweater was worn (by Mary).  Passive verb

The sweater was (very) worn.  Adjective
From a theoretical point of view, probabilistic language models present a standardised testing ground for hypotheses about how given syntactic structures and relations such as long distance and cross dependencies are learnt and pattern within natural language. Probabilistic inference has been applied to CCG (Bisk and Hockenmaier, 2013, 2012a, 2012b; Bisk, 2015; Wang, 2016) and TAG (Blunsom and Cohn, 2010), but none of these use a variational bayesian approach. The variational bayesian approach first developed by Attias (1999) has been shown to be more efficient than sampling based approaches (Zhai, 2014). It turns an inference problem into an optimization problem which can be seen as more tractable when working with complex generative models. This method has been applied to CFG (Kurihara and Sato, 2004) and Adaptor Grammars (Cohen and Smith, 2010), but has yet to be applied to the wider class of MCSG formalisms, including MGs, which we propose to do in this paper. In section 2, first define a MG formalism and then present the generative model for this grammar. In section 3, we present our generalized algorithm for variational bayesian inference applied to the MG described in the previous section and a set of conditions for its application to other equivalent formalisms.
2 Minimalist grammar induction
2.1 The choice of MG
Minimalist Grammars are a formalization of Chomsky (1995)’s Minimalist program, a language processing model which is the theoretical ground for much of the current research in linguistic syntactic theory. MGs are classified as MCSG and equivalent to Multiple Context Free Grammars(MCFG) (Harkema, 2001). The interest in studying these grammars is that they offer a direct line of comparison for resulting syntactic structure predictions within the linguistic minimalist literature as well as a direct mapping between derivational and semantic relations within a given clause.
In the subsections which follow, we present a working definition of our Grammar and a generative model, to which we then apply our variational bayesian inference model to do unsupervised induction in section 3.
2.2 Defining a MG
Let be a MG, such that .

is the set of terminals, that is, all the possible phonological realisations of words and strings in a given language.

is the set of syntactic and semantic features used to define lexical items. There exist the following subclasses of syntactic features which interact with structure building operations :

category (e.g. v, d, p)  define the syntactic categories (verb, noun …);

selector (e.g. =d^L, =p^R)  select arguments;

licensor (e.g. +case, +wh)  select moving lexical items;

licensee (e.g. case, wh)  selected moving lexical items.


are the structure building operations.
In an MG, the sequence of features determines which lexical items can merge. Each lexical item is categorized by one category feature. If a lexical item has selector features, these are checked by merging with lexical items of the corresponding category feature which has not yet merge into the derivation. Licensor feature are used to move previously merged lexical items with the corresponding licensee feature.
Let be a lexical item such that and and is a probability weight on , such that . Let be a lexicon and is a vector of probabilities on lexical items.
contains two structure building operations, merge defined in 2.2 and move defined in 2.2. These operations apply to lexical items, both simple (terminals) and complex (nonterminals) and are defined as follows:
For every where the syntactic features of are and the phonetic features of cover words of a string , we note the axiom .
Let , then define
{exe}
\exmerge

mergeL (merge a nonmoving item to left)

mergeR (merge nonmoving item to right)

mergem (merge an eventually moving item)
move

move1 (moving item to final landing position)

move2 (moving item which will move again)
Here, represent chains, items which have merged into the derivation, but still carry licensee features  i.e. items which have not reached their final landing position in the syntactic structure.
The following figure is an example of a derivation in this system using and .
{exe}
\ex Derivation of ’what did you see?’
\Tree[.move
c
what did you see [.merge
+wh c, wh
[.=i^R +wh c ]
[.merge
i, wh
(did you see, what) [.=v^R i
did ]
[.merge
v, wh
(you see, what)
[.d
you ]
[.merge
=d^L v, wh
(see, what)
[.=d =d^L v
see ]
[.d wh
what ] ] ] ] ] ]

mergem: v selects d what;

mergeL: v selects d you;

mergeR: i selects v you see, what, where what is a lexical item which still has a licensee feature;

mergeR: c selects i did you see, what;

move: wh moves to satisfy +wh what did you see;

All features are satisfied.
For a more descriptive account of how derivations of structures in this MG work, we refer you to Harasim et al. (2017), a detailed account of our grammar formalism and of the parser we use build our initial derivations for test corpora.
2.3 The generative model
The following generative model builds on the concepts presented in the previously defined MG. However, to guarantee that no probability mass is given to derivations which crash and do not cannot conclude due to mismatches in licensor and licensee features, instead of generating derivations bottomup as do most MG formalizations, the following generative model presents derivations in a topdown fashion. Thus, we begin every derivation with the category feature of a lexical item. We then proceed to unmerge and unmove down the derivation tree. We will refer to lexical items feature sets as chains in the derivation.
For every unmove operation we have exactly one licensor feature (+f_i) which belongs to the head chain of the derivation and one licensee feature (f_i) which is added as a new chain. For every unmerge operation there is exactly one selector feature (=x_i) which is added to the head branch and one category feature which is added to the argument branch. Argument branches can also take a set of licensee features from the chains which follow the head branch head chain in one of two ways: by creating a new chain with the licensee features or by adding them to the chain of the argument branch head.
We can define unmerge and unmove as follows:
Let c,x be category features and are selectee features, is a selector feature, is a set of feature and . We use to indicate a chain and to indicate a branch.
{exe}
\exunmerge

unmerge1 (licensee features stay in head branch)

unmerge2 (licensee features move to merged chains of argument branch)

unmerge3 (licensee features move to head chain of argument branch)

unmerge4 (licensee features move to both head chain and merged chains of argument branch)
unmove

unmove1
As a result of how these rules are defined, a chain which is not the head can contain at most one licensee feature. We define the spine of a derivation tree as the path from the root of all nodes headed by the root category feature. Every derivation tree of a given lexical item will have the same spine.
To simplify the model, we will assume that our grammar only contains one licensee feature: +f. Thus, a derivation can have at most one open chain. This means that if we apply unmerge so that a head chain with at licensee chain [f] requires an argument of category x, there are only three possible outcomes, either we apply unmerge1 and the licensee chain stays on the head branch, or we apply unmerge2 and the licensee chain joins the argument branch, or finally we apply unmerge3 and the licensee feature is added to the head chain of the argument branch. We can therefore define 3C conditional distributions, where C is the set of category features in our grammar.
Let , return a distribution over lexical items of category , where . Let , where is a lexical item.
We define a Dirichlet prior over parameterized by a distribution over categories of lexical items .
(1) 
where,
(2) 
Using variational bayes, we can infer the optimal estimated distribution of .
3 The variational bayesian inference approach
The variational bayesian approach provides an approximation of the actual posterior distribution.
(3) 
We wish to approximate the posterior as . One way of optimizing this approximation is by minimizing the KullbackLeibler distance between the and . This method is called ’Meanfield variational Bayes’.
(4)  
Minimizing KullbackLeibler distance is equivalent to maximizing the lower bound on the log likelihood (Kurihara and Sato, 2004).
3.1 MG induction using variational bayesian inference
Let us define , our corpus, as , and , the set of possible derivations for , as . We can now define the posterior we are interested in as follows:
(5) 
To optimize by minimising the KullbackLeibler distance with the true posterior, we have to have an updating schema. By making the meanfield assumption, we can create the foundations of such a schema. Using the meanfield approximation method, we define our variational distribution as:
(6) 
where is some derivation of .
We can now proceed to update and in alternation in an iterative manner until our variational approximation approaches our true posterior.
First, we update .
(7) 
where
(8) 
then,
(9) 
is a hyperparameter of the Dirichlet prior, returns all the possible derivations for and returns the counts of in , given the current estimate of the grammar, where expected lexical item counts were calculated using InsideOutside. Thus, we can proceed to update .
(10) 
Second, we update .
(11) 
(12) 
where
(13) 
and is the digamma function.
By updating and in alternation, given that they are codependent, we can optimize . From  a probability distribution over distributions of lexical items  we can infer an estimated optimal probabilistic lexicon for .
3.2 Variational inference for lexicalized MCSG formalisms
Given that variational bayesian inferencetransforms an inference problem into an optimization one, unlike samplingbased methods of inference, the method can easily be translated to work for other models as long as they can satisfy certain assumptions. The previous example of variational inference applied to a MG was built on three assumptions. {exe} \exAssumptions for generalized variational bayesian inference for MCSGs

The latent derivation trees are contextfree;

The probabilities of the trees in 1 can be decomposed into the product of their components’ probabilities;

There is a Dirichlet prior over lexical items/rules.
The first assumption supposes that each lexical item is wellformed and conditionally independent, however the linearization assumption usually given in CFGs is not required of MCSG derivation trees. The variational inference schema presented can be generalized to any lexicalized MCSG formalism that satisfies these three assumptions.
4 Conclusion
In this paper, we presented a generalized variational bayesian approach to MCSG induction. We demonstrated how this approach can be applied to a specific grammar instantiation within this equivalency class using the case of a MG. In future work, we hope to implement a computable version of this algorithm which will be integrated into a MCSG parsing framework which is also currently being developed (Harasim et al., 2017). This new framework will then be used to test linguistic syntactic theories and provide a baseline for future development of unsupervised language learning models as well as more ’human’like natural language processing applications.
References
 Attias (1999) H. Attias. Inferring parameters and structure of latent variable models by variational bayes. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pages 21–30. Morgan Kaufmann Publishers Inc., 1999.
 Bisk and Hockenmaier (2012a) Y. Bisk and J. Hockenmaier. Induction of linguistic structure with combinatory categorial grammars. In Proceedings of the NAACLHLT Workshop on the Induction of Linguistic Structure, pages 90–95. Association for Computational Linguistics, 2012a.
 Bisk and Hockenmaier (2012b) Y. Bisk and J. Hockenmaier. Simple robust grammar induction with combinatory categorial grammars. In AAAI, 2012b.
 Bisk and Hockenmaier (2013) Y. Bisk and J. Hockenmaier. An hdp model for inducing combinatory categorial grammars. Transactions of the Association for Computational Linguistics, 1:75–88, 2013.
 Bisk (2015) Y. Y. Bisk. Unsupervised Grammar Induction with Combinatory Categorial Grammars. PhD thesis, University of Illinois at UrbanaChampaign, 2015.
 Blunsom and Cohn (2010) P. Blunsom and T. Cohn. Unsupervised induction of tree substitution grammars for dependency parsing. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1204–1213. Association for Computational Linguistics, 2010.
 Chomsky (1995) N. Chomsky. The minimalist program (current studies in linguistics 28). Cambridge et al, 1995.
 Cohen and Smith (2010) S. B. Cohen and N. A. Smith. Covariance in unsupervised learning of probabilistic grammars. Journal of Machine Learning Research, 11(Nov):3017–3051, 2010.
 Harasim et al. (2017) D. Harasim, C. Bruno, E. Portelance, and T. J. O’Donnell. A generalized parsing framework for abstract grammars technical report. arXiv, page submit/2054453, 2017.
 Harkema (2001) H. Harkema. Parsing Minimalist Languages. PhD thesis, University of California, Los Angeles, 2001.
 Joshi (1985) A. K. Joshi. Tree adjoining grammars: How much contextsensitivity is required to provide reasonable structural descriptions? In D. R. Dowty, L. Karttunen, and A. M. Zwicky, editors, Natural Language Parsing. Cambridge University Press, 1985.
 Joshi et al. (1969) A. K. Joshi, S. R. Kosaraju, and H. Yamada. String adjunct grammars. In 10th Annual Symposium on Switching and Automata Theory (swat 1969), pages 245–262, Oct 1969. doi: 10.1109/SWAT.1969.23.
 Kurihara and Sato (2004) K. Kurihara and T. Sato. An application of the variational bayesian approach to probabilistic contextfree grammars. In IJCNLP04 Workshop beyond shallow analyses, 2004.
 Seki et al. (1991) H. Seki, T. Matsumura, M. Fujii, and T. Kasami. On multiple contextfree grammars. Theoretical Computer Science, 88(2):191 – 229, 1991. ISSN 03043975. doi: https://doi.org/10.1016/03043975(91)90374B.
 Shieber (1985) S. M. Shieber. Evidence against the contextfreeness of natural language. The Formal complexity of natural language, 33:320–332, 1985.
 Stabler (1997) E. Stabler. Derivational minimalism. Logical Aspects of Computational Linguistics, pages 68–95, 1997.
 Stabler (2011) E. P. Stabler. Computational perspectives on minimalism. Oxford handbook of linguistic minimalism, pages 617–643, 2011.
 Steedman (1987) M. Steedman. Combinatory grammars and parasitic gaps. Natural Language & Linguistic Theory, 5(3):403–439, 1987.
 Szabolcsi (1992) A. Szabolcsi. Combinatory grammar and projection from the lexicon. Lexical matters, 1192, 1992.
 Wang (2016) A. X. Wang. Linguistically Motivated Combinatory Categorial Grammar Induction. PhD thesis, University of Washington, 2016.
 Weir (1988) D. J. Weir. Characterizing mildly contextsensitive grammar formalisms. PhD thesis, 1988.
 Zhai (2014) K. Zhai. Models, Inference, and Implementation for Scalable Probabilistic Models of Text. PhD thesis, 2014.