A probabilistic topdown parser
for minimalist grammars
Abstract
This paper describes a probabilistic topdown parser for minimalist grammars. Topdown parsers have the great advantage of having a certain predictive power during the parsing, which takes place in a lefttoright reading of the sentence. Such parsers have already been wellimplemented and studied in the case of ContextFree Grammars (see for example [Roa01]), which are already topdown, but these are difficult to adapt to Minimalist Grammars, which generate sentences bottomup. I propose here a way of rewriting Minimalist Grammars as Linear ContextFree Rewriting Systems, allowing us to easily create a topdown parser. This rewriting allows also to put a probabilistic field on these grammars, which can be used to accelerate the parser. I propose also a method of refining the probabilistic field by using algorithms used in data compression.
Contents
Throughout this paper, I will refer as a subtree of a tree , the set of nodes in dominated by a particular node, which will be the root of the subtree. On the other hand, a cut is the set of the leaves of a finite prefix tree of .
1 Introduction
The idea of this parser is to see a minimalist grammar (MG) as a linear contextfree rewriting system (LCFRS) on its derivation trees. This transformation allows us to work on a grammar without movement, generating sentences from top to bottom (on contrary of MG, which generates sentences bottomup), and to put a probabilistic field on it.
1.1 Minimalist grammars
Minimalist grammars are designed to generate (subparts of) human natural languages. They are framed in Chomsky’s minimalist program [Cho95], and were first described by E. Stabler in [Sta97]. For the sake of clarity, I will in this paper use slightly different convention to represent the trees generated by a minimalist grammar.
Minimalist grammars distinguishe themselves from more classical contextfree grammars by the fact that they allow movement, commonly required by syntacticians to generate such sentences as (for example) ‘Which mouse did the cat eat’, where ‘which mouse’ is basegenerated at the end of the sentence (in the object position), and moves at the front. The tree corresponding to this sentence, as generated by the toy Minimalist Grammar we will consider here as example, is the following:
.
\Tree[.
[.
[.
]
[.
]
]
[.
[.
]
[.
[.
[.
]
[.
]
]
[.
[.
]
[.
]
]
]
]
]
where denotes the trace of the subtree ‘which mouse’, which has moved in front of the sentence. A trace is kept for psychological reasons, as these traces can be shown to be still present for the computation of the meaning of sentences. They also allows to keep what is called the deep structure, corresponding to the tree where no movement happened, and all constituents are in their base position, where lexical selection takes place.
A minimalist grammar takes several lexical elements, and builds a tree with them. The toy grammar we consider will have the following lexical items:
.
This grammar generates (roughly) all affirmative/interrogative past sentences about a cat and a mouse eating each other.
As can be seen, many symbols are used next to the actual phonetic contents of the words (the, eat, cat, etc…). These are syntactic features, and the sequence of these in a lexical item represents its syntactic category, and is all that is needed to compute the tree. Two lexical items with the same lexical category can be freely interchanged without losing grammaticality.
The syntactic features may be of four types:

categories, represented by a string of letters, among which is the distinguished feature c, used to recognise the grammatical outputs. For example, n. The set of categories will be noted Cat.

selectors, represented by the string of letters of a category, preceded by a =. For example, =n. The set of selectors will be noted Sel.

licensees, represented by a string of letters preceded by a . For example, wh. The set of licensees will be noted Licensee.

licensors, represented by a string of letters corresponding to a licencee, preceded by a +. For example, +wh. The set of licensors will be noted Licensor.
Syntactic features must follow a certain order, to ensure good formation of trees :
The trees are computed by using two functions on the lexical items, to form constituents (i.e., trees):

merge, when a selector selects a corresponding category,

move, when a licensee moves to a corresponding licensor.
Let’s see how this works on our little tree:

Take and . We added a in front of the syntactic categories to keep track of the derivation. Here, the two features just right of the dot (the current features are =n and n. It’s a selector and its corresponding category, so we can merge them to a bigger constituent:
[.
[. ] [. ] ]Both of the syntactic categories are copied to the root of the new constituent, with their dots moved one step right, since the current features were used. The category with the selector always comes first. If, as it is the case for the syntactic category of mouse, the dot ends up at the far right, the category may be left out, since it won’t have any role in the further derivations. The indicates the head of the constituent, i.e. the constituent where the selector came from.

Then merging the new constituent, whose syntactic category is , with (note the selector =d, corresponding to the current d) gives:
[.
[. ] [.
[. ] [. ] ] ]Note also that here, the second syntactic category still has something right of the dot, so stays.

Together with \Tree[. [. ] [. ] ] , it merges into:
[.
[.
[. ] [. ] ] [.
[. ] [.
[. ] [. ] ] ] ],Note that for merging, only the first category in the list of syntactic categories is considered. Here, in , only is considered (in fact, only ).
Note also that if the constituent with the selector is complex (i.e. is not formed of a single lexical item), as it is here the case, the merging happens in the other way: head right and selected constituent left. This has to do with the fact that english is a SVO language.

It can then merge with , giving:
[.
[. ] [.
[.
[. ] [. ] ] [.
[. ] [.
[. ] [. ] ] ] ] ], 
Finally, we can apply move. this function applies to a single constituent, whose first syntactic category has a licensor right of the dot. Here, +wh. It will then scan the other categories to find a corresponding licensee right of a dot (here, wh), and move the corresponding constituent to the top of the tree, giving the final sentence:
[.
[.
[. ] [. ] ] [.
[. ] [.
[.
[. ] [. ] ] [.
[. ] [.
] ] ] ] ]The dots are moved as usual, the indicates the constituent where the licensor was, and in this case, since the only feature right of a dot is the distinguished feature c, we know that the derivation yielded a grammatical output.
The constituent moved corresponds to the biggest constituent whose head is the lexical element containing the considered lincensee (this corresponds to the syntactic notion of maximal projection).
The phonetical content of this sentence is the concatenation of the phonetic contents of its leaves, in lefttoright reading: ‘which mouse did the cat eat’.
The notion of head is an important notion in linguistics, since, by the principle of locality of selection, we want to restrict the amount of information that an item has access to. As such, the only information about a constituent accessible from outside (for a merge operation, for example), is the rightofthedot features of its head, that is, the features of the first syntactic category (minus the leftofthedot ones, kept only for the sake of historic bookkeeping).
1.2 Derivation trees
Another way of representing the constituents generated by a grammar is by using its derivation tree:
Definition 1.1.
The derivation tree of a constituent is a binary tree showing the history of its building by the functions and . Its leaves are lexical items, and its nodes are labelled by either (merge, it’s then a binary node) or (move, it’s then a unary node).
For example, the derivation tree of the previous example 1.1, is:
. \Tree[. [. [. ] [. [. [. ] [. ] ] [. [. ] [. [. ] [. ] ] ] ] ] ]
Let’s note that to each subtree of a derivation tree corresponds a unique constituent, appearing in the construction of the main one. We can then label each node of a derivation tree by the syntactic category of the corresponding constituent.
2 Probabilistic minimalist grammars
2.1 Derivation trees of MG as LCFRSderived trees
The basis of this method is to see minimalist derivation trees as trees generated by linear contextfree rewriting systems (LCFRS). Putting a probability field on these is indeed very easy. A similar approach was also used in the minimalist parser of H. Harkema in his thesis [Har01].
We thus take a general minimalist grammar .
The closure of Lex under gives the outputs of the grammars. Here, we will not consider these outputs, but the derivation trees describing the process giving these outputs.
One important difference between contextfree grammars, for which probabilistic versions are wellstudied, and minimalist grammars is that CFG generate trees from top to bottom, by means of rules rewriting each nonterminal node by a number of other nodes, while a MG generates trees from bottom to the top, by merging and moving elements. While in the CFG case, we begin with a single symbol and then choose rules to rewrite it (thus enabling us to assign probabilities to the process by assigning probabilities to the rewriting rules), in MG, we begin with a bunch of lexical items, not necessarily compatible with each other, and merge them together (and occasionally moving them too). Here we will present a way of seeing the generating process of MG as a LCFRS, which, as CFG, generates from top to bottom with a set of rules. The differents nonterminal symbols will be defined by closure of a certain set of axioms (starting symbols) under a set of inference rules, giving this way a topdown way of generating derivation trees of MG.
2.1.1 Categories and partial outputs
In order to do this, we will first define a particular type of objects, called categories, which will be the nonterminal symbols of our LCFRS:
Definition 2.1.
A category is either a lexical item, or a sequence of the form , where are elements of Syn, or a special symbol . A simple category is a category with its first dot at the leftmost place (and ). Otherwise, it is a complex category. is neither simple or complex.
Categories corresponds exactly to the list of syntactic categories defined in 1.1, although our definition allows here categories which cannot be generated by a Minimalist Grammar. We will of course only be interested in those who are.
We then define a partial output as a string of categories. These represent a particular stage in the construction of a minimalist derivation tree by the corresponding LCFRS, the different categories being the categories of the partial derivation tree which is build.
2.1.2 Axiom
There is a single axiom, the category .
2.1.3 Inference rules
These rules correspond to the rewriting rules of the Linear ContextFree Rewriting System corresponding to our Minimalist Grammar . For each possible application of one of the functions or of grammar giving a particular category , there is a corresponding inference rule (which gives quite a lot of rules…). Then, given a particular category , the rules will tell how this particular type of tree (remember that categories describe a particular type of trees generated by the grammar) can be unmerged or unmoved into one (in case of unmove) or two (in case of unmerge) different types of trees. To this must be added the rules expanding the category, and the lexicalisation rules. The first allows us to begin with a unique symbol, instead of all categories ending with the distinguished feature. The second ones allow us to leave the lexical part of the parsing up to the last moment. Thus here we are:

Start rules: for every lexical item ,
Start: 
Rewriting rules for complex categories:

Unmerge rules: the lefthand category is of the form

Cases where the selector was a simple tree :

For any lexical item of feature string ,
Unmerge1: 
For any element , with ,
Unmerge3, simple:
It should be noted that necessarily, .


Cases where the selector was a complex tree:

For any decomposition , and any lexical item of feature string ,
Unmerge2: 
For any element , and any decomposition ,
Unmerge3, complex:
As in 2iB, has to be non empty.



Unmove rules: the lefthand category is of the form

For any (necessarily unique by the Shortest Movement Constraint), with ,
Unmove2: 
If there is no , then for any lexical item of feature string ,
Unmove1:



Rewriting rules for simple categories: for any lexical item ,
Lexicalize:
.
The set of relevant partial outputs can thus be defined as the closure of the axiom under the inference rules. This set describes exactly all possible partial outputs given by the LCFRS , i.e. all possible strings of categories obtained by a cut through a tree generated by the LCFRS . Such a string correspond to a selection of outputs (not necessarily complete) generated by the minimalist grammar , such that they can be put together by application of and , in the same order (two categories will get merged only if they are adjacent in the string) to obtain a complete output. A relevant output is a relevant partial output consisting of only lexical items. It corresponds to grammatical sentences.
The relevant categories are exactly the categories that appear in a relevant partial output. They correspond to the possible sets of similar partial trees generated by the grammar . They are in finite number, since, by the Shortest Movement Constraint, no two identical licensees can appear in the feature strings of a relevant category (omitting the first string). Thus two identical feature strings (diverging only by the position of the dot) can’t appear together, and therefore the total length of all the feature strings of a relevant category is bounded by the sum of the length of all the feature strings of the lexical items, which is finite.
2.1.4 Derivation trees
With this formalism, we have now a quite straightformard way of defining minimalist derivation trees, in a way that enables us to put very simply probabilities on them: they are just the trees obtained by maximal application of rewriting rules to the axiom . The probability is simply given by a probability field on the rules.
2.2 Probabilities on MG derivation trees
To define a probability field on the derivation trees of a MG, we now just have to put conditional probabilities on the rules discussed before, given the initial relevant category. The probability of a given tree will then be the product of the probabilities of the rules that generate it, as for regular probabilistic linear contextfree rewriting systems. There can be however quite a lot of such rules and relevant categories, even if the MG is quite simple, but they can all be computed beforehand with the only knowledge of the grammar, thanks to the definition by closure of these categories. Indeed, we will see a simple method permitting to compute both the relevant categories and the inference rules that are needed.
It should be noted that the functions (Merge1,2,3 and Move1,2) having potentially given birth to a given relevant category are quite few (at most two), only if we use the dot notation, which keeps track of a minimal part of the history of the derivation. This is why the relevant categories should include all features of the lexical item potentially heading the tree (and not just the ones on the right of the dot).
To settle things a bit, we will here illustrate this method with a little example.
2.3 Example :
We will here consider the MG with the following lexical items ( being the empty string):
.
This grammar generates exactly the strings of the form , . Since this is a contextfree language, we wouldn’t have needed to use licensors and licencees, but for the sake of getting a language simple enough with enough rules (especially movement ones), we will work on this one.
We now want to get the relevant categories of this language, and the corresponding ‘contextfree rules’. A quite straightforward way to obtain them is to start from the axiom and follow the inference rules to close the set of relevant categories. From we apply the schemes to get all applicable rules, apply them, get some new relevant categories, apply the schemes to get new rules, apply them, etc… Since they are in finite number, this algorithm will eventually terminate, giving us all the relevant categories and needed rules (we won’t get them all, since the schemes could apply to nonrelevant categories, but we don’t want those in any case).
So here we go:

starting rules: we search for all lexical items whose features ends with c. There are two here, giving two different relevant categories: and . We have thus two rules:

Start:

Start:
We have now two new relevant categories: and . We will now write the rules with these on the left side of the arrow.


correspond to case 3. There is but one lexical item with features c, which is , so we have a single rule:

Lexicalize:
No new relevant category is created, so we can move to the next one:


corresponds to the case 2b, so we can have two possibilities. Since there is no ‘’, only the case 2(b)ii can apply. We must then look for lexical items whose last feature is . There is but one (and thus only one corresponding relevant category), . So we have one possible rule:

Unmove1:
We have now a new relevant category, .


corresponds to case 2(a)i. For case 2iA, we have to look for a lexical item whose last feature is a. Since there is no such item, we fall back to 2iB. Here we have to look in ‘’ for feature strings of type . There is only one, namely , so we have one rule:

Unmerge3, simple:
We got here two more relevant categories, and .


corresponds to case 3, and there is but one lexical item with the corresponding features, so we have one additional rule:

Lexicalize:


corresponds to case 2(a)i. We first try case 2iA. We look for lexical items with last feature b. There are two such items, namely and . We then have two rules:

Unmerge1:

Unmerge1:
Since ‘’ is here empty, case 2iB can’t apply, and we move on to the three newly discovered relevant categories, , and .


is ready to be lexicalized, there is still only one corresponding lexical item, so we get the rule:

Lexicalize:


is in the same case, we thus have:

Lexicalize:


corresponds to the case 2(b)ii, with only one corresponding lexical item, thus the rule:

Unmove1:


corresponds to case 2iB, and we have one rule:

Unmerge3, simple:
Since has already been treated, we can move to the last untreated relevant category:


is ready to be lexicalized:

Lexicalize:

We are now ready to give probabilities to these rules, conditioned by the lefthand side. The assignment here is quite easy : apart from the two cases where there are two possible rules (axiom choice and category and ), the conditioned probability will be (there is no choice). For the two other cases, we can assign any probability to one rule, and give the other a probability . We can now give the following table:
Start  
Start  
Lexicalize  
Unmove1  
Unmerge3, simple  
Lexicalize  
Unmerge1, simple  
Unmerge1, simple  
Lexicalize  
Lexicalize  
Unmove1  
Unmerge3, simple  
Lexicalize 
We will now end by giving the probability of a particular derivation tree:
. \Tree[. [. [. [. ] ] [. [. [. ] ] [. [. [. [. ] ] [. [. [. ] ] [. [. ] ] ] ] ] ] ] ]
All the rules here have probability , except the top one, the choice of the start rule , which has probability , the one from , which has probability , and the one from , which has probability . So the complete tree has probability , and, for example, the subtree headed by has probability .
2.4 The Cats and Mouses example
Let’s now get back to our toy grammar 1.1 and see how it rewrites:
.
The rules are the following:
1  Start  
2  Start  
3  Start  
4  Unmerge2  
5  Unmerge1  
6  Lexicalize  
7  Unmerge1  
8  Lexicalize  
9  Lexicalize  
10  Lexicalize  
11  Unmove1  
12  Unmerge1  
13  Lexicalize  
14  Unmerge2  
15  Unmerge3, complex  
16  Unmerge3, simple  
17  Lexicalize  
18  Unmerge1  
19  Lexicalize  
20  Unmerge1  
21  Unmerge1  
22  Lexicalize  
23  Unmerge2 
3 The probabilistic topdown parser
The parser will work on an ordered list of hypothesis, which he will expand in turn during the parse of the sentence. Before beginning presenting the algorithm, some definitions are needed:
3.1 Definitions
One difficulty in working with derivation trees instead of regular derived trees, is that the order of the words cannot be easily deduced (short of redoing the actual derivation). In order to keep track of the position of a category in the derived tree (so the parser may know in which order to expand the tree), we introduce position indices, which denotes positions in the derived tree from its root by a chain of digits ( if going down left, if going down right). From this perspective we can also define a successor operator on them, corresponding to a lefttoright sweep of the tree.
Consider the grammar given by the following lexical items:
.
This grammar will generate the derived tree:
. \Tree[. [. ] [. [.a: ] [. [.b: ] [. ] ] ] ]
Corresponding to this tree is the derivation tree:
. \Tree[. [. [. ] [. [. ] [. ] ] ] ]
The parser should try to expand the nodes leading to the first leaf in the derived tree 3.1, but is actually building the derivation tree 3.1. As such, it should begin by expanding rightmost nodes, then switch back to leftmost ones when is parsed to parse , etc… Position indices showing in which position which category is can be computed online and incorporated to the derivation tree, for example for all categories corresponding to , since its final position is just one branch down and left from the root. The parser will just have to expand the unexpanded nodes with lowest (i.e. leftmost) position indices. In order to do this, it can keep track of a pointer telling up to which point nodes have been expanded, and expand the corresponding one. Then upgrading the pointer with the adequate notion of successor keeps the parser working. To formalize this:
Definition 3.1.
A position index is a element .
Its successor is defined to be:
Two positions indices correspond if . In this case, we say also that points to .
The notion of correspondence enables the parser to have some liberty in the pointer indicating the index to be expanded. Indeed, the parser will not try to expand the node with the index exactly equal to the pointer, but just corresponding to it, that is, equal to the pointer with as many s as possible following, or, in the derived tree, down the leftmost path from the node indicated by the pointer, which is what we would want: the first unexpanded node down the pointer.
Definition 3.2.
A situated category is a pair , where is a sequence of position indices and is a sequence of dotted feature strings (so that is a category). For readability, we will write as .
Definition 3.3.
A hypothesis is a 5uple where:

is a finite set of situated categories (the nodes of the partial derivation tree),

is a position index, the pointer, pointing to the next node to expand,

is the probability of the hypothesis,

is a dotted input string, and

is the sequence of rules used to obtain this hypothesis from the axiom .
The dotted input string is the string of word of the phrase being parsed, with a dot indicating up to which point it has already been parsed (in fact, up to which point the words have been scanned). For example, if , this means that this hypothesis has already scanned (i.e., recognised a node for) the words “The”, “cat” and “has”, but not yet “eaten”, “the” and “mouse”.
3.2 Position indices and nodes
The parser will expand the hypothesis trees in a quite particular way, corresponding to a lefttoright reading of the output sentence. Since movement is possible in MG, the parser will have to keep track of the ‘position’ of the different elements, to only expand the leafs corresponding to the currently parsed word. This is the role of the position indices.
A position index different from will represent a particular subtree in the final derived tree, where the traces of the moved nodes are deleted (moving up its sister to the position of its mother). The position index corresponds to the subtree dominated by the node obtained by going down in the tree from its root, left if , right otherwise, then again left if , right otherwise, etc…
Back to our toy grammar 3.1:
The derivation tree with indexed relevant features corresponding to 3.1:
[. [. [. ] [. [. ] [. ] ] ] ]
The indexed relevant category at the root of the derivation tree has a empty position string since it represent the derived tree itself, and in for example, we have because this relevant category represent the tree under the node obtained if you go right (), then right again () from the root node of the derived tree (without the moving categories, since they will move so won’t end up at the same place). We have similarly because the subtree described by ends up as the left () daughter of the root of the derived tree.
The assignment of these position strings is given by the inference rules, which will be discussed later.
3.3 Axiom
The axiom of the parser are exactly the same as the axiom for the LCFRS corresponding to our MG discussed in 2, plus an empty position string (it represents the whole derived tree…). Its probability will be of course 1, and the pointer will be set as . So, if the phrase to be parsed is , we have a parser axiom .
3.4 Inference rules
We here have exactly the same inference rules as before, exept that these will assign position strings too. So here they are:

Start rules: for every lexical item ,
Start: 
Unmerge rules: the lefthand category is of the form

Cases where the selector was a simple tree :

For any lexical item of feature string ,
Unmerge1:\Tree[. [. ] [. ] ] is here if (and thus too), and otherwise.

For any element , with ,
Unmerge3, simple:\Tree[. [. ] [. ] ] is here if (and thus too), and otherwise. It should be noted that necessarily, .


Cases where the selector was a complex tree:

For any decomposition , and any lexical item of feature string ,
Unmerge2:\Tree[. [. ] [. ] ] is, as always, if (and thus has to be empty too), and otherwise.

For any element , and any decomposition ,
Unmerge3, complex:\Tree[. [. ] [. ] ] is still if (and thus has to be empty too), and otherwise. As in 2iB, has to be non empty.



Unmove rules: the lefthand category is of the form

For any (necessarily unique by the Shortest Movement Constraint), with ,
Unmove2:\Tree[. [. ] ] 
If there is no , then for any lexical item of feature string ,
Unmove1:\Tree[. [. ] ]

There is no lexicalise rule, since it will in fact be replaced by a ‘scan rule’, checking if the feature string of the word currently parsed corresponds to the current feature string.
3.5 Topdown parser
The parser takes an input string , a minimalist grammar rewrited into a LCFRS and a beam function , setting a threshold to the probability of the selected hypothesis. It will work on a priority queue of hypothesis . The function can be very general, here we will consider that its argument is the priority queue . The parser works as following:

Beginning: The parser start with the queue of hypothesis consisting of the axiom of the grammar.

Expanding: At each step, the parser will:

take the topranked hypothesis (i.e. the hypothesis with greatest ) in the priority queue,

check the corresponding position string pointer. If , and the parsing dot in is at the far right (i.e. ), then the parser terminates and returns the sequence of rules . If the phrase is not completely parsed ( but ), the hypothesis is deleted and the parser moves to the next one. If , the parser moves to the next step, and tries to:

find the leaf of , , in which is the position string corresponding to the pointer . If , is set to .

expand . For this, we have two possibilities:

Expand: If is a complex situated category, the parser will delete the current hypothesis and add to the priority queue, for all possible inference rules , a new hypothesis , such that is where the node has been replaced by , and is either if the rule did change the value of the position string corresponding to (i.e. if the rule was Unmerge1, Unmerge2 and Unmove1, and the first element of had for position string), and in the other cases. is the concatenation operator.

Scan: If is a simple indexed category, say , then the parser will delete the current hypothesis , and try to lexicalize . It will do two things:

Scan, : If there is a rule , then a new hypothesis is added to the priority queue, where is where the leaf was replaced by .

Scan, : If is in the grammar, then a new hypothesis is added to the priority queue, where is where the leaf was replaced by .
If these two steps fail, then no hypothesis is added to the priority queue.

The new hypothesis are inserted in the priority queue at their ‘right place’, i.e. after all hypothesis of higher probability.


Prune: The parser deletes all hypothesis of the priority queue whose probability is lower than .
If is empty, then the parse failed and the sentence is judged ungrammatical.

3.6 Example
Here we will present a small example of the parsing of a particular sentence of the grammar we presented in 2.3, consisting of the the lexical items:
The corresponding LCFRS was consisting of these rules:
S1  Start  
S2  Start  
L1 