Constituent Parsing as Sequence Labeling
Abstract
We introduce a method to reduce constituent parsing to sequence labeling. For each word , it generates a label that encodes: (1) the number of ancestors in the tree that the words and have in common, and (2) the nonterminal symbol at the lowest common ancestor. We first prove that the proposed encoding function is injective for any tree without unary branches. In practice, the approach is made extensible to all constituency trees by collapsing unary branches. We then use the ptb and ctb treebanks as testbeds and propose a set of fast baselines. We achieve 90% Fscore on the ptb test set, outperforming the \newcitevinyals2015grammar sequencetosequence parser. In addition, sacrificing some accuracy, our approach achieves the fastest constituent parsing speeds reported to date on ptb by a wide margin.
1 Introduction
Constituent parsing is a core problem in nlp where the goal is to obtain the syntactic structure of sentences expressed as a phrase structure tree.
Traditionally, constituentbased parsers have been built relying on chartbased, statistical models Collins (1997); Charniak (2000); Petrov et al. (2006), which are accurate but slow, with typical speeds well below 10 sentences per second on modern CPUs Kummerfeld et al. (2012).
Several authors have proposed more efficient approaches which are helpful to gain speed while preserving (or even improving) accuracy. \newcitesagae2005classifier present a classifier for constituency parsing that runs in linear time by relying on a shiftreduce stackbased algorithm, instead of a grammar. It is essentially an extension of transitionbased dependency parsing Nivre (2003). This line of research has been polished through the years Wang et al. (2006); Zhu et al. (2013); Dyer et al. (2016); Liu and Zhang (2017); FernándezGonzález and GómezRodríguez (2018).
With an aim more related to our work, other authors have reduced constituency parsing to tasks that can be solved faster or in a more generic way. \newciteFer2015Parsing reduce phrase structure parsing to dependency parsing. They propose an intermediate representation where dependency labels from a head to its dependents encode the nonterminal symbol and an attachment order that is used to arrange nodes into constituents. Their approach makes it possible to use offtheshelf dependency parsers for constituency parsing. In a different line, \newcitevinyals2015grammar address the problem by relying on a sequencetosequence model where trees are linearized in a depthfirst traversal order. Their solution can be seen as a machine translation model that maps a sequence of words into a parenthesized version of the tree. \newciteChoeChar2016 recast parsing as language modeling. They train a generative parser that obtains the phrasal structure of sentences by relying on the \newcitevinyals2015grammar intuition and on the \newcitezaremba2014recurrent model to build the basic language modeling architecture.
More recently, \newciteShenDistance2018 propose an architecture to speed up the current stateoftheart chart parsers trained with deep neural networks Stern et al. (2017); Kitaev and Klein (2018). They introduce the concept of syntactic distances, which specify the order in which the splitting points of a sentence will be selected. The model learns to predict such distances, to then recursively partition the input in a topdown fashion.
Contribution
We propose a method to transform constituent parsing into sequence labeling. This reduces it to the complexity of tasks such as partofspeech (PoS) tagging, chunking or namedentity recognition. The contribution is twofold.
First, we describe a method to linearize a tree into a sequence of labels (§2) of the same length of the sentence minus one.
Second, we use such encoding to present different baselines that can effectively predict the structure of sentences (§3). To do so, we rely on a recurrent sequence labeling model based on bilstm’s Hochreiter and Schmidhuber (1997); Yang and Zhang (2018). We also test other models inspired in classic approaches for other tagging tasks Schmid (1994); Sha and Pereira (2003). We use the Penn Treebank (ptb) and the Penn Chinese Treebank (ctb) as testbeds.
The comparison against \newcitevinyals2015grammar, the closest work to ours, shows that our method is able to train more accurate parsers. This is in spite of the fact that our approach addresses constituent parsing as a sequence labeling problem, which is simpler than a sequencetosequence problem, where the output sequence has variable/unknown length. Despite being the first sequence labeling method for constituent parsing, our baselines achieve decent accuracy results in comparison to models coming from mature lines of research, and their speeds are the fastest reported to our knowledge.
2 Linearization of nary trees
Notation and Preliminaries
In what follows, we use bold style to refer to vectors and matrices (e.g and ). Let = be an input sequence of words, where . Let be the set of constituent trees with leaf nodes that have no unary branches. For now, we will assume that the constituent parsing problem consists in mapping each sentence to a tree in , i.e., we assume that correct parses have no unary branches. We will deal with unary branches later.
To reduce the problem to a sequence labeling task, we define a set of labels that allows us to encode each tree in as a unique sequence of labels in , via an encoding function . Then, we can reduce the constituent parsing problem to a sequence labeling task where the goal is to predict a function , where are the parameters to be learned. To parse a sentence, we label it and then decode the resulting label sequence into a constituent tree, i.e., we apply .
For the method to be correct, we need the encoding of trees to be complete (every tree in must be expressible as a label sequence, i.e., must be a function, so we have full coverage of constituent trees) and injective (so that the inverse function is welldefined). Surjectivity is also desirable, so that the inverse is a function on , and the parser outputs a tree for any sequence of labels that the classifier can generate.
We now define our and show that it is total and injective. Our encoding is not surjective per se. We handle illformed label sequences in §2.3.
2.1 The Encoding
Let be a word located at position in the sentence, for . We will assign it a 2tuple label , where: is an integer that encodes the number of common ancestors between and , and is the nonterminal symbol at the lowest common ancestor.
Basic encodings
The number of common ancestors may be encoded in several ways.

Absolute scale: The simplest encoding is to make directly equal to the number of ancestors in common between and .

Relative scale: A second and better variant consists in making represent the difference with respect to the number of ancestors encoded in . Its main advantage is that the size of the label set is reduced considerably.
Figure 1 shows an example of a tree linearized according to both absolute and relative scales.
Encoding for trees with exactly children
For trees where all branchings have exactly children, it is possible to obtain a even more efficient linearization in terms of number of labels. To do so, we take the relative scale encoding as our starting point. If we build the tree incrementally in a lefttoright manner from the labels, if we find a negative , we will need to attach the word (or a new subtree with that word as its leftmost leaf) to the th node in the path going from to the root. If every node must have exactly children, there is only one valid negative value of : the one pointing to the first node in said path that has not received its th child yet. Any smaller value would leave this node without enough children (which cannot be fixed later due to the lefttoright order in which we build the tree), and any larger value would create a node with too many children. Thus, we can map negative values to a single label. Figure 2 shows an example for the case of binarized trees ().
Links to root
Another variant emerged from the empirical observation that some tokens that are usually linked to the root node (such as the final punctuation in Figure 1) were particularly difficult to learn for the simpler baselines. To successfully deal with these cases in practice, it makes sense to consider a simplified annotation scheme where a node is assigned a special tag (root, ) when it is directly linked to the root of the tree.
From now on, unless otherwise specified, we use the relative scale without the simplification for exactly children. This will be the encoding used in the experiments (§4), because the size of the label set is significantly lower than the one obtained by relying on the absolute one. Also, it works directly with nonbinarized trees, in contrast to the encoding that we introduce for trees with exactly children, which is described only for completeness and possible interest for future work. For the experiments (§4), we also use the special tag (root, ) to further reduce the size of the label set and to simplify the classification of tokens connected to the root, where is expected to be large.
2.2 Theoretical correctness
We now prove that is a total function and injective for any tree in . We remind that trees in this set have no unary branches. Later (in §2.3) we describe how we deal with unary branches. To prove correctness, we use the relative scale. Correctness for the other scales follows trivially.
Completeness
Every pair of nodes in a rooted tree has at least one common ancestor, and a unique lowest common ancestor. Hence, for any tree in , the label defined in Section 2.1 is welldefined and unique for each word , ; and thus is a total function from to .
Injectivity
The encoding method must ensure that any given sequence of labels corresponds to exactly one tree. Otherwise, we have to deal with ambiguity, which is not desirable.
For simplicity, we will prove injectivity in two steps. First, we will show that the encoding is injective if we ignore nonterminals (i.e., equivalently, that the encoding is injective for the set of trees resulting from replacing all the nonterminals in trees in with a generic nonterminal ). Then, we will show that it remains injective when we take nonterminals into account.
For the first part, let be a tree where nonterminals take a generic value . We represent the label of the th leaf node as . Consider the representation of as a bracketed string, where a singlenode tree with a node labeled is represented by , and a tree rooted at with child subtrees is represented as .
Each leaf node will appear in this string as a substring . Thus, the parenthesized string has the form , where the s are strings that can only contain brackets and nonterminals, as by construction there can be no leaf nodes between and .
We now observe some properties of this parenthesized string. First, note that each of the substrings must necessarily be composed of zero or more closing parentheses followed by zero or more opening parentheses with their corresponding nonterminal, i.e., it must be of the form . This is because an opening parenthesis followed by a closing parenthesis would represent a leaf node, and there are no leaf nodes between and in the tree.
Thus, we can write as , where is a string matching the expression and a string matching the expression . With this, we can write the parenthesized string for as
Let us now denote by the string . Then, and taking into account that and are trivially empty in the previous expression due to bracket balancing, the expression for the tree becomes simply , where we know, by construction, that each is of the form .
Since we have shown that each tree in uniquely corresponds to a string , to show injectivity of the encoding, it suffices to show that different values for a generate different label sequences.
To show this, we can say more about the form of : it must be either of the form or of the form , i.e., it is not possible that contains both opening parenthesis before the leaf node and closing parentheses after the leaf node. This could only happen if the tree had a subtree of the form , but this is not possible since we are forbidding unary branches.
Hence, we can identify each with an integer number : if has neither opening nor closing parentheses outside the leaf node, if it has opening parentheses, and if it has closing parentheses. It is easy to see that corresponds to the values in the relativescale label encoding of the tree . To see this, note that the number of unclosed parentheses at the point right after in the string exactly corresponds to the number of common ancestors between the th and th leaf nodes. A positive corresponds to opening parentheses before , so the number of common ancestors of and will be more than that of and . A negative corresponds to closing parentheses after , so the number of common ancestors will conversely decrease by . A value of zero means no opening or closing parentheses, and no change in the number of common ancestors.
Thus, different parenthesized strings generate different label sequences, which proves injectivity ignoring nonterminals (note that does not affect injectivity as it is uniquely determined by the other values: it corresponds to closing all the parentheses that remain unclosed at that point).
It remains to show that injectivity still holds when nonterminals are taken into account. Since we have already proven that trees with different structure produce different values of in the labels, it suffices to show that trees with the same structure, but different nonterminals, produce different values of . Essentially, this reduces to showing that every nonterminal in the tree is mapped into a concrete . That said, consider a tree , and some nonterminal in . Since trees in do not have unary branches, has at least two children. Consider the rightmost word in the first child subtree, and call it . Then, is the leftmost word in the second child subtree, and is the lowest common ancestor of and . Thus, , and a tree with identical structure but a different nonterminal at that position will generate a label sequence with a different value of . This concludes the proof of injectivity.
2.3 Limitations
We have shown that our proposed encoding is a total, injective function from trees without unary branches with yield of length to sequences of labels. This will serve as the basis for our reduction of constituent parsing to sequence labeling. However, to go from theory to practice, we need to overcome two limitations of the theoretical encoding: nonsurjectivity and the inability to encode unary branches. Fortunately, both can be overcome with simple techniques.
Handling of unary branches
The encoding function cannot directly assign the nonterminal symbols of unary branches, as there is not any pair of words that have those in common. Figure 3 illustrates it with an example.
It is worth remarking that this is not a limitation of our encoding, but of any encoding that would facilitate constituent parsing as sequence labeling, as the number of nonterminal nodes in a tree with unary branches is not bounded by any function of . The fact that our encoding works for trees without unary branches owes to the fact that such a tree cannot have more than nonleaf nodes, and therefore it is always possible to encode all of them in labels associated with leaf nodes.
To overcome this issue, we follow a collapsing approach, as is common in parsers that need special treatment of unary chains Finkel et al. (2008); Narayan and Cohen (2016); Shen et al. (2018). For clarity, we use the name intermediate unary chains to refer to unary chains that end up into a nonterminal symbol (e.g. in Figure 3) and leaf unary chains to name those that yield a PoS tag (e.g. ). Intermediate unary chains are collapsed into a chained single symbol, which can be encoded by as any other nonterminal symbol. On the other hand, leaf unary chains are collapsed together with the PoS tag, but these cannot be encoded and decoded by relying on , as our encoding assumes a fixed sequence of leaf nodes and does not encode them explicitly. To overcome this, we propose two methods:

To use an extra function to enrich the PoS tags before applying our main sequence labeling function. This function is of the form , where is the set of labels of the leaf unary chains (without including the PoS tags) plus a dummy label . maps to if there is no leaf unary chain at , or to the collapsed label otherwise.

To extend our encoding function to predict them as a part of our labels , by transforming them into 3tuples where encodes the leaf unary chain collapsed label for , if there is any, or none otherwise. We call this extended encoding function .
The former requires to run two passes of sequence labeling to deal with leaf unary chains. The latter avoids this, but the number of labels is larger and sparser. In §4 we discuss how these two approaches behave in terms of accuracy and speed.
Nonsurjectivity
Our encoding, as defined formally in Section 2.1, is injective but not surjective, i.e., not every sequence of labels of the form corresponds to a tree in . In particular, there are two situations where a label sequence formally has no tree, and thus is not formally defined and we have to use extra heuristics or processing to define it:

Sequences with conflicting nonterminals. A nonterminal can be the lowest common ancestor of more than two pairs of contiguous words when branches are nonbinary. For example, in the tree in Figure 1, the lowest common ancestor of both “the” and “red” and of “red” and “toy” is the same node. This translates into , in the label sequence. If we take that sequence and set , we obtain a label sequence that does not strictly correspond to the encoding of any tree, as it contains a contradiction: two elements referencing the same node indicate different nonterminal labels. In practice, this problem is trivial to solve: when a label sequence encodes several conflicting nonterminals at a given position in the tree, we compute using the first such nonterminal and ignoring the rest.

Sequences that produce unary structures. There are sequences of values that do not correspond to a tree in because the only tree structure satisfying the common ancestor conditions of their values (the one built by generating the string of s in the injectivity proof) contains unary branchings, causing the problem described above where we do not have a specification for every nonterminal. An example of this is the sequence in absolute scaling, that was introduced in Figure 3. In practice, as unary chains have been previously collapsed, any generated unary node is considered as not valid and removed.
3 Sequence Labeling
Sequence labeling is an structured prediction task that generates an output label for every token in an input sequence Rei and Søgaard (2018). Examples of practical tasks that can be formulated under this framework in natural language processing are PoS tagging, chunking or namedentity recognition, which are in general fast. However, to our knowledge, there is no previous work on sequence labeling methods for constituent parsing, as an encoding allowing it was lacking so far.
In this work, we consider a range of methods ranging from traditional models to stateoftheart neural models for sequence labeling, to test whether they are valid to train constituencybased parsers following our approach. We give the essential details needed to comprehend the core of each approach, but will mainly treat them as black boxes, referring the reader to the references for a careful and detailed mathematical analysis of each method. Appendix A specifies additional hyperparameters for the tested models.
Preprocessing
We add to every sentence both beginning and end tokens.
3.1 Traditional Sequence Labeling Methods
We consider two baselines to train our prediction function , based on popular sequence labeling methods used in nlp problems, such as PoS tagging or shallow parsing Schmid (1994); Sha and Pereira (2003).
Conditional Random Fields
Lafferty et al. (2001) Let crf be its prediction function, a crf model computes conditional probability distributions of the form such that crf = = . In our work, the inputs to the crf are words and PoS tags. To represent a word , we are using information of the word itself and also contextual information from .

We extract the word form (lowercased), the PoS tag and its prefix of length 2, from . For these words we also include binary features: whether it is the first word, the last word, a number, whether the word is capitalized or uppercased.

Additionally, for we look at the suffixes of both length 3 and 2 (i.e. and ).
To build our CRF models, we relied on the sklearncrfsuite library
MultiLayer Perceptron
Rosenblatt (1958) We use one hidden layer. Let mlp be its prediction function, it treats sequence labeling as a set of independent predictions, one per word. The prediction for a word is computed as , where is the input vector and and the weights and biases to be learned at layer .
We consider both a discrete (mlp) and an embedded (mlp) perceptron. For the former, we use as inputs the same set of features as for the crf. For the latter, the vector for is defined as a concatenation of word and PoS tag embeddings from .
To build our mlps, we relied on keras.
3.2 Sequence Labeling Neural Models
We are using ncrfpp++
bilstm = = =
In the case of multilayer bilstm’s, the timestep outputs of the bilstm are fed as input to the bilstm. The output label for each is finally predicted as .
Given a sentence , the input to the sequence model is a sequence of embeddings where each , such that and are a word and a PoS tag embedding, and is a word embedding obtained from an initial character embedding layer, also based on a bilstm. Figure 4 shows the architecture of the network.
4 Experiments
We report results on models trained using the relative scale encoding and the special tag (root,). As a reminder, to deal also with leaf unary chains, we proposed two methods in §2.3: to predict them relying both on the encoding functions and , or to predict them as a part of an enriched label predicted by the function . For clarity, we are naming these models with the superscripts and , respectively.
Datasets
We use the Penn Treebank Marcus et al. (1994) and its official splits: Sections 2 to 21 for training, 22 for development and 23 for testing. For the Chinese Penn Treebank Xue et al. (2005): articles 001 270 and 4401151 are used for training, articles 301325 for development, and articles 271300 for testing. We use the version of the corpus with the predicted PoS tags of \newciteDyerRecurrent2016. We train the models based on the predicted output by the corresponding model.
Metrics
We use the Fscore from the evalb script. Speed is measured in sentences per second. As the problem is reduced to sequence labeling, we briefly comment on the accuracy (percentage of correctly predicted labels) of our baselines.
Source code
It can be found at https://github.com/aghie/tree2labels
Hardware
The models are run on a single thread of a CPU
4.1 Results
Table 1 shows the performance of our baselines on the ptb development set. It is worth noting that since we are using different libraries to train the models, these might show some differences in terms of performance/speed beyond those expected in theory. For the bilstm model we test:

bilstm: It does not use pretrained word embeddings nor character embeddings. The number of layers is set to 1.

bilstm: It includes character embeddings processed through a bilstm.

bilstm: is set to 2. No character embeddings.

bilstm: is set to 2.
Model  Fscore  Acc.  Sent/s  Sent/s 
(cpu)  (gpu)  
crf  60.4  63.9  83   
mlp  72.6  78.1  16  49 
mlp  74.8  79.3  503  666 
crf 
60.3  65.4  6   
mlp  71.9  78.0  31  95 
mlp  75.4  79.7  342  890 
bilstm  87.2  88.9  144  541 
bilstm  88.3  89.8  144  543 
bilstm  88.5  90.0  120  456 
bilstm  89.7  90.7  72  476 
bilstm  89.9  90.9  65  405 
bilstm  87.3  89.3  206  941 
bilstm  88.5  90.1  209  957 
bilstm  88.0  90.0  180  808 
bilstm  89.8  90.9  119  842 
bilstm  89.7  90.9  109  716 

Model 
Testbed  CPU Run  GPU Run  Fscore  
#Cores  Sents/s  #GPU  Sents/s  
Sequence labeling  
mlp 
WSJ23  1  501  1  669  74.1 
mlp  WSJ23  1  349  1  929  74.8 
bilstm  WSJ23  1  148  1  581  88.1 
bilstm  WSJ23  1  221  1  1016  88.3 
bilstm  WSJ23  1  66  1  434  89.9 
bilstm  WSJ23  1  115  1  780  90.0 
bilstm  WSJ23  1  74  1  506  90.0 
bilstm  WSJ23  1  126  1  898  90.0 
Sequencetosequence 

3layer lstm  WSJ 23  70  
3layer lstm + Attention  WSJ 23  Multicore  120  88.3  
(number not  
Vinyals et al. (2015)  specified)  
Constituency parsing as dependency parsing  
\newciteFer2015Parsing  WSJ23  1  41  90.2  
Chartbased parsers  
\newcitecharniak2000maximum  WSJ23  1  6  89.5  
\newcitepetrov2007improved  WSJ23  1  6  90.1  
\newcitestern2017minimal  WSJ23  16^{*}  20  91.8  
\newciteKitaev2018Constituency  WSJ23  2  70  95.1  
+ELMo Peters et al. (2018)  
Chartbased parsers with GPUspecific implementation  
\newcitecanny2013multi  WSJ(30)  1  250  
\newcitehall2014sparser  WSJ(40)  1  404  
Transitionbased and other greedy constituent parsers 

\newcitezhu2013fast  WSJ23  1  101  89.9  
\newcitezhu2013fast+Padding  WSJ23  1  90  90.4  
\newciteDyerRecurrent2016  WSJ23  1  17  91.2  
Fernández and GómezRodríguez (2018)  WSJ23  1  18  91.7  
\newcitestern2017minimal  WSJ23  16^{*}  76  91.8  
\newciteLiu2017InOrder  WSJ23  91.8  
\newciteShenDistance2018  WSJ23  1  111  91.8  

Model  Fscore 

mlp 
63.1 
mlp  64.4 
bilstm  84.4 
bilstm  84.1 
bilstm  84.4 
bilstm  83.1 
\newcitezhu2013fast 
82.6 
\newcitezhu2013fast+P  83.2 
\newciteDyerRecurrent2016  84.6 
\newciteLiu2017InOrder  86.1 
\newciteShenDistance2018  86.5 
Fernández and GómezRodríguez (2018)  86.8 

The and the models obtain similar Fscores. When it comes to speed, the bilstms are notably faster than the bilstms. models are expected to be more efficient, as leaf unary chains are handled implicitly. In practice, is a more expensive function to compute than the original , since the number of output labels is significantly larger, which reduces the expected gains with respect to the models. It is worth noting that our encoding is useful to train an mlp with a decent sense of phrase structure, while being very fast. Paying attention to the differences between Fscore and Accuracy for each baseline, we notice the gap between them is larger for crfs and mlps. This shows the difficulties that these methods have, in comparison to the bilstm approaches, to predict the correct label when a word has few common ancestors with . For example, let 10x be the right (relative scale) label between and , and let =1x and =9x be two possible wrong labels. In terms of accuracy it is the same that a model predicts or , but in terms of constituent Fscore, the first will be much worse, as many closed parentheses will remain unmatched.
5 Discussion
We are not aware of work that reduces constituency parsing to sequence labeling. The work that can be considered as the closest to ours is that of \newcitevinyals2015grammar, who address it as a sequencetosequence problem, where the output sequence has variable/unknown length. In this context, even a one hidden layer perceptron outperforms their 3layer lstm model without attention, while parsing hundreds of sentences per second. Our best models also outperformed their 3layer lstm model with attention and even a simple bilstm model with pretrained GloVe embeddings obtains a similar performance. In terms of Fscore, the proposed sequence labeling baselines still lag behind mature shiftreduce and chart parsers. In terms of speed, they are clearly faster than both CPU and GPU chart parsers and are at least on par with the fastest shiftreduce ones. Although with significant loss of accuracy, if phraserepresentation is needed in largescale tasks where the speed of current systems makes parsing infeasible GómezRodríguez (2017); GómezRodríguez et al. (2017), we can use the simpler, less accurate models to get speeds well above any parser reported to date.
It is also worth noting that in their recent work, published while this manuscript was under review, \newciteShenDistance2018 developed a mapping of binary trees with leaves to sequences of integers (Shen et al., 2018, Algorithm 1). This encoding is different from the ones presented here, as it is based on the height of lowest common ancestors in the tree, rather than their depth. While their purpose is also different from ours, as they use this mapping to generate training data for a parsing algorithm based on recursive partitioning using realvalued distances, their encoding could also be applied with our sequence labeling approach. However, it has the drawback that it only supports binarized trees, and some of its theoretical properties are worse for our goal, as the way to define the inverse of an arbitrary label sequence can be highly ambiguous: for example, a sequence of equal labels in this encoding can represent any binary tree with leaves.
6 Conclusion
We presented a new parsing paradigm, based on a reduction of constituency parsing to sequence labeling. We first described a linearization function to transform a constituent tree (with leaves) into a sequence of labels that encodes it. We proved that this encoding function is total and injective for any tree without unary branches. We also discussed its limitations: how to deal with unary branches and nonsurjectivity, and showed how these can be solved. We finally proposed a set of fast and strong baselines.
Acknowledgments
This work has received funding from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150), from the TELEPARESUDC project (FFI201451978C22R) and the ANSWERASAP project (TIN201785160C21R) from MINECO, and from Xunta de Galicia (ED431B 2017/01). We gratefully acknowledge NVIDIA Corporation for the donation of a GTX Titan X GPU.
Appendix A Setup configuration used to train our sequence labeling methods
Conditional Random Fields
We use the default configuration provided together with the sklearncrfsuite library.
MultiLayer Perceptron
Both the discrete and distributed perceptrons are implemented in keras.

Training hyperparameters The model is trained up to 30 epochs, with early stopping (patience=4). We use Stochastic Gradient Descent (sgd) to optimize the objective function. The initial learning rate is set to 0.1.

Layer and embedding sizes. The dimension of the hidden layer is set to 100. For the perceptron fed with embeddings, we use 100 and 20 dimensions to represent a word and its PoS tag, respectively.
Bidirectional Long ShortTerm Memory
We relied on the NCRFpp framework Yang and Zhang (2018).

Training hyperparameters We use minibatching (the batch size during training is set to 8). As optimizer, we use sgd, setting the initial learning rate to 0.2, momentum to 0.9 and a linear decay of 0.05. We train the model up to 100 epochs and keep the best performing model in the development set.

Layer and embedding sizes: We use 100, 30 and 20 dimensions to represent a word, a postag and a character embedding. The output hidden layer from the character embeddings layer is set to 50. The lefttoright and righttoleft lstms generate each a hidden vector of size 400.
Footnotes
 A last dummy label is generated to fulfill the properties of sequence labeling tasks.
 We tried contextual information beyond the immediate previous and next word, but the performance was similar.
 https://sklearncrfsuite.readthedocs.io/en/latest/
 In contrast to the discrete input, larger contextual information was useful.
 https://keras.io/
 https://github.com/jiesutd/NCRFpp, with PyTorch.
 An Intel(R) Core(TM) i77700 CPU @ 3.60GHz.
 A GeForce GTX 1080.
 A larger batch will likely result in faster parsing when executing the model on a gpu, but not necessarily on a cpu.
References
 John Canny, David Hall, and Dan Klein. 2013. A multiteraflop constituency parser using GPUs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1898–1907.
 Eugene Charniak. 2000. A maximumentropyinspired parser. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pages 132–139. Association for Computational Linguistics.
 Do Kook Choe and Eugene Charniak. 2016. Parsing as language modeling. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2331–2336, Austin, Texas. Association for Computational Linguistics.
 Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pages 16–23. Association for Computational Linguistics.
 Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. 2016. Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 199–209. Association for Computational Linguistics.
 Daniel FernándezGonzález and Carlos GómezRodríguez. 2018. Faster ShiftReduce Constituent Parsing with a NonBinary, BottomUp Strategy. ArXiv eprints.
 Daniel FernándezGonzález and André F. T. Martins. 2015. Parsing as reduction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1523–1533. Association for Computational Linguistics.
 Jenny Rose Finkel, Alex Kleeman, and Christopher D Manning. 2008. Efficient, featurebased, conditional random field parsing. Proceedings of ACL08: HLT, pages 959–967.
 Carlos GómezRodríguez. 2017. Towards fast natural language parsing: FASTPARSE ERC Starting Grant. Procesamiento del Lenguaje Natural, 59.
 Carlos GómezRodríguez, Iago AlonsoAlonso, and David Vilares. 2017. How important is syntactic parsing accuracy? An empirical evaluation on rulebased sentiment analysis. Artificial Intelligence Review.
 David Hall, Taylor BergKirkpatrick, and Dan Klein. 2014. Sparser, better, faster GPU parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 208–217.
 Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation, 9(8):1735–1780.
 Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing using bidirectional LSTM feature representations. Transactions of the Association for Computational Linguistics, 4:313–327.
 Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a selfattentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia. Association for Computational Linguistics.
 Jonathan K. Kummerfeld, David Hall, James R. Curran, and Dan Klein. 2012. Parser showdown at the Wall Street corral: An empirical investigation of error types in parser output. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1048–1059, Jeju Island, Korea. Association for Computational Linguistics.
 John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
 Jiangming Liu and Yue Zhang. 2017. Inorder transitionbased constituent parsing. Transactions of the Association for Computational Linguistics, 5:413–424.
 Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Mark Ferguson, Karen Katz, and Britta Schasberger. 1994. The Penn Treebank: Annotating predicate argument structure. In Proceedings of the Workshop on Human Language Technology, HLT ’94, pages 114–119, Stroudsburg, PA, USA. Association for Computational Linguistics.
 Shashi Narayan and Shay B. Cohen. 2016. Optimizing spectral learning for parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1546–1556, Berlin, Germany. Association for Computational Linguistics.
 Joakim Nivre. 2003. An efficient algorithm for projective dependency parsing. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT), pages 149–160.
 Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
 Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237. Association for Computational Linguistics.
 Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 433–440. Association for Computational Linguistics.
 Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 404–411.
 Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual partofspeech tagging with bidirectional long shortterm memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 412–418, Berlin, Germany. Association for Computational Linguistics.
 Marek Rei and Anders Søgaard. 2018. Zeroshot sequence labeling: Transferring knowledge from sentences to tokens. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 293–302. Association for Computational Linguistics.
 Frank Rosenblatt. 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386.
 Kenji Sagae and Alon Lavie. 2005. A classifierbased parser with linear runtime complexity. In Proceedings of the Ninth International Workshop on Parsing Technology, pages 125–132. Association for Computational Linguistics.
 Helmut Schmid. 1994. Partofspeech tagging with neural networks. In Proceedings of the 15th Conference on Computational Linguistics  Volume 1, COLING ’94, pages 172–176, Stroudsburg, PA, USA. Association for Computational Linguistics.
 Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language TechnologyVolume 1, pages 134–141. Association for Computational Linguistics.
 Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessandro Sordoni, Aaron Courville, and Yoshua Bengio. 2018. Straight to the tree: Constituency parsing with neural syntactic distance. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1171–1180. Association for Computational Linguistics.
 Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. A minimal spanbased neural constituency parser. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 818–827, Vancouver, Canada. Association for Computational Linguistics.
 Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2015. Grammar as a foreign language. In Advances in Neural Information Processing Systems, pages 2773–2781.
 Mengqiu Wang, Kenji Sagae, and Teruko Mitamura. 2006. A fast, accurate deterministic parser for chinese. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL44, pages 425–432, Stroudsburg, PA, USA. Association for Computational Linguistics.
 Naiwen Xue, Fei Xia, FuDong Chiou, and Marta Palmer. 2005. The Penn Chinese Treebank: Phrase structure annotation of a large corpus. Natural language engineering, 11(2):207–238.
 Jie Yang and Yue Zhang. 2018. NCRF++: An opensource neural sequence labeling toolkit. In Proceedings of ACL 2018, System Demonstrations, pages 74–79, Melbourne, Australia. Association for Computational Linguistics.
 Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
 Muhua Zhu, Yue Zhang, Wenliang Chen, Min Zhang, and Jingbo Zhu. 2013. Fast and accurate shiftreduce constituent parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 434–443.