Training Dependency Parsers with Partial Annotation
Recently, these has been a surge on studying how to obtain partially annotated data for model supervision. However, there still lacks a systematic study on how to train statistical models with partial annotation (PA). Taking dependency parsing as our case study, this paper describes and compares two straightforward approaches for three mainstream dependency parsers. The first approach is previously proposed to directly train a log-linear graph-based parser (LLGPar) with PA based on a forest-based objective. This work for the first time proposes the second approach to directly training a linear graph-based parse (LGPar) and a linear transition-based parser (LTPar) with PA based on the idea of constrained decoding. We conduct extensive experiments on Penn Treebank under three different settings for simulating PA, i.e., random dependencies, most uncertain dependencies, and dependencies with divergent outputs from the three parsers. The results show that LLGPar is most effective in learning from PA and LTPar lags behind the graph-based counterparts by large margin. Moreover, LGPar and LTPar can achieve best performance by using LLGPar to complete PA into full annotation (FA).
Traditional supervised approaches for structural classification assume full annotation (FA), meaning that the training instances have complete manually-labeled structures. In the case of dependency parsing, FA means a complete parse tree is provided for each training sentence. However, recent studies suggest that it is more economic and effective to construct labeled data with partial annotation (PA). A lot of research effort has been attracted to obtain partially-labeled data for different tasks via active learning [\citenameSassano and Kurohashi2010, \citenameMirroshandel and Nasr2011, \citenameLi et al.2012, \citenameMarcheggiani and Artières2014, \citenameFlannery and Mori2015, \citenameLi et al.2016], cross-lingual syntax projection [\citenameSpreyer and Kuhn2009, \citenameGanchev et al.2009, \citenameJiang et al.2010, \citenameLi et al.2014], or mining natural annotation implicitly encoded in web pages [\citenameJiang et al.2013, \citenameLiu et al.2014, \citenameNivre et al.2014, \citenameYang and Vozila2014]. Figure 1 gives an example sentence partially annotated with two dependencies.
However, there still lacks systematic study on how to train structural models such as dependency parsers with PA. Most previous works listed above rely on ad-hoc strategies designed for only basic dependency parsers. One exception is that \newcitezhenghua-c14 convert partial trees into forests and train a log-linear graph-based dependency parser (LLGPar) with PA based on a forest-base objective, showing promising results. Meanwhile, it is still unclear how PAs can be used to train state-of-the-art linear graph-based (LGPar) and transition-based parser (LTPar). Please refer to Section 6 for detailed discussions of previous methods for training parsers with PA.
This paper aims to thoroughly study this issue and make systematic comparison on different approaches of training parsers with PA. In summary, we make the following contributions.
We present a general framework for directly training state-of-the-art LGPar and LTPar with PA based on constrained decoding. The basic idea is to use the current feature weights to parse the sentence under the PA-constrained search space, and use the best parse as a pseudo gold-standard reference for feature weight update during perceptron training. We also implement the forest-objective based approach of \newcitezhenghua-c14 for LLGPar.
We have made thorough comparison among different directly-train approaches under three different settings for simulating PA, i.e., random dependencies, most uncertain dependencies, and dependencies with divergent outputs from the three parsers. We have also compared the proposed directly-train approaches with the straightforward complete-then-train approach.
Extensive experiments on Penn Treebank lead to several interesting and clear findings.
2 Dependency Parsing
Given an input sentence , dependency parsing builds a complete dependency tree rooted at , where is an artificial token linking to the root of the sentence [\citenameKübler et al.2009]. A dependency tree comprises a set of dependencies, namely , where is a dependency from a head word to a modifier word . A complete dependency tree contains dependencies, namely , whereas a partial dependency tree contains less than dependencies, namely . Alternatively, FA can be understood as a special form of PA. For clarity, we denote a complete tree as and a partial tree as . The decoding procedure aims to find an optimal complete tree :
where defines the search space containing all legal trees for ; is a score/probability of ; is a sparse accumulated feature vector corresponding to ; is the feature weight vector.
2.1 Graph-based Approach
To facilitate efficient search, the graph-based method factorizes the score of a dependency into those of small subtrees :
Dynamic programming based exact search are usually applied to find the optimal tree [\citenameMcDonald et al.2005, \citenameMcDonald and Pereira2006, \citenameCarreras2007, \citenameKoo and Collins2010]. We adopt the second-order model of \newciteMcDonald-eacl06-non-proj-2-order which incorporates two kinds of subtrees, i.e., single dependencies and adjacent siblings, and the feature set described in \newcitebohnet-C10-hash-kernel.
A log-linear graph-based parser (LLGPar) defines the conditional probability of given as
For training, is optimized using gradient descent to maximize the likelihood of the training data.
A linear graph-based parser (LGPar) uses perceptron-like online training to directly learn w. The workflow is similar to Algorithm 1, except that the gold-standard reference is directly provided in the training data without the need of constrained decoding in line 6. Previous work mostly adopts linear models to build dependency parsers since perceptron training is simple yet effective in achieving competitive parsing accuracy in variety of languages. Recently, LLGPar attracts more attention due to its capability in producing subtree probabilities and learning from PA [\citenameLi et al.2014, \citenameMa and Zhao2015].
2.2 Transition-based Approach
The transition-based method builds a dependency by applying sequence of shift/reduce actions , and factorizes the score of a tree into the sum of scores of each action in [\citenameYamada and Matsumoto2003, \citenameNivre2003, \citenameZhang and Nivre2011]:
where is the action taken at step and is the configuration status after taking action . State-of-the-art transition-based methods usually use inexact beam search to find a highest-scoring action sequence, and adopt global perceptron-like training to learn . We build an arc-eager transition-based dependency parser and the state-of-the-art features described in [\citenameZhang and Nivre2011], referred as a linear transition-based parser (LTPar).
3 Directly training parsers with PA
As described in \newcitezhenghua-c14, LLGPar can naturally learn from PA based on the idea of ambiguous labeling, which allows a sentence to have multiple parse trees (forest) as its gold-standard reference [\citenameRiezler et al.2002, \citenameDredze et al.2009, \citenameTäckström et al.2013]. First, a partial tree is converted in to a forest by adding all possible dependencies pointing to remaining words without heads, with the constraint that a newly added dependency does not violate existing ones in . The forest can be formally defined as , whose conditional probability is the sum of probabilities of all trees that it contains:
Then, we can define an forest-based training objective function to maximize the likelihood of training data as described in \newcitezhenghua-c14.
LGPar can be extended to directly learn from PA based on the idea of constrained decoding, as shown in Algorithm 1, which has been previously applied to Chinese word segmentation with partially labeled sequences [\citenameJiang et al.2010]. The idea is using the best tree in the constrained search space (line 6) as a pseudo gold-standard reference for weight update. In traditional perceptron training, would be a complete parse tree provided in the training data. It is trivial to implement constrained decoding for graph-based parsers, and we only need to disable some illegal combination operations during dynamic programming.
LTPar can also directly learn from PA in a similar way, as shown in Algorithm 1. Constrained decoding is performed to find a pseudo gold-standard reference (line 7). It is more complicate to design constrained decoding for transition-based parsing than graph-based parsing. Fortunately, \newcitenivre-j14-constrained propose a procedure to enable arc-eager parsers to decode in the search space constrained by some given dependencies. We ignore the details due to the space limitation.
4.1 Data, parameter settings, and evaluation metric
We conduct experiments on Penn Treebank (PTB), and follow the standard data split data (sec - as training, sec as development, and sec as test). Original bracketed structures are converted into dependency structures using Penn2Malt with default head-finding rules. We build a CRF-based bigram part-of-speech (POS) tagger to produce automatic POS tags for all train/dev/test data (10-way jackknifing on training data), with tagging accuracy on test data. As suggested by an earlier anonymous reviewer, we further split the training data into two parts. We assume that the first training sentences are provided as a small-scale data with FA, which can be obtained by a small amount of manual annotation or through cross-lingual projection methods. We simulate PA for the remaining sentences. Table 1 shows the data statistics.
We train LLGPar with stochastic gradient descent [\citenameFinkel et al.2008]. We set the beam size to during both training and evaluation of LTPar. Following standard practice established by \newcitecollins-emnlp02-perc, we adopt averaged weights for evaluation of LGPar and LTPar and use the early-update strategy during training LTPar.
Since we have two sets of training data, we adopt the simple corpus-weighting strategy of \newcitezhenghua-c14. In each iteration, we merge train- and a subset of random sentences from train-, shuffle them, and then use them for training. For all parsers, training terminates when the peak parsing accuracy on dev data does not improve in consecutive iterations. For evaluation, we use the standard unlabeled attachment score (UAS) excluding punctuation marks.
4.2 Three settings for simulating PA on train-
In order to simulating PA for each sentence in train-, we only keep gold-standard dependencies (not considering punctuation marks), and remove all other dependencies. We experiment with three simulation settings to fully investigate the capability of different approaches in learning from PA.
Random ( or ): For each sentence in train-, we randomly select words, and only keep dependencies linking to these words. With this setting, we aim to purely study the issue without biasing to certain structures. This setting may be best fit the scenario automatic syntax projection based on bitext, where the projected dependencies tend to be arbitrary (and noisy) due to the errors in automatic source-language parses and word alignments and non-isomorphism syntax between languages.
Uncertain ( or ): In their work of active learning with PA, \newcitezhenghua-p16 show that the marginal probabilities from LLGPar is the most effective uncertainty measurement for selecting the most informative words to be annotated. Following their work, we first train LLGPar on train- with FA, and then use LLGPar to parse train- and select most uncertain words to keep their heads.
Following \newcitezhenghua-p16, we measure the uncertainty of a word according to the marginal probability gap between its two most likely heads and .
The intuition is that the smaller the probability gap is, the more uncertain the model is about . The marginal probability of a dependency is the sum of probabilities of all legal trees that contain the dependency.
This setting fits the scenario of active learning, which aims to save annotation effort by only annotating the most useful structures. From another perspective, this settings may tend to bias to LLGPar by keeping structures that are most useful for LLGPar.
Divergence (): We train all three parsers on train-, and use them to parse train-. If their output trees do not assign the same head to a word, then we keep the gold-standard dependency pointing to the word, leading to remaining dependencies. Different from the uncertain setting, this setting does not bias to any parser.
4.3 Results of different parsers trained on FA
We train the three parsers on all the training data with FA. We also employ four publicly available parsers with their default settings. BerkeleyParser (v1.7) is a constituent-structure parser, whose results are converted into dependency structures [\citenamePetrov and Klein2007]. TurboParser (v2.1.0) is a linear graph-based dependency parser using linear programming for inference [\citenameMartins et al.2013]. Mate-tool (v3.3) is a linear graph-based dependency parser very similar to our implemented LGPar [\citenameBohnet2010]. ZPar (v0.6) is a linear transition-based dependency parser very similar to our implemented LGPar [\citenameZhang and Clark2011]. The results are shown in Table 2. We can see that the three parsers that we implement achieve competitive parsing accuracy and serve as strong baselines.
4.4 Results of the directly-train approaches
|FA (random)||PA (random)||PA (uncertain)||PA (divergence)|
|LGPar||93.00 (-0.16)||91.76 (-0.17)||90.80 (-0.35)||91.63 (-0.76)||90.62 (-1.04)||92.46 (-0.56)||91.64 (-0.80)||91.69 (-0.73)|
|LTPar||92.77 (-0.39)||91.22 (-0.71)||90.35 (-0.80)||91.12 (-1.27)||90.12 (-1.54)||91.35 (-1.67)||90.99 (-1.45)||90.70 (-1.72)|
Comparing the three parsers, we have several clear findings. (1) LLGPar achieves best performance over all settings and is very effective in learning from PA. (2) The accuracy gap between LGPar and LLGPar becomes larger with PA than with FA, indicating LGPar is less effective in learning from PA than LLGPar. (3) LTPar lags behind LLGPar by large margin and is ineffective in learning from PA.
FA (random) vs. PA (random): from the results in the two major columns, we can see that LLGPar achieves higher accuracy by about when trained on sentences with random dependencies than when trained on random sentences with FA. This is reasonable and can be explained under the assumption that LLGPar can make full use of PA in model training. In fact, in both cases, the training data contains approximately the same number of annotated dependencies. However, from the perspective of model training, given some dependencies in the case of PA, more information about the syntactic structure can be derived.111Also, as suggested in the work of \newcitezhenghua-p16, annotating PA is more time-consuming than annotating FA in terms of averaged time for each dependency, since dependencies in the same sentence are correlated and earlier annotated dependencies usually make later annotation easier. Taking Figure 1 as an example, “I” can only modify “saw” due to the single-root and single-head constraints; similarly, “Sarah” can only modify either “saw” or “with”; and so on. Therefore, given the same amount of annotated dependencies, random PA contains more syntactic information than random FA, which explains why LLGPar performs better with PA than FA.
In contrast, both LGPar and LTPar achieve slight lower accuracy with PA than with FA. This is another evidence that LGPar and LTPar is less effective than LLGPar in learning from PA.
PA (random) vs. PA (uncertain): we can see that all three parser achieves much higher accuracy in the latter case.222The only exception is LTPar with PA, the accuracy increases by only , which may be caused by the ineffectiveness of LTPar in learning from PA. The annotated dependencies in PA (uncertain) are most uncertain ones for current statistical parser (i.e., LLGPar), and thus are more helpful for training the models than those in PA (random). Another phenomenon is that, in the case of PA (uncertain), increasing to actually doubles the number of annotated dependencies, but only boost accuracy of LLGPar by , which indicates that newly added dependencies are much less useful since the model can already well handle these low-uncertainty dependencies.
PA (uncertain, ) vs. PA (divergence): we can see that the all three parsers achieve similar parsing accuracies. This indicates that uncertainty measurement based on LLGPar can actually discovers useful dependencies to be annotated without particularly biasing towards itself.
In summary, we can conclude from the results that LLGPar can effectively learn from PA, whereas LGPar is slightly less effective and LTPar is ineffective at all.
4.5 Results of the complete-then-train methods
|Parser for completion||No constraints||PA (random)||PA (uncertain)||PA (divergence)|
|LLGPar-||86.67||92.65 (+5.98)||90.02 (+3.35)||97.43 (+10.76)||94.43 (+7.76)||94.36 (+7.69)|
|LGPar-||86.05||92.16 (+6.11)||89.48 (+3.43)||97.30 (+11.25)||94.11 (+8.06)||94.21 (+8.16)|
|LTPar-||85.38||91.76 (+6.38)||88.89 (+3.51)||96.90 (+11.52)||93.35 (+7.97)||93.85 (+8.47)|
|LLGPar-+||–||95.55 (+2.90)||93.37 (+3.35)||98.30 (+0.87)||96.22 (+1.79)||95.57 (+1.21)|
The most straight-forward method for learning from PA is the complete-then-learn method [\citenameMirroshandel and Nasr2011]. The idea is first using an existing parser to complete partial trees in train- into full trees based on constrained decoding, and then training the target parser on train- with FA and train- with completed FA.
Results of completing via constrained decoding: Table 4 reports UAS of the completed trees on train- using two different strategies for completion. “No constraints ()” means that train- has no annotated dependencies and normal decoding without constraints is used. In the remaining columns, each parser performs constrained decoding on PA where dependencies are provided in each sentence.
Coarsely-trained-self for completion: We complete PA into FA using corresponding parsers coarsely trained on only train- with FA. We call these parsers LLGPar-, LLTPar-, LTPar- respectively.
Fine-trained-LLGPar for completion: We complete PA into FA using LLGPar fine trained on both train- with FA and train- with PA. We call this LLGPar as LLGPar-+. Please note that LLGPar-+ actually performs closed test in this setting, meaning that it parses its training data. For example, LLGPar-+ trained on random () is employed to complete the same data by filling the remaining dependencies.
Comparing the three parsers trained on train-, we can see that constrained decoding has similar effects on all three parsers, and is able to return much more accurate trees. Numbers in parenthesis show the accuracy gap between normal () and constrained decoding. This suggests that constrained decoding itself is not responsible for the ineffectiveness of Algorithm 1 for LTPar.
Comparing the results of LLGPar- and LLGPar-+ (numbers in parenthesis showing the accuracy gap), it is obvious that the latter produces much better full trees since the fine-trained LLGPar can make extra use of PA in train- during training.
Results of training on completed FA: Table 5 compares performance of the three parsers trained on train- with FA and train- with completed FA, from which we can draw several clear and interesting findings. First, different from the case of directly training on PA, the three parsers achieve very similar parsing accuracies when trained on data with completed FA in both completion settings. Second, using parsers coarsely-trained on train- for completion leads to very bad performance, which is even much worse than those of the directly-train method in Table 3 except for LTPar with uncertain (). Third, using the fine-trained LLGPar-+ for completion makes LGPar and LTPar achieve nearly the same accuracies with LLGPar, which may be because LLGPar provides complementary effects during completion, analogous to the scenario of co-training.
|Completed by LLGPar/LGPar/LTPar-||Completed by LLGPar-+|
|PA (random)||PA (uncertain)||PA (divergence)||PA (random)||PA (uncertain)||PA (divergence)|
|Directly train on train- with PA||Train- with FA completed by LLGPar-+|
|PA (random)||PA (uncertain)||PA (divergence)||PA (random)||PA (uncertain)||PA (divergence)|
4.6 Results on test data: directly-train vs. complete-then-train
Table 6 reports UAS on test data of parsers directly trained on train- with FA and train- with PA, and of those trained on train- with FA and train- with FA completed by fine-trained LLGPar-+. The results are consistent with the those on dev data in Table 3 and 5. Comparing the two settings, we can draw two interesting findings. First, LLGPar performs slightly better with the directly-train method. Second, LGPar performs slightly better with the complete-then-train method in most cases except for uncertain (). Third, LTPar performs much better with the complete-then-train method.
5 Failed attempts to enhancing LTPar
All experimental results in the previous section suggest that LTPar is ineffective in learning from PA, and Table 4 indicates that constrained decoding itself works well for LTPar. In contrast, LGPar is also based on constrained decoding and works much better than LTPar. The most important difference is that in line 4-7 of Algorithm 1, LGPar uses dynamic programming based exact search algorithm to find the highest-scoring tree according to the current model, whereas LTPar use approximate beam search algorithm. The approximate search procedure may cause the optimal tree drops off the beam too soon and thus the returned may cause the model be updated to bias to certain wrong structures, which cannot be further covered due to the lack of sufficient supervision in the scenario of PA.
We have tried three strategies to enhance LTPar so far though little progress has been made. First, we set the beam size to , and hope LTPar may learn better from PA with larger beam. Second, as suggested by an earlier anonymous reviewer, we use -best and instead of the 1-best outputs for feature weight update. We try to use the averaged feature vector of -best and/or -best in line 9. Third, we also try a conservative update strategy. The idea is that first we obtain (corresponding to a tree ) in line 5. Then, for each dependency in that is compatible with those in the partial tree , we temporarily insert it into . We use the enlarged in line 7. In this way, the returned is more similar to so that less risk is taken during model update. So far, the results for all three strategies are negative. However, we will keep looking into this issue in future. We will give more detailed descriptions and results upon publication with extra space.
6 Related work
In parsing community, most previous works adopt ad-hoc methods to learn from PA. \newcitesassano-p10-partial-annotation, \newcitejiang-wenbin-p10-bilingual-projection, and \newciteflannery-iwpt15-active-learning-partial-annotation convert partially annotated instances into local dependency/non-dependency classification instances, which may suffer from the lack of non-local correlation between dependencies in a tree.
mirroshandel-iwpt-2011-partial-annotation and \newcitemajidi-13-active-learning-committee adopt the complete-then-learn method. They use parsers coarsely trained on existing data with FA for completion via constrained decoding. However, our experiments show that this leads to dramatic decrease in parsing accuracy.
nivre-j14-constrained present a constrained decoding procedure for arc-eager transition-based parsers. However, their work focuses on allowing their parser to effectively exploit external constraints during the evaluation phase. In this work, we directly employ their method and show that constrained decoding is effective for LTPar and thus irresponsible for its ineffectiveness in learning PA.
Directly learning from PA based on constrained decoding is previously proposed by \newcitejiang-p13-natural-annotation for Chinese word segmentation, which is treated as a character-level sequence labeling problem. In this work, we first apply the idea to LGPar and LTPar for directly learning from PA.
Directly learning from PA based on a forest-based objective in LLGPar is first proposed by \newcitezhenghua-c14, inspired by the idea of ambiguous labeling. Similar ideas have been extensively explored recently in sequence labeling tasks [\citenameLiu et al.2014, \citenameYang and Vozila2014, \citenameMarcheggiani and Artières2014].
hwa-99-partial-annotation pioneers the idea of exploring PA for constituent grammar induction based on a variant Inside-Outside re-estimation algorithm [\citenamePereira and Schabes1992]. \newciteclark-curran-06-partial-annotation propose to train a Combinatorial Categorial Grammar parser using partially labeled data only containing predicate-argument dependencies. \newcitemielens-sun-baldridge:2015:ACL:parser-imputation propose to impute missing dependencies based on Gibbs sampling in order to enable traditional parsers to learn from partial trees.
This paper investigates the problem of training dependency parsers on partially labeled data. Particularly, we focus on the realistic scenario where we have a small-scale training dataset with FA and a large-scale training dataset with PA. We experiment with three settings for simulating PA. We compare several directly-train and complete-then-train approaches with three mainstream parsers, i.e., LLGPar, LGPar, and LTPar. Finally, we draw the following important conclusions. (1) For the complete-then-train approach, using parsers coarsely trained on small-scale data with FA for completion leads to unsatisfactory results. (2) LLGPar is able to make full use of PA for training. In contrast, LGPar is slightly inferior and LTPar performs badly in learning from PA. (3) The complete-then-train approach can make LGPar and LTPar on par with LLGPar in terms of parsing accuracy if using LLGPar fine trained on all data with both FA and PA for completion.
For future, we will further investigate the reason behind the ineffectiveness of LTPar in learning from PA, and try to propose effective strategies to solve the issue. Our next plan is to employ the dynamic programming-enhanced beam search by merging equivalent states proposed by \newcitehuang-sagae:2010:ACL, which allows the parser to explore larger search space during decoding. Moreover, we also plan to consider more constraints beyond dependencies. For example, \newcitenivre-j14-constrained propose a constrained decoding procedure which can also incorporate bracketing constraints, i.e., a certain span forming a single-root subtree, which would be interesting yet challenging for graph-based parsers due to the complexity of designing dynamic programming based algorithms.
The authors would like to thank the anonymous reviewers for the helpful comments.
- [\citenameBohnet2010] Bernd Bohnet. 2010. Top accuracy and fast dependency parsing is not a contradiction. In Proceedings of COLING, pages 89–97.
- [\citenameCarreras2007] Xavier Carreras. 2007. Experiments with a higher-order projective dependency parser. In Proceedings of EMNLP/CoNLL, pages 141–150.
- [\citenameClark and Curran2006] Stephen Clark and James Curran. 2006. Partial training for a lexicalized-grammar parser. In Proceedings of the Human Language Technology Conference of the NAACL, pages 144–151.
- [\citenameCollins2002] Michael Collins. 2002. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of EMNLP 2002, pages 1–8.
- [\citenameDredze et al.2009] Mark Dredze, Partha Pratim Talukdar, and Koby Crammer. 2009. Sequence learning from data with multiple labels. In ECML/PKDD Workshop on Learning from Multi-Label Data.
- [\citenameFinkel et al.2008] Jenny Rose Finkel, Alex Kleeman, and Christopher D. Manning. 2008. Efficient, feature-based, conditional random field parsing. In Proceedings of ACL, pages 959–967.
- [\citenameFlannery and Mori2015] Daniel Flannery and Shinsuke Mori. 2015. Combining active learning and partial annotation for domain adaptation of a japanese dependency parser. In Proceedings of the 14th International Conference on Parsing Technologies, pages 11–19.
- [\citenameGanchev et al.2009] Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar. 2009. Dependency grammar induction via bitext projection constraints. In Proceedings of ACL-IJCNLP 2009, pages 369–377.
- [\citenameHuang and Sagae2010] Liang Huang and Kenji Sagae. 2010. Dynamic programming for linear-time incremental parsing. In Proceedings of ACL 2010, pages 1077–1086.
- [\citenameHwa1999] Rebecca Hwa. 1999. Supervised grammar induction using training data with limited constituent information. In Proceedings of ACL, pages 73–79.
- [\citenameJiang et al.2010] Wenbin Jiang, , and Qun Liu. 2010. Dependency parsing and projection based on word-pair classification. In ACL, pages 897–904.
- [\citenameJiang et al.2013] Wenbin Jiang, Meng Sun, Yajuan Lü, Yating Yang, and Qun Liu. 2013. Discriminative learning with natural annotations: Word segmentation as a case study. In Proceedings of ACL, pages 761–769.
- [\citenameKoo and Collins2010] Terry Koo and Michael Collins. 2010. Efficient third-order dependency parsers. In ACL, pages 1–11.
- [\citenameKübler et al.2009] Sandra Kübler, Ryan McDonald, and Joakim Nivre. 2009. Dependency Parsing (Synthesis Lectures On Human Language Technologies). Morgan and Claypool Publishers.
- [\citenameLi et al.2012] Shoushan Li, Guodong Zhou, and Chu-Ren Huang. 2012. Active learning for Chinese word segmentation. In Proceedings of COLING 2012: Posters, pages 683–692.
- [\citenameLi et al.2014] Zhenghua Li, Min Zhang, and Wenliang Chen. 2014. Soft cross-lingual syntax projection for dependency parsing. In COLING, pages 783–793.
- [\citenameLi et al.2016] Zhenghua Li, Min Zhang, Yue Zhang, Zhanyi Liu, Wenliang Chen, Hua Wu, and Haifeng Wang. 2016. Active learning for dependency parsing with partial annotation. In ACL.
- [\citenameLiu et al.2014] Yijia Liu, Yue Zhang, Wanxiang Che, Ting Liu, and Fan Wu. 2014. Domain adaptation for CRF-based Chinese word segmentation using free annotations. In Proceedings of EMNLP, pages 864–874.
- [\citenameMa and Zhao2015] Xuezhe Ma and Hai Zhao. 2015. Probabilistic models for high-order projective dependency parsing. In arXiv:1502.04174, pages 1–22.
- [\citenameMajidi and Crane2013] Saeed Majidi and Gregory Crane. 2013. Active learning for dependency parsing by a committee of parsers. In Proceedings of IWPT, pages 98–105.
- [\citenameMarcheggiani and Artières2014] Diego Marcheggiani and Thierry Artières. 2014. An experimental comparison of active learning strategies for partially labeled sequences. In Proceedings of EMNLP, pages 898–906.
- [\citenameMartins et al.2013] Andre Martins, Miguel Almeida, and Noah A. Smith. 2013. Turning on the turbo: Fast third-order non-projective turbo parsers. In Proceedings of ACL, pages 617–622.
- [\citenameMcDonald and Pereira2006] Ryan McDonald and Fernando Pereira. 2006. Online learning of approximate dependency parsing algorithms. In Proceedings of EACL, pages 81–88.
- [\citenameMcDonald et al.2005] Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of ACL, pages 91–98.
- [\citenameMielens et al.2015] Jason Mielens, Liang Sun, and Jason Baldridge. 2015. Parse imputation for dependency annotations. In Proceedings of ACL-IJCNLP, pages 1385–1394.
- [\citenameMirroshandel and Nasr2011] Seyed Abolghasem Mirroshandel and Alexis Nasr. 2011. Active learning for dependency parsing using partially annotated sentences. In Proceedings of the 12th International Conference on Parsing Technologies, pages 140–149.
- [\citenameNivre et al.2014] Joakim Nivre, Yoav Goldberg, and Ryan McDonald. 2014. Constrained arc-eager dependency parsing. In Computational Linguistics, volume 40, pages 249–258.
- [\citenameNivre2003] Joakim Nivre. 2003. An efficient algorithm for projective dependency parsing. In Proceedings of IWPT, pages 149–160.
- [\citenamePereira and Schabes1992] Fernando Pereira and Yves Schabes. 1992. Inside-outside reestimation from partially bracketed corpora. In Proceedings of the Workshop on Speech and Natural Language (HLT), pages 122–127.
- [\citenamePetrov and Klein2007] Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In Proceedings of NAACL.
- [\citenameRiezler et al.2002] Stefan Riezler, Tracy H. King, Ronald M. Kaplan, Richard Crouch, John T. III Maxwell, and Mark Johnson. 2002. Parsing the wall street journal using a lexical-functional grammar and discriminative estimation techniques. In Proceedings of ACL, pages 271–278.
- [\citenameSassano and Kurohashi2010] Manabu Sassano and Sadao Kurohashi. 2010. Using smaller constituents rather than sentences in active learning for japanese dependency parsing. In Proceedings of ACL, pages 356–365.
- [\citenameSpreyer and Kuhn2009] Kathrin Spreyer and Jonas Kuhn. 2009. Data-driven dependency parsing of new languages using incomplete and noisy training data. In CoNLL, pages 12–20.
- [\citenameTäckström et al.2013] Oscar Täckström, Ryan McDonald, and Joakim Nivre. 2013. Target language adaptation of discriminative transfer parsers. In Proceedings of NAACL, pages 1061–1071.
- [\citenameYamada and Matsumoto2003] Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical dependency analysis with support vector machines. In Proceedings of IWPT, pages 195–206.
- [\citenameYang and Vozila2014] Fan Yang and Paul Vozila. 2014. Semi-supervised Chinese word segmentation using partial-label learning with conditional random fields. In Proceedings of EMNLP, pages 90–98.
- [\citenameZhang and Clark2011] Yue Zhang and Stephen Clark. 2011. Syntactic processing using the generalized perceptron and beam search. Computational Linguistics, 37(1):105–151.
- [\citenameZhang and Nivre2011] Yue Zhang and Joakim Nivre. 2011. Transition-based dependency parsing with rich non-local features. In Proceedings of ACL, pages 188–193.