Low-Resource Syntactic Transfer with Unsupervised Source Reordering
We describe a cross-lingual transfer method for dependency parsing that takes into account the problem of word order differences between source and target languages. Our model only relies on the Bible, a considerably smaller parallel data than the commonly used parallel data in transfer methods. We use the concatenation of projected trees from the Bible corpus, and the gold-standard treebanks in multiple source languages along with cross-lingual word representations. We demonstrate that reordering the source treebanks before training on them for a target language improves the accuracy of languages outside the European language family. Our experiments on 68 treebanks (38 languages) in the Universal Dependencies corpus achieve a high accuracy for all languages. Among them, our experiments on 16 treebanks of 12 non-European languages achieve an average UAS absolute improvement of over a state-of-the-art method.
There has recently been a great deal of interest in cross-lingual transfer of dependency parsers, for which a parser is trained for a target language of interest using treebanks in other languages. Cross-lingual transfer can eliminate the need for the expensive and time-consuming task of treebank annotation for low-resource languages. Approaches include annotation projection using parallel data sets Hwa et al. (2005); Ganchev et al. (2009), direct model transfer through learning of a delexicalized model from other treebanks Zeman and Resnik (2008); Täckström et al. (2013), treebank translation Tiedemann et al. (2014), using synthetic treebanks Tiedemann and Agić (2016); Wang and Eisner (2016), using cross-lingual word representations Täckström et al. (2012); Guo et al. (2016); Rasooli and Collins (2017) and using cross-lingual dictionaries Durrett et al. (2012).
Recent results from \newciterasooli2017cross have shown accuracies exceeding 80% on unlabeled attachment accuracy (UAS) for several European languages.111Specifically, Table 9 of \newciterasooli2017cross shows 13 datasets, and 11 languages, with UAS scores of over 80%; all of these datasets are in European languages. However non-European languages remain a significant challenge for cross-lingual transfer. One hypothesis, which we investigate in this paper, is that word-order differences between languages are a significant challenge for cross-lingual transfer methods. The main goal of our work is therefore to reorder gold-standard source treebanks to make those treebanks syntactically more similar to the target language of interest. We use two different approaches for source treebank reordering: 1) reordering based on dominant dependency directions according to the projected dependencies, 2) learning a classifier on the alignment data. We show that an ensemble of these methods with the baseline method leads to higher performance for the majority of datasets in our experiments. We show particularly significant improvements for non-European languages.222Specifically, performance of our method gives an improvement of at least 2.3% absolute scores in UAS on 11 datasets in 9 languages—Coptic, Basque, Chinese, Vietnamese, Turkish, Persian, Arabic, Indonesian Hebrew—with an average improvement of over 4.5% UAS.
The main contributions of this work are as follows:
We propose two different syntactic reordering methods based on the dependencies projected using translation alignments. The first model is based on the dominant dependency direction in the target language according to the projected dependencies. The second model learns a reordering classifier from the small set of aligned sentences in the Bible parallel data.
We run an extensive set of experiments on 68 treebanks for 38 languages. We show that by just using the Bible data, we are able to achieve significant improvements in non-European languages. Our ensemble method is able to maintain a high accuracy in European languages.
We show that syntactic transfer methods can outperform a supervised model for cases in which the gold-standard treebank is very small. This indicates the strength of these models when the language is truly low-resource.
Unlike most previous work for which a simple delexicalized model with gold part-of-speech tags are used, we use lexical features and automatic part-of-speech tags. Our final model improves over two strong baselines, one with annotation projection and the other one inspired by the non-neural state-of-the-art model of \newciterasooli2017cross. Our final results improve the performance on non-European languages by an average UAS absolute improvement of and LAS absolute improvement of .
2 Related Work
There has recently been a great deal of research on dependency parser transfer. Early work on direct model transfer Zeman and Resnik (2008); McDonald et al. (2011); Cohen et al. (2011); Rosa and Zabokrtsky (2015); Wang and Eisner (2018a) considered learning a delexicalized parser from one or many source treebanks. A number of papers Naseem et al. (2012); Täckström et al. (2013); Zhang and Barzilay (2015); Ammar et al. (2016); Wang and Eisner (2017) have considered making use of topological features to overcome the problem of syntactic differences across languages. Our work instead reorders the source treebanks to make them similar to the target language before training on the source treebanks.
agic_selection use part-of-speech sequence similarity between the source and target language for selecting the source sentences in a direct transfer approach. \newciteisomorphic_transfer preprocess source trees to increase the isomorphy between the source and the target language dependency trees. They apply their method on a simple delexicalized model and their accuracy on the small set of languages that they have tried is significantly worse than ours in all languages. The recent work by \newcitewang_emnlp18 reorders delexicalized treebanks of part-of-speech sequences in order to make it more similar to the target language of interest. The latter work is similar to our work in terms of using reordering. Our work is more sophisticated by using a full-fledged parsing model with automatic part-of-speech tags and every accessible dataset such as projected trees and multiple source treebanks as well as cross-lingual word embeddings for all languages.
Previous work Täckström et al. (2012); Duong et al. (2015); Guo et al. (2015, 2016); Ammar et al. (2016) has considered using cross-lingual word representations. A number of authors Durrett et al. (2012); Rasooli and Collins (2017) have used cross-lingual dictionaries. We also make use of cross-lingual word representations and dictionaries in this paper. We use the automatically extracted dictionaries from the Bible to translate words in the source treebanks to the target language. One other line of research in the delexicalized transfer approach is creating a synthetic treebank Tiedemann and Agić (2016); Wang and Eisner (2016, 2018b).
Annotation projection Hwa et al. (2005); Ganchev et al. (2009); McDonald et al. (2011); Ma and Xia (2014); Rasooli and Collins (2015); Lacroix et al. (2016); Agić et al. (2016) is another approach in parser transfer. In this approach, supervised dependencies are projected through word alignments and then used as training data. Similar to previous work Rasooli and Collins (2017), we make use of a combination of projected dependencies from annotation projection in addition to partially translated source treebanks. One other approach is treebank translation Tiedemann et al. (2014) for which a statistical machine translation system is used to translate source treebanks to the target language. These models need a large amount of parallel data for having an accurate translation system.
Using the Bible data goes back to the work of \newcitediab2000statistical and \newciteYarowsky:2001:IMT:1072133.1072187. Recently there has been more interest in using the Bible data for different tasks, due to its availability for many languages Christodouloupoulos and Steedman (2014); Agić et al. (2015, 2016); Rasooli and Collins (2017). Previous work Östling and Tiedemann (2017) has shown that the size of the Bible dataset does not provide a reliable machine translation model. Previous work in the context of machine translation Bisazza and Federico (2016); Daiber et al. (2016) presumes the availability of a parallel data that is often much larger than the Bible data.
3 Baseline Model
Our model trains on the concatenation of projected dependencies and all of the source treebanks . The projected data is from the set of projected dependencies for which at least of words have projected dependencies or there is a span of length such that all words in that span achieve a projected dependency. This is the same as the definition of dense structures by \newciterasooli-collins:2015:EMNLP.
We use our reimplementation of the state-of-the-art neural biaffine graph-based parser of \newcitedozat2016deep333https://github.com/rasoolims/universal-parser. Because many words in the projected dependencies do not have a head assignment, the parser ignores words without heads during training. Inspired by \newciterasooli2017cross, we replace every word in the source treebanks with its most frequent aligned translation word from the Bible data in the target language. If that word does not appear in the Bible, we use the original word. That way, we have a code-switched data for which some of the words are being translated. In addition to fine-tuning the word embeddings, we use the fixed pre-trained cross-lingual word embeddings using the training approach of \newciterasooli2017cross using the Wikipedia data and the Bible dictionaries.
Before making use of the source treebanks in the training data, we reorder each tree in the source treebanks to be syntactically more similar to the word order of the target language. In general, for a head that has modifiers , we decide to put each of the dependents on the left or right of the head . After placing them in the correct side of the head, the order in the original source sentence is preserved. Figure 1 shows a real example of an English tree that is reordered for the sake of Persian as the target language. Here we see that we have a verb-final sentence, with nominal modifiers following the head noun. If one aims to translate this English sentence word by word, the reordered sentence gives a very good translation without any change in the sentence.
As mentioned earlier, we use two different approaches for source treebank reordering: 1) reordering based on dominant dependency directions according to the projected dependencies, 2) learning a classifier on the alignment data. We next describe these two methods.
4.1 Model 1: Reordering Based on Dominant Dependency Direction
The main goal of this model is to reorder source dependencies based on dominant dependency directions in the target language. We extract dominant dependency directions according to the projected dependencies from the alignment data, and use the information for reordering source treebanks.
Let the tuple show the dependency of the ’th word in the ’th projected sentence for which the ’th word is the parent with the dependency label . shows an unknown dependency for the ’th word: this occurs when some of the words in the target sentence do not achieve a projected dependency. We use the notations and to show the head index and dependency label of the ’th word in the ’th sentence.
Dependency direction: shows the dependency direction of the ’th modifier word in the ’th sentence:
Dependency direction proportion: Dependency direction proportion of each dependency label with direction is defined as:
Dominant dependency direction: For each dependency label , we define the dominant dependency direction if . In cases where there is no dominant dependency direction, .
We consider the following dependency labels for extracting dominant dependency direction information: nsubj, obj, iobj, csubj, ccomp, xcomp, obl, vocative, expl, dislocated, advcl, advmod, aux, cop, nmod, appos, nummod, acl, amod. We find the direction of other dependency relations, such as most of the function word dependencies and other non-core dependencies such as conjunction, not following a fixed pattern in the Universal Dependencies corpus.
Given a set of projections , we calculate the dominant dependency direction information for the projections . Similar to the projected dependencies, we extract supervised dominant dependency directions from the gold-standard source treebank : . When we encounter a gold-standard dependency relation in a source treebank , we change the direction if the following condition holds:
In other words, if the source and target languages do not have the same dominant dependency direction for and the dominant direction of the target language is the reverse of the current direction, we change the direction of that dependency. Reordering multiple dependencies in a gold standard tree then results in a reordering of the full tree, as for example in the transformation from Figure 0(a) to Figure 0(b).
4.2 Model 2: Reordering Classifier
We now describe our approach for learning a reordering classifier for a target language using the alignment data. Unlike the first model for which we learn concrete rules, this model learns a reordering classifier from automatically aligned data. This model has two steps; the first step prepares the training data from the automatically aligned parallel data, and the second step learns a classifier from the training data.
4.2.1 Preparing Training Data from Alignments
The goal of this step is to create training data for the reordering classifier. This data is extracted from the concatenation of parallel data from all source languages translated to the target language. Given a parallel dataset for that contains pairs of source and target sentences and , the following steps are applied to create training data:
Extracting reordering mappings from alignments: We first extract intersected word alignments for each source-target sentence pair. This is done by running the Giza++ alignments Och and Ney (2003) in both directions. We ignore sentence pairs that more than half of the source words do not get alignment. We create a new mapping that maps each index in the original source sentence to a unique index in the reordered sentence.
Parsing source sentences: We parse each source sentence using the supervised parser of the source language. We use the mapping to come up with a reordered tree for each sentence. In cases for which the number of non-projective arcs in the projected tree increase compared to the original tree, we do not use the sentence in the final training data.
Extracting classifier instances: We create a training instance for every modifier word . The decision about the direction of each dependency can be made based on the following condition:
In other words, we decide about the new order of a dependency according to the mapping .
Figure 2 shows an example for the data preparation step. As shown in the figure, the new directions for the English words are decided according to the Persian alignments.
The reordering classifier decides about the new direction of each dependency according to the recurrent representation of the head and dependent words. For a source sentence that belongs to a source language , we first obtain its recurrent representation by running a deep (3 layers) bi-directional LSTM Hochreiter and Schmidhuber (1997), where . For every dependency tuple , we use a multi-layer Perceptron (MLP) to decide about the new order of the ’th word with respect to its head :
where and is as follows:
where relu is the rectified linear unit activation Nair and Hinton (2010), , , and is as follows:
where and are the recurrent representations for the modifier and head words respectively, is the dependency relation embedding dictionary that embeds every dependency relation to a vector, is the direction embedding for the original position of the head with respect to its head and embeds each direction to a 2-dimensional vector, and is the language embedding dictionary that embeds the source language id to a vector.
The input to the recurrent layer is the concatenation of two input vectors. The first vector is the sum of the fixed pre-trained cross-lingual embeddings, and randomly initialized word vector. The second vector is the part-of-speech tag embeddings.
Figure 3 shows a graphical depiction of the two reordering models that we use in this work.
Datasets and Tools
We use 68 datasets from 38 languages in the Universal Dependencies corpus version 2.0 Nivre et al. (2017). The languages are Arabic (ar), Bulgarian (bg), Coptic (cop), Czech (cs), Danish (da), German (de), Greek (el), English (en), Spanish (es), Estonian (et), Basque (eu), Persian (fa), Finnish (fi), French (fr), Hebrew (he), Hindi (hi), Croatian (hr), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Korean (ko), Latin (la), Lithuanian (lt), Latvian (lv), Dutch (nl), Norwegian (no), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Slovak (sk), Slovene (sl), Swedish (sv), Turkish (tr), Ukrainian (uk), Vietnamese (vi), and Chinese (zh).
We use the Bible data from \newcitechristodouloupoulos2014massively for the 38 languages. We extract word alignments using Giza++ default model Och and Ney (2003). Following \newciterasooli-collins:2015:EMNLP, we obtain intersected alignments and apply soft POS consistency to filter potentially incorrect alignments. We use the Wikipedia dump data to extract monolingual data for the languages in order to train monolingual embeddings. We follow the method of \newciterasooli2017cross to use the extracted dictionaries from the Bible and monolingual text from Wikipedia to create cross-lingual word embeddings. We use the UDPipe pretrained models Straka and Straková (2017) to tokenize Wikipedia, and a reimplementation of the Perceptron tagger of \newcitecollins:2002:EMNLP02444https://github.com/rasoolims/SemiSupervisedPosTagger to achieve automatic POS tags trained on the training data of the Universal Dependencies corpus Nivre et al. (2017). We use word2vec Mikolov et al. (2013)555https://github.com/dav/word2vec to achieve embedding vectors both in monolingual and cross-lingual settings.
Supervised Parsing Models
We trained our supervised models on the union of all datasets in a language to obtain a supervised model for each language. It is worth noting that there are two major changes that we make to the neural parser of \newcitedozat2016deep in our implementation666https://github.com/rasoolims/universal-parser using the Dynet library Neubig et al. (2017): first, we add a one-layer character BiLSTM to represent the character information for each word. The final character representation is obtained by concatenating the forward representation of the last character and the backward representation of the first character. The concatenated vector is summed with the randomly initialized as well as fixed pre-trained cross-lingual word embedding vectors. Second, inspired by \newciteweiss-EtAl:2015:ACL-IJCNLP, we maintain the moving average parameters to obtain more robust parameters at decoding time.
We excluded the following languages from the set of source languages for annotation projection due to their low supervised accuracy: Estonian, Hungarian, Korean, Latin, Lithuanian, Latvian, Turkish, Ukrainian, Vietnamese, and Chinese.
Baseline Transfer Models
We use two baseline models: 1) Annotation projection: This model only trains on the projected dependencies. 2) Annotation projection + direct transfer: To speed up training, we sample at most thousand sentences from each treebank, comprising a training data of about 37K sentences.
5.1 Reordering Ensemble Model
We noticed that our reordering models perform better in non-European languages, and perform slightly worse in European languages. We use the following ensemble model to make use of all of the three models (annotation projection + direct transfer, and the two reordering models), to make sure that we always obtain an accurate parser.
The ensemble model is as follows: given three output trees for the ’th sentence for in the target language , where the first tuple () belongs to the baseline model, the second () and third () belong to the two reordering models, we weight each dependency edge with respect to the following conditions:
where is a coefficient that puts more weight on the first or the other two outputs depending on the target language family:
and is a simple weighting depending on the dominant order information:
The above coefficients are modestly tuned on the Persian language as our development language. We have not seen any significant change in modifying the numbers: instead, the fact that an arc with a dominant dependency direction is regarded as a more valuable arc, and the baseline should have more effect in the European languages suffices for the ensemble model.
We run the Eisner first-order graph-based algorithm Eisner (1996) on top of the edge weights to extract the best possible tree.
We run all of the transfer models with 4000 mini-batches, in which each mini-batch contains approximately 5000 tokens. We follow the same parameters as in \newcitedozat2016deep and use a dimension of 100 for character embeddings. For the reordering classifier, we use the Adam algorithm Kingma and Ba (2014) with default parameters to optimize the log-likelihood objective. We filter the alignment data to keep only those sentences for which at least half of the source words have an alignment. We randomly choose of the reordering data as our heldout data for deciding when to stop training the reordering models. Table 1 shows the parameter values that we use in the reordering classifier.
|Dep. relation embedding||50|
|Language ID embedding||50|
|Number of BiLSTM layers||–||3|
|Mini-batch size (tokens)||–|
Table 2 shows the results on the Universal Dependencies corpus Nivre et al. (2017). As shown in the table, the algorithm based on dominant dependency directions improves the accuracy on most of the non-European languages and performs slightly worse than the baseline model in the European languages. The ensemble model, in spite of its simplicity, improves over the baseline in most of the languages, leading to an average UAS improvement of for all languages and for non-European languages. This improvement is very significant in many of the non-European languages; for example, from an LAS of to in Coptic, from a UAS of to in Basque, from a UAS of to in Chinese. Our model also outperforms the supervised models in Ukrainian and Latvian. That is an interesting indicator that for cases that the training data is very small for a language (37 sentences for Ukrainian, and 153 sentences for Latvian), our transfer approach outperforms the supervised model.
In this section, we briefly describe our analysis based on the results in the ensemble model and the baseline. For some languages such as Coptic, the number of dense projected dependencies is too small (two trees) such that the parser gives a worse learned model than a random baseline. For some other languages, such as Norwegian and Spanish, this number is too high (more than twenty thousand trees), such that the baseline model performs very well.
The dominant dependency direction model generally performs better than the classifier. Our manual investigation shows that the classifier kept many of the dependency directions unchanged, while the dominant dependency direction model changed more directions. Therefore, the dominant direction model gives a higher recall with the expense of losing some precision. The training data for the reordering classifier is very noisy due to wrong alignments. We believe that the dominant direction model, besides its simplicity, is a more robust classifier for reordering, though the classifier is helpful in an ensemble setting.
Our detailed analysis show that we are able to improve the head dependency relation for the three most important head POS tags in the dependency grammar. We see that this improvement is more consistent for all non-European languages. Table 3 shows the differences in parsing f-score of dependency relations for adjectives, nouns and verbs as the head. As we see in the Table, we are able to improve the head dependency relation for the three most important head POS tags in the dependency grammar. We see that this improvement is more consistent for all non-European languages. We skip the details of those analysis due to space limitations. More thorough analysis can be found in (Rasooli, 2019, Chapter 6).
For a few number of languages such as Vietnamese, the best model, even though improves over a strong baseline, still lacks enough accuracy to be considered as a reliable parser in place of a supervised model. We believe that more research on those language will address the mentioned problem. Our current model relies on supervised part-of-speech tags. Future work should study using transferred part-of-speech tags instead of supervised tags, leading to a much more realistic scenario for low-resource languages.
We have also calculated the POS trigram cosine similarity between the target language gold standard treeebanks, and the three source training datasets (original, and the two reordered datasets). In all of the non-European languages, the cosine similarity of the reordered datasets improved with different values in the range of . For Czech, Portuguese, German, Greek, English, Romanian, Russian, and Slovak, both of the reordered datasets slightly decreased the trigram cosine similarity. For other languages, the cosine similarity was roughly the same.
We have described a cross-lingual dependency transfer method that takes into account the problem of word order differences between the source and target languages. We have shown that applying projection-driven reordering improves the accuracy of non-European languages while maintaining the high accuracies in European languages. The focus of this paper is primarily of dependency parsing. Future work should investigate the effect of our proposed reordering methods on truly low-resource machine translation.
We deeply thank the anonymous reviewers for their useful feedback and comments.
- Agić (2017) Željko Agić. 2017. Cross-lingual parser selection for low-resource languages. In Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), pages 1–10. Association for Computational Linguistics.
- Agić et al. (2015) Željko Agić, Dirk Hovy, and Anders Søgaard. 2015. If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, pages 268–272.
- Agić et al. (2016) Željko Agić, Anders Johannsen, Barbara Plank, Héctor Alonso Martínez, Natalie Schluter, and Anders Søgaard. 2016. Multilingual projection for parsing truly low-resource languages. Transactions of the Association for Computational Linguistics, 4:301–312.
- Ammar et al. (2016) Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer, and Noah Smith. 2016. Many languages, one parser. Transactions of the Association for Computational Linguistics, 4:431–444.
- Bisazza and Federico (2016) Arianna Bisazza and Marcello Federico. 2016. A survey of word reordering in statistical machine translation: Computational models and language phenomena. Computational linguistics, 42(2):163–205.
- Christodouloupoulos and Steedman (2014) Christos Christodouloupoulos and Mark Steedman. 2014. A massively parallel corpus: The Bible in 100 languages. Language Resources and Evaluation, pages 1–21.
- Cohen et al. (2011) Shay B. Cohen, Dipanjan Das, and Noah A. Smith. 2011. Unsupervised structure prediction with non-parallel multilingual guidance. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 50–61, Edinburgh, Scotland, UK. Association for Computational Linguistics.
- Collins (2002) Michael Collins. 2002. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pages 1–8. Association for Computational Linguistics.
- Daiber et al. (2016) Joachim Daiber, Miloš Stanojević, and Khalil Sima’an. 2016. Universal reordering via linguistic typology. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3167–3176.
- Diab and Finch (2000) Mona Diab and Steve Finch. 2000. A statistical word-level translation model for comparable corpora. In Content-Based Multimedia Information Access-Volume 2, pages 1500–1508. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE.
- Dozat and Manning (2016) Timothy Dozat and Christopher D Manning. 2016. Deep biaffine attention for neural dependency parsing. arXiv preprint arXiv:1611.01734.
- Duong et al. (2015) Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. 2015. Cross-lingual transfer for unsupervised dependency parsing without parallel data. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning, pages 113–122, Beijing, China. Association for Computational Linguistics.
- Durrett et al. (2012) Greg Durrett, Adam Pauls, and Dan Klein. 2012. Syntactic transfer using a bilingual lexicon. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1–11, Jeju Island, Korea. Association for Computational Linguistics.
- Eisner (1996) Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In Proceedings of the 16th conference on Computational linguistics-Volume 1, pages 340–345. Association for Computational Linguistics.
- Ganchev et al. (2009) Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar. 2009. Dependency grammar induction via bitext projection constraints. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 369–377, Suntec, Singapore. Association for Computational Linguistics.
- Guo et al. (2015) Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1234–1244, Beijing, China. Association for Computational Linguistics.
- Guo et al. (2016) Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2016. A representation learning framework for multi-source transfer parsing. In The Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), Phoenix, Arizona, USA.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Hwa et al. (2005) Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural language engineering, 11(03):311–325.
- Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
- Lacroix et al. (2016) Ophélie Lacroix, Lauriane Aufrant, Guillaume Wisniewski, and François Yvon. 2016. Frustratingly easy cross-lingual transfer for transition-based dependency parsing. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1058–1063, San Diego, California. Association for Computational Linguistics.
- Ma and Xia (2014) Xuezhe Ma and Fei Xia. 2014. Unsupervised dependency parsing with transferring distribution via parallel guidance and entropy regularization. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1337–1348, Baltimore, Maryland. Association for Computational Linguistics.
- McDonald et al. (2011) Ryan McDonald, Slav Petrov, and Keith Hall. 2011. Multi-source transfer of delexicalized dependency parsers. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 62–72, Edinburgh, Scotland, UK. Association for Computational Linguistics.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Nair and Hinton (2010) Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814.
- Naseem et al. (2012) Tahira Naseem, Regina Barzilay, and Amir Globerson. 2012. Selective sharing for multilingual dependency parsing. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 629–637. Association for Computational Linguistics.
- Neubig et al. (2017) Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, et al. 2017. Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980.
- Nivre et al. (2017) Joakim Nivre, Željko Agić, Lars Ahrenberg, Maria Jesus Aranzabe, Masayuki Asahara, et al. 2017. Universal Dependencies 2. LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague.
- Och and Ney (2003) Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
- Östling and Tiedemann (2017) Robert Östling and Jörg Tiedemann. 2017. Neural machine translation for low-resource languages. arXiv preprint arXiv:1708.05729.
- Ponti et al. (2018) Edoardo Maria Ponti, Roi Reichart, Anna Korhonen, and Ivan Vulić. 2018. Isomorphic transfer of syntactic structures in cross-lingual NLP. In Proceedings of ACL, Melbourne, Australia. Association for Computational Linguistics.
- Rasooli (2019) Mohammad Sadegh Rasooli. 2019. Cross-Lingual Transfer of Natural Language Processing Systems. Ph.D. thesis, Columbia University.
- Rasooli and Collins (2015) Mohammad Sadegh Rasooli and Michael Collins. 2015. Density-driven cross-lingual transfer of dependency parsers. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 328–338, Lisbon, Portugal. Association for Computational Linguistics.
- Rasooli and Collins (2017) Mohammad Sadegh Rasooli and Michael Collins. 2017. Cross-lingual syntactic transfer with limited resources. Transactions of the Association of Computational Linguistics, 5(1):279–293.
- Rosa and Zabokrtsky (2015) Rudolf Rosa and Zdenek Zabokrtsky. 2015. Klcpos3 - a language similarity measure for delexicalized parser transfer. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 243–249, Beijing, China. Association for Computational Linguistics.
- Straka and Straková (2017) Milan Straka and Jana Straková. 2017. Tokenizing, POS tagging, lemmatizing and parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 88–99, Vancouver, Canada. Association for Computational Linguistics.
- Täckström et al. (2013) Oscar Täckström, Ryan McDonald, and Joakim Nivre. 2013. Target language adaptation of discriminative transfer parsers. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1061–1071, Atlanta, Georgia. Association for Computational Linguistics.
- Täckström et al. (2012) Oscar Täckström, Ryan McDonald, and Jakob Uszkoreit. 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: Human language technologies, pages 477–487. Association for Computational Linguistics.
- Tiedemann et al. (2014) Jörg Tiedemann, Željko Agić, and Joakim Nivre. 2014. Treebank translation for cross-lingual parser induction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 130–140, Ann Arbor, Michigan. Association for Computational Linguistics.
- Tiedemann and Agić (2016) Jörg Tiedemann and Željko Agić. 2016. Synthetic treebanking for cross-lingual dependency parsing. Journal of Artificial Intelligence Research, 55:209–248.
- Wang and Eisner (2016) Dingquan Wang and Jason Eisner. 2016. The galactic dependencies treebanks: Getting more data by synthesizing new languages. Transactions of the Association for Computational Linguistics, 4:491–505.
- Wang and Eisner (2017) Dingquan Wang and Jason Eisner. 2017. Fine-grained prediction of syntactic typology: Discovering latent structure with supervised learning. Transactions of the Association for Computational Linguistics, 5:147–161.
- Wang and Eisner (2018a) Dingquan Wang and Jason Eisner. 2018a. Surface statistics of an unknown language indicate how to parse it. Transactions of the Association for Computational Linguistics, 6:667–685.
- Wang and Eisner (2018b) Dingquan Wang and Jason Eisner. 2018b. Synthetic data made to order: The case of parsing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Weiss et al. (2015) David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Structured training for neural network transition-based parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 323–333, Beijing, China. Association for Computational Linguistics.
- Yarowsky et al. (2001) David Yarowsky, Grace Ngai, and Richard Wicentowski. 2001. Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the First International Conference on Human Language Technology Research, HLT ’01, pages 1–8, Stroudsburg, PA, USA. Association for Computational Linguistics.
- Zeman and Resnik (2008) Daniel Zeman and Philip Resnik. 2008. Cross-language parser adaptation between related languages. In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages, pages 35–42.
- Zhang and Barzilay (2015) Yuan Zhang and Regina Barzilay. 2015. Hierarchical low-rank tensors for multilingual transfer parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1857–1867, Lisbon, Portugal. Association for Computational Linguistics.