Data Augmentation via Dependency Tree Morphing for Low-Resource Languages

Data Augmentation via Dependency Tree Morphing for Low-Resource Languages


Neural NLP systems achieve high scores in the presence of sizable training dataset. Lack of such datasets leads to poor system performances in the case low-resource languages. We present two simple text augmentation techniques using dependency trees, inspired from image processing. We “crop” sentences by removing dependency links, and we “rotate” sentences by moving the tree fragments around the root. We apply these techniques to augment the training sets of low-resource languages in Universal Dependencies project. We implement a character-level sequence tagging model and evaluate the augmented datasets on part-of-speech tagging task. We show that crop and rotate provides improvements over the models trained with non-augmented data for majority of the languages, especially for languages with rich case marking systems.


1 Introduction

Most recently, various deep learning methods have been proposed for many natural language understanding tasks including sentiment analysis, question answering, dependency parsing and semantic role labeling. Although these methods have reported state-of-the-art results for languages with rich resources, no significant improvement has been announced for low-resource languages. In other words, feature-engineered statistical models still perform better than these neural models for low-resource languages.1 Generally accepted reason for low scores is the size of the training data, i.e., training labels being too sparse to extract meaningful statistics.

Label-preserving data augmentation techniques are known to help methods generalize better by increasing the variance of the training data. It has been a common practice among researchers in computer vision field to apply data augmentation, e.g., flip, crop, scale and rotate images, for tasks like image classification CiresanMS12; KrizhevskySH12. Similarly, speech recognition systems made use of augmentation techniques like changing the tone and speed of the audio KoPPK15; RagniKRG14, noise addition hartmann2016two and synthetic audio generation TakahashiGPG16. Comparable techniques for data augmentation are less obvious for NLP tasks, due to structural differences among languages. There are only a small number of studies that tackle data augmentation techniques for NLP, such as ZhangL15 for text classification and FadaeeBM17a for machine translation.

In this work, we focus on languages with small training datasets, that are made available by the Universal Dependency (UD) project. These languages are dominantly from Uralic, Turkic, Slavic and Baltic language families, which are known to have extensive morphological case-marking systems and relatively free word order. With these languages in mind, we propose an easily adaptable, multilingual text augmentation technique based on dependency trees, inspired from two common augmentation methods from image processing: cropping and rotating. As images are cropped to focus on a particular item, we crop the sentences to form other smaller, meaningful and focused sentences. As images are rotated around a center, we rotate the portable tree fragments around the root of the dependency tree to form a synthetic sentence. We augment the training sets of these low-resource languages via crop and rotate operations. In order to measure the impact of augmentation, we implement a unified character-level sequence tagging model. We systematically train separate parts-of-speech tagging models with the original and augmented training sets, and evaluate on the original test set. We show that crop and rotate provide improvements over the non-augmented data for majority of the languages, especially for languages with rich case marking system.

2 Method

We borrow two fundamental label-preserving augmentation ideas from image processing: cropping and rotation. Image cropping can be defined as removal of some of the peripheral areas of an image to focus on the subject/object (e.g., focusing on the flower in a large green field). Following this basic idea, we aim to identify the parts of the sentence that we want to focus and remove the other chunks, i.e., form simpler/smaller meaningful sentences 2. In order to do so, we take advantage of dependency trees which provide us with links to focuses, such as subjects and objects. The idea is demonstrated in Fig. LABEL:fig:crop on the Turkish sentence given in Fig. LABEL:fig:dt. Here, given a predicate (wrote) that governs a subject (her father), an indirect object (to her) and a direct object (a letter); we form three smaller sentences with a focus on the subject (first row in Fig. LABEL:fig:crop: her father wrote) and the objects (second and third row) by removing all dependency links other than the focus (with its subtree). Obviously, cropping may cause semantic shifts on a sentence-level. However it preserves local syntactic tags and even shallow semantic labels.

Images are rotated around a chosen center with a certain degree to enhance the training data. Similarly, we choose the root as the center of the sentence and rotate the flexible tree fragments around the root for augmentation. Flexible fragments are usually defined by the morphological typology of the language wordOrder. For instance, languages close to analytical typology such as English, rarely have inflectional morphemes. They do not mark the objects/subjects, therefore words have to follow a strict order. For such languages, sentence rotation would mostly introduce noise. On the other hand, large number of languages such as Latin, Greek, Persian, Romanian, Assyrian, Turkish, Finnish and Basque have no strict word order (though there is a preferred order) due to their extensive marking system. Hence, flexible parts are defined as marked fragments which are again, subjects and objects. Rotation is illustrated in Fig. LABEL:fig:flip on the same sentence.


  1. For example, in the case of dependency parsing, recent best results from CoNLL-18 parsing shared task can be compared to the results of traditional language-specific models.
  2. Focus should not be confused with the grammatical category FOC.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description