Data Augmentation via Dependency Tree Morphing for Low-Resource Languages
Neural NLP systems achieve high scores in the presence of sizable training dataset. Lack of such datasets leads to poor system performances in the case low-resource languages. We present two simple text augmentation techniques using dependency trees, inspired from image processing. We “crop” sentences by removing dependency links, and we “rotate” sentences by moving the tree fragments around the root. We apply these techniques to augment the training sets of low-resource languages in Universal Dependencies project. We implement a character-level sequence tagging model and evaluate the augmented datasets on part-of-speech tagging task. We show that crop and rotate provides improvements over the models trained with non-augmented data for majority of the languages, especially for languages with rich case marking systems.
Most recently, various deep learning methods have been proposed for many natural language understanding tasks including sentiment analysis, question answering, dependency parsing and semantic role labeling. Although these methods have reported state-of-the-art results for languages with rich resources, no significant improvement has been announced for low-resource languages. In other words, feature-engineered statistical models still perform better than these neural models for low-resource languages.
Label-preserving data augmentation techniques are known to help methods generalize better by increasing the variance of the training data. It has been a common practice among researchers in computer vision field to apply data augmentation, e.g., flip, crop, scale and rotate images, for tasks like image classification CiresanMS12; KrizhevskySH12. Similarly, speech recognition systems made use of augmentation techniques like changing the tone and speed of the audio KoPPK15; RagniKRG14, noise addition hartmann2016two and synthetic audio generation TakahashiGPG16. Comparable techniques for data augmentation are less obvious for NLP tasks, due to structural differences among languages. There are only a small number of studies that tackle data augmentation techniques for NLP, such as ZhangL15 for text classification and FadaeeBM17a for machine translation.
In this work, we focus on languages with small training datasets, that are made available by the Universal Dependency (UD) project. These languages are dominantly from Uralic, Turkic, Slavic and Baltic language families, which are known to have extensive morphological case-marking systems and relatively free word order. With these languages in mind, we propose an easily adaptable, multilingual text augmentation technique based on dependency trees, inspired from two common augmentation methods from image processing: cropping and rotating. As images are cropped to focus on a particular item, we crop the sentences to form other smaller, meaningful and focused sentences. As images are rotated around a center, we rotate the portable tree fragments around the root of the dependency tree to form a synthetic sentence. We augment the training sets of these low-resource languages via crop and rotate operations. In order to measure the impact of augmentation, we implement a unified character-level sequence tagging model. We systematically train separate parts-of-speech tagging models with the original and augmented training sets, and evaluate on the original test set. We show that crop and rotate provide improvements over the non-augmented data for majority of the languages, especially for languages with rich case marking system.
We borrow two fundamental label-preserving augmentation ideas from image processing: cropping and rotation. Image cropping can be defined as removal of some of the peripheral areas of an image to focus on the subject/object (e.g., focusing on the flower in a large green field). Following this basic idea, we aim to identify the parts of the sentence that we want to focus and remove the other chunks, i.e., form simpler/smaller meaningful sentences
Images are rotated around a chosen center with a certain degree to enhance the training data. Similarly, we choose the root as the center of the sentence and rotate the flexible tree fragments around the root for augmentation. Flexible fragments are usually defined by the morphological typology of the language wordOrder. For instance, languages close to analytical typology such as English, rarely have inflectional morphemes. They do not mark the objects/subjects, therefore words have to follow a strict order. For such languages, sentence rotation would mostly introduce noise. On the other hand, large number of languages such as Latin, Greek, Persian, Romanian, Assyrian, Turkish, Finnish and Basque have no strict word order (though there is a preferred order) due to their extensive marking system. Hence, flexible parts are defined as marked fragments which are again, subjects and objects. Rotation is illustrated in Fig. LABEL:fig:flip on the same sentence.
- For example, in the case of dependency parsing, recent best results from CoNLL-18 parsing shared task can be compared to the results of traditional language-specific models.
- Focus should not be confused with the grammatical category FOC.