Machine Translation of Low-Resource Spoken Dialects:Strategies for Normalizing Swiss German

Machine Translation of Low-Resource Spoken Dialects:
Strategies for Normalizing Swiss German

Abstract

The goal of this work is to design a machine translation system for a low-resource family of dialects, collectively known as Swiss German. We list the parallel resources that we collected, and present three strategies for normalizing Swiss German input in order to address the regional and spelling diversity. We show that character-based neural MT is the best solution for text normalization and that in combination with phrase-based statistical MT we reach 36% BLEU score. This value, however, is shown to decrease as the testing dialect becomes more remote from the training one.

Keywords: machine translation, low-resource languages, spoken dialects, character-based neural MT

Machine Translation of Low-Resource Spoken Dialects:

Strategies for Normalizing Swiss German

Pierre-Édouard Honnetthanks: Work conducted while the second author was at the Idiap Research Institute, Martigny, Switzerland., Andrei Popescu-Belis, Claudiu Musat, Michael Baeriswyl
 Idiap Research Institute  HEIG-VD / HES-SO  Swisscom (Schweiz) AG
Rue Marconi 19, CP 592 Route de Cheseaux 1, CP 521 Genfergasse 14
CH-1920 Martigny CH-1401 Yverdon-les-Bains CH-3011 Bern
Switzerland Switzerland Switzerland
pehonnet@idiap.ch andrei.popescu-belis claudiu.musat@swisscom.com
@heig-vd.ch michael.baeriswyl@swisscom.com

Abstract content

1 Introduction

In the era of social media, more and more people make online contributions in their own language. The diversity of these languages is however a barrier to information access or aggregation across languages. Machine translation can now overcome this limitation with considerable success for well-resourced languages, i.e. specifically for language pairs which are endowed with large enough parallel corpora to enable the training of neural or statistical MT systems. This is not the case, though, for many low-resourced languages which have been traditionally considered as oral rather than written means of communication, and which often lack even a standardized spelling, and/or exhibit significant variations across dialects. Such languages have an increasing presence in written communication, especially in social media, while remaining inaccessible to non-speakers.

This paper presents a first attempt to design a written MT system for a mostly spoken language with strong dialectal variations: Swiss German. Although spoken in a technologically developed country by around five million native speakers, Swiss German has never been significantly used in writing – with the exception of folklore or children books – before the advent of social media. Rather, from primary school, speakers of Swiss German are taught to use High German in writing – a variety known to linguists as Swiss Standard German, which is one of the three official federal languages along with French and Italian. Still, Swiss German is used in social media such as Twitter or Flickr, but foreigners or even Swiss speakers of the other official languages cannot understand it.

In this paper, we describe the first end-to-end MT system from Swiss German to High German. In Section 2 we present the Swiss German dialects and review the scarce monolingual and even scarcer parallel language resources that can be used for training MT. In Section 3 we review previous work on Swiss German and on MT of low-resource languages. We address the major issue of dialectal variation and lack of standard spelling – which affects many other regional and/or spoken languages as well – through three solutions in Section 4: the design of explicit conversion rules; the use of phonetic representations; and character-based neural MT. These solutions are then combined with phrase-based statistical MT to provide standalone translation, as explained in Section 5. In Section 6 we detail the evaluation results. We first find that the similarity between the regions corresponding to training and test data has a stronger effect on performance than the similarity of text genre. Moreover, the results show that character-based NMT is beneficial for dealing with spelling variation. Our system thus appears as an initial general purpose MT system making Swiss German accessible to non-speakers, and can serve as a benchmark for future, better-resourced attempts.

2 Collecting Swiss German Resources

2.1 A Heterogeneous Family of Dialects

Definition.

“Swiss German” [Russ, 1990, Christen et al., 2013] refers to a family of dialects used mainly for spoken communication by about two thirds of the population of Switzerland (i.e. over five million speakers). Swiss German is typically learned at home as a first language, but is substituted starting from primary school by High German for all written forms, as well as for official discourse (politics, media). Linguistically, the variety of High German (or Standard German) written and spoken in Switzerland is referred to as Swiss Standard German (see Russ [Russ, 1994], Chapter 4, p. 76–99), and is almost entirely intelligible to German or Austrian speakers. On the contrary, Swiss German is typically not intelligible outside Switzerland.

In fact, Swiss German constitutes a group of heterogeneous dialects, which exhibit very important local variations. Due to their spoken nature, moreover, there is no standardized written form, and little teaching material is available to foreigners. For instance, the word kleine (meaning small’ in Standard German) could be written as chlyni, chliini, chline, chli or chlii in Swiss German. Linguistic studies of the Swiss German dialects (see Russ [Russ, 1990] or Christen et al. [Christen et al., 2013]) generally put a large emphasis on phonetic, lexical or syntactic variations and their geographical distribution, often concluding that such variations are continuous and non-correlated with each other.

Divisions.

The areas where each dialect is spoken are influenced both by the administrative divisions (cantons and communes) and by the natural borders (topography). Within the large group of Germanic languages, the dialects of Switzerland belong to the Alemannic group; however, while a majority of dialects are High Alemannic (yellow area on map in Figure 1), those spoken in the city of Basel and in the Canton of Valais belong respectively to the Low Alemannic and the Highest Alemannic groups. Within the Alemannic group, a multitude of divisions have been proposed; one of the most consistent ones is the Brunig-Napf-Reuss line (red line in Figure 1) between the eastern and western groups. A fine-grained approach could easily identify one or more dialects for each canton.

For the purpose of this study, we distinguish only two additional sub-groups on each of the sides of the Brunig-Napf-Reuss line, and refer to them using the most important canton. Westwards, we distinguish the Bernese group from the group spoken around Basel (cantons of Basel-Country, Solothurn and parts of Aargau). Eastwards, we distinguish the Zürich group from the easternmost group around St. Gallen. Therefore, for training and testing machine translation on various “dialects”, we will consider in what follows six main variants of Swiss German, represented on the map in Figure 1.

Figure 1: Map of Switzerland, with six main groups of dialects that we identified for the purpose of our research. The area in yellow indicates the High-Alemannic dialects. Image source (map, yellow area, red line): https://commons.wikimedia.org/wiki/File:Brunig-Napf-Reuss-Linie.png.

Notations.

We will refer to Swiss German as ‘GSW’ (abbreviation from ISO 639-2) followed by the indication of the variant: GSW-BS (city of Basel), GSW-BL (regions of Basel, Solothurn, parts of Aargau), GSW-BE (mainly canton of Bern), GSW-ZH (canton of Zurich and neighbors), GSW-SG (St. Gallen and easternmost part of Switzerland), GSW-VS (the German-speaking part of the canton of Valais/Wallis). These groups correspond to the dialect labels used in the Alemannic Wikipedia (see Section 2.2 below), from west to east: Basel, Baselbieter, Bern, Zurich, Ùndertòggeborg, and Wallis (Valais). Moreover, below, we will append the genre of the training data to the dialect abbreviation.

Usage and Need for MT.

Swiss German is primarily used for spoken communication, but the widespread adoption of social media in Switzerland has significantly increased its written use for informal exchanges on social platforms or in text messages. No standardized spelling has emerged yet (a fact related to the lack of GSW teaching as a second language), and GSW is still written partly with reference to High German (or Swiss Standard German), and partly using a phonetic transcription, again inspired from German pronunciation. Access to such content in social media is nearly impossible to foreigners, and even to speakers of different dialects (e.g. Valaisan content to Bernese speakers). Our goal is to design an MT system translating all varieties of GSW (with their currently observed spelling) towards High German, taking advantage of the relative similarity of these languages. By pivoting through High German, other target languages can then be supported; moreover, if a speech-to-text system existed for Swiss German [Garner et al., 2014], our system would also enable spoken translation.

2.2 Parallel Resources

Despite attempts to use comparable corpora or even monolingual data only (reviewed in Section 3), parallel corpora aligned at the sentence are an essential resource for training statistical MT systems. In our case, while written resources in Swiss German are to some extent available (as reviewed in the next section), it is rare to find their High German translation, or vice-versa. When this is available, the two documents are often not available in electronic version, which requires a time-consuming digitization effort to make them usable for MT.111Many of them are children books, such as Pitschi by Hans Fischer, The Gruffalo by Julia Donaldson, or The Little Prince by Antoine de Saint-Exupéry.

One of our goals is to collect the largest possible set of parallel GSW/DE texts, in a first stage regardless of their licensing status. We include among such resources parallel lexicons (“dictionaries”) as we will show that they are helpful for training MT. We summarize in Table 1 the results of our collection effort, providing brief descriptions of each resource with especially their variant of GSW and their domain, and describe in more detail each resource hereafter.

Dataset Train Dev. Test Total
GSW-BE-Novel 2,667 218 183 3,251
GSW-BE-Wikipedia 180 67 247
GSW-VS-Radio 463 100 50 613
GSW-ZH-Wikipedia 45 50 95
GSW-BE-Bible 126 126
GSW-Archimob 40,159 2,710 2,710 45,579
GSW-ZH-Lexicon1 1,527 1,527
GSW-BE-Lexicon2 1,224 1,224
Table 1: GSW/DE parallel datasets partitioned for MT training, tuning and testing (numbers of parallel sentences). The lexicons (last two lines) were not used for testing, and regarding the GSW_BE_Novel, 183 other lines were kept apart for future testing.
GSW-BE-Novel.

Translations of books from DE into GSW are non-existent. We thus searched for books written originally in GSW and then translated into DE. Among the growing body of literature published in Swiss German, we found only one volume translated into High German and available in electronic form: Der Goalie bin ig (in English I am the Keeper). It was written in Bernese dialect by Pedro Lenz in 2010. The DE translation stays close to the original GSW-BE text, therefore sentence-level alignment was straightforward, resulting in 3,251 pairs of sentences with 37,240 words in GSW-BE and 37,725 words in DE.

GSW-BE-Wikipedia

and GSW-ZH-Wikipedia. The Alemannic version of Wikipedia222http://als.wikipedia.org appeared initially as a promising source of data. However, its articles are written not only in Swiss German, but also in other Alemannic dialects such as Alsatian, Badisch and Swabian. According to its policy, contributors should write in their own dialects, and therefore only a few articles are homogeneous and have an explicit indication of their dialect (using an infobox with one of the six labels indicated above). Among them, even fewer have an explicit statement indicating that they have been translated from High German, to be used for a GSW/DE parallel dataset. We identified two such pages and sentence-aligned them to serve as test data, namely about Hans Martin Sutermeister (a Swiss doctor) translated from DE into GSW-BE, and Wädenswil (a town near Zürich) from DE into GSW-ZH.333These pages are respectively available at https://de.wikipedia.org/wiki/Hans_Martin_Sutermeister (High German), https://als.wikipedia.org/wiki/Hans_Martin_Sutermeister (Bernese), https://de.wikipedia.org/wiki/W%E4denswil (High German), and https://als.wikipedia.org/wiki/W%E4denswil (Zurich Swiss German).

GSW-VS-Radio.

A small corpus of Valaisan Swiss German (a.k.a. Wallisertiitsch) has been collected at the Idiap Research Institute [Garner et al., 2014].444www.idiap.ch/dataset/walliserdeutsch The corpus consists of transcriptions of a local radio broadcast555Radio Rottu, http://www.rro.ch.) and has been translated into High German.

GSW-BE-Bible.

The Bible has been translated in several GSW dialects, but the only electronic version available to us were online excerpts in Bernese.666http://www.edimuster.ch/baernduetsch/bibel.htm. However, this is not translated from High German (but likely directly from the Greek Nestle-Aland text), hence the alignment with any of the traditional or modern German Bibles is problematic.777 www.die-bibel.de/bibeln/online-bibeln/ We selected the contemporary Gute Nachricht Bibel (1997) for its modern vocabulary, and generated parallel data from four excerpts of the Old and New Testament, while acknowledging their particular style and vocabulary. The following excerpts were aligned: Üse Vatter, D Wienachtsgschicht, Der barmhärzig Samaritaner and D Wält wird erschaffe.

GSW-Archimob.

Archimob is a corpus of standardized Swiss German [Samardžić et al., 2016], as it consists of transcriptions of interviewees speaking Swiss German, with, for each transcribed word, a “normalized” version into High German.888http://www.spur.uzh.ch/en/departments/korpuslab/ArchiMob.html The interviews record memories of WW II, and all areas of Switzerland are represented. In most cases, this is simply the corresponding High German word or group of words, but in other cases this is Swiss German with a “standardized” orthography as estimated by the annotators (no unified convention). Using our vocabulary of High German (see below), we filtered out all sentences whose normalizations included words outside this vocabulary. In other words, we kept only truly High German sentences, along with their original Swiss German counterparts, resulting in about 45,000 GSW/DE word-aligned sentence pairs.

GSW-ZH-Lexicon

and GSW-BE-Lexicon. The last two parallel resources are vocabularies, i.e. lists of GSW words with their DE translation. As such, they are useful for training our research systems, but not for testing them. The first one is based on Hoi Zäme, a manual of Zürich Swiss German intended for High German speakers. The data was obtained by scanning the printed version, performing OCR, and manually aligning the result. Although the book contains also parallel sentences, only the bilingual dictionary was used in our study, resulting in 1,527 word and their translations. A similar dictionary for Bernese (GSW-BE vs. DE) was found online999www.edimuster.ch/baernduetsch/woerterbuechli.htm with 1,224 words for which we checked and corrected the alignments.

2.3 Monolingual Resources

The Phonolex dictionary, a phonetic dictionary of High German, was used for training our grapheme-to-phoneme converter described in Section 4.2 below. It contains High German words with their phonetic transcriptions. We also used it as a High German dictionary to find out of vocabulary words.

About 75 pages from the Alemannic Wikipedia mentioned above have been collected in the six GSW variants mentioned earlier. They have been used to derive orthographic normalization rules in Section 4.1.

To build language models (see Section 5) we used the News Crawls 2007–2015 from the Workshop on Machine Translation.101010http://www.statmt.org/wmt17/translation-task.html

3 Previous Work on Swiss German and the MT of Low-Resource Languages

The variability of Swiss German dialects has been investigated by linguistic studies [Russ, 1990, Christen et al., 2013], but proper language resources are extremely rare. The ArchiMob corpus [Samardžić et al., 2016] is quite unique, as it provides transcripts of spoken GSW narratives, along with a “normalization” into High German, or into an ad-hoc normalized spelling for words that do not exist in this language [Samardžić et al., 2015]. Character-level SMT was studied in order to perform this normalization automatically [Scherrer and Ljubešić, 2016].

Initial attempts for MT of GSW include a system for generating GSW texts from DE, focusing on a fine-grained modeling of regional variations shown on a map [Scherrer, 2012], and a system combining ASR and MT of Swiss German from Valais [Garner et al., 2014].

The MT of low-resource languages or dialects has been studied on many other important cases, in particular for Arabic dialects [Zbib et al., 2012] which are also predominantly used in spoken communication. The most frequently used strategies are either the crowdsourcing of additional parallel data, or the use of large monolingual and comparable corpora to perform bilingual lexicon induction before training an MT system [Klementiev et al., 2012, Irvine and Callison-Burch, 2016, Carl et al., 2008]. In our case, however, even such corpora are unavailable.

The Workshops on Statistical MT (see www.statmt.org) have in some years proposed a translation task for “low-resourced” languages to/from English, such as Hindi in 2014, Finnish in 2015, or Latvian in 2017. In 2011, the featured translation task aimed at translating text messages from Haitian Creole into English, with a parallel corpus of similar size as ours. The original system built in the wake of the 2010 Haiti earthquake leveraged a phonetic mapping from French to Haitian Creole to obtain a large bilingual lexicon [Lewis, 2010].

4 Normalizing Swiss German for MT

Three issues must be addressed when translating Swiss German into High German, which all contribute to a large number of out-of-vocabulary (OOV) words (i.e. previously unseen during training) in the source language:

  1. The scarcity of parallel GSW/DE data for training (see Section 2.2), which cannot be easily addressed by the strategies seen in Section 3.

  2. The variability of the dialects (regions) across training and testing data (which increases dialect-specific scarcity).

  3. The lack of a standard spelling for GSW, which introduces intra-dialect and intra-speaker variability.

There are several ways to address such a variability. The most principled one is the normalization of all GSW input using unified spelling conventions, coupled with the design of a GSW/DE MT system for normalized input. However, such a goal is far too ambitious with respect to our scope. Instead, we propose here to “normalize” Swiss German input for the concrete perspective of machine translation, by converting unknown GSW words either to known GSW ones, or to High German words, which are preserved by the GSW/DE MT system111111The MT system is specifically built so that OOV words are copied in the target sentence, rather than deleted. and increase the number of correctly translated words. This procedure, summarized as follows, rests on the assumption that many OOV GSW words are close to DE words, but with a slightly different pronunciation and spelling (see examples in the third column of Table 2). Each of the three strategies below proceeds as follows, using the GSW vocabulary to determine OOV words:

  1. For each OOV word , apply the normalization strategy. If this changes into then go to (2).

  2. If is a known GSW word then replace with and proceed to (4). If not, go to (3).

  3. If is a known DE word then replace with . If not, leave unchanged and go to (4).

  4. Translate the modified GSW text.

This “normalization” method has thus two possible chances to help MT by converting OOV words: either into a GSW word known to the MT system, or into a correct DE word which is left unmodified through MT. We describe below the three strategies to normalize GSW text input before GSW/DE MT, which can be used separately or even in combination.

4.1 Explicit Spelling Conversion Rules

For every OOV word :

  1. Modify into by applying in sequence the spelling conversion rules in Table 2.

  2. If is in the GSW dictionary, then replace with . If not, go to (3).

  3. If is in the DE dictionary, then replace with . If not, keep the original spelling.

We defined a set of orthographic rules to map specific patterns to a more “standard” spelling observed for Swiss German. Table 2 lists the orthographic rules implemented in our system, with some possible conversion examples.

Spelling Convert to Example
.*scht.* .*st.* Angscht Angst
.*schp.* .*sp.* Schprache Sprache
^gäge.* ^gegen.* Gägesatz Gegensatz
CäC CeC Präsident President
^gm.* ^gem Gmeinde Gemeinde
^gf.* ^gef gfunde gefunde(n)
^gw.* ^gew gwählt gewählt
^aa.* ^an.* Aafang Anfang
.*ig$ .*ung$ Regierig Regierung
^ii.* ^ein.* Iiwohner Einwohner
Table 2: Orthographic conversion rules using POSIX metacharacters ^ and $ for the beginning and end of a word, .* for any sequence of characters, and C for any consonant.

After applying the conversion rules to the word, we check both in our GSW and DE dictionaries if the word exists with this “standard” spelling. In the second case, using directly the DE word simply results in translating the word, before the full sentence translation is carried out.

4.2 Using Phonetic Representations

The second approach is based on the assumption that despite orthographic differences, some words have the same pronunciation. Thus, converting an OOV word to its phonetic transcription may allow finding a word present in the vocabulary which has the same pronunciation, in which case, substituting the OOV word with a word with the same pronunciation should help in the translation: the assumption is that two words not written the same but pronounced the same are in GSW the same word, so the translation should be the same.

For this, a grapheme to phoneme (g2p) converter is needed. It consists of an algorithm which is able to convert character sequences into phonetic sequences, or go from the written form of a word to its pronunciation. The idea is to build it on standard German, as we expect Swiss German to be written in a phonetic way, which means that the opposite conversion should be close to standard German pronunciation rules. In our experiments, a g2p converter was trained on the Phonolex dictionary. A GSW phonetic dictionary was created by using this system, and once a new OOV word is to be translated, we follow the approach:

  1. Convert the word to its pronunciation.

  2. If the resulting pronunciation exists in the phonetic GSW dictionary, then replace with the corresponding word. If not, go to (3).

  3. If the resulting pronunciation exists in the phonetic DE dictionary, then replace with the corresponding word. If not, keep the original spelling.

The Phonolex dictionary, a phonetic dictionary, was used for training our grapheme to phoneme converter. It contains High German words with their phonetic transcriptions. We also use it as a High German dictionary to look into for OOV words.

4.3 Character-based Neural MT

Training mainstream neural MT systems121212Typically using recurrent neural networks (RNNs) to encode a source sentence, and then decode into the target language [Cho et al., 2014], augmented with an attention mechanism to the source sentence [Bahdanau et al., 2014]. is not possible for GSW/DE as the size of our resources is several orders of magnitude below NMT requirements. However, several recent approaches have explored a new strategy: the translation system is trained at the character level [Ling et al., 2015, Costa-Jussa and Fonollosa, 2016, Chung et al., 2016, Bradbury et al., 2017].

Among these approaches, we use here Quasi Recurrent Neural Networks (QRNNs) [Bradbury et al., 2017], which take advantage of both convolutional and recurrent layers. The increased parallelism introduced by the use of convolutional layers allows to speed up both training and testing of translation models. As above, we use character-based neural MT (noted CBNMT) to attempt to translate unknown words rather than entire sentences (for which training data is clearly insufficient). There are two advantages to use CBNMT for OOV translation only. First, the training data may be sufficient to capture spelling conversion better than hand-crafted rules (as those in Table 2). Second, we can use rather compact recurrent layers as the sequences to translate are much shorter than sentences.131313Character-based approaches are limited by the size (in number of characters) of the sentences which can be translated. To increase the number of characters supported, one must increase the size of the recurrent layers.

Our system for CBNMT of OOV GSW words are based on open source scripts for TensorFlow available online, using the implementation of the QRNN architecture proposed by Kyubyong Park.141414https://github.com/Kyubyong/quasi-rnn We made to it the following modifications:

  1. We added “start of word” symbol to avoid mistake on first letter of the word. This was actually done outside of the translation scripts, by adding the ‘:’ symbol to each word before the first letter, and removing it after translation.

  2. We implemented a “translation” script, which takes two arguments as input: the input file (to be translated) and output file (translated). This allows to translate files “not in test mode”, which means, we do not need to have the reference translation to compare with.

  3. We added the possibility to translate an incomplete minibatch, by padding the last incomplete batch with empty symbols (0).151515Originally, if the size of the minibatch is , and the number of sentences modulo is (i.e. there are sentences), then only the first sentences were translated by the system, which ignored the last ones.

  4. We set the following hyper-parameters for our task. The maximum number of characters was set to 40, as no longer words were found in our GSW vocabulary. The minibatch size was kept to 16, and the number of hidden units was kept to 320, as in the default implementation.

We trained the CBNMT model using unique word pairs from the Archimob corpus (see 2.2 above), i.e. a Swiss German word and its normalized (generally High German) word, with a training set of 40,789 word pairs and a development set of 2,780 word pairs.

5 Integration with Machine Translation

Using the Moses toolkit161616http://www.statmt.org/moses/ to build a phrase-based statistical MT (PBSMT) system [Koehn et al., 2003], we used various subsets of the parallel GSW/DE data presented in Section 2.2 above to learn a translation model. As for the target language model, we trained a tri-gram model using IRSTLM171717http://hlt-mt.fbk.eu/technologies/irstlm over ca. 1.3 billion words in High German (see Section 2.3). We tuned the system using the development data indicated. As explained above, the “normalization” strategies are used to attempt to change GSW OOV words into either GSW or even DE words that are in-vocabulary. As we will see, we have combined two strategies in several experiments. We will use the most commonly used metric for automatic MT evaluation, i.e. the BLEU score [Papineni et al., 2002].

6 Results and Discussion

6.1 Effects of Genre and Dialect

As the amount of parallel GSW/DE data was limited, the first system to translate from Swiss German into High German was built using Moses trained on the Bernese novel corpus (GSW-BE-Novel in Table 1 above).

Table 3 gives the BLEU scores obtained when testing this system on test sets from different regions or topics. Moreover, we also vary the tuning sets, including ones closer to the target test domains to assess their impact.

Test set Tuning (dev) set BLEU
GSW-BE-Novel GSW-BE-Novel 35.3
GSW-BE-Wikipedia GSW-BE-Novel 21.9
    same GSW-BE-Wikipedia 21.7
GSW-ZH-Wikipedia GSW-BE-Novel 16.2
    same GSW-ZH-Wikipedia 15.3
GSW-VS-Radio GSW-BE-Novel 9.7
Table 3: BLEU scores for various tuning and test sets for the baseline system trained on GSW-BE-Novel. Performance decreases significantly as the dialect and domain are more remote from the training/tuning data.

The scores follow the following (expected) trends:

  • when testing on similar data, i.e. the same dialect and same domain (test data from the same book), the scores are the highest, and in the same range as state of the art EN-DE or EN-FR systems.

  • when changing domain (testing on Wikipedia data in the same dialect), the scores are decreasing.

  • when testing on different dialects, the scores decrease more. This is true both for GSW-ZH and GSW-VS. As the dialect and domain are further from the data used to train the system, the score gets lower. GSW-VS is known to be very different from any other dialect, and radio broadcast data is expected to be very different from the book used at training time.

6.2 Effect of the Size of Training Data

To evaluate first the effect of using more training data, with larger vocabularies, a new system was trained using the same data as in the previous experiments, complemented with the two bilingual lexicons presented in Section 2.2.

Test set BLEU
GSW-BE-Novel 36.2
GSW-BE-Wikipedia 23.6
GSW-ZH-Wikipedia 17.3
GSW-VS-Radio 10.0
Table 4: BLEU scores for various test sets (Bern, Zurich, Valais dialects) for a Moses-based system trained over data including the two GSW/DE dictionaries.

Table 4 presents the BLEU scores obtained when training such system. The scores increase in all cases by about 1 point in BLEU score. As expected, using more training data in the form of bilingual lexicons yields more reliable translation models.

Test set Baseline1 Baseline2 Phon. Orth. Orth. & Phon. CBNMT & Phon. CBNMT
GSW-Archimob 10.9 10.8 10.8 11.2 13.9 27.9 32.9
GSW-BE-Novel 36.2 34.1 34.3 34.4 34.4 35.6 35.4
GSW-BE-Wikipedia 23.6 22.7 23.2 23.6 23.7 20.5 24.0
GSW-ZH-Wikipedia 17.3 17.7 17.1 18.9 18.2 22.0 22.1
GSW-VS-Radio 10.0 10.7 11.0 12.2 12.0 8.7 22.9
GSW-BE-Bible 5.7 5.8 6.2 6.1 6.4 6.3 6.3
Table 5: BLEU scores for several test sets and normalization strategies (orthographic, phonetic, character-based NMT).

6.3 Out-of-Vocabulary GSW Words

The three approaches proposed for normalization were evaluated on the same dataset as the previous systems. Additionally, two other approaches combining, on one side orthographic and phonetic based conversions, and on the other side CBNMT and phonetic conversion, were evaluated. Table 5 summarizes the results for the baseline system and the proposed approaches.

Baseline1 corresponds to the system trained with a language model trained only on the parallel GSW-DE training data, while Baseline2 is using a larger language model, as described in Section 6.2. We can make the following observations:

  • In all the cases except GSW-BE-Novel, the orthographic approach improves the BLEU score of the baseline system, and the improvement is bigger for more remote dialects and domains.

  • The phonetic approach improves the score in 4 out of 6 cases: in the case where the score is decreasing, we expect that some words did not require pre-processing, and that the pre-processing may have converted the word to a false positive (i.e. the algorithm found a word matching, but it was not the correct one for translation).

  • Combining both approaches always results in better scores than the baseline, but in the case for which the phonetic approach score deteriorated, orthographic conversion only performs better.

  • In all the cases, combining CBNMT with the baseline PBSMT works the best. In cases where the dialect or domain are different, the improvement is the highest (except for the Bible test set), which was expected, as ultimately, more data was used to train the CBNMT models. This is especially true for the GSW-Archimob test set, which has similar data as the one used to train the CBNMT models.

  • Baseline1 performs better than all the systems for GSW-BE-Novel test set. This is expected as the training data is both from the same dialect and the same domain. Additionally, the language model is trained on this same data.

7 Conclusion

In this paper, we proposed solutions for the machine translation of a family of dialects, Swiss German, for which parallel corpora are scarce. Our efforts on resource collection and MT design have yielded:

  • a small Swiss German / High German parallel corpus of about 60k words;

  • a larger list of resources which await digitization and alignment;

  • three solutions for input normalization, to address variability of region and spelling;

  • a baseline GSW-to-DE MT system reaching 36 BLEU points.

Among the three normalization strategies, we found that character-based neural MT was the most promising one. Moreover, we found that MT quality depended more strongly on the regional rather than topical similarity of test vs. training data.

These findings will be helpful to design MT systems for other spoken dialects without fixed written forms, such as numerous regional languages across Africa and Asia, which are a natural means of communication in social media.

8 Bibliographical References

References

  • Bahdanau et al., 2014 Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • Bradbury et al., 2017 Bradbury, J., Merity, S., Xiong, C., and Socher, R. (2017). Quasi-recurrent neural networks. In Proceedings of the International Conference on Learning Representations (ICLR).
  • Carl et al., 2008 Carl, M., Melero, M., Badia, T., Vandeghinste, V., Dirix, P., Schuurman, I., Markantonatou, S., Sofianopoulos, S., Vassiliou, M., and Yannoutsou, O. (2008). METIS-II: low resource machine translation. Machine Translation, 22(1):67–99.
  • Cho et al., 2014 Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  • Christen et al., 2013 Christen, H., Glaser, E., Friedli, M., and Renn, M. (2013). Kleiner Sprachatlas der deutschen Schweiz. Verlag Huber, Frauenfeld, Switzerland.
  • Chung et al., 2016 Chung, J., Cho, K., and Bengio, Y. (2016). A character-level decoder without explicit segmentation for neural machine translation. arXiv preprint arXiv:1603.06147.
  • Costa-Jussa and Fonollosa, 2016 Costa-Jussa, M. R. and Fonollosa, J. A. (2016). Character-based neural machine translation. arXiv preprint arXiv:1603.00810.
  • Garner et al., 2014 Garner, P. N., Imseng, D., and Meyer, T. (2014). Automatic speech recognition and translation of a Swiss German dialect: Walliserdeutsch. In Proceedings of Interspeech, Singapore, September.
  • Irvine and Callison-Burch, 2016 Irvine, A. and Callison-Burch, C. (2016). End-to-end statistical machine translation with zero or small parallel texts. Natural Language Engineering, 22(4):517–548.
  • Klementiev et al., 2012 Klementiev, A., Irvine, A., Callison-Burch, C., and Yarowsky, D. (2012). Toward statistical machine translation without parallel corpora. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 130–140, Avignon, France.
  • Koehn et al., 2003 Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the ACL, pages 48–54, Edmonton, Canada.
  • Lewis, 2010 Lewis, W. D. (2010). Haitian Creole: How to build and ship an MT engine from scratch in 4 days, 17 hours, and 30 minutes. In Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT), Saint-Raphaël, France.
  • Ling et al., 2015 Ling, W., Trancoso, I., Dyer, C., and Black, A. W. (2015). Character-based neural machine translation. arXiv preprint arXiv:1511.04586.
  • Papineni et al., 2002 Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL), pages 311–318, Philadelphia, PA, USA.
  • Russ, 1990 Russ, C. V. J. (1990). High Alemannic. In Charles V. J. Russ, editor, The Dialects of Modern German. A linguistic survey, pages 364–393. Routledge, London, UK.
  • Russ, 1994 Russ, C. V. (1994). The German language today. A linguistic introduction. Routledge, London, UK.
  • Samardžić et al., 2015 Samardžić, T., Scherrer, Y., and Glaser, E. (2015). Normalising orthographic and dialectal variants for the automatic processing of Swiss German. In Proceedings of the 7th Language and Technology Conference (LTC), Poznan, Poland.
  • Samardžić et al., 2016 Samardžić, T., Scherrer, Y., and Glaser, E. (2016). ArchiMob – a corpus of spoken Swiss German. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia. European Language Resources Association (ELRA).
  • Scherrer and Ljubešić, 2016 Scherrer, Y. and Ljubešić, N. (2016). Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS), Bochum, Germany.
  • Scherrer, 2012 Scherrer, Y. (2012). Generating Swiss German sentences from Standard German: a multi-dialectal approach. Ph.D. thesis, University of Geneva, Switzerland.
  • Zbib et al., 2012 Zbib, R., Malchiodi, E., Devlin, J., Stallard, D., Matsoukas, S., Schwartz, R., Makhoul, J., Zaidan, O. F., and Callison-Burch, C. (2012). Machine translation of Arabic dialects. In Proceedings of the 2012 Conference of the North American Chapter of the ACL, pages 49–59, Montreal, Canada.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
602
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description