MIZĀN: A Large Persian-English Parallel Corpus
One of the most major and essential tasks in natural language processing is machine translation that is now highly dependent upon multilingual parallel corpora. Through this paper, we introduce the biggest Persian-English parallel corpus with more than one million sentence pairs collected from masterpieces of literature. We also present acquisition process and statistics of the corpus, and experiment a base-line statistical machine translation system using the corpus.
Advent of the digital computers in early 20th century revolutionized ways to encounter every aspects of sciences. New interdisciplinary areas, such as corpus linguistic and computational linguistic are destined the automatic translation’s state of the art, now referred to as statistical machine translation (SMT), that is based on using somehow language independent statistical methods trained by large parallel corpora containing foreign and target language sentence pairs [brown1993mathematics, koehn2003statistical].
There exist some multilingual parallel corpora for resource-rich languages such as Europarl [koehn2005europarl] and JRC-Acquis [steinberger2006jrc]. In addition, there are many bilingual corpora, with English as one end in most cases, such as corpora presented in [altenberg2000english, tadic2000building, germann2001aligned, ma2006corpus, utiyama2007japanese].
However, many limited-resource languages including Persian, lack applicable parallel corpora to benefit from the SMT. First attempt to Persian-English automatic translation was Shiraz Project wherein they prepared a parallel corpus with 3K sentence pairs [amtrup2000persian]. Mousavi \shortcitemosavi2009constructing proposed a proprietary corpus containing 100K sentence pairs. TEP is a publicly available corpus containing about 550K sentence pairs with 8M terms from movies subtitles [pilevar2011tep]. This corpus is built from colloquial Persian that in some cases differs from formal Persian in terms of both morphology and syntax.
Apparently, researchers have attempted to build Persian-English parallel corpora but due to lack of resources and huge amount of required works, the resulted corpora are unsatisfactory in size or quality. Therefore, in order to contribute to Persian-English machine translation research, we present MIZĀN, a manually aligned Persian-English parallel corpus that would be publicly available for research purposes and contains 1 million sentence pairs with 25 million terms, which is the largest Persian-English parallel corpus to date. We evaluate MIZĀN through a translation task and study how good current SMT approaches are for Persian-English translation and what are possible improvements.
2 Corpus Collection
Parallel contents required for building parallel corpora are usually collected form publicly available texts, mainly from web. However, despite our broad search for available Persian-English parallel texts, we were unable to find enough suitable resources to build our corpus.
Therefore, searching for any available English text that might have Persian equivalent in any extent, we decide to use copyright-free masterpieces of literature published through Project Gutenberg [hart1971project]. We collect a list of 500 titles and look them up in National Library and Archive of Iran to see if they have ever been translated into Persian and published in Iran. Among them, about 180 titles were translated to Persian but we find out that most of them were published more that 30 years ago, a fortunate incident, as their copyrights are expired but also a challenge, since they are not available off the shelves. It made us to pursue a cumbersome process of finding used copies one by one.
In parallel to acquiring enough books, we start to digitize them. We decided to use OCR, as a cheap and fast process for digitizing books. However, after working on first 10 titles, we observed that the rate of errors () and the times and expense needed to correct them is such high that it makes the more expensive and slower process of typewriting books reasonable. Therefore, the English side of our comparable text resource was downloaded from Project Gutenberg and the Persian side was manually typewritten from the corresponding translations. Transcription process takes about 3 years employing multiple typists.
Refinement is a common preprocessing for SMT [habash2006arabic]. Persian texts suffers vast amount of computational issues from choosing correct character set and encoding to morphological and orthographical ambiguities [kashefi2010towards, rasooliorthographic].
Persian along with Arabic and Urdu share most of their characters in Unicode. However, there are handful of language dependent but yet homograph exceptions that might mistakenly be used interchangeably. For example, the letter Yeh is encoded at U+064A with isolated form representation of ي, at U+06CC with isolated form representation of ی, and at least encoded in five more places. Using these characters interchangeably forms strings that are computationally different but visually similar that can seriously mislead every statistical analyses.
Persian language includes three main diacritic classes, Harekat that represents short vowel marks (i.e. ــَــِــُـ), Tashdid that is used to indicate gemination (i.e. ــّـ), and Tanvin that is used to indicate nunation (i.e. ــًــٍــٌـ). The use of diacritics in Persian is not mandatory, however, using diacritics in a word makes it computationally different from that word without diacritics [kashefi2013novel].
Persian possess intra-word space in addition to inter-word space (i.e. regular white space). An example of intra-word space or pseudo-space is شرکتها /SerkæthA:/, compare to inter-word space as شرکت ها and without space as شرکتها, all meaning ”companies”, while two later ones are more common but the former one is correct.
Challenging these issues we use Virastyar111Virastyar is a free and open-source project, providing fundamental Persian text processing tools. See http://sourceforge.net/projects/virastyar, to correct and normalize non-standard characters based on ISIRI 6219222http://www.isiri.org/portal/files/std/6219.htm, remove all optional diacritics, unify the ezafe usage as short Yeh and correct spacing of inflected words.
In order to align corresponding sentences of refined books, we developed an alignment aiding software operated by alignment specialists, whom were mostly translators and linguists, to ease the process by providing basic operations such as break, merge, delete and edit tools.
We automatically align corresponding books at chapter level using correspondence score presented in Rasooli \shortciterasooli2011extracting. Then, we change the granularity of alignment to paragraphs and recalculate the score to indicates that the paragraph pairs correspond one-to-one, one-to-two, or not at all. Providing such information warns alignment specialists how much attention and manual work (i.e. break, merge or delete) each paragraph pairs need to ensure alignment. Changing granularity from paragraph to sentence and repeating the same process, we align each parallel books at sentence level.
2.3 Corpus Statistics
MIZĀN corpus, containing 1,021,596 unique Persian-English sentence pairs is released in two files encoded in Unicode. Each file contain sentences in a language, each line of files represent a sentence and sentences correspond each other by line numbers. Table LABEL:tab:one shows the number of sentences and words of the corpus on each side.