MIZĀN: A Large Persian-English Parallel Corpus

MIZĀN: A Large Persian-English Parallel Corpus

Omid Kashefi
Intelligent Systems Program
University of Pittsburgh

One of the most major and essential tasks in natural language processing is machine translation that is now highly dependent upon multilingual parallel corpora. Through this paper, we introduce the biggest Persian-English parallel corpus with more than one million sentence pairs collected from masterpieces of literature. We also present acquisition process and statistics of the corpus, and experiment a base-line statistical machine translation system using the corpus.


1 Introduction

Advent of the digital computers in early 20th century revolutionized ways to encounter every aspects of sciences. New interdisciplinary areas, such as corpus linguistic and computational linguistic are destined the automatic translation’s state of the art, now referred to as statistical machine translation (SMT), that is based on using somehow language independent statistical methods trained by large parallel corpora containing foreign and target language sentence pairs [brown1993mathematics, koehn2003statistical].

There exist some multilingual parallel corpora for resource-rich languages such as Europarl [koehn2005europarl] and JRC-Acquis [steinberger2006jrc]. In addition, there are many bilingual corpora, with English as one end in most cases, such as corpora presented in [altenberg2000english, tadic2000building, germann2001aligned, ma2006corpus, utiyama2007japanese].

However, many limited-resource languages including Persian, lack applicable parallel corpora to benefit from the SMT. First attempt to Persian-English automatic translation was Shiraz Project wherein they prepared a parallel corpus with 3K sentence pairs [amtrup2000persian]. Mousavi \shortcitemosavi2009constructing proposed a proprietary corpus containing 100K sentence pairs. TEP is a publicly available corpus containing about 550K sentence pairs with 8M terms from movies subtitles [pilevar2011tep]. This corpus is built from colloquial Persian that in some cases differs from formal Persian in terms of both morphology and syntax.

Apparently, researchers have attempted to build Persian-English parallel corpora but due to lack of resources and huge amount of required works, the resulted corpora are unsatisfactory in size or quality. Therefore, in order to contribute to Persian-English machine translation research, we present MIZĀN, a manually aligned Persian-English parallel corpus that would be publicly available for research purposes and contains 1 million sentence pairs with 25 million terms, which is the largest Persian-English parallel corpus to date. We evaluate MIZĀN through a translation task and study how good current SMT approaches are for Persian-English translation and what are possible improvements.

2 Corpus Collection

Parallel contents required for building parallel corpora are usually collected form publicly available texts, mainly from web. However, despite our broad search for available Persian-English parallel texts, we were unable to find enough suitable resources to build our corpus.

Therefore, searching for any available English text that might have Persian equivalent in any extent, we decide to use copyright-free masterpieces of literature published through Project Gutenberg [hart1971project]. We collect a list of 500 titles and look them up in National Library and Archive of Iran to see if they have ever been translated into Persian and published in Iran. Among them, about 180 titles were translated to Persian but we find out that most of them were published more that 30 years ago, a fortunate incident, as their copyrights are expired but also a challenge, since they are not available off the shelves. It made us to pursue a cumbersome process of finding used copies one by one.

In parallel to acquiring enough books, we start to digitize them. We decided to use OCR, as a cheap and fast process for digitizing books. However, after working on first 10 titles, we observed that the rate of errors () and the times and expense needed to correct them is such high that it makes the more expensive and slower process of typewriting books reasonable. Therefore, the English side of our comparable text resource was downloaded from Project Gutenberg and the Persian side was manually typewritten from the corresponding translations. Transcription process takes about 3 years employing multiple typists.

2.1 Refinement

Refinement is a common preprocessing for SMT [habash2006arabic]. Persian texts suffers vast amount of computational issues from choosing correct character set and encoding to morphological and orthographical ambiguities [kashefi2010towards, rasooliorthographic].

Persian along with Arabic and Urdu share most of their characters in Unicode. However, there are handful of language dependent but yet homograph exceptions that might mistakenly be used interchangeably. For example, the letter Yeh is encoded at U+064A with isolated form representation of ي, at U+06CC with isolated form representation of ی, and at least encoded in five more places. Using these characters interchangeably forms strings that are computationally different but visually similar that can seriously mislead every statistical analyses.

Persian language includes three main diacritic classes, Harekat that represents short vowel marks (i.e. ــَــِــُـ), Tashdid that is used to indicate gemination (i.e. ــّـ), and Tanvin that is used to indicate nunation (i.e. ــًــٍــٌـ). The use of diacritics in Persian is not mandatory, however, using diacritics in a word makes it computationally different from that word without diacritics [kashefi2013novel].

Persian possess intra-word space in addition to inter-word space (i.e. regular white space). An example of intra-word space or pseudo-space is شرکت‌ها /SerkæthA:/, compare to inter-word space as شرکت ها and without space as شرکتها, all meaning ”companies”, while two later ones are more common but the former one is correct.

Challenging these issues we use Virastyar111Virastyar is a free and open-source project, providing fundamental Persian text processing tools. See http://sourceforge.net/projects/virastyar, to correct and normalize non-standard characters based on ISIRI 6219222http://www.isiri.org/portal/files/std/6219.htm, remove all optional diacritics, unify the ezafe usage as short Yeh and correct spacing of inflected words.

2.2 Alignment

In order to align corresponding sentences of refined books, we developed an alignment aiding software operated by alignment specialists, whom were mostly translators and linguists, to ease the process by providing basic operations such as break, merge, delete and edit tools.

We automatically align corresponding books at chapter level using correspondence score presented in Rasooli \shortciterasooli2011extracting. Then, we change the granularity of alignment to paragraphs and recalculate the score to indicates that the paragraph pairs correspond one-to-one, one-to-two, or not at all. Providing such information warns alignment specialists how much attention and manual work (i.e. break, merge or delete) each paragraph pairs need to ensure alignment. Changing granularity from paragraph to sentence and repeating the same process, we align each parallel books at sentence level.

2.3 Corpus Statistics

MIZĀN corpus, containing 1,021,596 unique Persian-English sentence pairs is released in two files encoded in Unicode. Each file contain sentences in a language, each line of files represent a sentence and sentences correspond each other by line numbers. Table LABEL:tab:one shows the number of sentences and words of the corpus on each side.

Language Sentences Words (Distinct)
Persian 1,011,085 12,049,952 (198,860)
English 1,011,085 11,667,272 (153,666)
Overall 1,011,085 23,717,224 (352,526)

: missing

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description