Design Challenges in Named Entity Transliteration

Design Challenges in Named Entity Transliteration

Yuval Merhav
Amazon Alexa AI
Cambridge, MA
\AndStephen Ash11footnotemark: 1
Amazon AWS AI
Seattle, WA
*Y. Merhav and S. Ash contributed equally to this work

We analyze some of the fundamental design challenges that impact the development of a multilingual state-of-the-art named entity transliteration system, including curating bilingual named entity datasets and evaluation of multiple transliteration methods. We empirically evaluate the transliteration task using the traditional weighted finite state transducer (WFST) approach against two neural approaches: the encoder-decoder recurrent neural network method and the recent, non-sequential Transformer method. In order to improve availability of bilingual named entity transliteration datasets, we release personal name bilingual dictionaries mined from Wikidata for English to Russian, Hebrew, Arabic, and Japanese Katakana. Our code and dictionaries are publicly available111


This work is licensed under a Creative Commons Attribution 4.0 International License. License details:

1 Introduction

Named entity transliteration is the process of converting a named entity from one language script to another, using the correct characters that represent the entity in the target language. It is an important component in many search and language understanding tasks, such as robust cross-language information retrieval (CLIR) and machine translation (MT), among others.

A possible simple transliteration approach is mapping every character (or sequence of characters) in the source language to its most common counterpart in the target language. However, spelling and pronunciation of many languages (e.g., English, Japanese) can be ambiguous and inconsistent [\citenameFushimi et al.1999]. As a result, most transliteration systems are data driven and use context for disambiguation; e.g., [\citenameYan and Nivre2016, \citenameHaizhou et al.2004, \citenameEkbal et al.2006].

While transliteration has been a long studied problem, some important aspects received little attention. There is not clear guidance that addresses a number of common design considerations faced when building a robust multilingual transliteration system, such as data representation and the huge gap in results depending on the language pairs and transliteration direction [\citenameRosca and Breuel2016]. Like many other NLP fields recently, neural transliteration systems have gained popularity. However, it is still unclear if neural systems consistently outperform traditional approaches and what architecture is ideal for this task. For example, in [\citenameYan and Nivre2016] the authors applied a stack of convolutional layers and simple recurrent layer on top for English to Chinese transliteration, which achieved competitive results but still below a phrase-based statistical machine translation system. Other works show that bidirectional long short-term memory (LSTM) neural networks and the encoder-decoder architecture [\citenameSutskever et al.2014] achieve comparable results with WFST based n-gram models that are considered state-of-the-art [\citenameRao et al.2015, \citenameYao and Zweig2015]. However, these works only studied grapheme-to-phoneme (G2P) conversion from English to standard English pronunciation sets, such as the CMU pronunciation dictionary [\citenameWeide2014]. Also, there is not empirical evidence comparing the neural encoder-decoder method with LSTM to the recently proposed Transformer [\citenameVaswani et al.2017] neural method on the transliteration task. The Transformer method uses a simple neural network architecture based solely on attention mechanisms. That motivated us to learn if it can produce strong results on transliteration as it did on translation.

The contributions in this paper are summarized as follows:

  • We enumerate and experimentally evaluate a number of design considerations important to building a named entity transliteration system such as: handling infrequent or unique name tokens in training and test data, building a bilingual training corpus from Wikidata222, and top-1 versus top- result performance.

  • We present empirical results comparing design choices across four different language/script pairs mined from Wikidata: English to Hebrew, English to Russian, English to Arabic, and English to Katakana. We are releasing each of the bilingual datasets and the particular train, development, and test splits to encourage future experimentation and benchmarking.

  • We empirically evaluate the traditional Weighted Finite State Transducer (WFST) approach against two neural network-based approaches: the sequence-to-sequence encoder-decoder architecture and the recent Transformer approach based of self-attention.

The rest of this paper is organized as follows: first, we briefly describe the traditional and neural approaches to transliteration present in the literature; second, we describe data collection considerations and our multilingual datasets; lastly, we describe our experimental setup, present a number of empirical results, and provide guidance based on our experience and analysis.

2 Transliteration Approaches

2.1 Traditional Approaches

Early transliteration works utilized dictionaries and phonetic resources to learn probabilistic mapping rules between languages [\citenameKnight and Graehl1998, \citenameStalls and Knight1998]. Later works introduced more robust methods that do not require an intermediate phonemic mapping and can learn direct orthographical mapping between two languages given a bilingual dictionary [\citenameHaizhou et al.2004]. Traditionally, such works are modeled based on the common Grapheme-to-Phoneme (G2P) joint-sequence modeling technique. Given word-to-pronunciation examples, an initial alignment between corresponding grapheme (word) and phoneme (pronunciation) sequences is learned. Then, a language model is learned on the aligned tokens (see [\citenameBisani and Ney2008] for a detailed description of joint-sequence models for G2P). The open source Phonetisaurus [\citenameNovak et al.2012] has shown to achieve state-of-the-art scores on different G2P tasks [\citenameThu et al.2016]. Phonetisaurus implements the common EM-driven sequence alignment algorithm, with a few constraints such as allowing only -to-one and one-to- alignments. It trains an n-gram language model from the aligned pairs, which is converted into a Weighted Finite-State Transducer (WFST). Decoding can then be done by extracting the shortest path through the phoneme lattice created via composition with the input word.

2.2 Neural Approaches

Recent work in Neural Machine Translation (NMT) has proposed a number of approaches to use neural networks in variable-length sequence-to-sequence tasks such as transliteration. The encoder-decoder architecture [\citenameSutskever et al.2014] is a recurrent neural network setup with two parts. An encoder is fed input tokens one at a time and encodes them into a hidden state vector. At the end of the input sequence, an end-of-sentence token is fed to signify the end of the encoding phase. Next, the hidden state output of the encoder is fed into the decoder. The decoder emits tokens and updated hidden states, which are recursively fed into itself, until there are no more output tokens to produce. An additional mechanism, attention [\citenameBahdanau et al.2014], allows the decoder to focus on different parts of the input sequence and capture long-range dependencies. More recently, the Transformer [\citenameVaswani et al.2017] model was proposed, which avoids the need for sequential processing, relying only on self-attention. A benefit of this approach is there is no information bottleneck in the encoded hidden state vector as in the Encoder-Decoder approach. Additionally, because there is no longer a sequential recurrent network, model training can be better parallelized, decreasing model training time.

Previous work applied variations of the encoder-decoder approach to the name transliteration task [\citenameRosca and Breuel2016] and grapheme-to-phoneme transduction [\citenameRao et al.2015], suggesting strong results on both. [\citenameRosca and Breuel2016] reports that on English to Japanese Katakana transliteration a unidirectional encoder-decoder with 2 hidden layers using gated recurrent unit (GRU) as the underlying neural cell type achieves the best result: a word error rate (WER) of .

3 Data

The transliteration shared task, as part of the named entities workshop (NEWS)333, has been a continuous effort of benchmarking different transliteration approaches and systems across different languages. The workshop released multiple multilingual datasets over the years. However, datasets from previous years are not publicly available and their license is restrictive. That motivated us to create new multilingual datasets based on Wikidata that will be publicly available and free for all.

Wikipedia has been widely used in transliteration works (e.g., [\citenameIrvine et al.2010, \citenameRosca and Breuel2016, \citenamePasternack and Roth2009]). Wikidata is a central knowledge base in all languages for Wikipedia and other Wikimedia projects. Most Wikidata pages contain labels444A Wikidata label is the most common name that the item would be known by. It does not need to be unique, in that multiple items can have the same label. in one or multiple languages. We automatically collected all “en” (English) labels where they pair with one of the following labels: “ja” (Japanese), “he” (Hebrew), “ar” (Arabic), and “ru” (Russian). We filtered out labels containing characters not belonging to their main script (e.g., English labels containing non-Latin characters). For Japanese, we only included labels with Katakana characters. For English tokens, we did not strip out diacritics555We did not remove any Latin script characters as these may carry useful information in the context of named entity transliteration. We did convert all characters to lowercase and strip some punctuation such as underscores, braces, exclamation marks, etc.

The first version of our dataset contained many types of entities, such as song and book titles. A quick analysis of the data revealed that many label pairs are translations and not transliterations. Consequently, we used the Wikidata “instance-of” property to only include labels of type “human”. There are significantly more pages of this type than any other type on Wikidata, and they are less noisy than other types for the transliteration task. Since no direction specific information was used in data gathering, we evaluate performance on both directions (e.g. English Arabic and Arabic English).

We also report performance on two publicly available datasets: (1) The CMU pronunciation dictionary [\citenameWeide2014]; and (2) The Arabic to English dataset extracted from Wikipedia titles in [\citenameRosca and Breuel2016]. The former is used mainly for evaluating G2P systems and not named entity transliteration, but the two tasks are related and many papers use it solely for evaluation.

4 Experimental Setup

4.1 Data Representation

Pronunciation of English names usually does not carry context across tokens. In the majority of names, the pronunciation of a last name is independent of the previous tokens (middle or first names). Hence, it makes sense to train a transliteration system on name pairs that consist of a single token on each side. Our Wikidata datasets contain many multi-token names. Consequently, for constructing the train and test examples, we learn the word alignments of multi-token names, which is a simple task since the majority of name pairs contain the same word order, and we are not dealing with CJK languages with no token boundaries. The majority of our Katakana names contain a middledot to separate the tokens. We throw away English to Katakana name pairs if the English name contains more than one token and the Katakana name has no middledots or spaces. It is worth noting that the majority of Russian names in Wikidata are written as “last, first” and not “first last” as most other languages.

After alignment, we construct our cross-validation splits with only single, unique name tokens. Splitting names into single token examples produces many duplicates. For example, every name that starts with “John” would produce a “John” example. For train, development, and test splits, we only include a single instance for each unique name token. When we encounter multiple possible target script transliterations for the same English token, we only include the most frequent target transliteration666We resolve ties by picking one of the targets arbitrarily. We found that including multiple transliteration targets instead of only the single most-frequent one, only impacts performance in a small way, and thus, we opted for the simpler approach.

Dataset Total Size Training Set Size Avg. Source Length Avg. Target Length Source Alphabet Size Target Alphabet Size
WD-EN-RU 164,640 105,371 7.0 6.6 188 62
WD-EN-KA 98,820 63,246 7.0 4.8 170 105
WD-EN-AR 74,973 41,584 6.7 5.9 145 67
WD-EN-HE 50,039 32,036 6.7 5.6 135 57
Rosca and B., 2016 15,898 12,877 6.0 6.8 48 39
CMUdict 126,191 113,438 7.5 6.3 27 39
Table 1: Dataset statistics. Note that “EN” refers to English labels in Wikidata, but in our setup it actually means Latin script. That is why the source alphabet size exceeds 100 for all the four Wikidata datasets. As expected, the “EN-RU” dataset has the largest Latin alphabet size (source alphabet size).

We used a typical 64/16/20 cross-validation split of the single token data. We used 64% of the data for training, 16% for a development evaluation set to pick hyperparameters, and a 20% held-out test set for metric reporting in Section 5. Our curated datasets are being released publicly in two forms. We are releasing the original name phrases where each line contains two tab-separated columns with the English name phrase in the first and the target script name phrase in the second column. Additionally, we are releasing the particular 64/16/20 splits of the aligned single token data in order that future researchers can replicate and benchmark against our results. Table 1 provides more information about the aligned, single token versions of each dataset.

For evaluating performance on the Arabic to English dataset from [\citenameRosca and Breuel2016], we used the same train, dev, and test splits used in the paper. For evaluating CMUdict performance, we used the same 90% train, 10% split used in [\citenameNovak et al.2012]. We used these datasets without making any changes.

4.2 Libraries

We evaluate the previously discussed approaches using the following libraries:

For the experiments using Phonetisaurus, we used the default configuration with an 8-order MITLM [\citenameHsu and Glass2008] language model. For the many-to-many alignment, we disallowed -transitions on the source side and allowed them on the target side (with the --seq2_del option). For the encoder-decoder experiments, we simply adapted the Seq2Seq tutorial [\citenameLuong et al.2017] to use the library for our character-level transliteration task. We used an LSTM cell-type, 2 hidden layers, 128 units per layer, a dropout probability of 0.2, and default ‘luong’ attention mechanism. Lastly, for the Tensor2Tensor experiments, we used the default configuration but tested with both 64 units per layer and 128 units per layer. Using 128 units per layer improved WER for every language by  2-3 points, and thus we only report results with 128 units. All of the scripts used in these experiments along with the mined Wikidata datasets are publicly available101010

4.3 Evaluation Metric

Since many names may have multiple correct transliterations, in most experiments, we report the Word Error Rate (WER) metric, which is based on the proportion of name tokens that were transliterated and exactly match the expected target script from the test set (ground truth). The lower the WER score, the better. We report WER scores as -best, -best, and -best, which reflects when the correct transliteration occurs in the top spot, or in one of the top 2 spots, or in one of the top 3 spots. Since we do not penalize bad transliterations, 2-best is always at least as accurate as 1-best, and 3-best is always at least as accurate as 1-best and 2-best. We feel it is important to report in this way, because in many cases, such as information retrieval, it is feasible to generate multiple possible transliterations and use all of them at search time to improve recall. In other cases, such as text-to-speech, the top-1 result is the only result that matters, because the system only has a single opportunity to provide the correct transliteration.

5 Experimental Results

Table 2 summarizes the results of each method on different datasets. The recent Tensor2Tensor Transformer architecture outperforms the WFST approach and the Seq2Seq approach on every language. However, it is worth noting that the training time using Tensor2Tensor is between 5-8 hours using an AWS p3.2xlarge111111 instance type, which has a Tesla V100 GPU. The Phonetisaurus training time only takes a couple of minutes on a typical dual-core Intel Core-i7 personal laptop. We did not expect the WER difference between T2T and Phonetisaurus to be that significant. Our hypothesis was that given the small number of characters in every language and small average input size, the n-gram based models would be hard to beat. NMT approaches have a clear advantage in handling long term dependencies, but we expected that the small input size of our named entity tokens implied few important long term dependencies. The CMUdict dataset is the only case in which Phonetisaurus (slightly) outperforms the neural Transformer approach. One explanation is that the CMUdict dataset is not that challenging, as can be seen by the difference in WER compared to all other datasets.

Task Dataset Method 1-best WER 2-best WER 3-best WER
English Arabic WD-EN-AR T2T 0.45 0.30 0.24
Seq2Seq 0.53 0.41 0.36
Phonetisaurus 0.51 0.37 0.30

English Hebrew
WD-EN-HE T2T 0.44 0.27 0.22
Seq2Seq 0.49 0.36 0.31
Phonetisaurus 0.49 0.32 0.26

English Katakana
WD-EN-KA T2T 0.51 0.35 0.29
Seq2Seq 0.60 0.48 0.43
Phonetisaurus 0.57 0.43 0.36

English Russian
WD-EN-RU T2T 0.35 0.22 0.17
Seq2Seq 0.40 0.29 0.25
Phonetisaurus 0.38 0.24 0.19

Arabic English
Rosca and B., 2016 T2T 0.75 0.63 0.57
Seq2Seq 0.81 0.71 0.67
Phonetisaurus 0.75 0.65 0.59

English ARPAbet
CMUdict T2T 0.29 0.16 0.11
Seq2Seq 0.29 0.18 0.14
Phonetisaurus 0.27 0.14 0.10
Table 2: Word Error Rate (WER) using different methods for named entity transliteration. “WD” refers to our produced Wikidata datasets. “T2T” refers to the Tensor2Tensor system.
Figure 1: Phonetisaurus: Impact of the number of training examples on WER.

Another observation is the WER gap between the languages. Among the English to tasks, Katakana has the worse WER, followed by Arabic that is only slightly behind Hebrew. Given that Arabic and Hebrew belong to the same language family, it is not surprising that their WER is similar. Russian has the best WER by a large margin. It also has the largest training data and it is the closest language to English among the four evaluated. We were interested to learn the impact of training size on the error rate. Figure 2 shows learning curves for various datasets. We used Phonetisaurus since it was the fastest to train. One interesting observation is that training size is not a big factor. When the number of training examples is equal across all languages, Russian still has the best WER by a large margin and Katakana the worst by a large margin. Another interesting observation is that we reach close to optimal performance with about 50% of the training data. For example, WER on WD-EN-RU (English to Russian) is 0.40 and 0.38 after training on 50K and 100K examples, respectively. CMUdict has the steepest reduction in WER as the number of training examples increases, with WER of 0.33 and 0.28 after training on 50K and 100K examples, respectively.

The seq2seq system consistently performed the worst. Based on design guidance in Neural Machine Translation literature [\citenameSutskever et al.2014], we experimented with feeding the source string in reverse order as this showed a significant improvement in encoder-decoder approaches. In our experiments, this did not meaningfully impact the WER. The resulting score was either identical or within one point for each of the language pairs that we tested.

5.1 Removing tokens that occur only once

One design question that we set out to answer is how to be handle unique name token pairs in dataset curation (e.g., the ‘EN-KA’ transliteration pair [“escalada”, “{CJK}UTF8minエスカラーダ”] occurs only once in the data we collected from Wikidata). Name tokens, like other language tokens, follow a Zipfian distribution and thus there are likely to be many tokens that appear infrequently. On the one hand, tokens that occur only once may be erroneous, and it may be better to exclude them from the training and test sets. Taking the English to Hebrew dataset as an example, 77% of the name tokens only occur once. Our hypothesis was that given this distribution of token frequency, excluding them would likely remove more useful information than noise. Table 3 illustrates the impact of excluding single occurrence name tokens from both train and test against a baseline of training and testing on everything. All of these results use the Transformer neural approach with 128 units per layer and 2 hidden layers.

Comparing column 1 (baseline) and column 2 in Table 3 shows that excluding the infrequent tokens from test, results in a surprisingly high reduction in WER. If we also filter the tokens from train (as shown in column 4), we see an increase in WER. The worse WER is achieved when filtering the train set but testing on the full test set. In English to Hebrew (WD-EN-HE), filtering the train set but testing on the full set increases the WER by 7.0 points over the baseline. Industrial transliteration systems will contain dictionary look-ups for frequent named entity transliterations and employ automated methods, as evaluated here, to handle out-of-vocabulary cases. Thus, it stands to reason that a fair evaluation of automated methods should include infrequent names even in the presence of some noise. The results presented in other sections include unique tokens at both train and test time.

An interesting question is why infrequent names in our data happen to be the hardest instances. One possible factor is that these names are less known to the public so there is no spelling convention to follow, leading to higher ambiguity. Wikidata labels are updated by many people from all over the world, hence name frequency can be thought as a proxy for annotator agreement. We also found that infrequent names in our data are longer on average, with a strong negative correlation between token frequency and token length. For example, the average length of all single occurrence English name tokens based on all of our test datasets is 7.1 characters, while the average length of all other English name tokens (at least two occurrences) is 6.4. Word-based evaluation metrics like we use here might be biased towards shorter names. Some prior works have used character-based metrics for transliteration, such as length-normalized edit distance [\citenameIrvine et al.2010, \citenameNoeman and Madkour2010], which might be a good alternative (depending on the application).

5.2 English as source or target language

Dataset Full Train, Full Test (Baseline) Full Train, Filtered Test Filtered Train, Full Test Filtered Train, Filtered Test
WD-EN-AR 0.45 0.32 0.50 0.35
WD-EN-HE 0.43 0.31 0.50 0.35
WD-EN-RU 0.35 0.24 0.39 0.26
WD-EN-KA 0.50 0.36 0.56 0.41
Table 3: T2T 1-Best Word Error Rate (WER) when filtering single occurrence name tokens from training and/or testing.
Dataset Task 1-best WER 2-best WER 3-best WER
WD-EN-AR English Arabic 0.45 0.30 0.24
WD-EN-AR Arabic English 0.75 0.64 0.58
WD-EN-HE English Hebrew 0.44 0.27 0.22
WD-EN-HE Hebrew English 0.77 0.65 0.59
WD-EN-KA English Katakana 0.51 0.35 0.29
WD-EN-KA Katakana English 0.70 0.57 0.50
WD-EN-RU English Russian 0.35 0.22 0.17
WD-EN-RU Russian English 0.47 0.35 0.30
Table 4: T2T Word Error Rate (WER) when modeling as English to versus to English.

We performed experiments comparing the performance of modeling the transliteration problem as English  versus  English. Our hypothesis was that learning the transliteration with English as the source language was going to result in a lower word error rate due to the loss of information when going to languages like Hebrew or Arabic. Table 4 reports the differences in these two approaches. In every case using English as the source language results in much better word error rate, but the impact varies depending on the language. Hebrew and Arabic are impacted the most, and Russian is the least penalized. Previous works that focus on back-transliterating (recovering names of English origin) report similar findings [\citenameIrvine et al.2010]. In the error analysis section we provide some insights on why the reverse task is harder.

5.3 Error Analysis

Given the nature of Wikidata, our extracted datasets contain a diverse set of names from different origins, including many foreign names with different linguistic conventions. Names can be pronounced differently depending on origin and context can play a role as well. This explains why only outputting the top transliteration is usually not enough, as our results show. One example showing how hard the task can be is the Irish name “Domhnall”, pronounced as DONAL, which none of the models transliterated correctly. The Hebrew, Arabic, and Russian models all included an ‘m’ in each of their 3-best outputs.

Figure 2 shows error examples in the various tasks. Vowel confusion is a common error. As expected, the Hebrew and Arabic models are affected by it the most. The problem becomes even worse when English is the target language. Hebrew and Arabic names in Wikidata, like in most of the web, do not include diacritical marks. This means that often English names with vowels are written in Arabic and Hebrew without the vowels. For example, “barak” is written as “brk” (in Hebrew letters) in modern Hebrew. When transliterating to English, the model often needs to recover the missing vowels. One of the figure examples is the name “Rashid”/“Rashed” that was transliterated incorrectly since the model failed to recover the letter ‘i’/‘e’ that is not included in the original Arabic name. Vowel confusion is also a problem in Japanese, as the “ewan” example in the figure shows (the name Ewan MacGregor is quite known in Japan). The third best transliteration is “{CJK}UTF8minユーアン”, which differs from the correct transliteration only by the Katakana long vowel “{CJK}UTF8minー”. There seem to be different vowel variations in Katakana, such as long vs. short vowels. We found it often happens in the word final position. Having a special treatment to handle such cases could significantly boost accuracy. Some errors are more language specific. For example, in English to Russian the model often fails on the letter ‘w’ that can be ambiguous since the sound does not exist in Russian. The figure shows how the model made the same error in all three transliterations of the name “edwarda”, replacing ‘w’ with the Russian letter ‘B’ (’v’ sound), which is commonly used as a replacement.

Figure 2: Model errors across different languages

Another interesting observation is that the models also fail on common names. This happens in all languages but more frequently occurs when English is the target language. For example, the Hebrew to English model transliterated “hillary” from Hebrew to English as “hilari”, “hilarry”, and “hillari”. We believe this is a result of our problem setup. Since all our test names are unseen in training and every name only occurs once without making use of frequency info, the language model cannot learn that “hillary” is more common than “hilarry”, for example, unless it is a sub-token of a longer name in training. Incorporating name frequency or large character-based n-gram language models learned from a large corpus (as in [\citenameAl-Onaizan and Knight2002, \citenameIrvine et al.2010]) could possibly help such cases.

6 Future Directions

We will continue exploring best practices for multilingual transliteration. We are planning to evaluate design criteria for doing transliteration with CJK languages, where alignment is a more difficult problem. We are also interested to understand better the relationship between token frequency and transliteration quality. Resources like Wikidata are extremely useful for curating large multilingual datasets, but some thought should be given to how we can reduce bias and noise with minimal effort, aiming to achieve quality that is comparable with manually annotated datasets. We believe that token frequency can be used as a proxy for annotator agreement and are planning to study this further. Finally, we are interested to study how well character-based NMT systems such as [\citenameLing et al.2015] perform on named entity transliteration.

7 Conclusions

We described a number of design considerations that one must address when building a robust, multilingual named entity transliteration system. We empirically demonstrated that the Tensor2Tensor neural Transformer method produces consistently strong results against the LSTM-based encoder-decoder neural method and WFST-based traditional method. We described a number of challenges and design choices when building bilingual dictionaries from mined knowledge bases, such as Wikidata. Our Wikidata curated datasets, including both the name phrase bilingual data and the aligned, single token datasets are being released for further experimentation and benchmarking.


  • [\citenameAbadi et al.2015] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from
  • [\citenameAl-Onaizan and Knight2002] Yaser Al-Onaizan and Kevin Knight. 2002. Translating named entities using monolingual and bilingual resources. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 400–408. Association for Computational Linguistics.
  • [\citenameBahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
  • [\citenameBisani and Ney2008] Maximilian Bisani and Hermann Ney. 2008. Joint-sequence models for grapheme-to-phoneme conversion. Speech communication, 50(5):434–451.
  • [\citenameEkbal et al.2006] Asif Ekbal, Sudip Kumar Naskar, and Sivaji Bandyopadhyay. 2006. A modified joint source-channel model for transliteration. In Proceedings of the COLING/ACL on Main conference poster sessions, pages 191–198. Association for Computational Linguistics.
  • [\citenameFushimi et al.1999] Takao Fushimi, Matsuo Ijuin, Karalyn Patterson, and Itaru F Tatsumi. 1999. Consistency, frequency, and lexicality effects in naming japanese kanji. Journal of Experimental Psychology: Human Perception and Performance, 25(2):382.
  • [\citenameHaizhou et al.2004] Li Haizhou, Zhang Min, and Su Jian. 2004. A joint source-channel model for machine transliteration. In Proceedings of the 42nd Annual Meeting on association for Computational Linguistics, page 159. Association for Computational Linguistics.
  • [\citenameHsu and Glass2008] Bo-June Hsu and James Glass. 2008. Iterative language model estimation: efficient data structure & algorithms. In Ninth Annual Conference of the International Speech Communication Association.
  • [\citenameIrvine et al.2010] Ann Irvine, Chris Callison-Burch, and Alexandre Klementiev. 2010. Transliterating from all languages. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA), pages 100–110.
  • [\citenameKnight and Graehl1998] Kevin Knight and Jonathan Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4):599–612.
  • [\citenameLing et al.2015] Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W Black. 2015. Character-based neural machine translation. arXiv preprint arXiv:1511.04586.
  • [\citenameLuong et al.2017] Minh-Thang Luong, Eugene Brevdo, and Rui Zhao. 2017. Neural machine translation (seq2seq) tutorial.
  • [\citenameNoeman and Madkour2010] Sara Noeman and Amgad Madkour. 2010. Language independent transliteration mining system using finite state automata framework. In Proceedings of the 2010 Named Entities Workshop, pages 57–61. Association for Computational Linguistics.
  • [\citenameNovak et al.2012] Josef R Novak, Nobuaki Minematsu, and Keikichi Hirose. 2012. Wfst-based grapheme-to-phoneme conversion: Open source tools for alignment, model-building and decoding. In Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing, pages 45–49.
  • [\citenamePasternack and Roth2009] Jeff Pasternack and Dan Roth. 2009. Learning better transliterations. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 177–186. ACM.
  • [\citenameRao et al.2015] Kanishka Rao, Fuchun Peng, Haşim Sak, and Françoise Beaufays. 2015. Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 4225–4229. IEEE.
  • [\citenameRosca and Breuel2016] Mihaela Rosca and Thomas Breuel. 2016. Sequence-to-sequence neural network models for transliteration. arXiv preprint arXiv:1610.09565.
  • [\citenameStalls and Knight1998] Bonnie Glover Stalls and Kevin Knight. 1998. Translating names and technical terms in arabic text. In Proceedings of the Workshop on Computational Approaches to Semitic Languages, pages 34–41. Association for Computational Linguistics.
  • [\citenameSutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
  • [\citenameThu et al.2016] Ye Kyaw Thu, Win Pa Pa, Yoshinori Sagisaka, and Naoto Iwahashi. 2016. Comparison of grapheme-to-phoneme conversion methods on a myanmar pronunciation dictionary. In Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), pages 11–22.
  • [\citenameVaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.
  • [\citenameWeide2014] R Weide. 2014. The CMU pronunciation dictionary, release 0.7b.
  • [\citenameYan and Nivre2016] Shao Yan and Joakim Nivre. 2016. Applying neural networks to english-chinese named entity transliteration. In Sixth Named Entity Workshop, joint with 54th ACL.
  • [\citenameYao and Zweig2015] Kaisheng Yao and Geoffrey Zweig. 2015. Sequence-to-sequence neural net models for grapheme-to-phoneme conversion. arXiv preprint arXiv:1506.00196.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description