Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

Soumil Mandal, Karthick Nanmaran

Department of Computer Science & Engineering
SRM Institute of Science & Technology, Chennai, India
{soumil.mandal, karthicknanmaran}

Building tools for code-mixed data is rapidly gaining popularity in the NLP research community as such data is exponentially rising on social media. Working with code-mixed data contains several challenges, especially due to grammatical inconsistencies and spelling variations in addition to all the previous known challenges for social media scenarios. In this article, we present a novel architecture focusing on normalizing phonetic typing variations, which is commonly seen in code-mixed data. One of the main features of our architecture is that in addition to normalizing, it can also be utilized for back-transliteration and word identification in some cases. Our model achieved an accuracy of 90.27% on the test data.


1 Introduction

With rising popularity of social media, the amount of data is rising exponentially. If mined, this data can proof to be useful for various purposes. In countries where the number of bilinguals are high, we see that users tend to switch back and forth between multiple languages, a phenomenon known as code-mixing or code-switching. An interesting case is switching between languages which share different mother scripts. On such occasions, one of the two languages is typed in it’s phonetically transliterated form in order to use a common script. Though there are some standard transliteration rules, for example ITRANS 111, ISO 222, but it is extremely difficult and un-realistic for people to follow them while typing. This indeed is the case as we see that identical words are being transliterated differently by different people based on their own phonetic judgment influenced by dialects, location, or sometimes even based on the informality or casualness of the situation. Thus, for creating systems for code-mixed data, post language tagging, normalization of transliterated text is extremely important in order to identify the word and understand it’s semantics. This would help a lot in systems like opinion mining, and is actually necessary for tasks like summarization, translation, etc. A normalizing module will also be of immense help while making word embeddings for code-mixed data.
      In this paper, we present an architecture for automatic normalization of phonetically transliterated words to their standard forms. The language pair we have worked on is Bengali-English (Bn-En), where both are typed in Roman script, thus the Bengali words are in their transliterated form. The canonical or normalized form we have considered is the Indian Languages Transliteration (ITRANS) form of the respective Bengali word. Bengali is an Indo-Aryan language of India where 8.10% 333 of the total population are language speakers and is also the official language of Bangladesh. The mother script of Bengali is the Eastern Nagari Script 444 Our architecture utilizes fully char based sequence to sequence learning in addition to Levenshtein distance to give the final normalized form or as close to it as possible. Some additional advantages of our system is that at an intermediate stage, the back-transliterated form of the word can be fetched (i.e. word identification), which will be very useful in several cases as original tools (i.e. tools using mother script) can be utilized, for example emotion lexicons. Some other important contributions of our research are the new lexicons that have been prepared (discussed in Sec 3) which can be used for building various other tools for studying Bengali-English code-mixed data.

2 Related Work

Normalization of text has been studied quite a lot Sproat et al. (1999), especially as it acts as a pre-processing step for several text processing systems. Using Conditional Random Fields (CRF), Zhu et al. (2007) performed text normalization on informal emails. Dutta et al. (2015) created a system based on noisy channel model for text normalization which handles wordplay, contracted words and phonetic variations in code-mixed background. An unsupervised framework was presented by Sridhar (2015) for normalizing domain-specific and informal noisy texts using distributed representation of words. The soundex algorithm was used in Sitaram et al. (2015) and Sitaram and Black (2016) for spelling correction of transliterated words and normalization in a speech to text scenario of code-mixed data respectively. Sharma et al. (2016) build a normalization system using noisy channel framework and SILPA spell checker in order to build a shallow parser. Sproat and Jaitly (2016) build a system combining two models, where one essentially is a seq2seq model which checks the possible normalizations and the other is a language model which considers context information. Jaitly and Sproat (2017) used a seq2seq model with attention trained at sentence level followed by error pruning using finite-state filters to build a normalization system, mainly targeted for text to speech purposes. A similar flow was adopted by Zare and Rohatgi (2017) as well where seq2seq was used for normalization and a window of size 20 was considered for context. Singh et al. (2018) exploited the fact that words and their variations share similar context in large noisy text corpora to build their normalizing model, using skip gram and clustering techniques. To the best of our knowledge, the system architecture proposed by us hasn’t been tried before, especially for code-mixed data.

3 Data Sets

On a whole, three data sets or lexicons were created. The first data set was a parallel lexicon (PL) where the 1st column had phonetically transliterated Bn words in Roman taken from code-mixed data prepared in Mandal et al. (2018b). The 2nd column consisted of the standard Roman transliterations (ITRANS) of the respective words. To get this, we first manually back-transliterated PL\textsubscriptcol_1 to the original word in Eastern Nagari script, and then converted it into standardized ITRANS format. The final size of the PL was 6000. The second lexicon we created was a transliteration dictionary (BN_TRANS) where the first column had Bengali words in Eastern Nagari script taken from samsad 555, while the second column had the standard transliterations (ITRANS). The number of entries in the dictionary was 21850. For testing, we took the data used in Mandal and Das (2018), language tagged it using the system described in Mandal et al. (2018a), and then collected Bn tagged tokens. Post manual checking and discarding of misclassified tokens, the size of the list was 905. Finally, each of the words were tagged with their ITRANS using the same approach used for making PL. For PL\textsubscriptcol_1 and test data, some initial rule based normalization techniques were used. If the input string contains a digit, it was replaced by the respective phone (e.g. ek for 1, dui for 2, etc), and if there are n consecutive identical characters where n 2 (elongation), it was trimmed down to 2 consecutive characters (e.g. baaaad will become baad), as no word in it’s standard form has more than two consecutive identical characters.

4 Proposed Method

Our method is a two step modular approach comprising of two degrees of normalization. The 1 normalization module does an initial normalization and tries to convert the input string closest to the standard transliteration. The 2 normalization module takes the output from the first module and tries to match with the standard transliterations present in the dictionary (BN_TRANS). The candidate with the closest match is returned as the final normalized string.

5 1 Normalization Module

The purpose of this module is to phonetically normalize the word as close to the standard transliteration as possible, to make the work of the matching module easier. To achieve this, our idea was to train a sequence to sequence model where the input sequences are user transliterated words and the target sequences are the respective ITRANS transliterations. We had specifically chosen this architecture as it has performed amazingly well in complex sequence mapping tasks like neural machine translation and summarization.

5.1 Seq2Seq Model

The sequence to sequence model Sutskever et al. (2014) is a relatively new idea for sequence learning using neural networks. It has been especially popular since it achieved state of the art results in machine translation task. Essentially, the model takes as input a sequence X = {x\textsubscript1, x\textsubscript2, …, x\textsubscriptn} and tries to generate another sequence Y = {y\textsubscript1, y\textsubscript2, …, y\textsubscriptm}, where x\textsubscripti and y\textsubscripti are the input and target symbols respectively. The architecture of seq2seq model comprises of two parts, the encoder and decoder. As the input and target vectors were quite small (words), attention Vaswani et al. (2017) mechanism was not incorporated.

5.1.1 Encoder

Encoder essentially takes a variable length sequence as input and encodes it into a fixed length vector, which is suppose to summarize it’s meaning taking into context as well. A recurrent neural network (RNN) cell is used to achieve this. The directional encoder reads the sequence from one end to the other (left to right in our case).

Here, E\textsubscriptx is the input embedding lookup table (dictionary), \textsubscriptenc are the transfer function for the recurrent unit e.g. Vanilla, LSTM or GRU. A contiguous sequence of encodings C = {h\textsubscript1, h\textsubscript2, …, h\textsubscriptn} is constructed which is then passed on to the decoder.

5.1.2 Decoder

Decoder takes input context vector C from the encoder, and computes the hidden state at time t as,

Subsequently, a parametric function out\textsubscriptk returns the conditional probability using the next target symbol being k. Here, the concept of teacher forcing is utilized, the strategy of feeding output of the model from a prior time-step as input.

Z is the normalizing constant

5.2 Training

The model is trained by minimizing the negative log-likelihood. For training, we used the fully character based seq2seq model Lee et al. (2016) with stacked LSTM cells. The input units were user typed phonetic transliterations (PL\textsubscriptcol_1) while the target units were respective standard transliterations (PL\textsubscriptcol_2). Thus, the model learns to map user transliterations to standard transliterations, effectively learning to normalize phonetic variations. The lookup table E\textsubscriptx we used for character encoding was a dictionary where the keys were the 26 English alphabets and the values were the respective index. Encodings at character level were then padded to the length of the maximum existing word in the dataset, which was 14, and was converted to one-hot encodings prior to feeding the to the seq2seq model. We created our seq2seq model using the Keras Chollet et al. (2015) library. The batch size was set to 64, and number of epochs was set to 100. The size of the latent dimension was kept at 128. Optimizer we chose was rmsprop, learning rate set at 0.001, loss was categorical crossentropy and transfer function used was softmax. Accuracy and loss graphs during training with respect to epochs are shown in Fig 1.

Figure 1: Training accuracy and loss.

As we can see from Fig 1, the accuracy reached at the end of training was not too high (around 41.2%) and the slope became asymptotic. This is quite understandable as the amount of training data was relatively quite low for the task, and the phonetic variations were quite high. On running this module on our testing data, an accuracy of 51.04% was achieved. It should be noted that even a single character predicted wrongly by the softmax function reduces the accuracy.

6 2 Normalization Module

This module basically comprises of the string matching algorithm. For this, we have used Levenshtein distance (LD) Levenshtein (1966), which is a string metric for measuring difference between two sequences. It does so by calculating the minimum number of insertions, deletions and substitutions required for converting one sequence to the other. Here, the output from the previous module is compared with all the standard ITRANS entries present in BN_TRANS and the string with the least Levenshtein distance is given as output, which is the final normalized form. If there are ties, the instance which has higher matches traversing from left to right is given more priority. Also, observing the errors from 1 normalizer, we noticed that in a lot of cases, the character pairs {a,o} and {b,v} are used interchangeably quite often, both in phonetic transliterations alone, as well as when compared with ITRANS. Thus, along with the standard approach, we tried a modified version as well where the cost of the above mentioned character pairs are same, i.e. they are treated as identical characters. This was simply done by assigning special symbols to those pairs, and replacing them in the input parameters. For example, post replacement, distance(chalo, chala) will become distance(ch$l$, ch$l$).

7 Evaluation

Our system was evaluated in two ways, one at word level and another at task level.

7.1 Word Level

Here, the basic idea was compare the normalized words with the respective standard transliterations. For this, the testing data discussed in Sec 3 was used. For comparison purposes, three other setups other than our proposed model (setup_4) were tested, all of which are described in Table 1.

Model 1 Norm LD Acc
setup_1 no standard 58.78
setup_2 no modified 61.10
setup_3 yes standard 89.72
setup_4 yes modified 90.27
Table 1: Comparison of different setups.

From Table 1, we can see that the jump in accuracy from setup_1 to setup_3 is quite significant (by 30.94%). This proves that instead of simple distance comparison with lexicon entries, a prior seq2seq normalization can have great impact on the performance. Additionally, we can also see that when modified input is given to the Levenshtein distance (LD), the accuracies achieved are slightly better. On analyzing the errors, we found out that majority (92%) of them is due to the fact that the standard from was not present in BN_TRANS, i.e. was out of vocabulary. These words were mostly slangs, expressions, or two words joined into a single one. The other 8% was due to the 1 module casuing substantial deviation from normal form. For deeper analysis, we collected the ITRANS of errors due out of vocab, and on comparison with the 1 normalizations, the mean LD was calculated to be 1.89, which is suggesting that if they were present in BN_TRANS, the normalizer would have given the correct output.

7.2 Task Level

For task level evaluation, we decided to go with sentiment analysis using the exact setup and data described in Mandal et al. (2018b), on Bengali-English code-mixed data. All the training and testing data were normalized using our system along with the lexicons that are mentioned. Finally, the same steps were followed and the different metrics were calculated. The comparison of the systems prior (noisy) and post normalization (normalized) is shown in Table 2.

Model Acc Prec Rec F1
noisy 80.97 81.03 80.97 81.20
normalized 82.47 82.52 82.47 82.61
Table 2: Prior and post normalization results.

We can see an improvement in the accuracy (by 1.5%). On further investigation, we saw that the unigram, bigram and trigram matches with the bag of n-grams and testing data increased by 1.6%, 0.4% and 0.1% respectively. The accuracy can be improved further more if back-transliteration is done and Bengali sentiment lexicons are used but that is beyond the scope of this paper.

8 Discussion

Though our proposed model achieved high accuracy, some drawbacks are there. Firstly is the requirement for the parallel corpus (PL) for training a seq2seq model, as manual checking and back-transliteration is quite tedious. Speed of processing in terms of words/second is not very high due to the fact that both seq2seq and Levenshthein distance calculation is computationally heavy, plus the O(n) search time. For string matching, simpler and faster methods can be tested and search area reduction algorithms (e.g. specifying search domains depending on starting character) can be tried to improve the processing speed. A simple lexical checker can be added as well before using seq2seq and/or matching module to see if the word is already in it’s transliterated form.

9 Conclusion & Future Work

In this article, we have presented a novel architecture for normalization of transliterated words in code-mixed scenario. We have employed the seq2seq model with LSTM cells for initial normalization followed by evaluating Levenshthein distance to retrieve the standard transliteration from a lexicon. Our approach got an accuracy of 90.27% on testing data, and improved the accuracy of a pre-existing sentiment analysis system by 1.5%. In future, we would like to collect more transliterated words and increase the data size in order to improve both PL and BN_TRANS. Also, combining this module with a context capturing system and expanding to other Indic languages like Hindi, Tamil will be one of the goals as well.


  • Chollet et al. (2015) François Chollet et al. 2015. Keras.
  • Dutta et al. (2015) Sukanya Dutta, Tista Saha, Somnath Banerjee, and Sudip Kumar Naskar. 2015. Text normalization in code-mixed social media text. In Recent Trends in Information Systems (ReTIS), 2015 IEEE 2nd International Conference on, pages 378–382. IEEE.
  • Jaitly and Sproat (2017) Navdeep Jaitly and Richard Sproat. 2017. An rnn model of text normalization.
  • Lee et al. (2016) Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2016. Fully character-level neural machine translation without explicit segmentation. arXiv preprint arXiv:1610.03017.
  • Levenshtein (1966) Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, pages 707–710.
  • Mandal and Das (2018) Soumil Mandal and Dipankar Das. 2018. Analyzing roles of classifiers and code-mixed factors for sentiment identification. arXiv preprint arXiv:1801.02581.
  • Mandal et al. (2018a) Soumil Mandal, Sourya Dipta Das, and Dipankar Das. 2018a. Language identification of bengali-english code-mixed data using character & phonetic based lstm models. arXiv preprint arXiv:1803.03859.
  • Mandal et al. (2018b) Soumil Mandal, Sainik Kumar Mahata, and Dipankar Das. 2018b. Preparing bengali-english code-mixed corpus for sentiment analysis of indian languages. arXiv preprint arXiv:1803.04000.
  • Sharma et al. (2016) Arnav Sharma, Sakshi Gupta, Raveesh Motlani, Piyush Bansal, Manish Srivastava, Radhika Mamidi, and Dipti M Sharma. 2016. Shallow parsing pipeline for hindi-english code-mixed social media text. arXiv preprint arXiv:1604.03136.
  • Singh et al. (2018) Rajat Singh, Nurendra Choudhary, and Manish Shrivastava. 2018. Automatic normalization of word variations in code-mixed social media text. arXiv preprint arXiv:1804.00804.
  • Sitaram and Black (2016) Sunayana Sitaram and Alan W Black. 2016. Speech synthesis of code-mixed text. In LREC.
  • Sitaram et al. (2015) Sunayana Sitaram, Sai Krishna Rallabandi, and SRAW Black. 2015. Experiments with cross-lingual systems for synthesis of code-mixed text. In 9th ISCA Speech Synthesis Workshop, pages 76–81.
  • Sproat et al. (1999) Richard Sproat, Alan W Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. 1999. Normalization of non-standard words. In WS’99 Final Report. Citeseer.
  • Sproat and Jaitly (2016) Richard Sproat and Navdeep Jaitly. 2016. Rnn approaches to text normalization: A challenge. arXiv preprint arXiv:1611.00068.
  • Sridhar (2015) Vivek Kumar Rangarajan Sridhar. 2015. Unsupervised text normalization using distributed representations of words and phrases. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 8–16.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.
  • Zare and Rohatgi (2017) Maryam Zare and Shaurya Rohatgi. 2017. Deepnorm-a deep learning approach to text normalization. arXiv preprint arXiv:1712.06994.
  • Zhu et al. (2007) Conghui Zhu, Jie Tang, Hang Li, Hwee Tou Ng, and Tiejun Zhao. 2007. A unified tagging approach to text normalization. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 688–695.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description