From English to Code-Switching: Transfer Learningwith Strong Morphological Clues

From English to Code-Switching: Transfer Learning
with Strong Morphological Clues

Gustavo Aguilar    Thamar Solorio
Department of Computer Science
University of Houston
Houston, TX 77204-3010
{gaguilaralas, tsolorio}

Code-switching is still an understudied phenomenon in natural language processing mainly because of two related challenges: it lacks annotated data, and it combines a vast diversity of low-resource languages. Despite the language diversity, many code-switching scenarios occur in language pairs, and English is often a common factor among them. In the first part of this paper, we use transfer learning from English to English-paired code-switched languages for the language identification (LID) task by applying two simple yet effective techniques: 1) a hierarchical attention mechanism that enhances morphological clues from character n-grams, and 2) a secondary loss that forces the model to learn n-gram representations that are particular to the languages involved. We use the bottom layers of the ELMo architecture to learn these morphological clues by essentially recognizing what is and what is not English. Our approach outperforms the previous state of the art on Nepali-English, Spanish-English, and Hindi-English datasets. In the second part of the paper, we use our best LID models for the tasks of Spanish-English named entity recognition and Hindi-English part-of-speech tagging by replacing their inference layers and retraining them. We show that our retrained models are capable of using the code-switching information on both tasks to outperform models that do not have such knowledge.

1 Introduction

While code-switching is a common phenomenon among multilingual speakers, it is still considered an understudied area in the field of natural language processing (NLP). The main reason is the lack of annotated data combined with the high diversity of languages in which this phenomenon can occur. However, social media has captured this linguistic phenomenon into written language, making code-switched data more accessible to researchers. Nevertheless, it is expensive to annotate this data and the annotation process also requires fluent speakers for the languages involved. Additionally, not all the languages have the same incidence and predominance, making annotations impractical and expensive for every combination of languages.

Even though code-switching can occur with multiple languages in the same utterance, it is more common to see this phenomenon in language pairs. Many of those language pairs have English as a common factor. Given that most of the NLP efforts are English-centric, it is reasonable to explore approaches where English-based models can be tailored to perform on code-switching settings.

In this paper, we study the code-switching phenomenon using English as an anchor language from multiple code-switched language pairs. In the first part of the paper, we focus on the task of language identification (LID) using ELMo Peters et al. (2018) as our English source. Our hypothesis is that English-based models should be able to recognize what is and what is not English when they are retrained. To accomplish that, we rely on the ELMo character-based architecture. We enhance the character convolutions in its core to detect morphological clues based on different n-gram orders. In addition, we refine these morphological clues using a hierarchical attention mechanism. The enhanced character n-gram representations are used to 1) compute a secondary loss according to whether a token is English or not, and 2) to inject morphological clues at the top of the model before performing inference for LID. Our models consistently outperform the LID state of the art on the Nepali-English Solorio et al. (2014), Spanish-English Molina et al. (2016), and Hindi-English Mave et al. (2018) datasets. In the second part of the paper, we transfer the learning of our best LID models to tasks such as named entity recognition (NER) for Spanish-English Aguilar et al. (2018) and part-of-speech (POS) tagging for Hindi-English Das (2016). We evaluate the impact of the pretrained code-switching knowledge for both NER and POS tagging comapred to models that do not have such information. We also report a new state of the art on the NER task.

Our contributions can be summarized as follows: 1) framing code-switching tasks based on an anchor high-resource language (i.e., English) and effectively using transfer learning to overcome the challenges for low-resource code-switched language pairs, 2) outperforming previous state-of-the-art results on LID for three datasets with different language pairs, 3) leveraging the code-switched knowledge to tasks such as NER and POS tagging, and establishing a new state of the art on NER by adapting fine-tuned LID models111We also release the source code to the NLP community to allow replicability of all the experiments.

2 Methodology

We provide a cursory overview of the ELMo architecture (Section 2.1), which serves as background for a detailed description of the enhanced character n-gram representation (Section 2.2). Then, we define the overall sequence tagger architecture in Section 2.3 and describe our training procedure in Section 2.4.

2.1 ELMo: A Cursory Overview

ELMo is a character-based language model that provides deep contextualized word representations Peters et al. (2018). It has established new state of the art in a wide variety of NLP applications by effectively capturing word contexts. The shadowed box in Figure 2 describes the most important components of ELMo: character CNNs, a highway network Srivastava et al. (2015) with linear projection, and bidirectional LSTMs. The CNN layers are applied over the character embeddings to produce morphological features using different kernel widths. The resulting feature maps are max-pooled and flattened to have a single vector per word out of characters. The flattened vector is passed to the highway network and projected to a smaller dimensional space. Essentially, this word vector only accounts for local features without context. The global contextual information is provided by the bidirectional LSTM layers, which also yield the final word representations. We refer to Peters et al. (2018) for more details222

We choose ELMo for code-switched language identification because 1) it has been extensively trained on English data as a general-purpose language model, which is essential to adapt to the idea of recognizing what is and what is not English, 2) it extracts morphological information out of characters, which is crucial since certain combinations of characters can determine if a word is of one language or another, and 3) it generates powerful word representations that account for multiple meanings depending on the context.

2.2 Enhanced Character N-grams

Figure 1: Enhanced character n-gram representation of a word. We take the outputs of seven convolution layers with kernel widths from 1 to 7. The figure only shows kernel widths of 2 and 3 for simplicity. The features are added to the position embeddings and weighted by a hierarchical attention mechanism for the final character n-gram representation.

As described in Section 2.1, ELMo convolves character embeddings in its first layers. We use the resulting feature maps from such convolutions before they are max-pooled. These feature maps are essentially n-gram representations whose feature dimension depends on the number of channels on each convolution layer. The order of the n-grams are determined by the kernel widths of the CNNs. Figure 1 shows kernel widths of 2 (bi-grams) and 3 (tri-grams) for simplicity, but ELMo uses kernel sizes in the set . We are interested in the resulting vectors because they effectively capture morphological patterns with different order of n-grams; these vectors can ultimately describe particularities of the word morphology in the code-switched languages.

2.2.1 Position Embeddings

Convolutional networks are known to lose spatial information: they only capture patterns from a small section of the input within a window regardless where the patterns occur. In our case, this results in n-gram vectors that are not tied to the sequential order in which they appear. Consider the tri-grams of the English word tuning and the Spanish word ingeniero333ingeniero translates to engineer.. Even though the suffix -ing is more frequent in English than Spanish, the resulting convolutions for both words will provide the same vectors for the tri-gram ing ignoring its place in the words. To avoid this behavior, we provide the model with position embeddings Gehring et al. (2017).

Consider the sequence of character n-grams of order where is an n-gram vector from the character convolutions, and is the number of output channels. Also, consider a set of position embedding matrices defined as where is the maximum length of characters in a word and is the dimension of the embeddings. Then, the position vectors for the sequence are defined by where is the -th vector from the position embedding matrix . We use to enable the addition of the position embeddings and the n-gram vectors. Thus, the resulting set of n-grams that account for positional information is given by . Figure 1 shows the position embeddings for bi-grams and tri-grams.

2.2.2 Hierarchical Attention

ELMo down-samples the outputs of its convolutional layers by max-pooling over the feature maps. Because of the nature of code-switching data, we argue that this operation is not ideal to adapt to new morphological patterns from other languages as the model will tend to discard patterns from languages other than English. Instead, we propose a hierarchical attention mechanism that down-samples by prioritizing the n-gram features for 1) every n-gram order separately, and then 2) across the different n-gram orders jointly as described in Figure 1.

We use an attention mechanism similar to the one introduced by Bahdanau et al. (2014). The idea is to concentrate mass probability over the tokens that capture the most relevant information along the input. Our attention mechanism uses the following equations:


where is a projection matrix, is the dimension of the attention space, and is the dimension of the input features. is the attention vector to be learned, and is a scalar that describes the attention probability associated to the -th token. is the weighted sum of the input token vectors and the attention probabilities. Note that this mechanism is used independently for every order of n-grams resulting in a set of vectors from Equation 3. This allows the model to find out relevant information across individual n-grams first (e.g., all bi-grams, all tri-grams, etc.). Then, we apply another instance (i.e., another set of parameters) of the same attention mechanism across all the n-gram vectors . The resulting vector is what we call enhanced character n-gram representation (see Figure 1).

2.3 Sequence Tagging

Figure 2: High-level overview of the proposed model.

Our proposed sequential tagger is built upon ELMo and the enhanced character n-gram mechanism as shown in Figure 2. The ELMo word representations are concatenated with word embeddings from Twitter Pennington et al. (2014) and fastText Bojanowski et al. (2017) following the approaches from Howard and Ruder (2018) and Mave et al. (2018). The resulting word representations are fed into a bidirectional LSTM. Since we want the model to emphasize the enhanced character n-gram representations, we concatenate such vectors with the output of the BLSTM layer before passing information to the inference layer. We use a conditional random field Lafferty et al. (2001) for the inference layer.

2.4 Training

Multi-Task learning. We train the model by minimizing the negative log-likelihood loss of the conditional random field classifier. We also force the model to minimize a secondary loss determined by the task of recognizing whether a token is in English, partially English, or another language based solely on morphological clues (see the softmax layer in Figure 2). The overall loss of our model is defined as follows:


where and are the negative log-likelihood losses defined by Equation 4. determines the primary loss for the LID task, whereas is the secondary loss for the English, partially English, or not English task. The third term accounts for regularization and is the penalty weight.

Fine-tuning. We fine-tune the model by progressively updating the parameters from the top to the bottom layers of the model. This avoids losing the pre-trained knowledge from ELMo and smoothly adapts the network to the new languages from the code-switched data. We use the slanted triangular learning rate scheduler with both gradual unfreezing and discriminative fine-tuning over the layers (i.e., different learning rates across layers) proposed by Howard and Ruder (2018). We group the non-ELMo parameters of our model apart from the ELMo parameters. We set the non-ELMo parameters to be the first group of parameters to be tuned (i.e., parameters from enhanced character n-grams, CRF, and BLSTM). Then, we further group the ELMo parameters as follows (top to bottom): 1) the second bidirectional LSTM layer, 2) the first bidirectional LSTM layer, 3) the highway network, 4) the linear projection from flattened convolutions to the token embedding space, 5) all the convolutional layers, and 6) the character embedding weights. Once all the layers have been unfrozen, we update all the parameters together. This technique allows us get the most of our model moving from English to a code-switching setting.

3 Experiments

We describe our experiments for language identification (LID) in Section 3.1. We take the best models for LID, and transfer the code-switching learning to tasks such as named entity recognition and part-of-speech tagging in Section 3.2.

3.1 Language Identification

Corpus Train Dev Test
Nepali-English 2014
Posts 8,494 1,499 2,874
Tokens 123,959 22,097 40,268
lang1 38,310 7,173 12,286
lang2 51,689 9,008 17,216
Spanish-English 2016
Posts 11,400 3,014 10,716
Tokens 139,539 33,276 121,446
lang1 78,814 16,821 16944
lang2 33,709 8,652 77047
Hindi-English 2018
Posts 5,045 891 1,485
Tokens 100,337 16,531 29,854
lang1 57,695 9,468 17,589
lang2 20,696 3,420 5,842
Table 1: The distribution of the language identification datasets. The labels lang1 and lang2 refer to English and either Nepali, Spanish or Hindi, respectively. The full distribution can be found in the Appendix A


Corpus CMI-all CMI-mixed
Nepali-English 2014 19.708 25.697
Spanish-English 2016 7.685 22.114
Hindi-English 2018 10.094 23.141
Table 2: Code-Mixing Index (CMI) for the language identification datasets. CMI-all: average over all utterances in the corpus. CMI-mixed: average over only code-switched instances.
ID Experiment Nepali-English Spanish-English Hindi-English
Dev Test Dev Test Dev Test
Baseline: ELMo combined with LSTM and CRF
Exp 1.0 ELMo (frozen) 89.387 87.849 87.039 86.935 86.446 87.659
Exp 1.1 ELMo (unfrozen) 96.192 95.700 95.508 96.363 95.997 96.420
Exp 1.2 ELMo (unfrozen) + LSTM 96.279 95.904 95.564 96.596 96.489 96.694
Exp 1.3 ELMo (unfrozen) + LSTM + CRF 96.320 95.882 95.615 96.748 96.545 96.717
Approach 1: Adding other English embeddings Upon Exp 1.3 Upon Exp 1.3 Upon Exp 1.3
Exp 2.0 Twitter 96.395 95.938 95.962 96.968 96.368 96.854
Exp 2.1 fastText 96.302 95.756 95.817 96.810 96.394 96.620
Exp 2.2 Twitter + fastText 96.423 96.051 96.014 97.030 96.457 96.624
Approach 2: N-gram-based MTL from convolutions Upon Exp 2.2 Upon Exp 2.2 Upon Exp 2.2
Exp 3.0 Flatten convolutions 96.535 95.986 96.061 97.048 96.418 96.753
Exp 3.1 Attention on each n-gram (low level) 96.588 95.937 96.089 97.140 96.572 96.950
Exp 3.2 Attention across n-grams (high level) 96.555 96.029 96.079 97.087 96.497 96.874
Exp 3.3 Hierarchical attention (low and high level) 96.610 96.032 96.127 97.242 96.588 96.972
Exp 3.4 Hierarchical attention + position emb. 96.690 96.170 96.117 97.350 96.513 96.833
Exp 3.5 Hierarchical attention + position emb. 96.712 96.203 96.202 97.553 96.636 96.960
Exp 3.6 Hierarchical attention 96.614 96.044 96.131 97.325 96.776* 97.001*
Approach 3: Fine-tuning parameters Upon Exp 3.5 Upon Exp 3.5 Upon Exp 3.6
Exp 4.0 STLR 96.530 96.005 96.036 97.469 96.748 96.833
Exp 4.1 STLR + grad. unfreezing 96.661 96.111 96.247 97.536 96.657 96.815
Exp 4.2 STLR + discr. fine-tuning 96.643 96.189 95.665 96.603 95.596 96.218
Exp 4.3 STLR + grad. unfreezing + discr. fine-tuning 96.755* 96.504* 96.408* 97.690* 96.194 96.528
Previous best published results
Mave et al. (2018) - - 96.510 97.060 96.6045 96.840
Table 3: The results of multiple incremental experiments on each dataset. STLR refers to the slanted triangular learning rate scheduler, means that the character representations were concatenated to the vectors before the CRF layer (see Figure 2), and the superscript * denotes the best scores in each dataset. The scores are calculated using the weighted F-1 metric, and we highlight the best scores on each set of experiments in bold.

Datasets. We use three code-switching datasets for the language identification (LID) task. The first one contains Nepali-English data from the Computational Approaches to Linguistic Code-Switching 2014 (CALCS) workshop Solorio et al. (2014). The second one has Spanish-English data from CALCS 2016 Molina et al. (2016). The last one uses Hindi-English data, which was introduced by Mave et al. (2018). The three datasets follow the CALCS annotation scheme, which contains the labels lang1, lang2, other, ne, ambiguous, and mixed. English is used for the lang1 label, and Nepali, Spanish, or Hindi map to lang2 depending on the dataset. We show the distribution of lang1 and lang2 in Table 1. Additionally, we use Code-Mixing Index (CMI) Gambäck and Das (2014) to measure the level of code mixing in these corpora (see Table 2). A higher CMI value indicates higher level of language mixing and thus language identification is more difficult. We compute code-mixing in each utterance and average over all utterances in the corpus (CMI-all) and also over only code-switched instances (CMI-mixed). We observe that Nepali-English corpus has higher level of mixing to the other two language pairs.

Baselines. We define our baselines using the ELMo architecture combined with bidirectional LSTM and CRF on top. As shown in Table 3, ELMo needs to update its parameters during training to perform well (see Exp 1.0 and Exp 1.1). Further improvements are shown when the model uses LSTM and CRF, which is consistent with previous research for sequence labeling tasks Akbik et al. (2018); Peters et al. (2018); Devlin et al. (2018). Hence, we establish experiment Exp 1.3 as a baseline and build upon this model.

Approach 1. In the second set of experiments, we incorporate word embeddings from fastText and Twitter. The idea is to provide the model with more English-based knowledge that is more suitable to the social media aspects of the data (e.g., misspellings, subword-level information, Twitter expressions, etc.). As such, this information acts complementary to the ELMo word embeddings. In fact, the results consistently improve across the datasets when we add these word embeddings, being the combination of fastText and Twitter embeddings the best performer (see Table 3).

Approach 2. We focus on leveraging morphological clues from character n-grams by using multi-task learning. We max-pool and flatten the convolutions from ELMo and feed them into a secondary task (Exp 3.0). This experiment shows improvements over the single-task model, which emphasizes the importance of character n-grams for LID. More elaborated ways of down-sampling the convolutions improve further the performance. For instance, adding attention at each n-gram (Exp 3.1), across n-grams (Exp 3.2), or both mechanisms as in the hierarchical attention (Exp 3.3) helps the model to perform better across the datasets. However, when we provide position embeddings to tell the model where the character n-grams occur, the model in Hindi-English data slightly drops the performance while the models in Spanish-English and Nepali-English data further improve. Our intuition is that the overlap among character n-grams from one language to another is a key factor to determine the importance of the position embeddings. In fact, Spanish-English data overlaps about 15% more on each n-gram order than Hindi-English data Mave et al. (2018). Lastly, we concatenate the enhanced n-gram representations with the output of the LSTM before the CRF layer. This makes the morphological features more prominent to the model at inference time, which yields improvements across datasets.

LID System lang1 lang2 WA F1
Nepali-English 2014
Al-Badrashiny and Diab (2016) 97.6 97.0 97.3
Ours (Exp 4.3) 98.124 95.170 97.387
Spanish-English 2016
Al-Badrashiny and Diab (2016) 88.6 96.9 95.2
Jain and Bhat (2014) 92.3 96.9 96.0
Mave et al. (2018) 93.184 98.118 96.840
Ours (Exp 4.3) 94.802 98.575 97.894
Hindi-English 2018
Mave et al. (2018) 98.241 95.657 97.596
Ours (Exp 3.6) 98.315 95.737 97.672
Table 4: Comparison of our best models with the best published scores for language identification. Scores are calculated with the F1 metric, and WA F1 is the weighted average F1 between both languages.

Approach 3. In this set of experiments we try to smooth the transition from English to the code-switching setting by fine-tuning the best model architectures obtained from previous experiments. As shown in Table 3, we use slanted triangular learning rate (STLR) scheduler Howard and Ruder (2018) in all the experiments for this set. This is a linear scheduler that rapidly increases the learning rate to lead the model to a good parameter space for the task, and then it slowly decreases the learning rate to allow the model to converge. For Spanish-English and Nepali-English, we get the best scores by using gradual unfreezing and discriminative fine-tuning (Exp 4.3). In the case of Spanish-English data, while the dataset contains more English than Spanish, the model still needs a smooth transition to the code-switching setting because both languages share the same Latin root, which makes it hard to discriminate very similar words. For Nepali-English, the dataset has signficantly more Nepali than English which forces the model to learn more about Nepali while still keeping the English knowledge. For Hindi-English, we did not get better performance with the fine-tuning approach than what we got for Exp 3.6. Since this dataset has more English than Hindi tokens, the model tends to prioritize the English knowledge reducing the effect of adapting to Nepali444The percentage of language tokens in the training set is 57.51% for English and 20.6% for Hindi..

Figure 3: Visualization of the attention weights at the trigram level for the Spanish-English 2016 dataset on the language identification task. The boxes contain the trigrams of the word below them. We also provide the predicted label by the model, and whether it was correct or wrong.

Attention analysis. Figure 3 shows the attention weights for trigrams in the Spanish-English dataset. The model is able to pick up suffixes that belong to one or the other language. In the case of the word coming, the trigram -ing is common in English for verbs in the present progressive tense. This means that the -ing suffix is often placed at the end of the word, making use of positional information. For the words including the trigrams aha and hah, the position does not provide any additional help. This is the case when the combination of trigrams tends to appear only on one language and so observing the n-gram is sufficient for the model. We observe similar instances for Hindi-English dataset. For this dataset, the model learns trigrams like -ing, -ian and iye, isi for English and Hindi respectively (see Appendix D).

Error analysis. Morphological information of words is very useful for the task of language identification. However, when the language pairs share the same root, the words surface forms are so similar that it becomes very difficult for the model to discriminate among them. In Figure 3, the word miserable has exactly the same spelling in both Spanish and English, making the level of ambiguity so high that the model is confused. We find similar cases for Hindi-English, where a good number of miss-labeled words by the model are due to common spellings in both languages (Ex. me, to, use). We also observe some examples with inconsistencies in the gold language labels, where our model predicts the correct language label.

3.2 Transfer Code-Switching Learning

NER System Dev F1 Test F1.
ELMo + BLSTM + CRF 59.91 62.53
Exp 4.3: No CS 59.87 62.42
Exp 4.3: CS + inference retrained 60.57 63.87
Exp 4.3: CS + fully retrained 61.03 66.06
Exp 4.3: CS + fully retrained + MTL 61.32 66.69
Trivedi et al. (2018) - 63.76
Table 5: The results for Spanish-English named entity recognition. For reference, we add the scores from the winner of the shared task on NER as well as the scores of ELMo + BLSTM + CRF to simulate the same layout in which ELMo was used for monolingual NER. CS means that the code-switching knowledge was used from previously-trained LID models.

We evaluate how useful the code-switching information is, by transferring the knowledge to tasks such as named entity recognition and part-of-speech tagging. Our experimental settings mainly focus on 1) training the best architecture for LID without the pre-trained code-switching knowledge, 2) only training the inference layer with the pre-trained code-switching knowledge, and 3) setting the model fully trainable with the code-switching knowledge included. We also explore potential improvements when language identification annotations are provided given that our models can compute LID loss.

Named Entity Recognition. We use the dataset from the CALCS 2018 Aguilar et al. (2018), which has the labels person, location, organization, group, title, product, event, time, other. The label distribution is detailed in Appendix C. As shown in Table 5, when the model is trained with or without the code-switching knowledge, there is a difference in performance of about 1% on the F1 metric in the test set. Further improvements are achieved by using a fully trainable model that contains code-switching knowledge. Note that the last experiment achieves the best results when the secondary loss is used. This is an advantage of our model architecture, and it helps the model by providing a regularization effect when such labels are available.

POS System Dev F1 Test F1
Exp 3.6: No CS 80.21 72.14
Exp 3.6: CS + inference retrained 80.89 72.53
Exp 3.6: CS + fully retrained 81.92 74.02
Exp 3.6: CS + fully retrained + MTL 82.18 74.84
Table 6: The results on the part-of-speech tagging for the Hindi-English dataset. CS means that the code-switching knowledge was used from previously-trained LID models.

Part-of-Speech Tagging. We use the POS tagging dataset on Hindi-English from the ICON 2016 contest Das (2016). The data distribution is provided in Appendix B555Unfortunately, we could not compare with the previously reported scores on this dataset since the data distribution does not match with the one that the authors provided.. Similar to the behavior on the NER task, we see improvements on the model when the code-switching knowledge is present. In fact, Table 6 shows that the same architecture does around 2% better on F1 metric for both the validation and test sets when comparing the models with and without code-swithcing knowledge. This supports further our claim that the code-switching information learned by the LID models is useful when we retrain the models for different tasks.

4 Related Work

Transfer learning has become more practical in the last years, making possible to apply very large neural networks to tasks where annotated data is limited Peters et al. (2018); Bahdanau et al. (2014); Devlin et al. (2018); Howard and Ruder (2018). Code-switching-related tasks are usually framed as low-resource problems because they involve a large number of languages that lack annotated data or have no pre-trained models available on such domains. In fact, researchers have been mainly focused on traditional machine learning techniques because they perform better than deep learning models given the data constraints Mave et al. (2018); Yirmibeşoğlu and Eryiğit (2018); Al-Badrashiny and Diab (2016). Even though transfer learning has not yet been vastly explored for code-switching, there are some researchers that have tried to apply it from monolingual to code-switching tasks such as named entity recognition Trivedi et al. (2018); Winata et al. (2018). Nevertheless, these works serve as evidence that transfer learning can improve code-switching tasks potentially overcoming the absence of annotated data.

Deep learning approaches such as LSTM-based and CNN-based models have been recently explored for code-switching tasks Ball and Garrette (2018); Mager et al. (2019). However, we notice that the code-switching literature barely covers attention mechanisms. Attention was introduced by Bahdanau et al. (2014) in the task of machine translation. Since then, it has been broadly used in many other applications such as semantic slot filling and sentiment analysis. For code-switching, Wang et al. (2018) proposed a gated-based attention mechanism that chooses monolingual embeddings from one or the other language according to the data input. This work shows the potential of attention in code-switching settings. We employ a different attention component, more similar to Bahdanau et al. (2014), to handle down-sampling of character convolutions without losing essential information (in contrast to the behavior of max-pooling operations). Even though this is not explored in the literature, we believe that such approach aligns better with the linguistic nature of code-switching, where morphology and character n-grams play a significant role on identifying the languages involved.

Another aspect that has not been considered for code-switching is position embeddings. Position embeddings combined with CNNs have proved useful in computer vision Gehring et al. (2017); they help to localize non-spatial features extracted by convolutional networks within an image. We apply the same principle to our code-switching data: we argue that character n-grams without position information may not be enough for a model to learn the actual morphological aspects of the languages. We empirically validate those aspects and discuss the incidence of such mechanism on our experiments.

5 Conclusion

We explored transfer learning from a high-resource language, such as English, to code-switched language pairs. Our experiments demonstrate that transfer learning enables large pre-trained models to be adapted to code-switching settings, where we can get an advantage of the pre-trained knowledge. We established new state of the art on language identification for Nepali-English, Spanish-English, and Hindi-English datasets. Moreover, we explored to what extent these models can be used by evaluating them on other tasks such as named entity recognition and part-of-speech tagging for code-switched text. We found that, without requiring any preprocessing, the knowledge learned from LID models can be successfully transferred to these tasks, achieving state-of-the-art results for Spanish-English NER and competitive results on POS tagging for Hindi-English.


We thank Deepthi Mave for providing general statistics of the code-switching datasets.


  • G. Aguilar, F. AlGhamdi, V. Soto, M. Diab, J. Hirschberg, and T. Solorio (2018) Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia, pp. 138–147. External Links: Link Cited by: §1, §3.2.
  • A. Akbik, D. Blythe, and R. Vollgraf (2018) Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649. External Links: Link Cited by: §3.1.
  • M. Al-Badrashiny and M. Diab (2016) LILI: a simple language independent approach for language identification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 1211–1219. External Links: Link Cited by: Table 4, §4.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. External Links: Link, 1409.0473 Cited by: §2.2.2, §4, §4.
  • K. Ball and D. Garrette (2018) Part-of-speech tagging for code-switched, transliterated texts without explicit language identification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3084–3089. External Links: Link Cited by: §4.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: ISSN 2307-387X Cited by: §2.3.
  • A. Das (2016) Tool contest on POS tagging for code-mixed Indian social media (Facebook, Twitter, and Whatsapp) text. Note: retrieved 05-10-2019 External Links: Link Cited by: §1, §3.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.1, §4.
  • B. Gambäck and A. Das (2014) On measuring the complexity of code-mixing. In Proceedings of the 11th International Conference on Natural Language Processing, Goa, India, pp. 1–7. Cited by: §3.1.
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. CoRR abs/1705.03122. External Links: Link, 1705.03122 Cited by: §2.2.1, §4.
  • J. Howard and S. Ruder (2018) Fine-tuned language models for text classification. CoRR abs/1801.06146. External Links: Link, 1801.06146 Cited by: §2.3, §2.4, §3.1, §4.
  • N. Jain and R. A. Bhat (2014) Language identification in code-switching scenario. In Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 87–93. External Links: Link, Document Cited by: Table 4.
  • J. Lafferty, A. McCallum, and F. C. Pereira (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. Cited by: §2.3.
  • M. Mager, Ö. Çetinoglu, and K. Kann (2019) Subword-level language identification for intra-word code-switching. CoRR abs/1904.01989. External Links: Link, 1904.01989 Cited by: §4.
  • D. Mave, S. Maharjan, and T. Solorio (2018) Language identification and analysis of code-switched social media text. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia, pp. 51–61. External Links: Link Cited by: §1, §2.3, §3.1, §3.1, Table 3, Table 4, §4.
  • G. Molina, F. AlGhamdi, M. Ghoneim, A. Hawwari, N. Rey-Villamizar, M. Diab, and T. Solorio (2016) Overview for the second shared task on language identification in code-switched data. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, Austin, Texas, pp. 40–49. External Links: Link, Document Cited by: §1, §3.1.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §2.3.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link Cited by: §1, §2.1, §3.1, §4.
  • T. Solorio, E. Blair, S. Maharjan, S. Bethard, M. Diab, M. Ghoneim, A. Hawwari, F. AlGhamdi, J. Hirschberg, A. Chang, and P. Fung (2014) Overview for the first shared task on language identification in code-switched data. In Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 62–72. External Links: Link, Document Cited by: §1, §3.1.
  • R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Highway networks. CoRR abs/1505.00387. External Links: Link, 1505.00387 Cited by: §2.1.
  • S. Trivedi, H. Rangwani, and A. Kumar Singh (2018) IIT (BHU) submission for the ACL shared task on named entity recognition on code-switched data. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia, pp. 148–153. External Links: Link Cited by: Table 5, §4.
  • C. Wang, K. Cho, and D. Kiela (2018) Code-switched named entity recognition with embedding attention. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia, pp. 154–158. External Links: Link Cited by: §4.
  • G. I. Winata, C. Wu, A. Madotto, and P. Fung (2018) Bilingual character representation for efficiently addressing out-of-vocabulary words in code-switching named entity recognition. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia, pp. 110–114. External Links: Link Cited by: §4.
  • Z. Yirmibeşoğlu and G. Eryiğit (2018) Detecting code-switching between Turkish-English language pair. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, Brussels, Belgium, pp. 110–115. External Links: Link Cited by: §4.

Appendix for “From English to Code-Switching: Transfer Learningwith Strong Morphological Clues”

Appendix A Language Identification Distributions

Table 7 shows the distribution of the language identification labels across the CALCS datasets.

Labels Nep-Eng ‘14 Spa-Eng ‘16 Hin-Eng ‘18
lang1 71,148 112,579 84,752
lang2 64,534 119,408 29,958
other 45,286 55,768 21,725
ne 5,053 5,693 9,657
ambiguous 126 404 13
mixed 177 54 58
fw 0 30 542
unk 0 325 17
Table 7: Label distribution for LID datasets.

We use the utterance-level distribution shown in Table 8 to compute the CMI (see Section 3.1).

Labels Nep-Eng ‘14 Spa-Eng ‘16 Hin-Eng ‘18
CS 9,868 8,733 3,237
lang1 1,374 8,427 3,842
lang2 1,614 7,273 298
other 11 697 44
Table 8: Utterance level language distribution for language identification datasets.

Appendix B Parts-of-Speech Label Distribution

Table 9 shows the distribution of the POS tags for Hindi-English. This dataset correspond to the POS tagging experiments in Section 3.2.

POS Labels Train Dev Test
G_N 10,318 1,767 1,601
G_V 5,846 933 839
G_X 5,049 795 732
G_PRP 2,839 432 341
PSP 2,351 425 296
G_J 1,828 298 346
DT 1,395 190 82
G_R 1,316 208 139
G_PRT 1,044 162 213
CC 814 115 67
E 289 55 53
G_SYM 288 32 45
U 249 51 43
855 138 196
# 509 78 86
$ 388 51 42
~ 29 4 1
null 2 1 8
Table 9: The POS tag distribution for Hindi-English.

Appendix C Named Entity Recognition Label Distribution

Table 10 shows the distribution of the NER labels for Spanish-English. This dataset correspond to the NER experiments in Section 3.2.

NER Classes Train Dev Test
Person 6,226 95 1,888
Location 4,323 16 803
Organization 1,381 10 307
Group 1,024 5 153
Title 1,980 50 542
Product 1,885 21 481
Event 557 6 99
Time 786 9 197
Other 382 7 62
NE Tokens 18,544 219 4,532
O Tokens 614,013 9,364 178,479
Tweets 50,757 832 15,634
Table 10: The distribution of labels for the Spanish-English NER dataset from CALCS 2018.
Figure 4: Visualization of the attention weights at the trigram level for the Hindi-English 2018 dataset on the language identification task. The boxes contain the trigrams of the word below them. We also provide the predicted label by the model, and whether it was correct or wrong.

Appendix D Visualization of Attention Weights for Trigrams: Hindi-English 2018

Figure 4 shows the attention behavior for tri-grams on th Hindi-English dataset.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description