From English to Code-Switching: Transfer Learning
with Strong Morphological Clues
Code-switching is still an understudied phenomenon in natural language processing mainly because of two related challenges: it lacks annotated data, and it combines a vast diversity of low-resource languages. Despite the language diversity, many code-switching scenarios occur in language pairs, and English is often a common factor among them. In the first part of this paper, we use transfer learning from English to English-paired code-switched languages for the language identification (LID) task by applying two simple yet effective techniques: 1) a hierarchical attention mechanism that enhances morphological clues from character n-grams, and 2) a secondary loss that forces the model to learn n-gram representations that are particular to the languages involved. We use the bottom layers of the ELMo architecture to learn these morphological clues by essentially recognizing what is and what is not English. Our approach outperforms the previous state of the art on Nepali-English, Spanish-English, and Hindi-English datasets. In the second part of the paper, we use our best LID models for the tasks of Spanish-English named entity recognition and Hindi-English part-of-speech tagging by replacing their inference layers and retraining them. We show that our retrained models are capable of using the code-switching information on both tasks to outperform models that do not have such knowledge.
While code-switching is a common phenomenon among multilingual speakers, it is still considered an understudied area in the field of natural language processing (NLP). The main reason is the lack of annotated data combined with the high diversity of languages in which this phenomenon can occur. However, social media has captured this linguistic phenomenon into written language, making code-switched data more accessible to researchers. Nevertheless, it is expensive to annotate this data and the annotation process also requires fluent speakers for the languages involved. Additionally, not all the languages have the same incidence and predominance, making annotations impractical and expensive for every combination of languages.
Even though code-switching can occur with multiple languages in the same utterance, it is more common to see this phenomenon in language pairs. Many of those language pairs have English as a common factor. Given that most of the NLP efforts are English-centric, it is reasonable to explore approaches where English-based models can be tailored to perform on code-switching settings.
In this paper, we study the code-switching phenomenon using English as an anchor language from multiple code-switched language pairs. In the first part of the paper, we focus on the task of language identification (LID) using ELMo Peters et al. (2018) as our English source. Our hypothesis is that English-based models should be able to recognize what is and what is not English when they are retrained. To accomplish that, we rely on the ELMo character-based architecture. We enhance the character convolutions in its core to detect morphological clues based on different n-gram orders. In addition, we refine these morphological clues using a hierarchical attention mechanism. The enhanced character n-gram representations are used to 1) compute a secondary loss according to whether a token is English or not, and 2) to inject morphological clues at the top of the model before performing inference for LID. Our models consistently outperform the LID state of the art on the Nepali-English Solorio et al. (2014), Spanish-English Molina et al. (2016), and Hindi-English Mave et al. (2018) datasets. In the second part of the paper, we transfer the learning of our best LID models to tasks such as named entity recognition (NER) for Spanish-English Aguilar et al. (2018) and part-of-speech (POS) tagging for Hindi-English Das (2016). We evaluate the impact of the pretrained code-switching knowledge for both NER and POS tagging comapred to models that do not have such information. We also report a new state of the art on the NER task.
Our contributions can be summarized as follows: 1) framing code-switching tasks based on an anchor high-resource language (i.e., English) and effectively using transfer learning to overcome the challenges for low-resource code-switched language pairs, 2) outperforming previous state-of-the-art results on LID for three datasets with different language pairs, 3) leveraging the code-switched knowledge to tasks such as NER and POS tagging, and establishing a new state of the art on NER by adapting fine-tuned LID models111We also release the source code to the NLP community to allow replicability of all the experiments.
We provide a cursory overview of the ELMo architecture (Section 2.1), which serves as background for a detailed description of the enhanced character n-gram representation (Section 2.2). Then, we define the overall sequence tagger architecture in Section 2.3 and describe our training procedure in Section 2.4.
2.1 ELMo: A Cursory Overview
ELMo is a character-based language model that provides deep contextualized word representations Peters et al. (2018). It has established new state of the art in a wide variety of NLP applications by effectively capturing word contexts. The shadowed box in Figure 2 describes the most important components of ELMo: character CNNs, a highway network Srivastava et al. (2015) with linear projection, and bidirectional LSTMs. The CNN layers are applied over the character embeddings to produce morphological features using different kernel widths. The resulting feature maps are max-pooled and flattened to have a single vector per word out of characters. The flattened vector is passed to the highway network and projected to a smaller dimensional space. Essentially, this word vector only accounts for local features without context. The global contextual information is provided by the bidirectional LSTM layers, which also yield the final word representations. We refer to Peters et al. (2018) for more details222https://allennlp.org/elmo.
We choose ELMo for code-switched language identification because 1) it has been extensively trained on English data as a general-purpose language model, which is essential to adapt to the idea of recognizing what is and what is not English, 2) it extracts morphological information out of characters, which is crucial since certain combinations of characters can determine if a word is of one language or another, and 3) it generates powerful word representations that account for multiple meanings depending on the context.
2.2 Enhanced Character N-grams
As described in Section 2.1, ELMo convolves character embeddings in its first layers. We use the resulting feature maps from such convolutions before they are max-pooled. These feature maps are essentially n-gram representations whose feature dimension depends on the number of channels on each convolution layer. The order of the n-grams are determined by the kernel widths of the CNNs. Figure 1 shows kernel widths of 2 (bi-grams) and 3 (tri-grams) for simplicity, but ELMo uses kernel sizes in the set . We are interested in the resulting vectors because they effectively capture morphological patterns with different order of n-grams; these vectors can ultimately describe particularities of the word morphology in the code-switched languages.
2.2.1 Position Embeddings
Convolutional networks are known to lose spatial information: they only capture patterns from a small section of the input within a window regardless where the patterns occur. In our case, this results in n-gram vectors that are not tied to the sequential order in which they appear. Consider the tri-grams of the English word tuning and the Spanish word ingeniero333ingeniero translates to engineer.. Even though the suffix -ing is more frequent in English than Spanish, the resulting convolutions for both words will provide the same vectors for the tri-gram ing ignoring its place in the words. To avoid this behavior, we provide the model with position embeddings Gehring et al. (2017).
Consider the sequence of character n-grams of order where is an n-gram vector from the character convolutions, and is the number of output channels. Also, consider a set of position embedding matrices defined as where is the maximum length of characters in a word and is the dimension of the embeddings. Then, the position vectors for the sequence are defined by where is the -th vector from the position embedding matrix . We use to enable the addition of the position embeddings and the n-gram vectors. Thus, the resulting set of n-grams that account for positional information is given by . Figure 1 shows the position embeddings for bi-grams and tri-grams.
2.2.2 Hierarchical Attention
ELMo down-samples the outputs of its convolutional layers by max-pooling over the feature maps. Because of the nature of code-switching data, we argue that this operation is not ideal to adapt to new morphological patterns from other languages as the model will tend to discard patterns from languages other than English. Instead, we propose a hierarchical attention mechanism that down-samples by prioritizing the n-gram features for 1) every n-gram order separately, and then 2) across the different n-gram orders jointly as described in Figure 1.
We use an attention mechanism similar to the one introduced by Bahdanau et al. (2014). The idea is to concentrate mass probability over the tokens that capture the most relevant information along the input. Our attention mechanism uses the following equations:
where is a projection matrix, is the dimension of the attention space, and is the dimension of the input features. is the attention vector to be learned, and is a scalar that describes the attention probability associated to the -th token. is the weighted sum of the input token vectors and the attention probabilities. Note that this mechanism is used independently for every order of n-grams resulting in a set of vectors from Equation 3. This allows the model to find out relevant information across individual n-grams first (e.g., all bi-grams, all tri-grams, etc.). Then, we apply another instance (i.e., another set of parameters) of the same attention mechanism across all the n-gram vectors . The resulting vector is what we call enhanced character n-gram representation (see Figure 1).
2.3 Sequence Tagging
Our proposed sequential tagger is built upon ELMo and the enhanced character n-gram mechanism as shown in Figure 2. The ELMo word representations are concatenated with word embeddings from Twitter Pennington et al. (2014) and fastText Bojanowski et al. (2017) following the approaches from Howard and Ruder (2018) and Mave et al. (2018). The resulting word representations are fed into a bidirectional LSTM. Since we want the model to emphasize the enhanced character n-gram representations, we concatenate such vectors with the output of the BLSTM layer before passing information to the inference layer. We use a conditional random field Lafferty et al. (2001) for the inference layer.
Multi-Task learning. We train the model by minimizing the negative log-likelihood loss of the conditional random field classifier. We also force the model to minimize a secondary loss determined by the task of recognizing whether a token is in English, partially English, or another language based solely on morphological clues (see the softmax layer in Figure 2). The overall loss of our model is defined as follows:
where and are the negative log-likelihood losses defined by Equation 4. determines the primary loss for the LID task, whereas is the secondary loss for the English, partially English, or not English task. The third term accounts for regularization and is the penalty weight.
Fine-tuning. We fine-tune the model by progressively updating the parameters from the top to the bottom layers of the model. This avoids losing the pre-trained knowledge from ELMo and smoothly adapts the network to the new languages from the code-switched data. We use the slanted triangular learning rate scheduler with both gradual unfreezing and discriminative fine-tuning over the layers (i.e., different learning rates across layers) proposed by Howard and Ruder (2018). We group the non-ELMo parameters of our model apart from the ELMo parameters. We set the non-ELMo parameters to be the first group of parameters to be tuned (i.e., parameters from enhanced character n-grams, CRF, and BLSTM). Then, we further group the ELMo parameters as follows (top to bottom): 1) the second bidirectional LSTM layer, 2) the first bidirectional LSTM layer, 3) the highway network, 4) the linear projection from flattened convolutions to the token embedding space, 5) all the convolutional layers, and 6) the character embedding weights. Once all the layers have been unfrozen, we update all the parameters together. This technique allows us get the most of our model moving from English to a code-switching setting.
We describe our experiments for language identification (LID) in Section 3.1. We take the best models for LID, and transfer the code-switching learning to tasks such as named entity recognition and part-of-speech tagging in Section 3.2.
3.1 Language Identification
|Baseline: ELMo combined with LSTM and CRF|
|Exp 1.0||ELMo (frozen)||89.387||87.849||87.039||86.935||86.446||87.659|
|Exp 1.1||ELMo (unfrozen)||96.192||95.700||95.508||96.363||95.997||96.420|
|Exp 1.2||ELMo (unfrozen) + LSTM||96.279||95.904||95.564||96.596||96.489||96.694|
|Exp 1.3||ELMo (unfrozen) + LSTM + CRF||96.320||95.882||95.615||96.748||96.545||96.717|
|Approach 1: Adding other English embeddings||Upon Exp 1.3||Upon Exp 1.3||Upon Exp 1.3|
|Exp 2.2||Twitter + fastText||96.423||96.051||96.014||97.030||96.457||96.624|
|Approach 2: N-gram-based MTL from convolutions||Upon Exp 2.2||Upon Exp 2.2||Upon Exp 2.2|
|Exp 3.0||Flatten convolutions||96.535||95.986||96.061||97.048||96.418||96.753|
|Exp 3.1||Attention on each n-gram (low level)||96.588||95.937||96.089||97.140||96.572||96.950|
|Exp 3.2||Attention across n-grams (high level)||96.555||96.029||96.079||97.087||96.497||96.874|
|Exp 3.3||Hierarchical attention (low and high level)||96.610||96.032||96.127||97.242||96.588||96.972|
|Exp 3.4||Hierarchical attention + position emb.||96.690||96.170||96.117||97.350||96.513||96.833|
|Exp 3.5||Hierarchical attention + position emb.||96.712||96.203||96.202||97.553||96.636||96.960|
|Exp 3.6||Hierarchical attention||96.614||96.044||96.131||97.325||96.776*||97.001*|
|Approach 3: Fine-tuning parameters||Upon Exp 3.5||Upon Exp 3.5||Upon Exp 3.6|
|Exp 4.1||STLR + grad. unfreezing||96.661||96.111||96.247||97.536||96.657||96.815|
|Exp 4.2||STLR + discr. fine-tuning||96.643||96.189||95.665||96.603||95.596||96.218|
|Exp 4.3||STLR + grad. unfreezing + discr. fine-tuning||96.755*||96.504*||96.408*||97.690*||96.194||96.528|
|Previous best published results|
|Mave et al. (2018)||-||-||96.510||97.060||96.6045||96.840|
Datasets. We use three code-switching datasets for the language identification (LID) task. The first one contains Nepali-English data from the Computational Approaches to Linguistic Code-Switching 2014 (CALCS) workshop Solorio et al. (2014). The second one has Spanish-English data from CALCS 2016 Molina et al. (2016). The last one uses Hindi-English data, which was introduced by Mave et al. (2018). The three datasets follow the CALCS annotation scheme, which contains the labels lang1, lang2, other, ne, ambiguous, and mixed. English is used for the lang1 label, and Nepali, Spanish, or Hindi map to lang2 depending on the dataset. We show the distribution of lang1 and lang2 in Table 1. Additionally, we use Code-Mixing Index (CMI) Gambäck and Das (2014) to measure the level of code mixing in these corpora (see Table 2). A higher CMI value indicates higher level of language mixing and thus language identification is more difficult. We compute code-mixing in each utterance and average over all utterances in the corpus (CMI-all) and also over only code-switched instances (CMI-mixed). We observe that Nepali-English corpus has higher level of mixing to the other two language pairs.
Baselines. We define our baselines using the ELMo architecture combined with bidirectional LSTM and CRF on top. As shown in Table 3, ELMo needs to update its parameters during training to perform well (see Exp 1.0 and Exp 1.1). Further improvements are shown when the model uses LSTM and CRF, which is consistent with previous research for sequence labeling tasks Akbik et al. (2018); Peters et al. (2018); Devlin et al. (2018). Hence, we establish experiment Exp 1.3 as a baseline and build upon this model.
Approach 1. In the second set of experiments, we incorporate word embeddings from fastText and Twitter. The idea is to provide the model with more English-based knowledge that is more suitable to the social media aspects of the data (e.g., misspellings, subword-level information, Twitter expressions, etc.). As such, this information acts complementary to the ELMo word embeddings. In fact, the results consistently improve across the datasets when we add these word embeddings, being the combination of fastText and Twitter embeddings the best performer (see Table 3).
Approach 2. We focus on leveraging morphological clues from character n-grams by using multi-task learning. We max-pool and flatten the convolutions from ELMo and feed them into a secondary task (Exp 3.0). This experiment shows improvements over the single-task model, which emphasizes the importance of character n-grams for LID. More elaborated ways of down-sampling the convolutions improve further the performance. For instance, adding attention at each n-gram (Exp 3.1), across n-grams (Exp 3.2), or both mechanisms as in the hierarchical attention (Exp 3.3) helps the model to perform better across the datasets. However, when we provide position embeddings to tell the model where the character n-grams occur, the model in Hindi-English data slightly drops the performance while the models in Spanish-English and Nepali-English data further improve. Our intuition is that the overlap among character n-grams from one language to another is a key factor to determine the importance of the position embeddings. In fact, Spanish-English data overlaps about 15% more on each n-gram order than Hindi-English data Mave et al. (2018). Lastly, we concatenate the enhanced n-gram representations with the output of the LSTM before the CRF layer. This makes the morphological features more prominent to the model at inference time, which yields improvements across datasets.
|LID System||lang1||lang2||WA F1|
|Al-Badrashiny and Diab (2016)||97.6||97.0||97.3|
|Ours (Exp 4.3)||98.124||95.170||97.387|
|Al-Badrashiny and Diab (2016)||88.6||96.9||95.2|
|Jain and Bhat (2014)||92.3||96.9||96.0|
|Mave et al. (2018)||93.184||98.118||96.840|
|Ours (Exp 4.3)||94.802||98.575||97.894|
|Mave et al. (2018)||98.241||95.657||97.596|
|Ours (Exp 3.6)||98.315||95.737||97.672|
Approach 3. In this set of experiments we try to smooth the transition from English to the code-switching setting by fine-tuning the best model architectures obtained from previous experiments. As shown in Table 3, we use slanted triangular learning rate (STLR) scheduler Howard and Ruder (2018) in all the experiments for this set. This is a linear scheduler that rapidly increases the learning rate to lead the model to a good parameter space for the task, and then it slowly decreases the learning rate to allow the model to converge. For Spanish-English and Nepali-English, we get the best scores by using gradual unfreezing and discriminative fine-tuning (Exp 4.3). In the case of Spanish-English data, while the dataset contains more English than Spanish, the model still needs a smooth transition to the code-switching setting because both languages share the same Latin root, which makes it hard to discriminate very similar words. For Nepali-English, the dataset has signficantly more Nepali than English which forces the model to learn more about Nepali while still keeping the English knowledge. For Hindi-English, we did not get better performance with the fine-tuning approach than what we got for Exp 3.6. Since this dataset has more English than Hindi tokens, the model tends to prioritize the English knowledge reducing the effect of adapting to Nepali444The percentage of language tokens in the training set is 57.51% for English and 20.6% for Hindi..
Attention analysis. Figure 3 shows the attention weights for trigrams in the Spanish-English dataset. The model is able to pick up suffixes that belong to one or the other language. In the case of the word coming, the trigram -ing is common in English for verbs in the present progressive tense. This means that the -ing suffix is often placed at the end of the word, making use of positional information. For the words including the trigrams aha and hah, the position does not provide any additional help. This is the case when the combination of trigrams tends to appear only on one language and so observing the n-gram is sufficient for the model. We observe similar instances for Hindi-English dataset. For this dataset, the model learns trigrams like -ing, -ian and iye, isi for English and Hindi respectively (see Appendix D).
Error analysis. Morphological information of words is very useful for the task of language identification. However, when the language pairs share the same root, the words surface forms are so similar that it becomes very difficult for the model to discriminate among them. In Figure 3, the word miserable has exactly the same spelling in both Spanish and English, making the level of ambiguity so high that the model is confused. We find similar cases for Hindi-English, where a good number of miss-labeled words by the model are due to common spellings in both languages (Ex. me, to, use). We also observe some examples with inconsistencies in the gold language labels, where our model predicts the correct language label.
3.2 Transfer Code-Switching Learning
|NER System||Dev F1||Test F1.|
|ELMo + BLSTM + CRF||59.91||62.53|
|Exp 4.3: No CS||59.87||62.42|
|Exp 4.3: CS + inference retrained||60.57||63.87|
|Exp 4.3: CS + fully retrained||61.03||66.06|
|Exp 4.3: CS + fully retrained + MTL||61.32||66.69|
|Trivedi et al. (2018)||-||63.76|
We evaluate how useful the code-switching information is, by transferring the knowledge to tasks such as named entity recognition and part-of-speech tagging. Our experimental settings mainly focus on 1) training the best architecture for LID without the pre-trained code-switching knowledge, 2) only training the inference layer with the pre-trained code-switching knowledge, and 3) setting the model fully trainable with the code-switching knowledge included. We also explore potential improvements when language identification annotations are provided given that our models can compute LID loss.
Named Entity Recognition. We use the dataset from the CALCS 2018 Aguilar et al. (2018), which has the labels person, location, organization, group, title, product, event, time, other. The label distribution is detailed in Appendix C. As shown in Table 5, when the model is trained with or without the code-switching knowledge, there is a difference in performance of about 1% on the F1 metric in the test set. Further improvements are achieved by using a fully trainable model that contains code-switching knowledge. Note that the last experiment achieves the best results when the secondary loss is used. This is an advantage of our model architecture, and it helps the model by providing a regularization effect when such labels are available.
|POS System||Dev F1||Test F1|
|Exp 3.6: No CS||80.21||72.14|
|Exp 3.6: CS + inference retrained||80.89||72.53|
|Exp 3.6: CS + fully retrained||81.92||74.02|
|Exp 3.6: CS + fully retrained + MTL||82.18||74.84|
Part-of-Speech Tagging. We use the POS tagging dataset on Hindi-English from the ICON 2016 contest Das (2016). The data distribution is provided in Appendix B555Unfortunately, we could not compare with the previously reported scores on this dataset since the data distribution does not match with the one that the authors provided.. Similar to the behavior on the NER task, we see improvements on the model when the code-switching knowledge is present. In fact, Table 6 shows that the same architecture does around 2% better on F1 metric for both the validation and test sets when comparing the models with and without code-swithcing knowledge. This supports further our claim that the code-switching information learned by the LID models is useful when we retrain the models for different tasks.
4 Related Work
Transfer learning has become more practical in the last years, making possible to apply very large neural networks to tasks where annotated data is limited Peters et al. (2018); Bahdanau et al. (2014); Devlin et al. (2018); Howard and Ruder (2018). Code-switching-related tasks are usually framed as low-resource problems because they involve a large number of languages that lack annotated data or have no pre-trained models available on such domains. In fact, researchers have been mainly focused on traditional machine learning techniques because they perform better than deep learning models given the data constraints Mave et al. (2018); Yirmibeşoğlu and Eryiğit (2018); Al-Badrashiny and Diab (2016). Even though transfer learning has not yet been vastly explored for code-switching, there are some researchers that have tried to apply it from monolingual to code-switching tasks such as named entity recognition Trivedi et al. (2018); Winata et al. (2018). Nevertheless, these works serve as evidence that transfer learning can improve code-switching tasks potentially overcoming the absence of annotated data.
Deep learning approaches such as LSTM-based and CNN-based models have been recently explored for code-switching tasks Ball and Garrette (2018); Mager et al. (2019). However, we notice that the code-switching literature barely covers attention mechanisms. Attention was introduced by Bahdanau et al. (2014) in the task of machine translation. Since then, it has been broadly used in many other applications such as semantic slot filling and sentiment analysis. For code-switching, Wang et al. (2018) proposed a gated-based attention mechanism that chooses monolingual embeddings from one or the other language according to the data input. This work shows the potential of attention in code-switching settings. We employ a different attention component, more similar to Bahdanau et al. (2014), to handle down-sampling of character convolutions without losing essential information (in contrast to the behavior of max-pooling operations). Even though this is not explored in the literature, we believe that such approach aligns better with the linguistic nature of code-switching, where morphology and character n-grams play a significant role on identifying the languages involved.
Another aspect that has not been considered for code-switching is position embeddings. Position embeddings combined with CNNs have proved useful in computer vision Gehring et al. (2017); they help to localize non-spatial features extracted by convolutional networks within an image. We apply the same principle to our code-switching data: we argue that character n-grams without position information may not be enough for a model to learn the actual morphological aspects of the languages. We empirically validate those aspects and discuss the incidence of such mechanism on our experiments.
We explored transfer learning from a high-resource language, such as English, to code-switched language pairs. Our experiments demonstrate that transfer learning enables large pre-trained models to be adapted to code-switching settings, where we can get an advantage of the pre-trained knowledge. We established new state of the art on language identification for Nepali-English, Spanish-English, and Hindi-English datasets. Moreover, we explored to what extent these models can be used by evaluating them on other tasks such as named entity recognition and part-of-speech tagging for code-switched text. We found that, without requiring any preprocessing, the knowledge learned from LID models can be successfully transferred to these tasks, achieving state-of-the-art results for Spanish-English NER and competitive results on POS tagging for Hindi-English.
We thank Deepthi Mave for providing general statistics of the code-switching datasets.
- Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia, pp. 138–147. External Links: Cited by: §1, §3.2.
- Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649. External Links: Cited by: §3.1.
- LILI: a simple language independent approach for language identification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 1211–1219. External Links: Cited by: Table 4, §4.
- Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. External Links: Cited by: §2.2.2, §4, §4.
- Part-of-speech tagging for code-switched, transliterated texts without explicit language identification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3084–3089. External Links: Cited by: §4.
- Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: Cited by: §2.3.
- Tool contest on POS tagging for code-mixed Indian social media (Facebook, Twitter, and Whatsapp) text. Note: retrieved 05-10-2019 External Links: Cited by: §1, §3.2.
- BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.1, §4.
- On measuring the complexity of code-mixing. In Proceedings of the 11th International Conference on Natural Language Processing, Goa, India, pp. 1–7. Cited by: §3.1.
- Convolutional sequence to sequence learning. CoRR abs/1705.03122. External Links: Cited by: §2.2.1, §4.
- Fine-tuned language models for text classification. CoRR abs/1801.06146. External Links: Cited by: §2.3, §2.4, §3.1, §4.
- Language identification in code-switching scenario. In Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 87–93. External Links: Cited by: Table 4.
- Conditional random fields: probabilistic models for segmenting and labeling sequence data. Cited by: §2.3.
- Subword-level language identification for intra-word code-switching. CoRR abs/1904.01989. External Links: Cited by: §4.
- Language identification and analysis of code-switched social media text. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia, pp. 51–61. External Links: Cited by: §1, §2.3, §3.1, §3.1, Table 3, Table 4, §4.
- Overview for the second shared task on language identification in code-switched data. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, Austin, Texas, pp. 40–49. External Links: Cited by: §1, §3.1.
- Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Cited by: §2.3.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Cited by: §1, §2.1, §3.1, §4.
- Overview for the first shared task on language identification in code-switched data. In Proceedings of the First Workshop on Computational Approaches to Code Switching, Doha, Qatar, pp. 62–72. External Links: Cited by: §1, §3.1.
- Highway networks. CoRR abs/1505.00387. External Links: Cited by: §2.1.
- IIT (BHU) submission for the ACL shared task on named entity recognition on code-switched data. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia, pp. 148–153. External Links: Cited by: Table 5, §4.
- Code-switched named entity recognition with embedding attention. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia, pp. 154–158. External Links: Cited by: §4.
- Bilingual character representation for efficiently addressing out-of-vocabulary words in code-switching named entity recognition. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, Melbourne, Australia, pp. 110–114. External Links: Cited by: §4.
- Detecting code-switching between Turkish-English language pair. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, Brussels, Belgium, pp. 110–115. External Links: Cited by: §4.
Appendix for “From English to Code-Switching: Transfer Learningwith Strong Morphological Clues”
Appendix A Language Identification Distributions
Table 7 shows the distribution of the language identification labels across the CALCS datasets.
|Labels||Nep-Eng ‘14||Spa-Eng ‘16||Hin-Eng ‘18|
|Labels||Nep-Eng ‘14||Spa-Eng ‘16||Hin-Eng ‘18|
Appendix B Parts-of-Speech Label Distribution
Appendix C Named Entity Recognition Label Distribution
Appendix D Visualization of Attention Weights for Trigrams: Hindi-English 2018
Figure 4 shows the attention behavior for tri-grams on th Hindi-English dataset.