Prevalence of code mixing in semi-formal patient communication in low resource languages of South Africa
In this paper we address the problem of code-mixing in resource-poor language settings. We examine data consisting of 182k unique questions generated by users of the MomConnect helpdesk, part of a national scale public health platform in South Africa. We show evidence of code-switching at the level of approximately 10% within this dataset- a level that is likely to pose challenges for future services. We use a natural language processing library (Polyglot) that supports detection of 196 languages and attempt to evaluate its performance at identifying English, isiZulu and code-mixed questions.
Code-mixing is a linguistic phenomenon where two languages are used spontaneously in one sentence. Code-mixing is widespread in multilingual and multicultural communities Mazibuko (2012). South Africa is a multilingual country where the Constitution recognises 11 official languages namely Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sepedi, Sesotho, Setswana, siSwati, Tshivenda and Xitsonga with about 98% of the total population speaking one of these as a first language Africa (2012). English is often used as a lingua franca and dominates the published media. However, it is only the fourth most prevalent first language in the country Africa (2012). Widespread use of numerous languages poses obvious challenges for the development of nationally relevant automated language processing tools. In the South African context, language tool development is further complicated by variation in each language’s use across diverse socio-economic and cultural contexts. Finally, the development of human language technologies (HLT), like corpora, lexica and software, have been hindered by significantly lower levels of digital access by those populations speaking South African languages. As a result, even the most widely spoken South African languages are classified as low-resource languages (LR). Nevertheless, it is becoming increasingly clear that the development of these tools is critically important to bridge the digital divide of a multilingual society. These tools are increasingly recognised as key providing access to information and automated language tools.
In this paper we highlight the challenges of an automated question-answering task for the MomConnect program run by the National Department of Health Daniel et al. (2019). These are questions sent by users representing a large proportion of women attending their first antenatal care (ANC1) with registration rates increasing from 40% in 2015, to 55% in 2016 and to 64% in 2017 LeFevre et al. (2018). Amongst this population users have registered with the rates of languages LeFevre et al. (2018) given in Table 1.
Linguistic analysis and computational modelling is challenging alone in the LR setting, but the task is further complicated by a prevalence of code-mixing, contractions, non-standard spellings, and ungrammatical constructions in our data set. Code switching degrades the performance of natural language processing (NLP) techniques, and language identification at token level is very challenging as there are fewer features available to document level language identification.
In order to quantify the challenges, we present an analysis of the prevalence of code-switching in a data set generated on a National Health platform, MomConnect (see Table 2). The dataset is comprised of 182k unique messages with examples from each of the 11 official South African languages. We present an algorithm to identify code switching, and evaluate the performance of the algorithm by comparing it to single language identification. To our knowledge, this is the first such analysis of a national scale text programme. The South African National Health Insurance proposed for 2026 could benefit from the use of Natural Language processing and these results provide a basis for estimating the level of effort that will be required to support these services.
|kuyenzeka yini kuthi umakuqhume condom kuvele kuthi khulelwe after day||en,zu,xh|
|Mng kade ngagcina ukuthol msg evela kini||zu,en|
|why ningaphenduli if umuntu ebuza something||zu,en|
For this paper we focused on detecting code switching in English, isiXhosa and isiZulu, the three most common languages used in the MomConnect population LeFevre et al. (2018). We have evaluated Polyglot as a means of tagging languages and code-switching by comparing the automated labelling provided by Polyglot against four manually labelled samples. Our data was labelled by native speakers and the pre-processing consisted of the following stages:
Removal of punctuation, emojis and digits from the data
Split each question into four chunks
Apply Polyglot to each chunk
Record Polyglot label
To evaluate the performance of this algorithm, we then compiled the following datasets.
Full Data Sample 400 randomly drawn sentences ignoring the Polyglot labels with manual language tags
English 400 randomly drawn sentences from those tagged by Polyglot as English with manual language tags
Zulu 400 randomly drawn sentences from those tagged by Polyglot as Zulu with manual language tags
Code-switched 400 randomly drawn sentences from those tagged by Polyglot as English + isiZulu, English + isiXhosa or isiZulu + isiXhosa with manual language tags
The breakdown of the different languages and language combinations is given in Table 3.
|Full Data Sample||4.5||2.75||0.50||3.25||76.50||4.50||3.25||4.50|
This distribution of language of incoming questions (Full Data Sample) is different to the languages chosen during registration given in Table 1 (=93.168, df = 3, p-value < 2.2e-16). Of the sample of 400 questions identified by the classifier as English, all were correct, whereas with isiZulu, this reduced to approximately 76%. Code switching was present in 65.5% of the questions that the classifier identified. An evaluation of the performance of the classifier on the Full Data Sample gave an accuracy of 0.78, weighted precision of 0.89, and weighted recall of 0.78.
It is interesting that the level of interaction with the service in English is higher than the level of English registrations. In addition there is evidence of extensive code switching in the data at the level of approximately 10%. The classifier appears to work well in these examples as evidenced by the high rate of positive predictions evident in the English, isiZulu, and code-switched data sets. However, more effort needs to be applied to evaluate the model using other techniques Platanios et al. (2017). An attempt to evaluate the model when applied to the Full Data Sample provides an accuracy of 0.775 which is not significantly higher than what would be obtained by simply assuming all questions were English (0.765). However this data set is highly imbalanced and more effort needs to be spent exploring means of evaluating the accuracy, precision and sensitivity of the model.
This paper demonstrates the challenges involved in natural language processing for resource poor environments at a national scale. These include imbalanced language distributions and evidence of extensive code-switching. These algorithms will need to be improved in the future to provide similar levels of digital access in these environments. A simple language classifier appears to show promise in being able to identify language and code-switching, although the evaluation of the model requires more thought given the imbalanced nature of the data set.
Presented at NeurIPS 2019 Workshop on Machine Learning for the Developing World.
- (2012) 2011 census in brief. Statistics South Africa, Pretoria, South Africa. Cited by: §1.
- (2019-07) Towards automating healthcare question answering in a noisy multilingual low-resource setting. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 948–953. External Links: Cited by: §1.
- (2018) Unpacking the performance of a mobile health information messaging program for mothers (momconnect) in south africa: evidence on program reach and messaging exposure. BMJ Global Health 3 (Suppl 2). External Links: Cited by: §1, §2.
- (2012) A socio-cultural approach to code-switching and code-mixing among speakers of isizulu in kwazulu-natal : a contribution to spoken language corpora.. Ph.D. Thesis, University of Kwazulu Natal, University of Kwazulu Natal, Durban, South Africa. Cited by: §1.
- (2017) Estimating accuracy from unlabeled data: A probabilistic logic approach. CoRR abs/1705.07086. External Links: Cited by: §4.