Analysing Emergent Users’ Text Messages Data and Exploring its Benefits

Analysing Emergent Users’ Text Messages Data and Exploring its Benefits


While users in the developed world can choose to adopt the technology that suits their needs, the emergent users cannot afford this luxury, hence, they adapt themselves to the technology that is readily available. When technology is designed, such as the mobile-phone technology, it is an implicit assumption that it would be adopted by the emergent users in due course. However, such user groups have different needs, and they follow different usage patterns as compared to users from the developed world. In this work, we target an emergent user base, i.e., users from a university in Pakistan, and analyse their texting behaviour on mobile phones. We see interesting results such as, the long-term linguistic adaptation of users in the absence of reasonable Urdu keyboards, the overt preference for communicating in Roman Urdu and the social forces related to textual interaction. We also present two case studies on how a single dataset can effectively help understand emergent users, improve usability of some tasks, and also help users perform previously difficult tasks with ease.

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier 10.1109/ACCESS.2017.DOI Analysing Emergent Users’ Text Messages Data and Exploring its Benefits ANAS BILAL1, AIMAL REXTIN1, AHMAD KAKAKHAIL1 AND MEHWISH NASIM 2 1Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan 2ARC Centre of Excellence for Mathematical and Statistical Frontiers, University of Adelaide, Australia Corresponding author: Aimal Rextin (e-mail:    INDEX TERMS Emergent Users, Roman Urdu, Text Message Analysis, Word Completion

I Introduction

Low cost mobile phones have allowed the wide scale adoption of smart phones in developing countries; for example, South Asian users alone constitute one of the largest mobile phone user bases in the world. In the last few years, researchers, have referred to user groups from developing countries as emergent users. Such user groups are either less educated, economically disadvantaged, geographically dispersed, or have a culturally heterogeneous background [1]. Millions of people in the developing world own mobile phones and large proportions of users have access to smartphones, owing to the availability of low cost smartphones. Technology is designed primarily for the needs of a user from the developed world and it is assumed emergent users will adopt it in a similar fashion. However, studies show that emergent users adopt these technologies in different ways [1] [24], with their unique usage peculiarities [2, 3]. Jones et al. [4] argued that emergent users should be engaged in technology design to yield designs of greater value.

Studies recommend that the interface design follows the traditional life cycle of HCI. However, this approach needs both time and financial resources; both lacking in case of emergent users. We propose an alternative approach; i.e., data driven usability improvement. This approach is feasible in today’s world due to the large amount of digital footprints left by users on their computing devices. We present two case studies on how a single dataset can help understand emergent users and improve usability of some tasks.

We decided to use data from text messages for our analysis because use of text messaging has increased in recent years. Pakistan Telecommunication Authority [5] estimates that there were million mobile subscribers in and more than billion text messages were sent in . We observed that Pakistani users generally use Roman Urdu when communicating by text messages. Roman Urdu is the colloquial term given to the practice of writing Urdu (the national language of Pakistan) in Roman script. It is generally agreed by local users that this practice evolved when users who were not comfortable in English started to communicate over text messages [32, 31, 13].

I-a Contributions

In this paper, we extend the work that we presented earlier [6] by performing additional analysis on the collected dataset. This dataset111Our dataset is available for academic use at is a collection of original text-messages corpus of mobile phone users in Pakistan. This data set consists of a hand-crafted file that includes grouping of various spelling variations of the same word. The following are our contributions for this extended study:

  1. Texting behaviour: Our first contribution is studying the texting behaviour of Pakistani users, such as, the predominant language used by the users and the various spelling variations of different words. This will be covered in Section III.

  2. Spelling Variations: Since Roman Urdu has naturally evolved, there are no standardized spellings in place. In Section IV, we will show that spelling variations have three main categories.

  3. Case Studies: The marked contribution of this paper is to show the importance of collecting datasets from developing countries in order to understand the use of technology in understudied populations. In this regard, we will discuss the following case studies:

    1. Availability of word completion feature is a major ease for smartphone users. In Section V, we show that by using this corpus one can help emergent smartphone users by providing him/her with more accurate word completion.

    2. In Section VI, we will discuss how users’ text data can be used to understand the social dynamics governing the interactions among people in developing countries, especially the ones involving intimate relations.

I-B Background

Users of computing systems who are disadvantaged are generally referred to as emergent users [1]. These include users from developing countries who face many challenges due to their different cultures and language, low education, and late access to technology etc. The number of mobile phone users from developing countries have increased dramatically in recent years due to the decreasing cost of smartphones and mobile internet. This has led to a steady stream of studies that analyse problems and potential solutions for these users. These include studying text messages from a structural and functional point of view [6]; a study exploring the use of computing technology by young migrant workers in China [7]; and a job searching website designed for illiterate people in Pakistan [8].

It has been argued that emergent users especially illiterate users find it difficult to use text based features on phones [9]. Various studies suggest that voice based access of such features should be made easier, e.g., voice-based use of text messages [10] and a voice-based job searching mobile application for Pakistani illiterate users [11]. However, text based communication has its own advantages such as its asynchronous nature and privacy, making it more convenient under certain social settings. The benefits of text messaging and their difficulty in using them led Pakistani users to adapt to the situation by using Roman script to convey their messages in the local language; this style of writing is known as Roman Urdu [6]. Any kind of analysis on Roman Urdu is made complex by the fact that Roman Urdu has no standard spelling defined and different people may use different spellings for the same word [12, 13, 14].

Text messaging communication is becoming more popular day by day, hence attracted the attention of researchers for analysis. For example some studies suggest that text messages can be classified into two disjoint groups. The first class of messages are called informational messages and they contain information of a practical nature, and the second class is called relational messages and they contain greetings and personal conversations etc. [15, 16, 17]. Some other studies compare the structure and function of the text messages of users with different characteristics such as age, gender, experience etc. [18, 19, 20, 15, 21]. There are even studies that look into how the language used on Facebook or WhatsApp etc. is different from standard languages [22] and studies that examine the functions of emojis in one-to-one messaging via text [23].

Roman Urdu Corpora

Urdu is not only the national language of Pakistan but also the official language of many Indian states and is among the most widely spoken languages of the sub-continent [29, 12, 14]. Researchers have collected various corpora of Roman Urdu in natural settings including a collection of tweets from Pakistan and using it to design an algorithm to separate English words from Roman Urdu words [30]. Hussain collected a corpora of text messages and analysed the linguistic patterns of emergent users [31]. There is also a corpus of Roman Urdu SMS and a Roman Urdu corpus containing Million messages from chat rooms [32]. We can see that the larger datasets are from the web chat rooms, which represents a totally different setting than text messages.

There have also been a few studies that investigate possible applications such as bilingual classification and sentiment analysis [33]; transliteration [13, 34]; word prediction [35]; and tagging parts of speech etc.

There are several applications of natural languages corpus such as detecting SPAM emails and messages[25, 26, 27] and other natural language processing tasks. Hence it is no surprise that various researchers have collected natural language corpora. For example, Tagg [28] collected a large scale corpus and performed linguistic investigation on about text messages.

I-C Organization

Section I is the introduction and motivation. Section II describes our dataset. In Section III, we discuss the texting behaviour of our user group, whereas in Section IV we analyse spelling variations in Roman Urdu text messages. In Sections V and VI we report our two case studies. Finally, Section VII concludes our paper.

Ii Dataset

There are several ways to textually communicate via a mobile phone including SMS, WhatsApp, Viber, Email etc. In the planning stages of this research work, we wanted to collect text data both from SMS and WhatsApp. However, due to technical difficulties in accessing WhatsApp data (mainly API restrictions), we decided to check whether our research objectives can be fulfilled by SMS data. This was done by conducting a pretest. We presented a group of users a single question: “How do they perceive their communication is divided among various channels including WhatsApp, phone calls, SMS, and other less popular channels of communication". The results showed that on average our participants perceived that of their communication was done through calls, through WhatsApp, through other less popular text communication channels like Facebook Messenger and finally through SMS. Hence, SMS accounts for of all text communication channels on smartphones. This gave us the confidence to collect a large enough data set in order to derive useful results.

Initially, we wanted to acquire text messages data from mobile service providers. However, we faced two constraints in this regard. The first was that legally, text content is stored for only days by the local service providers after which the content is deleted. The second constraint was that the service providers were reluctant to share their data due to various security and privacy concerns222Knowledge acquired through private communication, 2016.. Our dataset was collected through a custom Android application specifically developed for this study. We gathered individual SMS messages from students of a local university. These egos exchanged text messages with alters 333Ego is an individual who installed our application and alters are his/her contacts.. Hence, our results are not only valid for the limited number of participants but also for all their alters with whom they exchanged messages. We collected statistics such as frequency of alphabets, message length, time of message, unique words etc.444The data collection process was approved by the Research Ethics Committee at the local university. Details can be provided upon request.. We did not collect any data that would compromise the privacy of the users such as the complete message contents. We were conscientious from the start of the experiment about the need to protect personal information of our participants. We took the following measures to ensure that:

  1. We did not store any personal information like phone numbers and names and instead stored unique codes to identify the various alters for each ego.

  2. We did not gather any significant information about the text messages content rather, we gathered individual words and word bigrams but they were stored alphabetically for each individual’s full text messages history. This made it impossible to reconstruct any meaningful information from them.

  3. Participants were given the autonomy to share their messages with us. A spreadsheet file was generated on their smartphones. They could then analyse it before sending it to us by email. It was clearly communicated to them that they could delete some or all of their data had they felt uncomfortable sharing it.

We collected SMS data from females and males who were students in a local university. Our average participant had outgoing messages and incoming messages in a typical day. There are a total of words including both Roman Urdu and English words. This indicated that SMS is still popular despite the popularity of other messaging applications such as WhatsApp etc.

We gathered subjective information from our participants through a questionnaire whose results were partially discussed in our preliminary paper [6]. One of the questions that we asked was whether they deleted any data before initiating the data collection process. We found that of our participants deleted either or complete conversions with particular alters. The mean number of conversations deleted came out to be . The participants informed us that these alters were either very close or intimate friends. We will further discuss this aspect in Section VI.

Iii Initial Analysis

This section describes some important artefacts of the data specifically revolving around the dominant language in text messages. We studied whether the choice of language varies from alter to alter, and the degree to which it conforms to reciprocity in communication. We then studied the extent of spelling variations of a single word. We also looked at the different textisms present in text messages.

Iii-a Choice of language

We asked our participants about their preferred language for textual communication. We found at a confidence level that, generally type their messages in Roman Urdu. The same was confirmed from analysing uniformly random words from each participant, making a total of words. The primary reason driving their preference was ease of conveying their message as well as the perception that the message will be understandable by the recipient.

We also found at a confidence level of that participants use Roman Urdu when sending a text message to some alters and English for others. This could be attributed to many reasons, specifically to the nature of the ego’s relationship with the alter.

Iii-B Reciprocity

There is a strong evidence that people naturally tend to match both the alter’s vocabulary and sentence structure, when in a dialogue [36]. This natural alignment is called reciprocity. We wish to test whether reciprocity is present in our dataset, i.e. whether some ego-alter pairs predominantly converse in English while others predominantly converse in Roman Urdu?

FIGURE 1: Boxplot showing the distribution of the reciprocity coefficient for all ego-alter pairs. We can see that most ego-alter pairs have low value of , indicating that most words exchanged between an ego-alter pair tend to be in one language or the other.

We generated a uniform random sample of ego-alter pairs out of a total of such pairs. We then obtained all unique words that were exchanged between these ego-alter pairs and manually assigned them a language label555Roman Urdu or English. Based on this we computed two quantities: , the proportion of words sent by the ego to the said alter in English, and , the proportion of words received by the ego from the said alter in English. Based on these we computed the following:


We call as the reciprocity coefficient of a particular ego-alter pair 666We will get the same value of if we calculate it w.r.t. to Urdu instead of English. Fig. 1 shows the distribution of for our data. Now, will range between 0 and 1, both inclusive. The reason can be very easily seen from the following:

  • : when both ego and alter converse in one particular language exclusively.

  • : when either one uses one language and the other uses the other language exclusively.

We define an ego-alter pair to predominantly converse in one language if . We can see from Fig. 1 that this is true for most of the ego-alters pairs. Moreover, the mean in our sample comes to be . Indicating that most words exchanged between an ego-alter pair tend to be in one language or the other. We then decided to test the statistical significance of this finding with the following null and alternative hypothesis:


We applied the t test with and obtained a p-value of . Hence, we reject the null hypothesis as the observed results are highly unlikely if is true. Thus, we have indications that language reciprocity exists in our dataset.

Iii-C Textisms

Darkin et al. define textism as the different abbreviations, acronyms, slang, and emoticons typically used in text messages [37]. We found two types of textisms in our text message dataset that we will discuss below.

Numeric Homophones: Numeric Homophones are digits that are used in place of a word due to their acoustic similarity. Our survey showed that participants regularly use numeric digits. We note that the same practice was observed by Verheijen et. al. in their study of WhatsApp and Facebook chats of Dutch teenagers [22]. Some examples include using 4 to replace for and using the digit 7 in place of saath (which means both the number ’seven’ and ‘together’ in Urdu).

Character Repetition: We noticed that a large number of words have repetition of characters, e.g., pleasseeeeeeeee instead of please or yessss instead of yes. We wanted to find user’s intention behind repeating characters. For this purpose, we conducted a small independent user study and asked our participants whether they followed this practice in their text messaging or not? And if they do, then what was the reason behind it? We got responses from participants ( males and females). Our results showed that participants regularly follow this practice and the thematic analysis of their responses showed that do it to put emphasis on that word.

FIGURE 2: Plot showing frequencies of spelling variations. We can see that most words have spelling variations, while some words have as high as spelling variations.

Iv Spelling Variations

Many Roman Urdu words have multiple spellings because it has evolved naturally and even the same person may write the same word with slightly different spellings at different times. We wanted to better understand these spelling variations in Roman Urdu. For this purpose, we first hand-labeled words having multiple variations in each user’s profile separately, we then combined files of all our participants, and finally we enlisted all the words with multiple spelling along with their count. We found unique words with multiple spellings. Refer to Fig. 2 for further details.

In a natural language processing (NLP) task on a writing scheme with non standard spelling rules, like Roman Urdu, a necessary preprocessing step would be to determine whether two strings and correspond to the same word or not. An obvious first approach to solve the above mentioned preprocessing step is to apply a machine learning algorithm. However, after eye balling the data, we noticed that most of these word-pairs vary in limited ways, for example users seemed to use ‘i’ interchangeably with ‘e’. In order to systematically analyse their spelling variations, we designed an algorithm by modifying the Levenshtein Edit Distance algorithm [38]. Our algorithm enlists all spelling changes between two variations of the same word, more specifically it categorizes the change in spellings as either an addition/deletion of a character; or changing one character to another. Our data analysis revealed the following:

Add/Delete: We found that of the change operations were Add/Delete operations. Moreover, in about of the cases either a vowel or the character h or n was added or deleted as their sounds are close to certain Urdu alphabets.

Replace: The most common replace operation was interchanging e and i in a word-pair which accounted for of such spelling change operations. This was again not surprising because the Urdu alphabet Chotī ye is phonetically close to both ’i’ and ’e’.

Hence, we showed that one can determine with reasonable accuracy whether two strings correspond to the same word or not, given the probability distribution of the various subtypes of variations. Such an algorithm can have many applications, such as Roman Urdu chatbots. We next look at the two case studies which explore the applications of our dataset.

V Case Study 1: Ease in Text Entry

These days a wide variety of applications are available for smartphones users, but text entry is still the most common activity on smartphones [39, 40]. Hence, it is reasonable to assume that if we improve the usability of text entry on a smartphone, then we will also have significant impact on the overall usability of smartphone usage. This is probably the reason why software developers as well as research community is continuously trying to develop improved virtual keyboards for smartphones. Typing on a virtual keyboard is more difficult because of their smaller size and because they lack tactile feedback [41]. Different techniques have been used to improve typing experience on virtual keyboards, however many existing virtual keyboards use language model to speed up typing by predicting the complete word based on the previously typed letters [42]. There has also been extensive research on improving text entry of virtual keyboards, for example improving text entry when user is stationary [43, 44, 45, 46] and when user is walking [47, 48]. However, to the best of our knowledge there has been no work that studies improving text entry for Pakistani users in the context of Roman Urdu.

FIGURE 3: Subjective opinion of participants about English word completion versus Roman Urdu word completion. We can see that users find completing words in a phone with English dictionary irritating.

In order to assess text entry needs of Pakistani users, we conducted a pretest, in the form of a survey. Survey data was collected from 106 individuals, of them being males while the rest were females. The age of these individuals ranged from to years with a mean of years. Through the survey, we found that of our respondents type messages in Roman Urdu and feel that a specialized keyboard for Roman Urdu would be beneficial. One way to improve typing in Roman Urdu would be to improve auto-complete features on current keyboards as they do not provide word completion on non standard languages like Roman Urdu. This argument was supported by the results of a survey of individuals which is summarized in Fig. 3.

We have observed that many Pakistani users,over time, manually enter Roman Urdu words in the phone dictionary to make their text entry easier. As most Pakistani people generally type in Roman Urdu, it intrigued us to estimate how much time will be saved if the word completion algorithms are pre-trained with Roman Urdu dictionary. In this regard, we conducted two tests. First, we estimated how many words are completed by pre-training a word completion algorithm in a computer simulation. We then conducted a controlled experiment to measure the difference in time and subjective opinion when a dictionary is trained in Roman Urdu versus a standard smartphone dictionary available to Pakistani users. We will discuss them one by one below.

V-a Computer Simulation

One of the most primitive ways to complete the spelling of a word, given a prefix, is a Radix Tree. We used radix tree to estimate the increase in usability if the dictionary of smartphones uses dataset of local words. We conducted our experiment on three datasets, two of them were English datasets i.e. SMS corpus collected by Tagg [28] and most commonly used words in English given by Education First 777, and the third was the Roman Urdu dataset that we collected. We extracted unique words of all users from our Roman Urdu dataset. We divided our unique words data into two segments; data was used as training data while unique words used as our test data. We then built a radix tree using all words in the training set and then iteratively checked if each word in the training set is correctly completed through the same radix tree. As expected, radix tree built with Roman Urdu dataset outperformed in word completion by accurately completing words of the test data, while the accuracy of word completion when the radix tree is built with with Tagg’s [28] dataset and the top words was significantly lower. Results of this experiment are given in Table 1.

Dataset Words completed Words not completed
Roman Urdu Corpus ()
Tagg’s English Corpus ()
Top most used English words ()
TABLE 1: Results of Word Completion Experiment using radix tree

V-B User Study

FIGURE 4: Time taken by participants in English word completion versus Roman Urdu word completion. We can see that the participants using Roman Urdu word completion in general took less time to perform this activity.

We also conducted a controlled experiment to measure the time taken when a user types on a smartphone with standard word completion versus a smartphone with Roman Urdu word completion. We designed our experiment as a between-group experiment. We first selected participants ( male and females) with ages between and years. We then randomly divided these participants into two groups of each. Each group was asked to enter a piece of Roman Urdu text in a smartphone provided by the first author of this paper. This smartphone had a screen size of inches and was running Android operating system. One group entered the given text in a mobile phone with previously added Roman Urdu. While the second group entered the text with a default word completion (i.e. English).

For both groups, we noted the time taken in seconds. The boxplot in Fig. 4 shows that the participants using Roman Urdu word completion in general took less time to perform this activity. We denote the mean time taken for English word completion as and we calculated it to be (SD= 888Note: SD here refers to standard deviation). Similarly, we denote the mean time taken for Urdu word completion as and we calculated it to be (SD= ). We next wanted to test if this difference is statistically significant by applying t-test with the following set of hypotheses:



The p-value was computed to be and we concluded that it is highly unlikely that we get this data if and are equal.

Hence, a straightforward and simple way to improve the typing speed of an emergent users would be to use a text entry dataset to train the word completion software. The only condition being that such users write their local language in Roman script or use a local variation of English.

Vi Case Study 2: Sociological Analysis

In this case study we analyse the sociological aspects of text messaging. We started by looking at the top word bigrams that our participants have used in the text messages. It was surprising to notice a large proportion of intimate words. We divided the words into two categories: ordinary words and romantic/intimate words. For the top unique word bigrams, we found that words were of ordinary nature,whereas, were of intimate nature. The remaining words seemed to be part of marketing and promotional messages. Encouraged by the high percentage of intimate bigrams we extracted frequently occurring words which depict intimacy.

FIGURE 5: Distribution of romantic words used by our participants. We can see that most of our users used 100 or less words, while there are a few outliers with a high count.

We counted such words for each individual. We found that there were participants who used intimate words. The number of such words ranged from words in the files of those participants. The total number of words expressing intimate relations were . Fig. 5 shows the distribution of romantic words count per user. We recall from Section II that many participants deleted complete conversations with or of the alters, some of which were their intimate alters. Hence the difference in results of intimate and non-intimate alters as discussed in this section are likely to be more extreme in reality.

FIGURE 6: Comparison of the three groups of participants. We can see that the user with high number of romantic words communicate more than the other three groups at night. Night time is defined to start at 8:00 PM and ends at 7:00 AM

We grouped our participants into three disjoint sets based on the number of romantic words in their text messages. The three groups and the grouping criteria is given below:

  1. Low romantic group had or less romantic words.

  2. Medium romantic group had between and romantic words.

  3. High romantic had more than romantic words.

We next checked the temporal characteristics of these text messages. Fig. 6 shows the average proportion of text messages sent or received for each user group. It shows that with increasing number of romantic words, users seem to communicate later in the evening. In the Pakistani society generally it is easier for people to communicate with their romantic partners late s

in evenings since that time is relatively more private [49]. A few studies investigate how romantic couples use smartphones [50] and choose communication media [51], albeit to the best of our knowledge such studies have not been conducted in this particular sociological setting.

FIGURE 7: Time based comparison of the difference between the proportion of messages sent to intimate alters and non-intimate alters. We can see that overall our users communicated more with the intimate alters after 8:00 PM. Note that the dashed horizontal line shows a difference of zero while positive values indicate that more messages were sent to intimate alters.

In order to ensure that the trend shown in Fig. 6 was not due to factors such as different chronotypes of the different groups of participants, we decided to select a set of ego-alter pairs, where the ego had sent five or more intimate words. We restricted this to sent messages in order to ensure that an unsolicited text message is not analysed. There were twenty six such participants (egos). We divided the contacts (alters) of this set into two disjoint subsets: intimate alters and non-intimate alters. Here, intimate-alters are the contacts with whom five or more intimate words were communicated and non-intimate alters are the remaining contacts. We calculated the proportion of messages that were exchanged with intimate alters and the proportion that were exchanged with non-intimate alters. These proportion were calculated for each hour of the day. Next we computed their difference as . We can see from Fig. 7 that the same group of people tend to send more SMS to their intimate alters as compared to their non-intimate alters after 8:00 PM. To test this hypothesis, we applied t-test and found that the probability of an ego communicating with an intimate alter after 8:00 PM is more than his/her non intimate alters ( p-value ).

These results are supported by sociological theories that suggest that a modern man living in urban society is a part of several social groups known as social circles and he communicates with people in each social circle at a different pace during different times of the day [52].

Vii Conclusions and Future Work

In summary, we looked at the structure, function, and possible benefits of SMS data of emergent users in Pakistan. We recall that we collected various derived data from SMS data of Pakistani students. Moreover, no personal information such as names or phone numbers were collected. We also collected qualitative information from the same users through a survey.

We found that most users use Roman Urdu to communicate on text messages. Two thirds of our participants think it is because they are able to express themselves better this way. Moreover, the choice of language was not always consistent and we found to be dependent on the recipient of the message. Users also seem to adjust their language selection to match the language of the alter. Specifically, we found that some ego-alter pairs mostly communicate in English, while others mostly communicate in Roman Urdu.

Our work has three main conclusions:

  1. Text entry is one of the most common tasks performed on smartphones. When we conducted a survey about the text entry needs of Pakistani users, we were surprised to find that they did not feel the need for a specialized Urdu keyboard in its Arabic script, instead they felt the need for a Roman Urdu keyboard that helps them type faster on smartphones. This seems to indicate that a quick and easy way to improve text entry for emergent users is by training the word completion module using a dataset of local users’ text messages. We showed that in this way the proportion of words completed is higher than default word completion. Moreover, the satisfaction level of the users is also better when this is done.

  2. We note here that the dataset we collected can help in creating Roman Urdu chatbots. We recall that a chatbot is a software system that conducts conversations with human via textual or auditory methods. Although, the first chatbot ELIZA was created about 50 years ago, it is only recently that they are changing human-computer interaction with Apple Siri, Google Now and Amazon Alexa. There has been a lot of research and development in chatbots and natural language processing. However, the spelling variation in Roman Urdu posed a fundamental problem for developing chatbots that can converse in Roman Urdu with the user. However, we have seen that it is not difficult to solve this preprocessing step. Since Roman Urdu has naturally evolved, there are no standardized spellings in place. However, in Section IV, we showed that the spelling variations follow certain patterns. Moreover, the probability of certain patterns occurring is significantly higher than others. Hence, it is not difficult to solve the problem of spelling variations. This information can be used in improving the design of many existing applications and can spin-off new applications, for instance Roman Urdu chatbots which can help emergent smartphone users achieve a number of tasks with greater ease. Moreover, a similar approach can be applied to other emergent users who use non standard spellings. We are in the process of designing a Roman Urdu chatbot which would help people to report domestic violence.

  3. It was very surprising for us to discover many participants use SMS to communicate with their romantic partners. In our initial study [6], about of our participants suggested that their relational messages consists mainly of intimate messages. We were intrigued as romantic relationships are generally discouraged. We studied this more deeply by selecting words used in romantic conversations by hand. We discovered that many participants had high density of these words. We hence think that many young men and women have adopted text messaging for their more intimate conversation due to its cost efficiency and privacy. It might be useful to study this large segment of users more deeply to better understand their needs and requirements.

In summary, this is the first such study of SMS data of a large but understudied population with significant benefits. However, like any research study, there are certain aspects that threaten its validity. For example, the sample of participants in study is based on convenience sampling hence some results may not be generalized. Similarly, the results in Section IV are based on change operations detected by an algorithm, these operations might be different than how a human would classify them, which also makes this a future research topic. Similarly, Section VI might be underestimating the difference in communication patterns between intimate alters versus non-intimate alters as a number of our participants seemed to have deleted whole conversations taken place with a few of their alters as discussed in Section II.


We are thankful to the participants of this study which helped us in completing this work.

MN acknowledges support from ARC centre of Excellence for Mathematical and Statistical Frontiers, Australia.


  • [1] A. Joshi et al., “Technology adoption by’emergent’users: the user-usage model,” in Proceedings of the 11th Asia Pacific conference on computer human interaction, pp. 28–38, ACM, 2013.
  • [2] M. Nasim, A. Rextin, N. Khan, and M. M. Malik, “Understanding call logs of smartphone users for making future calls,” in Proceedings of the 18th International Conference on Human-Computer Interaction with Mobile Devices and Services, pp. 483–490, ACM, 2016.
  • [3] J. Pearson, S. Robinson, M. Jones, and C. Coutrix, “Evaluating deformable devices with emergent users,” in Proceedings of the 19th International Conference on Human-Computer Interaction with Mobile Devices and Services, p. 14, ACM, 2017.
  • [4] M. Jones, S. Robinson, J. Pearson, M. Joshi, D. Raju, C. C. Mbogo, S. Wangari, A. Joshi, E. Cutrell, and R. Harper, “Beyond “yesterday’s tomorrow": future-focused mobile interaction design by and for emergent users,” Personal and Ubiquitous Computing, vol. 21, no. 1, pp. 157–171, 2017.
  • [5] PTA, “Telecom indicators,” Accessed: 2016-08-25, 2016.
  • [6] A. Bilal, A. Rextin, A. Kakakhel, and M. Nasim, “Roman-txt: forms and functions of roman urdu texting,” in Proceedings of the 19th International Conference on Human-Computer Interaction with Mobile Devices and Services, p. 15, ACM, 2017.
  • [7] X. Lang, E. Oreglia, and S. Thomas, “Social practices and mobile phone use of young migrant workers,” in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services, pp. 59–62, ACM, 2010.
  • [8] I. A. Khan, S. S. Hussain, S. Z. A. Shah, T. Iqbal, and M. Shafi, “Job search website for illiterate users of pakistan,” Telematics and Informatics, vol. 34, no. 2, pp. 481–489, 2017.
  • [9] I. M. Thies et al., “User interface design for low-literate and novice users: Past, present and future,” Foundations and Trends® in Human–Computer Interaction, vol. 8, no. 1, pp. 1–72, 2015.
  • [10] E. Friscira, H. Knoche, and J. Huang, “Getting in touch with text: designing a mobile phone application for illiterate users to harness sms,” in Proceedings of the 2nd ACM Symposium on Computing for Development, p. 5, ACM, 2012.
  • [11] A. A. Raza, F. Ul Haq, Z. Tariq, M. Pervaiz, S. Razaq, U. Saif, and R. Rosenfeld, “Job opportunities through entertainment: Virally spread speech-based services for low-literate users,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2803–2812, ACM, 2013.
  • [12] A. Rafae, A. Qayyum, M. Moeenuddin, A. Karim, H. Sajjad, and F. Kamiran, “An unsupervised method for discovering lexical variations in roman urdu informal text,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 823–828, 2015.
  • [13] T. Ahmed, “Roman to urdu transliteration using wordlist,” in Proceedings of the Conference on Language and Technology, pp. 305–309, 2009.
  • [14] A. Daud, W. Khan, and D. Che, “Urdu language processing: a survey,” Artificial Intelligence Review, pp. 1–33, 2016.
  • [15] C. Thurlow and A. Brown, “Generation txt? the sociolinguistics of young people’s text-messaging,” Discourse analysis online, vol. 1, no. 1, p. 30, 2003.
  • [16] N. Döring, K. Hellwig, and P. Klimsa, “Mobile communication among german youth,” A sense of place: The global and the local in mobile communication, pp. 209–217, 2005.
  • [17] X. Faulkner and F. Culwin, “When fingers do the talking: a study of text messaging,” Interacting with computers, vol. 17, no. 2, pp. 167–185, 2005.
  • [18] J. Bernicot, O. Volckaert-Legrier, A. Goumi, and A. Bert-Erboul, “Forms and functions of sms messages: A study of variations in a corpus written by adolescents,” Journal of Pragmatics, vol. 44, no. 12, pp. 1701–1715, 2012.
  • [19] A. Goumi, O. Volckaert-Legrier, A. Bert-Erboul, and J. Bernicot, “Sms length and function: A comparative study of 13-to 18-year-old girls and boys,” European Review of Applied Psychology, vol. 61, no. 4, pp. 175–184, 2011.
  • [20] A. Deumert and S. Oscar Masinyana, “Mobile language choices- the use of english and isixhosa in text messages (sms) evidence from a bilingual south african sample,” English World-Wide, vol. 29, no. 2, pp. 117–147, 2008.
  • [21] J. M. Crosswhite, D. Rice, and S. M. Asay, “Texting among united states young adults: An exploratory study on texting and its use within families,” The Social Science Journal, vol. 51, no. 1, pp. 70–78, 2014.
  • [22] L. Verheijen and W. Stoop, “Collecting facebook posts and whatsapp chats,” in International Conference on Text, Speech, and Dialogue, pp. 249–258, Springer, 2016.
  • [23] H. Cramer, P. de Juan, and J. Tetreault, “Sender-intended functions of emojis in us messaging,” in Proceedings of the 18th International Conference on Human-Computer Interaction with Mobile Devices and Services, pp. 504–509, ACM, 2016.
  • [24] M. Nasim, A. Rextin, S. Hayat, N. Khan, and M. M. Malik, “Data analysis and call prediction on dyadic data from an understudied population,” Pervasive and Mobile Computing, vol. 41, pp. 166–178, 2017.
  • [25] D.-N. Sohn, J.-T. Lee, K.-S. Han, and H.-C. Rim, “Content-based mobile spam classification using stylistically motivated features,” Pattern Recognition Letters, vol. 33, no. 3, pp. 364–369, 2012.
  • [26] L. Chen, Z. Yan, W. Zhang, and R. Kantola, “Trusms: a trustworthy sms spam control system based on trust management,” Future Generation Computer Systems, vol. 49, pp. 77–93, 2015.
  • [27] J. M. Gómez Hidalgo, G. C. Bringas, E. P. Sánz, and F. C. García, “Content based sms spam filtering,” in Proceedings of the 2006 ACM symposium on Document engineering, pp. 107–114, ACM, 2006.
  • [28] C. Tagg, A corpus linguistics study of SMS text messaging. PhD thesis, The University of Birmingham, 2009.
  • [29] S. Urooj, S. Shams, S. Hussain, and F. Adeeba, “Sense tagged cle urdu digest corpus,” Centre for Language Engineering, Al-Khawarizmi Institute of Compute Science, University of Engineering and Technology, Lahore, 2014.
  • [30] I. Javed and H. Afzal, “Creation of bi-lingual social network dataset using classifiers,” in International Workshop on Machine Learning and Data Mining in Pattern Recognition, pp. 523–533, Springer, 2014.
  • [31] M. N. Hussain, Language Of Text Messages A Corpus Based Linguistic Analysis Of Sms In Pakistan. PhD thesis, International Islamic University, Islamabad, 2013.
  • [32] A. Irvine, J. Weese, and C. Callison-Burch, “Processing informal, romanized pakistani text messages,” in Proceedings of the Second Workshop on Language in Social Media, pp. 75–78, Association for Computational Linguistics, 2012.
  • [33] I. Javed, H. Afzal, A. Majeed, and B. Khan, “Towards creation of linguistic resources for bilingual sentiment analysis of twitter data,” in International Conference on Applications of Natural Language to Data Bases/Information Systems, pp. 232–236, Springer, 2014.
  • [34] M. Kamran Malik, T. Ahmed, S. Sulger, T. Bögel, A. Gulzar, G. Raza, S. Hussain, and M. Butt, “Transliterating urdu for a broad-coverage urdu/hindi lfg grammar,” in LREC 2010, Seventh International Conference on Language Resources and Evaluation, pp. 2921–2927, 2010.
  • [35] S. Shahzadi, B. Fatima, K. Malik, and S. M. Sarwar, “Urdu word prediction system for mobile phones,” World Applied Sciences Journal, vol. 22, no. 1, pp. 113–120, 2013.
  • [36] T. Koulouri, S. Lauria, and R. D. Macredie, “Do (and say) as i say: Linguistic adaptation in human–computer dialogs,” Human–Computer Interaction, vol. 31, no. 1, pp. 59–95, 2016.
  • [37] K. Durkin, G. Conti-Ramsden, and A. J. Walker, “Txt lang: Texting, textism use and literacy abilities in adolescents with and without specific language impairment,” Journal of Computer Assisted Learning, vol. 27, no. 1, pp. 49–57, 2011.
  • [38] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, pp. 707–710, 1966.
  • [39] T. M. T. Do, J. Blom, and D. Gatica-Perez, “Smartphone usage in the wild: a large-scale analysis of applications and context,” in Proceedings of the 13th international conference on multimodal interfaces, pp. 353–360, ACM, 2011.
  • [40] H. Falaki, R. Mahajan, S. Kandula, D. Lymberopoulos, R. Govindan, and D. Estrin, “Diversity in smartphone usage,” in Proceedings of the 8th international conference on Mobile systems, applications, and services, pp. 179–194, ACM, 2010.
  • [41] E. Hoggan, S. A. Brewster, and J. Johnston, “Investigating the effectiveness of tactile feedback for mobile touchscreens,” in Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 1573–1582, ACM, 2008.
  • [42] J. Goodman, G. Venolia, K. Steury, and C. Parker, “Language modeling for soft keyboards,” in Proceedings of the 7th international conference on Intelligent user interfaces, pp. 194–195, ACM, 2002.
  • [43] N. Henze, E. Rukzio, and S. Boll, “Observational and experimental investigation of typing behaviour using virtual keyboards for mobile devices,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2659–2668, ACM, 2012.
  • [44] D. Rudchenko, T. Paek, and E. Badger, “Text text revolution: a game that improves text entry on mobile touchscreen keyboards,” in Pervasive Computing, pp. 206–213, Springer, 2011.
  • [45] A. Sears, D. Revis, J. Swatski, R. Crittenden, and B. Shneiderman, “Investigating touchscreen typing: the effect of keyboard size on typing speed,” Behaviour & Information Technology, vol. 12, no. 1, pp. 17–22, 1993.
  • [46] A. Gunawardana, T. Paek, and C. Meek, “Usability guided key-target resizing for soft keyboards,” in Proceedings of the 15th international conference on Intelligent user interfaces, pp. 111–118, ACM, 2010.
  • [47] B. Schildbach and E. Rukzio, “Investigating selection and reading performance on a mobile phone while walking,” in Proceedings of the 12th international conference on Human computer interaction with mobile devices and services, pp. 93–102, ACM, 2010.
  • [48] S. Mizobuchi, M. Chignell, and D. Newton, “Mobile text entry: relationship between walking speed and text input task difficulty,” in Proceedings of the 7th international conference on Human computer interaction with mobile devices & services, pp. 122–128, ACM, 2005.
  • [49] M. Nasim, R. Charbey, C. Prieur, and U. Brandes, “Investigating link inference in partially observable networks: Friendship ties and interaction,” IEEE Transactions on Computational Social Systems, vol. 3, no. 3, pp. 113–119, 2016.
  • [50] M. Jacobs, H. Cramer, and L. Barkhuus, “Caring about sharing: Couples’ practices in single user device access,” in Group’16 Proceedings of the 19th International Conference on Supporting Group Work, Association for Computing Machinery, 2016.
  • [51] H. Cramer and M. L. Jacobs, “Couples’ communication channels: What, when & why?,” in Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 709–712, ACM, 2015.
  • [52] G. Simmel, Die Großstädte und das Geistesleben. Jazzybee Verlag, 2012.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description