\tau-SS3: a text classifier with dynamic n-grams for early risk detection over text streams

-SS3: a text classifier with dynamic n-grams for early risk detection over text streams

Sergio G. Burdisso sburdisso@unsl.edu.ar Marcelo Errecalde merreca@unsl.edu.ar Manuel Montes-y-Gómez mmontesg@inaoep.mx Universidad Nacional de San Luis (UNSL), Ejército de Los Andes 950, San Luis, San Lius, C.P. 5700, Argentina Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Argentina Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE), Luis Enrique Erro No. 1, Sta. Ma. Tonantzintla, Puebla, C.P. 72840, Mexico
Abstract

A recently introduced classifier, called SS3, has shown to be well suited to deal with early risk detection (ERD) problems on text streams. It obtained state-of-the-art performance on early depression and anorexia detection on Reddit in the CLEF’s eRisk open tasks. SS3 was created to naturally deal with ERD problems since: it supports incremental training and classification over text streams and it can visually explain its rationale. However, SS3 processes the input using a bag-of-word model lacking the ability to recognize important word sequences. This could negatively affect the classification performance and also reduces the descriptiveness of visual explanations. In the standard document classification field, it is very common to use word n-grams to try to overcome some of these limitations. Unfortunately, when working with text streams, using n-grams is not trivial since the system must learn and recognize which n-grams are important “on the fly”. This paper introduces -SS3, a variation of SS3 which expands the model to dynamically recognize useful patterns over text streams. We evaluated our model on the eRisk 2017 and 2018 tasks on early depression and anorexia detection. Experimental results show that -SS3 is able to improve both, existing results and the richness of visual explanations.

keywords:
Early Text Classification. Dynamic Word N-Grams. Incremental Classification. SS3. Explainability. Trie. Digital Tree.
journal: Pattern Recognition Letters

1 Introduction

The analysis of sequential data is a very active research area that addresses problems where data is processed naturally as sequences or can be better modeled that way, such as sentiment analysis, machine translation, video analytics, speech recognition, and time-series processing. A scenario that is gaining increasing interest in the classification of sequential data is the one referred to as “early classification”, in which, the problem is to classify the data stream as early as possible without having a significant loss in terms of accuracy. The reasons behind this requirement of “earliness” could be diverse. It could be necessary because the sequence length is not known in advance (e.g. a social media user’s content) or, for example, if savings of some sort (e.g. computational savings) can be obtained by early classifying the input. However, the most important (and interesting) cases are when the delay in that decision could also have negative or risky implications. This scenario, known as “early risk detection” (ERD) have gained increasing interest in recent years with potential applications in rumor detection (ma2015detect, ; ma2016detecting, ; kwon2017rumor, ), sexual predator detection and aggressive text identification (escalante2017early, ), depression detection (losada2017erisk, ; losada2016test, ) or terrorism detection (iskandar2017terrorism, ).

ERD scenarios are difficult to deal with since models need to support: classifications and/or learning over of sequential data (streams); provide a clear method to decide whether the processed data is enough to classify the input stream (early stopping); and additionally, models should have the ability to explain their rationale since people’s lives could be affected by their decisions.

A recently introduced text classifier(burdisso2019, ), called SS3, has shown to be well suited to deal with ERD problems on social media streams. Unlike standard classifiers, SS3 was created to naturally deal with ERD problems since: it supports incremental training and classification over text streams and it has the ability to visually explain its rationale. It obtained state-of-the-art performance on early depression, anorexia and self-harm detection on the CLEF eRisk open tasks(burdisso2019, ; burdisso2019clef, ).

However, at its core, SS3 processes each sentence from the input stream using a bag-of-word model. This leads to SS3 lacking the ability to capture important word sequences which could negatively affect the classification performance. Additionally, since single words are less informative than word sequences, this bag-of-word model reduces the descriptiveness of SS3’s visual explanations.

The weaknesses of bag-of-words representations are well-known in the standard document classification field, in which word n-grams are usually used to overcome them. Unfortunately, when dealing with text streams, using word n-grams is not a trivial task since the system has to dynamically identify, create and learn which n-grams are important “on the fly”.

In this paper, we introduce a modification of SS3, called -SS3, which expands its original definition to allow recognizing important word sequences. In Section 2 the original SS3 definition is briefly introduced. Section 3 formally introduces -SS3, in which the needed equations and algorithms are described. In Section 4 we evaluate our model on the CLEF’s eRisk 2017 and 2018 tasks on early depression and anorexia detection. Finally, Section 5 summarizes the main conclusions derived from this work.

2 The SS3 Text Classifier

Figure 1: Classification example for categories technology and sports. In this example, SS3 misclassified the document’s topic as since it failed to capture important sequences for technology like “machine learning” or “video game”. This was due to each sentence being processed as a bag of words.

As it is described in more detail in burdisso2019 , SS3 first builds a dictionary of words for each category during the training phase, in which the frequency of each word is stored. Then, using these word frequencies, and during the classification stage, it calculates a value for each word using a function to value words in relation to categories.111 stands for “global value” of a word. In contrast with “local value” () which only values a word according to its frequency in a given category, takes into account its relative frequency across all the categories. takes a word and a category and outputs a number in the interval [0,1] representing the degree of confidence with which is believed to exclusively belong to , for instance, suppose categories , we could have:

Additionally, a vectorial version of is defined as:

where (the set of all the categories). That is, is only applied to a word and it outputs a vector in which each component is the word’s gv for each category . For instance, following the above example, we have:

The vector is called the “confidence vector of ”. Note that each category is assigned a fixed position in . For instance, in the example above is the confidence vector of the word “sushi” and the first position corresponds to , the second to , and so on.

The computation of involves three functions, , and , as follows:

(1)
  • values a word based on the local frequency of in . As part of this process, the word distribution curve is smoothed by a factor controlled by the hyper-parameter .

  • captures the global significance of in , it decreases its value in relation to the value of in the other categories; the hyper-parameter controls how far the local value must deviate from the median to be considered significant.

  • sanctions in relation to how many other categories is significant () to. That is, The more categories whose is high, the smaller the value. The hyper-parameter controls how sensitive this sanction is.

To keep this paper shorter and simpler we will only introduce here the equation for since the computation of both, and , is based only on this function. Nonetheless, for those readers interested in knowing how and functions are actually computed, we highly recommend reading the SS3 original paperburdisso2019 . Thus, is defined as:

(2)

Which, after estimating the probability, , by analytical Maximum Likelihood Estimation(MLE), leads to the actual definition:

(3)

Where denotes the frequency of in and the maximum frequency seen in . The value is one of the SS3’s hyper-parameter, called “smoothness”.

As it is illustrated in Figure 1, the SS3 classification algorithm can be thought of as a 2-phase process. In the first phase, the input is split into multiple blocks (e.g. paragraphs), then each block is in turn repeatedly divided into smaller units (e.g. sentences, words). Thus, the previously “flat” document is transformed into a hierarchy of blocks. In the second phase, the function is applied to each word to obtain the “level 0” confidence vectors, which then are reduced to “level 1” confidence vectors by means of a level 0 summary operator, . This reduction process is recursively propagated up to higher-level blocks, using higher-level summary operators, , until a single confidence vector, , is generated for the whole input. Finally, the actual classification is performed based on the values of this single confidence vector, , using some policy —for example, selecting the category with the highest confidence value. Note that using the confidence vectors in this hierarchy of blocks, it is quite straightforward for SS3 to visually justify the classification if different blocks of the input are colored in relation to their values. This is quite relevant, especially when it comes to health-care systems, specialists should be able to manually analyze the reasons behind automatic users classifications.

Note that SS3 processes individual sentences using a bag-of-word model since the operators reduce the confidence vectors of individual words into a single vector. Therefore, it is not being taken into account any relationship that could exist between these individual words, for instance, between “machine” and “learning” or “video” and “game”. That is, the model cannot capture important word sequences that could improve the classification performance, as could have been possible in the example shown in Figure 1. In addition, by allowing the model to capture important sequences, we also improve its ability to visually explain its rationale —for instance, think of “self-driving cars” in relation to technology, this sequence could be highlighted and presented to the user instead of each word being overlooked since none of “driving”, “cars” and “self” are relevant to technology. In standard document classification scenarios, this type of relationship could be captured using variable-length n-grams. Unfortunately, when working with text streams, using n-grams is not trivial, since the system has to dynamically identify and learn which n-grams are important “on the fly”. In the next section, we will introduce a variation of SS3, we have called -SS3, which expands the model definition to allow it to dynamically recognize important word sequences.

Figure 2: -SS3 classification example. Since SS3 now has the ability to capture important word sequences, it is able to correctly classify the document’s topic as .

3 The -SS3 Text Classifier

Regarding the model’s formal definition, the only change we need to introduce is a generalized version of the function given in Equation 2. This is trivial because it only involves allowing to value not only words but also sequences of them. That is, in symbols, if is a sequence of words, then is now defined as:

(4)

where is the sequence of words with the highest probability of occurring given that the category is .

Then, the same way as with Equation 3, the actual definition of becomes:

(5)

Where denotes the frequency of sequence in and the maximum frequency seen in for sequences of length .

Thus, given any word sequence , now we could use the original Equation 1 to compute its . For instance, suppose -SS3 has learned that the following word sequences have the value given below:

Then, the previously misclassified example could now be correctly classified as shown in Figure 2. In the following subsections, we will see how this formal expansion is in fact implemented in practice, i.e. how the training and classification algorithms are implemented to store learned sequences and to, in fact, dynamically recognize sequences like those of this example during classification time.

3.1 Training

Figure 3: Training example. Gray color and bold indicate an update. (a) the first two words have been consumed and the tree has 3 nodes, one for each word and one for the bigram “mobile APIs”, then a comma (,) is found in the input and Algorithm 1’s line 9 and 10 have removed all the cursors and placed a new one, , pointing to the root; (b) the word “for” is consumed, a new node for this word is created using the node pointed by cursor (lines 14), is updated to point to this new node (line 15 and 20), the next term is read and a new cursor is created (line 11) in the root; (c) “mobile” is consumed, using cursor the node for this word updated its frequency to 2 (line 16), a new node is created for the bigram “for mobile” using cursor , and a new cursor is created in the root node (line 11); (d) finally, the word “developers” is consumed and similarly, new nodes are created for word “developers”, bigram “mobile developers” and trigram “for mobile developers”.
1:
2:procedure Learn-New-Document(, )
3:     input: , a sequence of lexical units
4:                 , the category the document belongs to
5:     local variables: , a set of prefix tree nodes
6:     
7:      an empty set
8:     for each in  do
9:          if  is not a word then
10:                an empty set
11:          else
12:               add .Prefix-Tree.Root to
13:               for each in  do
14:                    if  has not a child for  then
15:                         .Child-Node.New()                     
16:                     .Child-Node[]
17:                    .Freq .Freq + 1
18:                    if .Level  then
19:                         remove from
20:                    else
21:                         replace for in                                                   
22:end procedure
Algorithm 1 Learning Algorithm. Note that is a sequence of lexical units (terms) which includes not only words but also punctuation marks. stores the maximum allowed sequence length.

The original SS3 learning algorithm only needs a dictionary of term-frequency pairs for each category. Each dictionary is updated as new documents are processed —i.e. unseen terms are added and frequencies of already seen terms are updated. Note that these frequencies are the only elements we need to store since to compute we only need to know ’s frequency in , (see Equation 3).

Likewise, -SS3 learning algorithm only needs to store frequencies of all word sequences seen while processing training documents. More precisely, given a fixed positive integer , it must store information about all word -grams seen during training, with —i.e. single words, bigrams, trigrams, etc. To achieve this, the new learning algorithm uses a prefix tree (also called trie)(trie1960, ; crochemore2009trie, ) to store all the frequencies, as shown in Algorithm 1. Note that instead of having different dictionaries, one for each -grams (e.g. one for words, one for bigrams, etc.) we have decided to use a single prefix tree since all n-grams will share common prefix with the shorter ones. Additionally, note that instead of processing the input document times, again one for each -grams, we have decided to use multiple cursors to be able to simultaneously store all sequences allowing the input to be processed as a stream. Finally, note that lines 8 and 9 of Algorithm 1 ensure that we are only taking into account n-grams that make sense, i.e. those composed only of words. All these previous observations, as well as the algorithm intuition, are illustrated with an example in Figure 3. This example assumes that the training has just begun for the first time and that the short sentence, “Mobile APIs, for mobile developers”, is the first document to be processed. Note that this tree will continue to grow, later, as more documents are consumed.

Thus, each category will have a prefix tree storing information linked to word sequences in which there is a tree’s node for each learned k-gram. Note that in Algorithm 1 there will never be more than cursors and that the height of the trees will never grow higher than since nodes at level 1 store 1-grams, at level 2 store 2-gram, and so on.

Finally, it is worth mentioning that this learning algorithm allows us to keep the original one’s virtues. Namely, the training is still incremental (i.e. it supports online learning) since there is no need neither to store all documents nor to re-train from scratch every time new training documents are available, instead, it is only necessary to update the already created trees. Additionally, there is still no need to compute the document-term matrix because, during classification and using Equation 5 and Equation 1, could be dynamically computed based on the frequencies stored in the trees.

3.2 Classification

1:
2:function Classify-Sentence() returns a confidence vector
3:     input: , a sequence of lexical units
4:     local variables: , a sequence of n-grams
5:                                , confidence vectors
6:     
7:      Parse()
8:      Map(, )
9:     return Reduce(, )
10:end function
11:
12:function Parse() returns a sequence of n-grams
13:     input: , a sequence of lexical units
14:     global variables: , the learned categories
15:     local variables: , a sequence of words
16:                                , a sequence of n-grams
17:                                , a list of n-grams
18:     
19:      the first term in
20:     while  is not empty do
21:          for each in  do
22:                Best-N-Gram(, )           
23:           the n-gram with the highest in
24:          add to
25:          move forward .Length positions      
26:     return
27:end function
28:
29:function Best-N-Gram(, ) returns a n-gram
30:     input: , a category
31:                 , a cursor pointing to a term in the sentence
32:     local variables: , a node of .Prefix-Tree
33:                                , a sequence of words
34:                                , a sequence of words
35:     
36:      .Prefix-Tree.Root
37:     add to
38:     
39:     while  has a child for  do
40:           .Child-Node[]
41:           next word in the sentence
42:          add to
43:          if  then
44:                               
45:     return
46:end function
Algorithm 2 Sentence classification algorithm. Map applies the function to every n-gram in and returns a list of resultant vectors. Reduce reduces to a single vector by applying the operator cumulatively.

The original classification algorithm will remain mostly unchanged222See Algorithm 1 from (burdisso2019, )., we only need to change the process by which sentences are split into single words, by allowing them to be split into variable-length n-grams. Also, these n-grams must be “the best possible ones”, i.e. having the maximum value. To achieve this goal, we will use the prefix tree of each category as a deterministic finite automaton (DFA) to recognize the most relevant sequences. Virtually, every node will we considered as a final state if its is greater or equal to a small constant . Thus, every DFA will advance its input cursor until no valid transition could be applied, then the state (node) with the highest value will be selected. This process is illustrated in more detail in Figure 4.

Finally, the full algorithm is given in Algorithm 2.333Note that for this algorithm to be included as part of the SS3’s overall classification algorithm, we only need to modify the original definition of Classify-At-Level of Algorithm 1 in burdisso2019 so that when called with it will call our new function, Classify-Sentence. Note that instead of splitting the sentences into words simply by using a delimiter, now we are calling a Parse function on line 6. Parse intelligently splits the sentence into a list of variable length n-grams. This is done by calling the Best-N-Gram function on line 20 which carries out the process illustrated in Figure 4 to return the best n-gram for a given category.

(a)
(b)
Figure 4: Example of recognizing the best n-gram for the first sentence block of Figure 2, “Machine learning is being widely used”. For simplicity in this example, we only show the technology’s DFA. There are conceptually 2 cursors, the black one () represents the input cursor and the white one () the “lookahead” cursor to feed the automatons. (a) The lookahead cursor has advanced feeding the DFA with 3 words (“machine”, “learning”, and “is”) until no more state transitions were available. There were two possible final states, one for “machine” and another for “machinelearning”, the latter is selected since it has the highest (0.23); (b) Finally, after the bigram “machinelearning” was recognized (see the first two word blocks painted in gray in Figure 2), the input cursor advanced 2 positions and is ready to start the process again using “is” as the first word to feed the automatons.

4 Experimental Results

4.1 Datasets

Experiments were conducted on the CLEF 2017losada2017erisk and 2018losada2018overview eRisk open tasks, on early risk detection of depression and anorexia. These pilot tasks focused on sequentially processing the content posted by users on Reddit. The datasets used in these tasks are a collection of writings (submissions) posted by users (referred to as “subjects”). The evaluation metric used in the eRisk tasks is called Early Risk Detection Error (ERDE). The ERDE measure takes into account not only the correctness of the system’s output but also the delay taken to emit the decision. Additionally, the ERDE measure is parameterized by a parameter, , which serves as the “deadline” for decision making, i.e. if a correct positive decision is made in time ,444Time is measured in terms of user writings. it will be taken by as if it were incorrect (false positive). Finally, the performance of all participants was measured using and .

Due to space limitations, we are not introducing here the details about the datasets or the ERDE measure, however, they can be found in losada2017erisk and losada2018overview .

4.2 Implementation details

The classifier was manually coded in Python 2.7 using only built-in functions and data structures (such as dict, map and reduce functions, etc.). Source code is available at https://github.com/sergioburdisso/ss3. Additionally, to avoid wasting memory by letting the digital trees to grow unnecessary large, every million words we executed a “pruning” procedure in which all the nodes with a frequency less or equal than 10 were removed. We also fixed the maximum n-gram length to 3 (i.e. we set ).555We tested other values greater than 3 but we did not notice any improvements on classification performance. Finally, we used the same hyper-parameter values as used in burdisso2019 , namely we set and . The early classification policy we used was to classify test subjects as positive as soon as the positive accumulated confidence value exceeds the negative.

4.3 Results

Results are shown in Table 1. As it can be seen, except for in D18 and in A18, -SS3 outperformed the best models. It is worth mentioning that, although not included in the table, the new model improved the original SS3’s performance not only according to the ERDE measures but also to most of the traditional/timeless measures. For instance, in the eRisk 2017 early depression detection task, -SS3’s recall, precision and were 0.55, 0.43 and 0.77 respectively, against SS3’s 0.52, 0.44 and 0.63. In addition, although these values were not the best among all participants, they were well above the average (0.39, 0.36 and 0.51), which is not bad if we consider that hyper-parameter values of our model were selected to optimize the measure.

Results imply that learned n-grams contributed to improving the performance of our classifier. Furthermore, these n-grams also improved the visual explanations given by SS3, as illustrated in Figure 5.666We have built an online live demo to try out -SS3, available at http://tworld.io/ss3, in which along with the classification result it gives you interactive visual explanations similar to the one shown in this figure.

Task Model
D17 -SS3 12.6% 7.70%
SS3 12.6% 8.12%
UNSLA 13.66% 9.68%
FHDO-BCSGB 12.70% 10.39%
D18 -SS3 9.48% 6.17%
SS3 9.54% 6.35%
UNSLA 8.78% 7.39%
FHDO-BCSGA 9.50% 6.44%
A18 -SS3 11.31% 6.26%
SS3 11.56% 6.69%
FHDO-BCSGD 12.15% 5.96%
UNSLB 11.40% 7.82%
Table 1: Results on the eRisk tasks measured by ERDE (the lower, the better). The tasks are “D17” and “D18” for Depression detection on eRisk 2017 and 2018 respectively, and “A18” for Anorexia detection on eRisk 2018. For comparison purposes, the models with the best and in the competitions are also included.

5 Conclusions

In this article, we introduced -SS3, a novel text classifier that is able to learn and recognize useful patterns over text streams “on the fly”. We saw how -SS3 was defined as an expansion of the SS3 classifier. The new model uses a prefix tree to store variable-length n-grams seen during training. The same data structure is then used as a DFA to recognize important word sequences as the input is read. This allowed us to keep all the original SS3’s virtues: support for incremental classification and learning over text streams; easy support for early classification; and visual explainability. -SS3 showed an improvement over the original SS3 in terms of standard performance metrics as well as for ERDE metrics. It also showed an improvement in terms of the expressiveness of visual explanations.

(a) Sentence level
(b) Word level
Figure 5: This figure shows the visual description given in Figure 9 of burdisso2019 . It shows the subject 9579’s writing 60 of the 2017 depression detection task. The visual description is shown at different levels: (a) sentences and (b) words. Blocks are painted proportionally to the real confidence values we obtained for the depression category after experimentation. Note that more useful information is now shown, such as the n-grams “I was feeling” and “kill myself”, improving the richness of visual explanations.

References

References

  • (1) Burdisso, S.G., Errecalde, M., y Gómez, M.M.: A text classification framework for simple and effective early depression detection over social media streams. Expert Systems with Applications 133, 182 – 197 (2019). https://doi.org/10.1016/j.eswa.2019.05.023, http://www.sciencedirect.com/science/article/pii/S0957417419303525
  • (2) Burdisso, S.G., Errecalde, M., y Gómez, M.M.: UNSL at erisk 2019: a unified approach for anorexia, self-harm and depression detection in social media. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 10th International Conference of the CLEF Association, CLEF 2019. Springer International Publishing, Lugano, Switzerland (2019)
  • (3) Crochemore, M., Lecroq, T.: Trie. Encyclopedia of Database Systems pp. 3179–3182 (2009)
  • (4) Escalante, H.J., Villatoro-Tello, E., Garza, S.E., López-Monroy, A.P., Montes-y Gómez, M., Villaseñor-Pineda, L.: Early detection of deception and aggressiveness using profile-based representations. Expert Systems with Applications 89, 99–111 (2017)
  • (5) Fredkin, E.: Trie memory. Communications of the ACM 3(9), 490–499 (1960)
  • (6) Iskandar, B.S.: Terrorism detection based on sentiment analysis using machine learning. Journal of Engineering and Applied Sciences 12(3), 691–698 (2017)
  • (7) Kwon, S., Cha, M., Jung, K.: Rumor detection over varying time windows. PloS one 12(1), e0168344 (2017)
  • (8) Losada, D.E., Crestani, F.: A test collection for research on depression and language use. In: International Conference of the Cross-Language Evaluation Forum for European Languages. pp. 28–39. Springer (2016)
  • (9) Losada, D.E., Crestani, F., Parapar, J.: erisk 2017: Clef lab on early risk prediction on the internet: Experimental foundations. In: International Conference of the Cross-Language Evaluation Forum for European Languages. pp. 346–360. Springer (2017)
  • (10) Losada, D.E., Crestani, F., Parapar, J.: Overview of erisk: early risk prediction on the internet. In: International Conference of the Cross-Language Evaluation Forum for European Languages. pp. 343–361. Springer (2018)
  • (11) Ma, J., Gao, W., Mitra, P., Kwon, S., Jansen, B.J., Wong, K.F., Cha, M.: Detecting rumors from microblogs with recurrent neural networks. In: IJCAI. pp. 3818–3824 (2016)
  • (12) Ma, J., Gao, W., Wei, Z., Lu, Y., Wong, K.F.: Detect rumors using time series of social context information on microblogging websites. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. pp. 1751–1754. ACM (2015)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398439
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description