-SS3: a text classifier with dynamic n-grams for early risk detection over text streams
A recently introduced classifier, called SS3, has shown to be well suited to deal with early risk detection (ERD) problems on text streams. It obtained state-of-the-art performance on early depression and anorexia detection on Reddit in the CLEF’s eRisk open tasks. SS3 was created to naturally deal with ERD problems since: it supports incremental training and classification over text streams and it can visually explain its rationale. However, SS3 processes the input using a bag-of-word model lacking the ability to recognize important word sequences. This could negatively affect the classification performance and also reduces the descriptiveness of visual explanations. In the standard document classification field, it is very common to use word n-grams to try to overcome some of these limitations. Unfortunately, when working with text streams, using n-grams is not trivial since the system must learn and recognize which n-grams are important “on the fly”. This paper introduces -SS3, a variation of SS3 which expands the model to dynamically recognize useful patterns over text streams. We evaluated our model on the eRisk 2017 and 2018 tasks on early depression and anorexia detection. Experimental results show that -SS3 is able to improve both, existing results and the richness of visual explanations.
keywords:Early Text Classification. Dynamic Word N-Grams. Incremental Classification. SS3. Explainability. Trie. Digital Tree.
The analysis of sequential data is a very active research area that addresses problems where data is processed naturally as sequences or can be better modeled that way, such as sentiment analysis, machine translation, video analytics, speech recognition, and time-series processing. A scenario that is gaining increasing interest in the classification of sequential data is the one referred to as “early classification”, in which, the problem is to classify the data stream as early as possible without having a significant loss in terms of accuracy. The reasons behind this requirement of “earliness” could be diverse. It could be necessary because the sequence length is not known in advance (e.g. a social media user’s content) or, for example, if savings of some sort (e.g. computational savings) can be obtained by early classifying the input. However, the most important (and interesting) cases are when the delay in that decision could also have negative or risky implications. This scenario, known as “early risk detection” (ERD) have gained increasing interest in recent years with potential applications in rumor detection (ma2015detect, ; ma2016detecting, ; kwon2017rumor, ), sexual predator detection and aggressive text identification (escalante2017early, ), depression detection (losada2017erisk, ; losada2016test, ) or terrorism detection (iskandar2017terrorism, ).
ERD scenarios are difficult to deal with since models need to support: classifications and/or learning over of sequential data (streams); provide a clear method to decide whether the processed data is enough to classify the input stream (early stopping); and additionally, models should have the ability to explain their rationale since people’s lives could be affected by their decisions.
A recently introduced text classifier(burdisso2019, ), called SS3, has shown to be well suited to deal with ERD problems on social media streams. Unlike standard classifiers, SS3 was created to naturally deal with ERD problems since: it supports incremental training and classification over text streams and it has the ability to visually explain its rationale. It obtained state-of-the-art performance on early depression, anorexia and self-harm detection on the CLEF eRisk open tasks(burdisso2019, ; burdisso2019clef, ).
However, at its core, SS3 processes each sentence from the input stream using a bag-of-word model. This leads to SS3 lacking the ability to capture important word sequences which could negatively affect the classification performance. Additionally, since single words are less informative than word sequences, this bag-of-word model reduces the descriptiveness of SS3’s visual explanations.
The weaknesses of bag-of-words representations are well-known in the standard document classification field, in which word n-grams are usually used to overcome them. Unfortunately, when dealing with text streams, using word n-grams is not a trivial task since the system has to dynamically identify, create and learn which n-grams are important “on the fly”.
In this paper, we introduce a modification of SS3, called -SS3, which expands its original definition to allow recognizing important word sequences. In Section 2 the original SS3 definition is briefly introduced. Section 3 formally introduces -SS3, in which the needed equations and algorithms are described. In Section 4 we evaluate our model on the CLEF’s eRisk 2017 and 2018 tasks on early depression and anorexia detection. Finally, Section 5 summarizes the main conclusions derived from this work.
2 The SS3 Text Classifier
As it is described in more detail in burdisso2019 , SS3 first builds a dictionary of words for each category during the training phase, in which the frequency of each word is stored. Then, using these word frequencies, and during the classification stage, it calculates a value for each word using a function to value words in relation to categories.111 stands for “global value” of a word. In contrast with “local value” () which only values a word according to its frequency in a given category, takes into account its relative frequency across all the categories. takes a word and a category and outputs a number in the interval [0,1] representing the degree of confidence with which is believed to exclusively belong to , for instance, suppose categories , we could have:
Additionally, a vectorial version of is defined as:
where (the set of all the categories). That is, is only applied to a word and it outputs a vector in which each component is the word’s gv for each category . For instance, following the above example, we have:
The vector is called the “confidence vector of ”. Note that each category is assigned a fixed position in . For instance, in the example above is the confidence vector of the word “sushi” and the first position corresponds to , the second to , and so on.
The computation of involves three functions, , and , as follows:
values a word based on the local frequency of in . As part of this process, the word distribution curve is smoothed by a factor controlled by the hyper-parameter .
captures the global significance of in , it decreases its value in relation to the value of in the other categories; the hyper-parameter controls how far the local value must deviate from the median to be considered significant.
sanctions in relation to how many other categories is significant () to. That is, The more categories whose is high, the smaller the value. The hyper-parameter controls how sensitive this sanction is.
To keep this paper shorter and simpler we will only introduce here the equation for since the computation of both, and , is based only on this function. Nonetheless, for those readers interested in knowing how and functions are actually computed, we highly recommend reading the SS3 original paperburdisso2019 . Thus, is defined as:
Which, after estimating the probability, , by analytical Maximum Likelihood Estimation(MLE), leads to the actual definition:
Where denotes the frequency of in and the maximum frequency seen in . The value is one of the SS3’s hyper-parameter, called “smoothness”.
As it is illustrated in Figure 1, the SS3 classification algorithm can be thought of as a 2-phase process. In the first phase, the input is split into multiple blocks (e.g. paragraphs), then each block is in turn repeatedly divided into smaller units (e.g. sentences, words). Thus, the previously “flat” document is transformed into a hierarchy of blocks. In the second phase, the function is applied to each word to obtain the “level 0” confidence vectors, which then are reduced to “level 1” confidence vectors by means of a level 0 summary operator, . This reduction process is recursively propagated up to higher-level blocks, using higher-level summary operators, , until a single confidence vector, , is generated for the whole input. Finally, the actual classification is performed based on the values of this single confidence vector, , using some policy —for example, selecting the category with the highest confidence value. Note that using the confidence vectors in this hierarchy of blocks, it is quite straightforward for SS3 to visually justify the classification if different blocks of the input are colored in relation to their values. This is quite relevant, especially when it comes to health-care systems, specialists should be able to manually analyze the reasons behind automatic users classifications.
Note that SS3 processes individual sentences using a bag-of-word model since the operators reduce the confidence vectors of individual words into a single vector. Therefore, it is not being taken into account any relationship that could exist between these individual words, for instance, between “machine” and “learning” or “video” and “game”. That is, the model cannot capture important word sequences that could improve the classification performance, as could have been possible in the example shown in Figure 1. In addition, by allowing the model to capture important sequences, we also improve its ability to visually explain its rationale —for instance, think of “self-driving cars” in relation to technology, this sequence could be highlighted and presented to the user instead of each word being overlooked since none of “driving”, “cars” and “self” are relevant to technology. In standard document classification scenarios, this type of relationship could be captured using variable-length n-grams. Unfortunately, when working with text streams, using n-grams is not trivial, since the system has to dynamically identify and learn which n-grams are important “on the fly”. In the next section, we will introduce a variation of SS3, we have called -SS3, which expands the model definition to allow it to dynamically recognize important word sequences.
3 The -SS3 Text Classifier
Regarding the model’s formal definition, the only change we need to introduce is a generalized version of the function given in Equation 2. This is trivial because it only involves allowing to value not only words but also sequences of them. That is, in symbols, if is a sequence of words, then is now defined as:
where is the sequence of words with the highest probability of occurring given that the category is .
Then, the same way as with Equation 3, the actual definition of becomes:
Where denotes the frequency of sequence in and the maximum frequency seen in for sequences of length .
Thus, given any word sequence , now we could use the original Equation 1 to compute its . For instance, suppose -SS3 has learned that the following word sequences have the value given below:
Then, the previously misclassified example could now be correctly classified as shown in Figure 2. In the following subsections, we will see how this formal expansion is in fact implemented in practice, i.e. how the training and classification algorithms are implemented to store learned sequences and to, in fact, dynamically recognize sequences like those of this example during classification time.
The original SS3 learning algorithm only needs a dictionary of term-frequency pairs for each category. Each dictionary is updated as new documents are processed —i.e. unseen terms are added and frequencies of already seen terms are updated. Note that these frequencies are the only elements we need to store since to compute we only need to know ’s frequency in , (see Equation 3).
Likewise, -SS3 learning algorithm only needs to store frequencies of all word sequences seen while processing training documents. More precisely, given a fixed positive integer , it must store information about all word -grams seen during training, with —i.e. single words, bigrams, trigrams, etc. To achieve this, the new learning algorithm uses a prefix tree (also called trie)(trie1960, ; crochemore2009trie, ) to store all the frequencies, as shown in Algorithm 1. Note that instead of having different dictionaries, one for each -grams (e.g. one for words, one for bigrams, etc.) we have decided to use a single prefix tree since all n-grams will share common prefix with the shorter ones. Additionally, note that instead of processing the input document times, again one for each -grams, we have decided to use multiple cursors to be able to simultaneously store all sequences allowing the input to be processed as a stream. Finally, note that lines 8 and 9 of Algorithm 1 ensure that we are only taking into account n-grams that make sense, i.e. those composed only of words. All these previous observations, as well as the algorithm intuition, are illustrated with an example in Figure 3. This example assumes that the training has just begun for the first time and that the short sentence, “Mobile APIs, for mobile developers”, is the first document to be processed. Note that this tree will continue to grow, later, as more documents are consumed.
Thus, each category will have a prefix tree storing information linked to word sequences in which there is a tree’s node for each learned k-gram. Note that in Algorithm 1 there will never be more than cursors and that the height of the trees will never grow higher than since nodes at level 1 store 1-grams, at level 2 store 2-gram, and so on.
Finally, it is worth mentioning that this learning algorithm allows us to keep the original one’s virtues. Namely, the training is still incremental (i.e. it supports online learning) since there is no need neither to store all documents nor to re-train from scratch every time new training documents are available, instead, it is only necessary to update the already created trees. Additionally, there is still no need to compute the document-term matrix because, during classification and using Equation 5 and Equation 1, could be dynamically computed based on the frequencies stored in the trees.
The original classification algorithm will remain mostly unchanged222See Algorithm 1 from (burdisso2019, )., we only need to change the process by which sentences are split into single words, by allowing them to be split into variable-length n-grams. Also, these n-grams must be “the best possible ones”, i.e. having the maximum value. To achieve this goal, we will use the prefix tree of each category as a deterministic finite automaton (DFA) to recognize the most relevant sequences. Virtually, every node will we considered as a final state if its is greater or equal to a small constant . Thus, every DFA will advance its input cursor until no valid transition could be applied, then the state (node) with the highest value will be selected. This process is illustrated in more detail in Figure 4.
Finally, the full algorithm is given in Algorithm 2.333Note that for this algorithm to be included as part of the SS3’s overall classification algorithm, we only need to modify the original definition of Classify-At-Level of Algorithm 1 in burdisso2019 so that when called with it will call our new function, Classify-Sentence. Note that instead of splitting the sentences into words simply by using a delimiter, now we are calling a Parse function on line 6. Parse intelligently splits the sentence into a list of variable length n-grams. This is done by calling the Best-N-Gram function on line 20 which carries out the process illustrated in Figure 4 to return the best n-gram for a given category.
4 Experimental Results
4.2 Implementation details
The classifier was manually coded in Python 2.7 using only built-in functions and data structures (such as dict, map and reduce functions, etc.). Source code is available at https://github.com/sergioburdisso/ss3. Additionally, to avoid wasting memory by letting the digital trees to grow unnecessary large, every million words we executed a “pruning” procedure in which all the nodes with a frequency less or equal than 10 were removed. We also fixed the maximum n-gram length to 3 (i.e. we set ).555We tested other values greater than 3 but we did not notice any improvements on classification performance. Finally, we used the same hyper-parameter values as used in burdisso2019 , namely we set and . The early classification policy we used was to classify test subjects as positive as soon as the positive accumulated confidence value exceeds the negative.
Results are shown in Table 1. As it can be seen, except for in D18 and in A18, -SS3 outperformed the best models. It is worth mentioning that, although not included in the table, the new model improved the original SS3’s performance not only according to the ERDE measures but also to most of the traditional/timeless measures. For instance, in the eRisk 2017 early depression detection task, -SS3’s recall, precision and were 0.55, 0.43 and 0.77 respectively, against SS3’s 0.52, 0.44 and 0.63. In addition, although these values were not the best among all participants, they were well above the average (0.39, 0.36 and 0.51), which is not bad if we consider that hyper-parameter values of our model were selected to optimize the measure.
Results imply that learned n-grams contributed to improving the performance of our classifier. Furthermore, these n-grams also improved the visual explanations given by SS3, as illustrated in Figure 5.666We have built an online live demo to try out -SS3, available at http://tworld.io/ss3, in which along with the classification result it gives you interactive visual explanations similar to the one shown in this figure.
In this article, we introduced -SS3, a novel text classifier that is able to learn and recognize useful patterns over text streams “on the fly”. We saw how -SS3 was defined as an expansion of the SS3 classifier. The new model uses a prefix tree to store variable-length n-grams seen during training. The same data structure is then used as a DFA to recognize important word sequences as the input is read. This allowed us to keep all the original SS3’s virtues: support for incremental classification and learning over text streams; easy support for early classification; and visual explainability. -SS3 showed an improvement over the original SS3 in terms of standard performance metrics as well as for ERDE metrics. It also showed an improvement in terms of the expressiveness of visual explanations.
- (1) Burdisso, S.G., Errecalde, M., y Gómez, M.M.: A text classification framework for simple and effective early depression detection over social media streams. Expert Systems with Applications 133, 182 – 197 (2019). https://doi.org/10.1016/j.eswa.2019.05.023, http://www.sciencedirect.com/science/article/pii/S0957417419303525
- (2) Burdisso, S.G., Errecalde, M., y Gómez, M.M.: UNSL at erisk 2019: a unified approach for anorexia, self-harm and depression detection in social media. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 10th International Conference of the CLEF Association, CLEF 2019. Springer International Publishing, Lugano, Switzerland (2019)
- (3) Crochemore, M., Lecroq, T.: Trie. Encyclopedia of Database Systems pp. 3179–3182 (2009)
- (4) Escalante, H.J., Villatoro-Tello, E., Garza, S.E., López-Monroy, A.P., Montes-y Gómez, M., Villaseñor-Pineda, L.: Early detection of deception and aggressiveness using profile-based representations. Expert Systems with Applications 89, 99–111 (2017)
- (5) Fredkin, E.: Trie memory. Communications of the ACM 3(9), 490–499 (1960)
- (6) Iskandar, B.S.: Terrorism detection based on sentiment analysis using machine learning. Journal of Engineering and Applied Sciences 12(3), 691–698 (2017)
- (7) Kwon, S., Cha, M., Jung, K.: Rumor detection over varying time windows. PloS one 12(1), e0168344 (2017)
- (8) Losada, D.E., Crestani, F.: A test collection for research on depression and language use. In: International Conference of the Cross-Language Evaluation Forum for European Languages. pp. 28–39. Springer (2016)
- (9) Losada, D.E., Crestani, F., Parapar, J.: erisk 2017: Clef lab on early risk prediction on the internet: Experimental foundations. In: International Conference of the Cross-Language Evaluation Forum for European Languages. pp. 346–360. Springer (2017)
- (10) Losada, D.E., Crestani, F., Parapar, J.: Overview of erisk: early risk prediction on the internet. In: International Conference of the Cross-Language Evaluation Forum for European Languages. pp. 343–361. Springer (2018)
- (11) Ma, J., Gao, W., Mitra, P., Kwon, S., Jansen, B.J., Wong, K.F., Cha, M.: Detecting rumors from microblogs with recurrent neural networks. In: IJCAI. pp. 3818–3824 (2016)
- (12) Ma, J., Gao, W., Wei, Z., Lu, Y., Wong, K.F.: Detect rumors using time series of social context information on microblogging websites. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. pp. 1751–1754. ACM (2015)