An Automated Text Categorization Framework based on Hyperparameter Optimization

An Automated Text Categorization Framework based on Hyperparameter Optimization

Eric S. Tellez
CONACyT Consejo Nacional de Ciencia y Tecnología, Dirección de Cátedras, Insurgentes Sur 1582, Crédito Constructor 03940, Ciudad de México, México.INFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación, Circuito Tecnopolo Sur No 112, Fracc. Tecnopolo Pocitos II, Aguascalientes 20313, México.
   Daniela Moctezuma11footnotemark: 1
Centro de Investigación en Geografía y Geomática “Ing. Jorge L. Tamayo”, A.C. Circuito Tecnopolo Norte No. 117, Col. Tecnopolo Pocitos II, C.P. 20313,. Aguascalientes, Ags, México.
   Sabino Miranda-Jiménez11footnotemark: 1 22footnotemark: 2
   Mario Graff 11footnotemark: 1  22footnotemark: 2
April. 2017

A great variety of text tasks such as topic or spam identification, user profiling, and sentiment analysis can be posed as a supervised learning problem and tackle using a text classifier. A text classifier consists of several subprocesses, some of them are general enough to be applied to any supervised learning problem, whereas others are specifically designed to tackle a particular task, using complex and computational expensive processes such as lemmatization, syntactic analysis, etc. Contrary to traditional approaches, we propose a minimalistic and wide system able to tackle text classification tasks independent of domain and language, namely TC. It is composed by some easy to implement text transformations, text representations, and a supervised learning algorithm. These pieces produce a competitive classifier even in the domain of informally written text. We provide a detailed description of TC along with an extensive experimental comparison with relevant state-of-the-art methods. TC was compared on 30 different datasets. Regarding accuracy, TC obtained the best performance in 20 datasets while achieves competitive results in the remaining 10. The compared datasets include several problems like topic and polarity classification, spam detection, user profiling and authorship attribution. Furthermore, it is important to state that our approach allows the usage of the technology even without knowledge of machine learning and natural language processing.

1 Introduction

Due to the large and continuously growing volume of textual data, automated text classification methods have taken an increasing interest of research community. Although many efforts have been proposed in this direction, it remains as an open problem. The arrival of massive data sources, like micro-blogging platforms, introduces new challenges where many of the prior techniques failed. Among the new challenges are: the volume and noisy nature of the data, the shortness of the texts that implies little context, the informal style also plagued of misspellings and lexical errors, among others.

These new data sources have made popular tasks such as sentiment analysis and user profiling. The sentiment analysis problem consists in determining the polarity of a given text, which can be a global polarity (about the whole text) or about a particular subject or entity. The user profiling task consists in, given a text, predicting some facts about the author, like her/his demographic information (e.g., gender, age, language or region). Such is the importance of these problems that in the research community several international competitions have been carried out in recent years. For example SemEval111, TASS222 and SENTIPOLC333 tutreeb/sentipolc-evalita16/ are challenges for sentiment classifiers for Twitter data in English, Spanish, and Italian languages, respectively. PAN444 also opens calls for author profiling systems for English, Spanish and German languages. These problems are closely related to traditional text classification applications such as topic classification (e.g, classifying a news-like text into sports, politics, or economy), authorship attribution (e.g., identifying the author of a given text) and spam detection.

Usually, each of aforementioned problems is treated in a particular way, i.e., a method is proposed to solve adequately one classification task. Traditionally, this approach cannot generalize to other related task, and, consequently, the methods are dependent on the problem; however, it is worth to mention that this specialization produces a lot of insight about the problem’s domain. Conversely, in this contribution, we proposed a framework to create a text classifier regardless of both the domain and the language and based only a training set of labeled examples.

The idea of creating a text classifier almost independent of the language and domain is not novel, in fact, in our previous work [1], we introduced a combinatorial framework for sentiment analysis. There, aspects of language were considered such as stopwords and tokenizers with special attention to lexical structures for negations. Also, particularities of the domain like emoticons and emojis are considered. The presented manuscript is a generalization and formalization of our previous work; this allows us to simplify the entire framework to work independently of both the language and the particular task, and empower the use of more sophisticated text treatments whenever it is possible and necessary.

As stated above, we tackle the problem of creating text classifiers that work regardless of both the domain and the language, with nothing more than a training set to be learned. The general idea is to orchestrate a number of simple text transformations, tokenizers, a set of weighting schemes, along with a Support Vector Machine (SVM) as classifier to produce effective text classification. More detailed, we look at the problem of creating effective text classifiers as a combinatorial optimization problem; where there is a search space containing all possible combinations of different text transformations, tokenizers, and weighting procedures with their respective parameters, and, on this search space, a meta-heuristic is used to search for a configuration that produces a highly effective text classifier. This model selection procedure is commonly named in the literature as hyper-parameter optimization. To emphasize the simplicity of the approach, we named it micro Text Classification or simply TC.

This manuscript is organized as follows. The related work is presented in Section 2. Section 3 describes our contribution in depth. In Section 4, all the experimental details are described. In Section 5, we show an extensive experimental comparison of our approach with the relevant state-of-the-art methods over 30 different benchmarks. Finally, the conclusions are listed in Section 6.

2 Related work

Let us start by describing a typical text classifier which can be summarized as a set of few, but complex, parts [2]. Firstly, the input text is passed to a lexical analyzer that both parses and normalizes the text, it outputs a list of tokens that represent the input text. The lexical analyzer typically includes some simple transformation functions like the removal of diacritic symbols and lower casing the text, but it also can make use of sophisticated techniques like stemming, lemmatization, misspelling correction, etc. Whereas, the tokens are commonly represented by words, pairs or triplets of adjacent words (bigrams or trigrams), and in general, sequences of words (word n-grams). It is also possible to extend this approach to sequences of characters (character n-grams). When it is allowed to drop the middle words of word n-grams, we obtain skip-grams. The usage of these techniques is driven by the human knowledge of the particular problem being tackled. Also, it is worth to mention that the entire process is tightly linked to the input language.

Secondly, the output of the lexical analyzer is commonly used to create high dimensional vectors where each token of the vocabulary has a corresponding coordinate in the vector. So, the value of each coordinate is associated with the weight of that token. The traditional way of weighting is to use the local and global statistics of tokens, popular examples of this approach are TF, IDF, TFIDF, and Okapi BM25; alternatively, some information measures like the entropy are commonly used as weight terms. Many times it is desirable to reduce the dimension of the vector space, and several techniques can be used for that purpose, just like PCA [3] (Principal Component Analysis), and LSI [4] (Latent Semantic Indexing).

Finally, the output of the weighting scheme, is used to create a training set which can be learned by a classifier. A classifier is a machine learning algorithm that learns the instances of a training set . In more detailed, the training set is a finite number of inputs and outputs, i.e., where represents the -th input, and is the associated output. The objective is to find a function such that and that could be evaluated in any element of the input space. In general, it is not possible to find a function that learns , perfectly. Consequently, a good classifier finds a function that minimizes an error function or maximizes a score function.

Perhaps, one of the first generic text classifier was proposed by Rocchio [5] that works by generating object prototypes based on centroids of a Voronoi partition over TFIDF vectors. This strategy shows the effort to reduce the necessary memory to fit in the hardware available at that time. Rocchio uses the nearest neighbor classifiers over prototypes to perform the predictions, the preprocessing of the text was left to the expertise of the user. Rocchio was the baseline and the study object in the area for a long time; such is the case of the work presented by Joachims [6], which describes a probabilistic analysis of the Rocchio algorithm.

With the purpose of improving the quality of the text classification task, Cardoso [7] proposes the use of centroids to enhance the power of several typical classifiers, such as kNN (k-nearest neighbors) and SVM (Support Vector Machines). Also, Cardoso published a number of datasets in various preprocessing stages, which are popular among the text classification community because using them allows focusing on the weighting and classification algorithms, avoiding to tackle the text processing problem.

In [8], machine learning is used to create a spam detector. The proposed method uses a combination of a set of features, preprocessing steps or setup details, such as using lemmatization or not, using stop-list or not, keywords patterns, varying the length of the training corpus, etc. A similar work is presented by Androutsopoulos et al. in [9].

In the topic classification task, [10] presents an experimental scheme with the Reuters dataset and three machine learning methods (Rocchio algorithm, k-NN, and SVM), and also, three-term selection functions (information gain, chi-square and gain ratio). [11] proposes a topic modelling algorithm based on Latent Dirichlet Allocation (LDA) which assign one topic to an unlabeled document. Also, a combination of LDA and Expectation-Maximization (EM) algorithm is proposed.

Another approach to text classification is to move the focus from text processing and text classification, to improve the term-weighting; this is a successful strategy followed by recent works. Cummins [12] proposes a method based on Genetic Programming to determine and evaluate several term weighting schemes for the vector space model. Escalante et al. [13] present an approach to improve the performance of classical term-weighting schemes using genetic programming. Their approach outperforms standard schemes, based on an extensive experimental comparison. The authors also compare the Cummins [12] approach over their benchmarks.

Lai et al. [14] use both recurrent and convolutional neural networks to produce a term-weighting scheme that captures semantics from the text. Similarly to word embeddings [15, 16], the authors represent words based on their context and, also, they use skip-grams for text representation. The experimental results show higher values of macro-F1 in comparison to other state-of-the-art methods.

Vilares et al. [17] introduce an unsupervised approach for multilingual sentiment analysis driven by syntax-based rules; the words are weighted based on the analysis of syntax-graphs. The authors provide experimental support for English, Spanish, and German. However, to support an additional language, it needs to implement several rules and a proper syntax parser.

Mozetič et al. [18] study the effect of the agreement among human taggers in the performance of sentiment classifiers. In this way, they compare several classifiers over a traditional text normalization and a vector representation with TFIDF weighting.They provide 14 tagged datasets for European languages; we selected some of them for our benchmarks. See Section 5 for more details.

Author profiling is another important task related to text categorization, where several advances have been proposed. In [19] the authors report their approach to perform author profiling; in particular, they describe the best classifier of the PAN’13 contest that consists on a distributional word representation based on the membership to each class along with a number of text standard text preprocessing, see [20]. Recently, in PAN’17 [21], some current works related to user profiling are presented. In this case, user profiling is related to gender and language-region classification. In this aspect, in [22], an SVM, with linear kernel, in combination with word unigrams, character 3- to 5-grams and POS features are employed. In [23] the features were selected as word and POS n-grams, the number of emojis in the text, document sentiment, character flooding (counting the number of times that three or more identical character sequence appears in the text). Finally, a lexicon of important word is also employed.

3 TC: A Combinatorial Framework for Text Classification

Our approach consists in finding a competitive text classifier for a given task among a (possibly large) set of candidates classifiers. A text classifier is represented by the parameters that determine the classifier’s functionality along with the input dataset. The search of the desired text classifier should be performed efficiently and accurately, in the sense that the final classifier should be competitive concerning the best possible classifier in the defined space of classifiers.

In the first part of this section, we will describe the structure of our approach, that is, we state the parameters defining our configuration space. Then, we define the TC graph, which is the core structure used by the meta-heuristics implemented to find a good performing text classifier for a given task. In the road, we also describe the score function that encapsulates the functionality of the classifier and provides a numerical output necessary to maximize the efficiency of the classifier.

3.1 The configuration space

As mentioned previously, a text classifier consists of well differentiated parts. For our purposes, a classifier has the following parts: i) a list of functions that normalize and transform the input text to the input of tokenizers, ii) a set of tokenizer functions that transform the given text into a multiset of tokens, iii) a function that generates a vector from the multiset of tokens; and finally, iv) a classifier that knows how to assign a label to a given vector. These pieces define a TC space of configurations, which is defined by the tuple . In the following paragraphs a more detailed description is given.

  1. is the space of transformation functions, where is defined as the identity function and a set of related functions, mutually exclusive.555The identity function is defined as . We define the function such that , where the parameter is a text, i.e., a string of symbols.

  2. is the set of tokenizer functions. Each is defined as either a function that returns or a simple tokenizer function, i.e., a tokenizer function is a function that extracts a list of subsequences of the given argument. More precisely, the function is defined; where such that , extracts a list of subsequences of . The final multiset is named as bag of tokens.

  3. is a set of functions that transform a bag of tokens into a vector of dimension , i.e., where is a non empty string, . The proper value of each vector’s coordinate is also determined by ; the later task is commonly known as weighting scheme.

  4. Finally, is a set of functions that create a classifier for a given labeled dataset as knowledge source.

Now, let be the set of all possible configurations of the TC space; therefore, it is defined as follows:

then, the size of is described by

Without loss of generality, the size of the search space can be summarized as , where the term captures the effect of s with more than two member functions. This means that is lower bounded by , i.e., all s are binary and both and are singletons. Even on the simplest setup, the configuration space grows exponentially with the number of possible transformations and tokenizers. Thus, in order to find the best item, it is necessary to evaluate the entire space; this is computationally not feasible.666For instance, evaluating each configuration takes about 10 minutes on a commodity workstation; more about this will be detailed in the experimental section. A typical configuration space can contain billions of configurations such that the exhaustive evaluation is not feasible in current computers. To remain as a practical approach, instead of performing an exhaustive evaluation of to find the best configuration, we soften the problem to find a (very) competitive configuration; then it can be solved as a combinatorial optimization problem, in particular, as the maximization of a score function.

3.2 The configuration graph

In order to solve the combinatorial problem with local search-based meta-heuristics, it is necessary to create a graph where the vertex set corresponds to , and the edge set corresponds to the neighborhood of each vertex, . The edges are simply denoted by the neighborhood function , so is a TC graph.

Our main assumption is simple and feasible, the function score slowly varies on similar configurations, such that we can assume some degree of locally concaveness, in the sense that a well-performing local maximum can be reached using greedy decisions at some given point. Even when this is not true in general, the solver algorithm should be robust enough to get a good approximation even when the assumption is valid only with some degree of certainty. To induce the search properties, the neighborhood should be defined in such a way that neighborhoods describe only similar configurations. For this matter, we should define a distance function between configurations. First, we must define a comparison function,


Since each configuration is a tuple of functions, the Hamming distance over configurations is naturally defined as follows


Now, we can define , for any and a configuration . However, the number of items grows exponentially with the radius, and therefore, the notion of locality will be rapidly degraded. To maintain the locality, we define the neighborhood as:


Under this construction scheme, the diameter of is determined by the length of the configuration tuple, i.e., , the diameter determines the number of hops in the TC graph that an optimal opt algorithm will perform, in the worst case. However, since the best configuration is unknown, we must use score as an oracle that leads our navigation at each step.

3.3 The score function

The score function evaluates the performance of the text classifier defined by the configuration with the given training and test sets. Without loss of generality, the evaluation of a configuration can be described by three main steps:

  1. The dataset is divided into and .

  2. The TC algorithm described by learns from .

  3. The prediction performance of is evaluated using the dataset , more details are given below.

These steps can be modified to support cross-validation, schemes like -folds or bagging, which provide a more robust way to measure the performance of a classifier. The details of these measurement strategies are beyond the scope of this manuscript, the interested reader is referenced to Ch. 9 of [24].

Now, please recall from §3.1 that contains the parameters for a number of functions that transform the input text into its associated label. Given a configuration , a classifier is created using the labeled dataset transforming all texts in the training set to its corresponding vector form, i.e., for . Once the classifier is trained, the associated label for all is computed as . Finally, the performance of is computed comparing the predicted labels against the actual ones; a typical score function will use F1 (macro or micro), accuracy, precision, or recall, to measure the quality of the text classifier.

3.4 Optimization process

The core idea to solve the optimization problem is to navigate the graph using a combination of two meta-heuristics. In the following paragraphs, we briefly review the techniques we used to solve the combinatorial problem, a proper survey of the area is beyond the scope of this manuscript. However, the interested reader is referred to [25, 26].

To maintain TC in practical computational requirements, we select two types of fast meta-heuristics, Random Search [27] and Hill Climbing [25, 26] algorithms. The former consists in selecting the best performing configuration among the set randomly chosen from , that is,

where the size of is an open parameter dependent on the task. On the other hand, the core idea behind Hill Climbing is to explore the configuration’s neighborhood of an initial setup and then greedily update to be the best performing configuration in . The process is repeated until no improvement is possible, that is,

We improve the whole optimization process applying a Hill Climbing procedure over the best configuration found by a Random Search. We also add memory to avoid a configuration to be evaluated twice.777In principle, this is similar to Tabu search; however, our implementation is simpler than a typical implementation of Tabu search.

Summarizing, the optimization process is driven by the tuple , where i) is the TC space, ii) means the training set of labeled texts, iii) is the function to be maximized, and finally, iv) is a combinatorial optimization algorithm that uses and to find an almost optimal configuration in .

4 Experimental setup

This section describes the general setup used to characterize and compare our method with the related state-of-the-art. In particular, we define the set of functions used to create our TC space; and also, we detail the benchmarks used in the comparison.

All the experiments were run in an Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz with 32 threads and 192 GiB of RAM running CentOS 7.1 Linux. We implemented TC888Available under Apache 2 license at on Python. To characterize the performance of TC and compare it to the relevant state of the art, we selected a number of popular benchmarks in the literature; these datasets are described below. It is worth to mention that we bias our selection to benchmarks coming from popular international challenges. With the purpose of avoiding over-fitting, we performed the model selection using score as a -fold cross-validation of the specified performance measure, see Table 1. We decided to use cross-validation for this stage because we observed over-fitting for small datasets, like those found in authorship attribution, when we use a static train-test partitions to perform model selection. A brief experimental study of the effect of the validation schemes is presented in §5.2.

4.1 About our particular TC space

As state before, TC is a framework to create text classifiers searching for best models in a configuration space. This space can be adjusted for any particular problem, but here, we consider a general enough space to match a disparity of benchmarks (listed below in §4.3).

When the knowledge about the domain is low, then a large and generic configuration space should be used. It could be tempting to learn about the domain using the information found by the optimization process; this is clearly possible. However, it is encouraged to take into account that the search process will take decisions to match the particular dataset, not the domain, and any generalization of the knowledge must be curated by an expert in the domain. It is important to mention that large configuration spaces will consume a lot of computational time to be optimized.

On the other hand, a hand-crafted configuration space for a given problem can yield to very fast processing times; however, a vast knowledge of the domain is required to reach this state. In this case, we discard the possibility of discovering new knowledge on the domain and take advantage of the particularities of the dataset that a more general configuration space can provide.

To tackle with the disparate list of benchmarks, we select a generic large configuration space defined in the following paragraphs.

Preprocessing functions

We associate to the following function sets.


Defined as , the idea is to allow to remove or group into a single tag all hash tags, for and , respectively; the function lets the text unmodified. The format of a hash tag is that introduced by Twitter , but now popular along many data sources.


Defined as , this function set contains functions to remove, group, or left untouched numbers in the text.


Defined as , this function set contains functions to remove, group, or left untouched numbers in the text.


Defined as , this function set contains functions to remove, group, or left untouched users and host domains in the text. The pattern being tackled is @user this is a popular way to denote users in several social networks; the pattern also matches naturally with the domain part of email addresses.


Defined as , this function set contains functions to remove, or left untouched, diacritic symbols in the text. The objective is to reduce composed symbols like á,ä,ã,â, or à to simply a. This is a well known source of errors in informal text written in languages making hard use of diacritic symbols


Defined as , this function set contains functions to remove, or left untouched, duplicated contiguous symbols in the text.


Defined as , this function set contains functions to remove, or left untouched, duplicated punctuation symbols in the text. The list of punctuation symbols includes several symbols like ;,:,.,-,’,",(,),[,],{,},,<,>,?,!, among others.


Defined as contains functions to normalize the case of the text or left untouched.

The list of tokenizers

After all text normalization and transformation, a list of tokens should be extracted. We use three schemes for our tokenizers.

Word n-grams.

This family of tokenizers firstly tokenizes the text into words, and then, produces tokens for words, i.e. word -grams. An -gram is a string of consecutive words. For example, “The red car is in front of the tree” creates the following 3-grams: The red car, red car is, car is in, is in front, in front of, front of the, of the tree.

Character n-grams.

This family of tokenizers does not assume anything about the text and splits the input text to all -sized substrings, i.e., substrings of characters for a text of characters. For example, the character 4-grams of “I like the red car” are I_li, _lik, like, ike_, ke_t, e_th, _the, the_, he_r, e_re, _red, red_, ed_c, d_ca, _car. We use the symbol _ to show the symbol space.


Skip-grams are similar to word n-grams but allowing to skip the middle parts. For example, the skip-grams999Two words, skipping one in the middle of the previous example are I-the, like-red, the-car. The idea behind this family of tokenizers is to capture the occurrence of related words that are separated by some unrelated words.

For this matter, instead of selecting one or another tokenizer scheme, we allow to select any of the available tokenizers, and perform the union of the final multisets of tokens. For instance, our configuration space considers three word n-grams tokenizers (), nine character n-grams (), and three skip-grams and .

Weighting schemes

After we obtained a multiset (bag of tokens) from the tokenizers, we must create a vector space. We selected a small set of frequency filters and the TFIDF scheme to weight the coordinates of the vector. On one hand, we consider a sequential list of filters max-filter and min-filter, and then, we select to use the term frequency (TF) or the TFIDF as weight. For the max-filter we delete all tokens surpassing the frequency threshold of , where max-freq is the maximum frequency in of a token in the collection. We consider four filters, for instance we use . For the min-filter we delete all tokens not reaching the frequency threshold of , for instance we use, . Notice that and does not perform any filtering. So, we have embedded 32 different configurations for weighting.


We decide to use a singleton set populated with an SVM with a linear kernel. It is well known that SVM performs excellently for very large dimensional input (which is our case), and the linear kernel also performs well under this conditions. We do not optimize the parameters of the classifier since we are pretty interested in the rest of the process. We use the SVM classifier from liblinear, Fan et al. [28].

On the final configuration space

The task of finding the best model for the space of configurations is hard. The number of possible configurations of is (i.e., four trivalent functions sets and four bivalent function sets). From the above configuration, the number of possible tokenizers is 81; also, we have 32 different weighting combinations. So, the configuration is space contains more than 3.3 million configurations. For instance, a configuration needs close to ten minutes to be evaluated, i.e., a sentiment analysis benchmark with ten thousand tweets. Therefore, an exhaustive evaluation of the configuration space will need up to 64 years. Even implementing it in a large distributed cluster the process needs too much time to complete. Such power of computing is not easily accessible. Nonetheless, if we soften the problem to finding not the best model but an excellent one, we can use an algorithm for combinatorial optimization, as explained in §3.

4.2 On the preparation of the input text

Since TC considers the preprocessing step among its parts, we tried to collect all datasets in raw text, without any kind of preprocessing transformations. This was not possible in the general case, mostly due to the aging of datasets; we consider the following text preparation states, in the style of Cachopo [7]:

  • the raw text corresponds to the original, non-formatted text

  • the all-terms converts all text into lowercase, also, all diacritic symbols and punctuation marks are removed, and all spacing symbols are normalized to a single space

  • the no-short dataset removes all terms having less or equal than three characters

  • the no-stopwords dataset also removes all non discriminant words for English (adjetives, adverbs, conjunctions, articles, etc.)

  • finally, after the previous steps, all words were transformed by the Porter’s stemmer for English [29] to generate the stemmed dataset.

For instance, we use the all-terms for R8, R10, R52 and WebKB; for CADE we use the stemmed version. In these cases, we used the datasets prepared by Cachopo [7]. In other cases, we use the raw text. The effect of using one or another state is studied in Section 5.1.

4.3 Benchmark description

The text classification literature has a myriad of datasets, performance measures, and validation schemes. We select several prominent and popular benchmark configurations in the literature; for instance, we select to work with topic classification, spam identification, author profiling, authorship attribution, and sentiment analysis. To avoid implementation mistakes, we directly use the reported performances by the literature; nevertheless, we are restricted to compare under the same circumstances. Table 1 describes the language and number of classes of each dataset; it also describes the kind of validation; in particular, we consider two validation schemes: i) 10-fold cross-validation, and ii) a static train-test partition of the specified sizes. The diversity of benchmarks and validation schemes help us to prove the functionality of our approach in many circumstances.

name language #documents #classes performance total train test measure Topic Classification R8 English 7,674 70% 30% 8 macro-F1 R10 English 8,008 70% 30% 10 macro-F1 R52 English 9,100 70% 30% 52 macro-F1 News-4 English 13,919 70% 30% 4 macro-F1 News-20 English 20,000 70% 30% 20 macro-F1 WebKB English 4,199 70% 30% 4 macro-F1 CADE Portuguese 40,983 70% 30% 12 macro-F1 Spam Identification Ling-Spam English 2,893 — 10-fold — 2 macro-F1 PUA English 1,142 — 10-fold — 2 macro-F1 PU1 English 1,099 — 10-fold — 2 macro-F1 PU2 English 721 — 10-fold — 2 macro-F1 PU3 mixed 4,139 — 10-fold — 2 macro-F1 Author Profiling PAN’13 Gender & Age group English 242,040 236,600 25,440 2 & 3 accuracy Spanish 84,060 75,900 8,160 2 & 3 accuracy PAN’17 Gender & Language Variety Arabic - 2,400 - 2 & 4 accuracy English - 3,600 - 2 & 6 accuracy Spanish - 4,200 - 2 & 7 accuracy Portuguese - 1,200 - 2 & 2 accuracy Authorship Attribution CCA English 1,000 500 500 10 macro-F1 NFL English 97 52 42 3 macro-F1 Business English 175 85 90 6 macro-F1 Poetry English 200 145 55 6 macro-F1 Travel English 172 112 60 4 macro-F1 Cricket English 158 98 60 4 macro-F1 Multilingual Sentiment Analysis Arabic Arabic 2,000 — 10-folds — 3 macro-F1 German German 91,502 — 10-folds — 3 macro-F1 Portuguese Portuguese 86,062 — 10-folds — 3 macro-F1 Russian Russian 69,100 — 10-folds — 3 macro-F1 Spanish Spanish 19,767 — 10-folds — 3 macro-F1 Swedish Swedish 49,255 — 10-folds — 3 macro-F1 these datasets are encoded in a way that the original text is loss, however it preserves the document’s distribution. here, the documents are Twitter’s profiles, each user is described 100-300 single entries for a total of 1,265,898 tweets for all languages in the training set.
Table 1: Description of the benchmarks and its associated performance measure

The Reuters-21578101010 is one of the most used collection for text categorization research. The documents were manually labeled by personnel from Reuters Ltd. The 20Newsgroup111111 dataset is very popular in text classification area and it contains news related to different topics originally collected by Ken Lang. The WebKB dataset121212 contains university webpages. This dataset is composed of the webpages classified in seven different categories: student, faculty, staff, department, course, project and other. We use the four most popular classes in our experiments. The CADE dataset [7] is another collection of webpages, specifically Brazilian webpages classified by human experts. This collection contains a total of 12 classes, e.g. services, sports, science, education, news, among others. The PU [9] is a collection of emails written in English and other languages, classified as spam and non-spam messages; this collection contains the following datasets: PUA, PU1, PU2 and PU3. Ling-Spam dataset [30] is also a spam dataset. PAN contest [20, 21] has several tasks, between them are author identification and author profiling. The author profiling task is a forensic linguistics problem that consisnts in detecting gender and age for the author (PAN’13). For the PAN’17 age identification task was replaced by the task of determining the language variety of the writter, also, the number of different languages was increased to four. As listed in Table 1, the official dataset is undisclosed, and each algorithm must be evaluated with the TIRA evaluation platform.131313 The Authorship Attribution datasets [13] are a set of different types of topics: CCA, NFL, Business, Poetry, Travel and Cricket. The objective of these datasets is to determine the authorship of each document. The Multilingual Sentiment Analysis are a set of tweets in different languages: Arabic, German, Portuguese, Russian, Swedish and Spanish. The purpose of these datasets is classifying each tweet as negative, neutral, or positive polarity.

A detailed description of all these datasets is provided in Table 1, where there can be found some particularities of the dataset like the written language, the number of documents, the kind of evaluation (independent train-test sets or -folds), the number of classes, and the performance measure optimized by TC.

5 Experimental Results

This section is dedicated to comparing our work with the relevant state-of-the-art methods described above. Also, we characterize the generalization power in terms of the validation scheme.

The first task analyzed is authorship attribution, Table 2 shows the macro-F1 and accuracy performances for a set of authorship attribution benchmarks. Here, we compare TC with two term-weighting schemes [13] and [12]. The pre-processing stage of the TC’s input is all-terms; others use the stemmed stage. The best performing classifiers are created by TC, except for NFL where alternatives perform better. In the case of Business, Escalante et. al [13] performs slightly better only in terms of accuracy. Please notice that NFL and Bussiness are among the smaller dataset we tested, the low performance of TC can be produced by the low number of exemplars, while alternative schemes take advantage of the few samples to compute better weights.

macro-F1 Dataset Cummins [12, 13] Escalante [13] TC CCA 0.0182 0.7032 0.7633 NFL 0.7654 0.7637 0.7422 Business 0.7548 0.7808 0.8199 Poetry 0.4489 0.7003 0.7135 Travel 0.6758 0.7392 0.8621 Cricket 0.9170 0.8810 0.9665 Accuracy Dataset Cummins [12, 13] Escalante [13] TC CCA 0.1000 0.7372 0.7660 NFL 0.7778 0.8376 0.7555 Business 0.7556 0.8358 0.8222 Poetry 0.5636 0.7405 0.7272 Travel 0.6833 0.7845 0.8667 Cricket 0.9167 0.9206 0.9667
Table 2: Authorship Attribution Data sets.

In Table 2 the results of PAN’13 competition are presented. According to the contest report [20], the best results were achieved by Pastor, Santosh, and Meina. In this benchmark, TC produces the best result in all average cases. In a fine-grained comparison, only Meina surpasses TC on the gender identification for English.

Task English Spanish Avg.
Age 0.6605 0.6897 0.6751
TC Gender 0.5867 0.6750 0.6309
Joint 0.3946 0.4587 0.4267
Age 0.6572 0.6558 0.6565
Pastor L. Gender 0.5690 0.6299 0.5995
Joint 0.3813 0.4158 0.3985
Age 0.6408 0.6430 0.6419
Santosh Gender 0.5652 0.6473 0.6063
Joint 0.3508 0.4208 0.3858
Age 0.6491 0.4930 0.5711
Meina Gender 0.5921 0.5287 0.5604
Joint 0.3894 0.2549 0.3222
Table 3: Performance of TC on the author profiling task of the PAN’13 competition; all values are the accuracy score in the specified subtask.

Table 4 shows the performance of TC in the PAN’17 benchmark. The table also lists the best three results of the challenge, reported as statistically equivalent in [21], these works are detailed in §2. Please note that the result by Tellez et al. [31] was generated with TC but using a special term-weighting scheme based on entropy instead of TFIDF (or TF). The details of the entropy based term-weighting scheme are beyond the scope of this contribution; the interested reader is referenced to [31]. The plain TC, as described in this manuscript, achieves accuracies of 0.7880 and 0.8849, respectively for gender and variety identification. The joint prediction of both classes achieves an accuracy of 0.7038. These score values locate the plain TC in the eighth position in the official rank, see [21].

Method Task Arabic English Spanish Portuguese Avg.
Gender 0.7569 0.7938 0.7975 0.8038 0.7880
TC Variety 0.7894 0.8388 0.9364 0.9750 0.8849
Joint 0.6081 0.6704 0.7518 0.7850 0.7038
Gender 0.8006 0.8233 0.8321 0.8450 0.8253
Basile et al. [22] Variety 0.8313 0.8988 0.9621 0.9813 0.9184
Joint 0.6831 0.7429 0.8036 0.8288 0.7646
Gender 0.8031 0.8071 0.8193 0.8600 0.8224
Martinc et al. [23] Variety 0.8288 0.8688 0.9525 0.9838 0.9085
Joint 0.6825 0.7042 0.7850 0.8463 0.7545
Gender 0.7838 0.8054 0.7957 0.8538 0.8097
Tellez et al. [31] Variety 0.8275 0.9004 0.9554 0.9850 0.9171
Joint 0.6713 0.7267 0.7621 0.8425 0.7507
Table 4: Author profiling: PAN2017 benchmark [21], all methods were scored with the official gold-standard. All scores are based on the accuracy computation over the specified subset of items.

Table 5 reports the performance over topic classification benchmarks. This experiments considered several news datasets.141414Please refer to Table 1 for the detailed description of each benchmark. Our approach, TC, reaches best results in most of the datasets with exception of News-20 and News-4 where TC reaches second and third best performance.

macro-F1 Reuters-8C Reuters-10C Reuters-52C News-4C News-20C WebKB CADE Debole [10] - - - - - - - Escalante [13] 0.9135 0.9184 - - 0.6797 0.8879 0.4103 Cummins [12, 13] 0.8830 0.8759 - - 0.6645 0.7197 - Lai CNN [14] - - - 0.9479 - - - Lai RNN [14] - - - 0.9649 - - - Hingmire[11] - - - - - 0.7190 - Cachopo [7] - - - - - - - TC 0.9698 0.9662 0.6746 0.9432 0.8269 0.9098 0.5687 accuracy Reuters-8C Reuters-10C Reuters-52C News-4C News-20C WebKB CADE Debole [10] - 0.7040 - - - - - Escalante [13] 0.9056 0.8821 - - 0.6623 0.8912 0.5380 Cummins [12, 13] 0.7440 0.7659 - - 0.6578 0.7542 - Lai CNN [14] - - - - - - - Lai RNN [14] - - - - - - - Hingmire [11] - - - 0.9360 - - - Cachopo [7] 0.9049 - 0.8482 - 0.8460 0.8300 0.5071 TC 0.9214 0.9236 0.9376 0.9390 0.8348 0.9191 0.6174
Table 5: Topic Classification Datasets

In sentiment analysis task we compared the datasets reported in  [32, 33]. Moreover, we reported the results obtained with the B4MSA approach [34]. B4MSA is a method for multilingual polarity classification considered as a baseline to build more complex approaches151515 It is important to note that from each dataset reported in  [32, 33], both approaches, B4MSA and TC, use a subset specified in Table 6; e.g. in Arabic language we used 100%, in German we used 80% of the dataset and so on (all specified in table).

In Table 6, it can be seen that best results were obtained with B4MSA and TC in all the cases, and both results are very close.

language macro- accuracy
Arabic Salameh et al. [32] - 0.787
Saif et al. [33] - 0.794
B4MSA (100%) 0.642 0.799
TC (100%) 0.641 0.792
German Mozetič et al. [18] - 0.610
B4MSA (89%) 0.621 0.668
TC (89%) 0.614 0.672
Portuguese Mozetič et al. [18] - 0.507
B4MSA (58%) 0.557 0.561
TC (58%) 0.562 0.566
Russian Mozetič et al. [18] - 0.603
B4MSA (69%) 0.754 0.750
TC (69%) 0.754 0.751
Swedish Mozetič et al. [18] - 0.616
B4MSA (93%) 0.680 0.691
TC (93%) 0.679 0.688
Spanish B4MSA 0.657 0.784
TC 0.649 0.780
Table 6: Multilingual sentiment analysis

Finally, Table 7 shows the results of spam classification task. Here, it can be seen that best results in the macro-F1 measure were obtained with our approach TC; nevertheless, the best results in the accuracy score were achieved by Androutsopoulos et al. [35] except in Ling-Spam dataset where TC reached the best performance.

macro-F1 Data set Androutsopoulos [35] Sakkis [36] Cheng [37] TC Ling-Spam - 0.8957 0.9870 0.9979 PUA 0.8897 - - 0.9478 PU1 0.9149 - 0.983 0.9664 PU2 0.6794 - - 0.9044 PU3 0.9265 - 0.977 0.9701 accuracy Data set Androutsopoulos [8] Sakkis [36] Cheng [37] TC Ling-Spam - - 0.9800 0.9993 PUA 0.9600 - - 0.9482 PU1 0.9750 - 0.971 0.9706 PU2 0.9839 - - 0.9634 PU3 0.9778 - 0.968 0.9738
Table 7: spam classification

5.1 About the pre-processing state of the input text

kind of actual actual pred pred
preprocessing accuracy macro-F1 accuracy macro-F1
raw 0.8265 0.8199 0.8968 0.8963
all-terms 0.8340 0.8260 0.9075 0.9056
no-short 0.8310 0.8235 0.9052 0.9034
no-stopwords 0.8373 0.8300 0.9099 0.9082
stemmed 0.8413 0.8344 0.9071 0.9058
Table 8: The performance of TC for text collections being in different stages of text normalization for News benchmark.

Here, the pre-processing step is analyzed; for this, Table 8 shows different performances that correspond to the News benchmark in various stages of the normalization process, as used as inputs for TC. We found that TC achieves high performances without using additional sophisticated pre-processing steps, almost all of them, language dependent. For instance, using the raw text is just below points than the performance using the stemmed collection. The human intervention to prepare the input text is barely needed by TC without significantly reducing the performance in practice. Alternatively, methods like Escalante et al. [13] and Cachopo [7] need to use the stemmed version of the dataset to achieve its optimal performance, i.e., accuracy values ranging from to , for more details see Table 5.

5.2 On the robustness of the score function

The score function leads the model selection procedure to fulfill the requirements of the task. In this process, it is necessary to determine which precise quality’s measure is needed, e.g., macro-F1 or accuracy. As any learning algorithm, it is necessary to protect the score with some validation schemes to avoid the latent overfitting. On this matter, we consider the use of two validation schemes: i) stratified -folds and ii) a random binary partition of size for the train set and for the test set, for a (training) collection of size .

(a) Authors NFL – k-folds
(b) Authors NFL – binary partition
(c) Authors Business – k-folds
(d) Authors Business – binary partition
(e) Authors Cricket – k-folds
(f) Authors Cricket – binary partition
Figure 1: The final performance in small datasets as a function of the validation’s stage of the score function of TC; we consider two validation schemes for this purpose: i) -folds and ii) random binary partitions of sizes and , for training and testing subsets respectively.
(a) News – k-folds
(b) News – binary partition
(c) WebKB – k-folds
(d) WebKB – binary partition
(e) R52 – k-folds
(f) R52 – binary partition
Figure 2: The final performance on medium sized datasets as a function of the validation’s stage of the score function of TC; we consider two validation schemes for this purpose: i) -folds and ii) random binary partitions of sizes and , for training and testing subsets respectively.

To learn how to choose the right criteria, we review both the predicted and the actual performance (macro-F1, for instance) of these two validation schemes. The predicted macro-F1 is the performance achieved by the model selection procedure using some of the two mentioned validation schemes. The actual performance is the one obtained directly evaluating the gold-standard collection.

Figure  1 shows the performance of TC on small databases. The stability of -folds in terms of predicted and actual performance is supported by Figures 1(a), 1(c) and 1(e). This is also true for larger datasets like those depicted in Figures 2(a), 2(c) and 2(e). The figures show that even on the TC achieves almost its optimal actual performance; even when the predicted performance is most of the times better for larger values. On the other hand, the binary partition method is prone to overfit, especially on small datasets and small values (i.e., small test sets). For instance, Figure 1(b) shows the performance for NFL; please note how yields to very competitive performances, i.e. higher than 0.9 for both macro-F1 and accuracy. These performances are pretty higher than those achieved by the alternatives (see Table 2); however, yields to low actual performances, contrasting the perfect predicted performance. A similar case happens for the Business dataset, Figure 1(d); but in this case, the actual performance is relatively stable. The behaviour of binary partition in larger dataset is less prone to overfit, like Figures 2(b) and 2(d) illustrate. Nonetheless, the case of R52, Figure 2(f), shows that the overfitting issue is still latent; however, it barely affects the actual performance since the score function is applied to a large enough test set.

As rule of thumb, it is safe to use -fold cross-validation to compute score in the model selection stage. We encourage the use of small values (e.g., 2, 3 or 5) since the actual performance is relatively stable and the computational cost is kept low. Please notice that -folds procedure introduces a factor of to the computational cost of score, and, algorithms to solve the underlying combinatorial optimization problem need to evaluate a considerable number of configurations to achieve good results. In cases where the number of samples is pretty large, or a rapid solution is required, the binary partition method is also a good choice, especially for high values. The later setup corresponds to prepare robust score functions at the cost of reducing the train set in the model selection stage. The reduction of the training set is not a major problem for the actual performance, as it is illustrated by experiments corresponding to binary partition performances, see Figures 1 and 2. Please remember, at this stage, we are just selecting a proper configuration, and in a subsequent step, the final model is computed using this configuration and the entire training dataset .

6 Conclusions

In this work, a minimalistic and global approach to text classification is proposed. Moreover, our approach was evaluated in a broad range of classification tasks such as topic classification, sentiment analysis, spam detection and user profiling; for this matter, a total of databases related with these tasks were employed. In order to evaluate the performance of our approach, the results obtained in each task were compared to the state-of-the-art methods, related to each task. Additionally, we analyze the effect of the pre-processing stage. In this experiment, we observed that our approach is competitive with the alternative methods even using the raw text as input, without a penalty in the performance; therefore, it is possible to use TC to create text classifiers with a little knowledge of natural language processing and machine learning techniques. We also studied some simple strategies to avoid overfitting problem; we consider using a -fold cross-validation scheme and a binary partition to perform the model selection. Based on our experimental observation, our TC can both properly fit the dataset and speedup the construction step using small values in cross-validation schemes and small training sets when we use binary random partitions. We also found that perform -folds can be the preferred validation scheme on small to medium sized datasets, but very large datasets can use the binary partition scheme without a significant reduction of the performance, and also, keeping the cost the entire process low.


  • [1] E. S. Tellez, S. Miranda-Jiménez, M. Graff, D. Moctezuma, R. R. Suárez, O. S. Siordia, A simple approach to multilingual polarity classification in twitter, Pattern Recognition Letters.
  • [2] A. Khan, B. Baharudin, L. H. Lee, K. Khan, A review of machine learning algorithms for text-documents classification, Journal of advances in information technology 1 (1) (2010) 4–20.
  • [3] S. Wold, K. Esbensen, P. Geladi, Principal component analysis, Chemometrics and Intelligent Laboratory Systems 2 (1) (1987) 37 – 52.
  • [4] T. K. Landauer, P. W. Foltz, D. Laham, An introduction to latent semantic analysis, Discourse processes 25 (2-3) (1998) 259–284.
  • [5] J. J. Rocchio, Relevance feedback in information retrieval.
  • [6] T. Joachims, A probabilistic analysis of the rocchio algorithm with tfidf for text categorization., Tech. rep., DTIC Document (1996).
  • [7] A. Cardoso-Cachopo, Improving methods for single-label text categorization, Ph.D. thesis, PhD Thesis, Instituto Superior Técnico, Portugal (2007).
  • [8] I. Androutsopoulos, G. Paliouras, E. Michelakis, Learning to filter unsolicited commercial e-mail, ” DEMOKRITOS”, National Center for Scientific Research, 2004.
  • [9] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, C. D. Spyropoulos, An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages, in: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’00, ACM, New York, NY, USA, 2000, pp. 160–167.
  • [10] F. Debole, F. Sebastiani, Supervised Term Weighting for Automated Text Categorization, Springer Berlin Heidelberg, Berlin, Heidelberg, 2004, pp. 81–97.
  • [11] S. Hingmire, S. Chougule, G. K. Palshikar, S. Chakraborti, Document classification by topic labeling, in: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, ACM, 2013, pp. 877–880.
  • [12] R. Cummins, C. O’Riordan, Evolving local and global weighting schemes in information retrieval, Information Retrieval 9 (3) (2006) 311–330.
  • [13] H. J. Escalante, M. A. Garcia-Limon, A. Morales-Reyes, M. Graff, M. M. y Gomez, E. F. Morales, J. Martinez-Carranza, Term-weighting learning via genetic programming for text classification, Knowledge-Based Systems 83 (2015) 176 – 189.
  • [14] S. Lai, L. Xu, K. Liu, J. Zhao, Recurrent convolutional neural networks for text classification., in: AAAI, 2015, pp. 2267–2273.
  • [15] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in: Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.
  • [16] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Advances in neural information processing systems, 2013, pp. 3111–3119.
  • [17] D. Vilares, C. Gómez-Rodríguez, M. A. Alonso, Universal, unsupervised (rule-based), uncovered sentiment analysis, Knowledge-Based Systems 118 (2017) 45 – 55.
  • [18] I. Mozetič, M. Grčar, J. Smailović, Multilingual twitter sentiment classification: The role of human annotators, PloS one 11 (5) (2016) e0155036.
  • [19] A. P. Lopez-Monroy, M. M. y Gomez, H. J. Escalante, L. V. nor Pineda, E. Stamatatos, Discriminative subprofile-specific representations for author profiling in social media, Knowledge-Based Systems 89 (2015) 134 – 147.
  • [20] F. Rangel, P. Rosso, M. Moshe Koppel, E. Stamatatos, G. Inches, Overview of the author profiling task at pan 2013, in: CLEF Conference on Multilingual and Multimodal Information Access Evaluation, CELCT, 2013, pp. 352–365.
  • [21] F. Rangel, P. Rosso, M. Potthast, B. Stein, Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter, in: CLEF, 2017.
  • [22] A. Basile, G. Dwyer, M. Medvedeva, J. Rawee, H. Haagsma, M. Nissim, N-gram: New groningen author-profiling model, in: CLEF, 2017.
  • [23] M. Martinc, I. Å krjanec, K. Zupan, S. Pollak, N-gram: New groningen author-profiling model, in: CLEF, 2017.
  • [24] D. L. Olson, D. Delen, Advanced Data Mining Techniques, 1st Edition, Springer Publishing Company, Incorporated, 2008.
  • [25] E. K. Burke, G. Kendall, et al., Search methodologies, Springer, 2005.
  • [26] R. Battiti, M. Brunato, F. Mascia, Reactive search and intelligent optimization, Vol. 45, Springer Science & Business Media, 2008.
  • [27] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization, Journal of Machine Learning Research 13 (Feb) (2012) 281–305.
  • [28] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, Liblinear: A library for large linear classification, Journal of machine learning research 9 (Aug) (2008) 1871–1874.
  • [29] S. Bird, E. Klein, E. Loper, Natural Language Processing with Python, O’Reilly Media, 2009.
  • [30] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, C. D. Spyropoulos, An evaluation of naive bayesian anti-spam filtering, arXiv preprint cs/0006013.
  • [31] E. S. Tellez, S. Miranda-Jiménez, M. Graff, D. Moctezuma, Gender and language variety identification with microtc, PAN 2017, CLEF (Working Notes).
  • [32] M. Salameh, S. Mohammad, S. Kiritchenko, Sentiment after translation: A case-study on arabic social media posts, in: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Denver, Colorado, 2015, pp. 767–777.
  • [33] S. M. Mohammad, M. Salameh, S. Kiritchenko, How translation alters sentiment, Journal of Artificial Intelligence Research 55 (2016) 95–130.
  • [34] E. S. Tellez, S. M. Jiménez, M. Graff, D. Moctezuma, R. R. Suárez, O. S. Siordia, A simple approach to multilingual polarity classification in twitter, arXiv preprint arXiv:1612.05270.
  • [35] I. Androutsopoulos, G. Paliouras, E. Michelakis, Learning to filter unsolicited commercial e-mail (2004).
  • [36] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. D. Spyropoulos, P. Stamatopoulos, A memory-based approach to anti-spam filtering for mailing lists, Information Retrieval 6 (1) (2003) 49–73.
  • [37] C. H. Li, J. X. Huang, Spam filtering using semantic similarity approach and adaptive {BPNN}, Neurocomputing 92 (2012) 88 – 97, data Mining Applications and Case Study.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description