Lexical Bias In Essay Level Prediction
Automatically predicting the level of non-native English speakers given their written essays is an interesting machine learning problem. In this work I present the system balikasg that achieved the state-of-the-art performance in the CAp 2018 data science challenge among 14 systems. I detail the feature extraction, feature engineering and model selection steps and I evaluate how these decisions impact the system’s performance. The paper concludes with remarks for future work.
Automatically predicting the level of English of non-native speakers from their written text is an interesting text mining task. Systems that perform well in the task can be useful components for online, second-language learning platforms as well as for organisations that tutor students for this purpose. In this paper I present the system balikasg that achieved the state-of-the-art performance in the CAp 2018 data science challenge among 14 systems.
The rest of the paper is organized as follows: in Section 2 I frame the problem of language level as an ordinal classification problem and describe the available data. Section 3 presents the feature extaction and engineering techniques used. Section 4 describes the machine learning algorithms for prediction as well as the achieved results. Finally, Section 5 concludes with discussion and avenues for future research.
2 Problem Definition
In order to approach the language-level prediction task as a supervised classification problem, I frame it as an ordinal classification problem. In particular, given a written essay from a candidate, the goal is to associate the essay with the level of English according to the Common European Framework of Reference for languages (CEFR) system. Under CEFR there are six language levels , such that . In this notation, is the beginner level while is the most advanced level. Notice that the levels of are ordered, thus defining an ordered classification problem. In this sense, care must be taken both during the phase of model selection and during the phase of evaluation. In the latter, predicting a class far from the true should incur a higher penalty. In other words, given a essay, predicting is worse than predicting , and this difference must be captured by the evaluation metrics.
In order to capture this explicit ordering of , the organisers proposed a cost measure that uses the confusion matrix of the prediction and prior knowledge in order to evaluate the performance of the system. In particular, the meaures uses writes as:
where is a cost matrix that uses prior knowledge to calculate the misclassification errors and is the number of observations of class classified with category . The cost matrix is given in Table 1. Notice that, as expected, moving away from the diagonal (correct classification) the misclassification costs are higher. The biggest error (44) occurs when a essay is classified as . On the contrary, the classification error is lower (6) when the opposite happens and an essay is classified as . Since is not symmetric and the costs of the lower diagonal are higher, the penalties for misclassification are worse when essays of upper languages levels (e.g., ) are classified as essays of lower levels.
The data used in this work were released in the framework of the CAp 2018 competition and are an excerpt of Geertzen et al. (2013); Huang et al. (2018). The competition’s goal was to evaluate automated systems that performed well on the task. The organisers released two datasets: the training and the test data. The test data became available only a few days before the competition end without the associated language level and were used only for evaluation. The evaluation was performed by the organisers after submitting the system prediction in a text file as frequently done in such competitions.
The training data consist of 27,310 essays while the test data contain 13,656 essays. Figure 1 illustrates the distribution of essays according to their level. The classification problem is unbalanced as there are far more training examples for the first levels (e.g., ) than for the rest. The organisers announced that they performed a stratified split with respect to the level label, so we expect similar distributions for the test data.
The released data consist of the essay text as well as various numerical features. The numerical features are either statistics calculated on the essay text (length, number of sentences/syllables etc.) or indexes that try to captrue the readability and complexity of the essays, like the Coleman and Flesch families of indexes. Table 2 presents basic statistics that describe the essay text.
|Number of essays||13,656|
|Avg. essay length (words)||80.22|
|Avg. essay length (sentences)||6.75|
3 Feature Extaction
In this section I present the extracted features partitioned in six groups and detail each of them separately.
Most of the features in this family were provided by the challenge organisers using the v0.10-2 of the R koRpus package.
For each essay I calculated its probability under two language models: one trained with the essays belonging to the labels and another trained on the essays of . For this purpose, I used trigram language models with modified Kneser Ney smoothing, with -grams of a lower order () as a back-off mechanism Heafield et al. (2013) using the implementation of Heafield (2011). As language models can easily overfit when trained on small corpora, I decided to replace words with less than 10 occurrences by their part-of-speech (POS) tags and numbers by the special token “”. The hope is that language models will capture the more complicated patterns we expect higher-level users to use. The free parameters concerning the use of language models like the optimal value of , the decision whether to ignore or replace low frequency words with their POS etc. were made using stratified 3-fold cross-validation in the training data. The same applies for every major decision in the feature extraction process, unless otherwise stated.
Word embeddings are dense word vectors that have been show to capture the semantics of words. I represented a given essay using word clusters Balikas and Partalas (2018) calculated by applying -Means () on the ConceptNet embeddings Speer et al. (2017). Clustering the words of the corpus vocabulary generates semantically coherent clusters. In turn, to represent a document I used a binary one-hot-encoded representation of the clusters where the essay words belong to. For instance, in our case where each of the vocabulary words belongs to one of 1,000 clusters, an essay is represented by a 1,000-dimensional binary vector. The non-zero elements of this vector are the ids of clusters where the essay words belong to.
Topic models are a class of unsupervised models. They are generative models in that they define a mechanism on how a corpus of documents is generated. In this work I used Latent Dirichlet Allocation (LDA) Blei et al. (2003) in order to obtain dense document representations that describe the topics that appear in each document. During the inference process, these topics, that are multinomial distributions over the corpus vocabulary, are uncovered and each document is represented by a mixture of them. I used a custom Python LDA implementation
POS tags are informative text representations that can capture the complexity of the expressions used by an author. Intuitively, beginners use less adjectives and adverbs for example compared to more advanced users of a language. To capture this, I obtained the sequence of POS tags of an essay using SpaCy.
Last, I explicitly encoded the content of the essay using its bigram bag-of-words representation. In order to limit the effect of frequent terms like stopwords I applied the idf weighting scheme.
4 Model Selection and Evaluation
As the class distribution in the training data is not balanced, I have used stratified cross-validation for validation purposes and for hyper-parameter selection.
As a classification1 algorithm, I have used gradient boosted trees trained with gradient-based one-side sampling as implemented in the Light Gradient Boosting Machine toolkit released by Microsoft.
Figure 2 illustrates the performance the Gradient Boosted Trees achieve on each of the feature sets. Complementary and for reference, Table 3 presents the accuracy scores of each feature set. Notice that the best performance is obtained when all features are used. Adding the per-document topic distributions infered by the topic models seems to improve the results considerably.
|+ Numerical Features||23.52|
|+ Language Models||14.00|
|+ Latent Dirichlet Allocation||5.74|
|+ Part-Of-Speech tags||5.52|
In order to better evaluate the effect of each family of features without the bias of the ordering of adding families of features, Table 4 presents an ablation study. The Table presents the scores achieved when using all features as well as the scores achieved when removing a particular family of features. In this sense, one can estimate the added value of each family. From the table we notice that the most effective features are the numerical features, the document distributions learned with LDA and the scores from the language models. When removing the features of this family we notice the biggest reduction in performance. For reference, the table presents the performance of a tuned Logistic Regression with the same class weights for the class imbalance. Gradient Boosted trees outperform Logistic Regression by a large margin in terms of , proving their efficiency for classification tasks.
|All features (Winning solution)||4.97||98.2|
|All features (Log. Regression)||10.10||97.2|
|- Numerical Features||14.65||95.6|
|- Language Models||5.66||98.1|
|- Latent Dirichlet Allocation||7.14||97.3|
|- Part-Of-Speech tags||5.01||98.1|
|Reduced feature set||4.90||98.2|
Another interesting point concerns the effect of some features on the two evaluation measures that Table 4 presents. Notice, for instance, that while language models are quite important for the custom error metric (Eq. 1) of the challenge, their effect is smaller for accuracy. This suggests that adding them reduces the size of the errors considerably, but does not increase much the correctly classified instances.
The last observation on the impact of the features on the evaluation measures motivates an error analysis step to examine the errors the model produces. Table 5 presents the confusion matrix of the 3-fold cross-validated predictions in the training data. As expected, the majority of examples appear in the diagonal denoting correct classification. Most of the errors that occur are in neighboring categories suggesting that it can be difficult to differentiate between them. Lastly, very few misclassification errors occur between categories that are far with respect to the given order of language levels which suggests that the system successfully differentiates between them.
In this work I presented the feature extraction, feature engineering and model evaluation steps I followed while developing balikasg for CAp 2018 that was ranked first among 14 other systems. I evaluated the efficiency of the different feature groups and found that readbility and complexity scores as well as topic models to be effective predictors. Further, I evaluated the the effectiveness of different classification algorithms and found that Gradient Boosted Trees outperform the rest of the models in this problem.
While in terms of accuracy the system performed excellent achieving 98.2% in the test data, the question raised is whether there are any types of biases in the process. For instance, topic distributions learned with LDA were valuable features. One, however, needs to deeply investigate whether this is due to the expressiveness and modeling power of LDA or an artifact of the dataset used. In the latter case, given that the candidates are asked to write an essay given a subject Geertzen et al. (2013) that depends on their level, the hypothesis that needs be studied is whether LDA was just a clever way to model this information leak in the given data or not. I believe that further analysis and validation can answer this question if the topics of the essays are released so that validation splits can be done on the basis of these topics.
I would like to thank the organisers of the challenge and NVidia for sponsoring the prize of the challenge. The views expressed in this paper belong solely to the author, and not necessarily to the author’s employer.
- At the time of writing of this paper, the test data have not become publicly available.
- Georgios Balikas and Ioannis Partalas. 2018. On the effectiveness of feature set augmentation using clusters of word embeddings. SwissText 2018.
- David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.
- Jeroen Geertzen, Theodora Alexopoulou, and Anna Korhonen. 2013. Automatic linguistic annotation of large scale l2 databases: The ef-cambridge open language database (efcamdat). In Proceedings of the 31st Second Language Research Forum. Somerville, MA: Cascadilla Proceedings Project.
- Kenneth Heafield. 2011. Kenlm: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197. Association for Computational Linguistics.
- Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H Clark, and Philipp Koehn. 2013. Scalable modified kneser-ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 690–696.
- Yan Huang, Akira Murakami, Theodora Alexopoulou, and Anna Korhonen. 2018. Dependency parsing of learner english. International Journal of Corpus Linguistics, 23(1):28–54.
- Robert Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, pages 4444–4451.