Lexical Bias In Essay Level Prediction

Lexical Bias In Essay Level Prediction


Automatically predicting the level of non-native English speakers given their written essays is an interesting machine learning problem. In this work I present the system balikasg that achieved the state-of-the-art performance in the CAp 2018 data science challenge among 14 systems. I detail the feature extraction, feature engineering and model selection steps and I evaluate how these decisions impact the system’s performance. The paper concludes with remarks for future work.


1 Introduction

Automatically predicting the level of English of non-native speakers from their written text is an interesting text mining task. Systems that perform well in the task can be useful components for online, second-language learning platforms as well as for organisations that tutor students for this purpose. In this paper I present the system balikasg that achieved the state-of-the-art performance in the CAp 2018 data science challenge among 14 systems.1 In order to achieve the best performance in the challenge, I decided to use a variety of features that describe an essay’s readability and syntactic complexity as well as its content. For the prediction step, I found Gradient Boosted Trees, whose efficiency is proven in several data science challenges, to be the most efficient across a variety of classifiers.

The rest of the paper is organized as follows: in Section 2 I frame the problem of language level as an ordinal classification problem and describe the available data. Section 3 presents the feature extaction and engineering techniques used. Section 4 describes the machine learning algorithms for prediction as well as the achieved results. Finally, Section 5 concludes with discussion and avenues for future research.

2 Problem Definition

In order to approach the language-level prediction task as a supervised classification problem, I frame it as an ordinal classification problem. In particular, given a written essay from a candidate, the goal is to associate the essay with the level of English according to the Common European Framework of Reference for languages (CEFR) system. Under CEFR there are six language levels , such that . In this notation, is the beginner level while is the most advanced level. Notice that the levels of are ordered, thus defining an ordered classification problem. In this sense, care must be taken both during the phase of model selection and during the phase of evaluation. In the latter, predicting a class far from the true should incur a higher penalty. In other words, given a essay, predicting is worse than predicting , and this difference must be captured by the evaluation metrics.

In order to capture this explicit ordering of , the organisers proposed a cost measure that uses the confusion matrix of the prediction and prior knowledge in order to evaluate the performance of the system. In particular, the meaures uses writes as:


where is a cost matrix that uses prior knowledge to calculate the misclassification errors and is the number of observations of class classified with category . The cost matrix is given in Table 1. Notice that, as expected, moving away from the diagonal (correct classification) the misclassification costs are higher. The biggest error (44) occurs when a essay is classified as . On the contrary, the classification error is lower (6) when the opposite happens and an essay is classified as . Since is not symmetric and the costs of the lower diagonal are higher, the penalties for misclassification are worse when essays of upper languages levels (e.g., ) are classified as essays of lower levels.

0 1 2 3 4 6
1 0 1 4 5 8
3 2 0 3 5 8
10 7 5 0 2 7
20 16 12 4 0 8
44 38 32 19 13 0
Table 1: Cost matrix used to calculate the miscalssification error described in Eq. (1).


The data used in this work were released in the framework of the CAp 2018 competition and are an excerpt of Geertzen et al. (2013); Huang et al. (2018). The competition’s goal was to evaluate automated systems that performed well on the task. The organisers released two datasets: the training and the test data. The test data became available only a few days before the competition end without the associated language level and were used only for evaluation. The evaluation was performed by the organisers after submitting the system prediction in a text file as frequently done in such competitions.

The training data consist of 27,310 essays while the test data contain 13,656 essays. Figure 1 illustrates the distribution of essays according to their level. The classification problem is unbalanced as there are far more training examples for the first levels (e.g., ) than for the rest. The organisers announced that they performed a stratified split with respect to the level label, so we expect similar distributions for the test data.2

The released data consist of the essay text as well as various numerical features. The numerical features are either statistics calculated on the essay text (length, number of sentences/syllables etc.) or indexes that try to captrue the readability and complexity of the essays, like the Coleman and Flesch families of indexes. Table 2 presents basic statistics that describe the essay text.

Figure 1: The distribution of essays according to the CERF levels in the training data.
Description Value
Number of essays 13,656
Vocabulary size 38,337
Avg. essay length (words) 80.22
Avg. essay length (sentences) 6.75
Table 2: Basic statistics for the released essays.

3 Feature Extaction

In this section I present the extracted features partitioned in six groups and detail each of them separately.

Numerical features

Most of the features in this family were provided by the challenge organisers using the v0.10-2 of the R koRpus package.3 For a full list of these features, please visit the challenge website.4 During the preliminary exploratory analysis I found some of the features released by the organisers to be inaccurate. Hence, I recalculated the number of sentences, words, letters per essay and I added the Gunning Fog index, which estimates the number of years of formal education a person needs to understand an English text on the first reading using the Python textstat package.5 Also, I added the number of difficult words in an essay using the lists of difficult words of textstat, the number of mispelled words using the GNU dictionary6, the number of duplicate words in the essay as well as the number of words with inverse document frequency (idf) smaller then the average idf of the corpus words. Overall, there are 66 numerical features in this family.

Language models

For each essay I calculated its probability under two language models: one trained with the essays belonging to the labels and another trained on the essays of . For this purpose, I used trigram language models with modified Kneser Ney smoothing, with -grams of a lower order () as a back-off mechanism Heafield et al. (2013) using the implementation of Heafield (2011). As language models can easily overfit when trained on small corpora, I decided to replace words with less than 10 occurrences by their part-of-speech (POS) tags and numbers by the special token “”. The hope is that language models will capture the more complicated patterns we expect higher-level users to use. The free parameters concerning the use of language models like the optimal value of , the decision whether to ignore or replace low frequency words with their POS etc. were made using stratified 3-fold cross-validation in the training data. The same applies for every major decision in the feature extraction process, unless otherwise stated.

Word Clusters

Word embeddings are dense word vectors that have been show to capture the semantics of words. I represented a given essay using word clusters Balikas and Partalas (2018) calculated by applying -Means () on the ConceptNet embeddings Speer et al. (2017). Clustering the words of the corpus vocabulary generates semantically coherent clusters. In turn, to represent a document I used a binary one-hot-encoded representation of the clusters where the essay words belong to. For instance, in our case where each of the vocabulary words belongs to one of 1,000 clusters, an essay is represented by a 1,000-dimensional binary vector. The non-zero elements of this vector are the ids of clusters where the essay words belong to.

Topic Models

Topic models are a class of unsupervised models. They are generative models in that they define a mechanism on how a corpus of documents is generated. In this work I used Latent Dirichlet Allocation (LDA) Blei et al. (2003) in order to obtain dense document representations that describe the topics that appear in each document. During the inference process, these topics, that are multinomial distributions over the corpus vocabulary, are uncovered and each document is represented by a mixture of them. I used a custom Python LDA implementation7 and concatenated the per-document topic distributions obtained when training LDA with 30, 40, 50 and 60 topics. For each number of topics, I ran the inference process for 200 burn-in iterations, so that the collapsed Gibbs sampler converges, and then sampled the topic distributions every 10 iterations until 50 in order to obtain uncorrelated samples.

Part-of-Speech tags

POS tags are informative text representations that can capture the complexity of the expressions used by an author. Intuitively, beginners use less adjectives and adverbs for example compared to more advanced users of a language. To capture this, I obtained the sequence of POS tags of an essay using SpaCy.8 Then to represent the POS sequences I used -grams () and encoded them as bag-of-words.

Essay text

Last, I explicitly encoded the content of the essay using its bigram bag-of-words representation. In order to limit the effect of frequent terms like stopwords I applied the idf weighting scheme.

4 Model Selection and Evaluation

As the class distribution in the training data is not balanced, I have used stratified cross-validation for validation purposes and for hyper-parameter selection. As a classification1 algorithm, I have used gradient boosted trees trained with gradient-based one-side sampling as implemented in the Light Gradient Boosting Machine toolkit released by Microsoft.9. The depth of the trees was set to 3, the learning rate to 0.06 and the number of trees to 4,000. Also, to combat the class imbalance in the training labels I assigned class weights at each class so that errors in the frequent classes incur less penalties than error in the infrequent.


Figure 2 illustrates the performance the Gradient Boosted Trees achieve on each of the feature sets. Complementary and for reference, Table 3 presents the accuracy scores of each feature set. Notice that the best performance is obtained when all features are used. Adding the per-document topic distributions infered by the topic models seems to improve the results considerably.

Figure 2: The accuracy scores of each feature set using 3-fold cross validation on the training data.
+ Numerical Features 23.52
+ Language Models 14.00
+ Clusters 14.20
+ Latent Dirichlet Allocation 5.74
+ Part-Of-Speech tags 5.52
+ Bag-of-words 4.97
Table 3: Stratified 3-fold cross-validation scores for the official measure of the challenge.

In order to better evaluate the effect of each family of features without the bias of the ordering of adding families of features, Table 4 presents an ablation study. The Table presents the scores achieved when using all features as well as the scores achieved when removing a particular family of features. In this sense, one can estimate the added value of each family. From the table we notice that the most effective features are the numerical features, the document distributions learned with LDA and the scores from the language models. When removing the features of this family we notice the biggest reduction in performance. For reference, the table presents the performance of a tuned Logistic Regression with the same class weights for the class imbalance. Gradient Boosted trees outperform Logistic Regression by a large margin in terms of , proving their efficiency for classification tasks.

Features Error () Accuracy
All features (Winning solution) 4.97 98.2
All features (Log. Regression) 10.10 97.2
- Numerical Features 14.65 95.6
- Language Models 5.66 98.1
- Clusters 4.99 98.1
- Latent Dirichlet Allocation 7.14 97.3
- Part-Of-Speech tags 5.01 98.1
- Bag-of-words 5.52 97.7
Reduced feature set 4.90 98.2
Table 4: Ablation study to explore the importance of different feature families.

Another interesting point concerns the effect of some features on the two evaluation measures that Table 4 presents. Notice, for instance, that while language models are quite important for the custom error metric (Eq. 1) of the challenge, their effect is smaller for accuracy. This suggests that adding them reduces the size of the errors considerably, but does not increase much the correctly classified instances.

The last observation on the impact of the features on the evaluation measures motivates an error analysis step to examine the errors the model produces. Table 5 presents the confusion matrix of the 3-fold cross-validated predictions in the training data. As expected, the majority of examples appear in the diagonal denoting correct classification. Most of the errors that occur are in neighboring categories suggesting that it can be difficult to differentiate between them. Lastly, very few misclassification errors occur between categories that are far with respect to the given order of language levels which suggests that the system successfully differentiates between them.

11,224 54 3 0 1 0
99 7,531 42 0 0 0
30 95 5,297 23 7 1
0 4 32 2,273 14 1
7 2 7 35 465 19
1 2 2 6 4 29
Table 5: Confusion matrix of the 3-fold stratified cross validation. The value is the number of predictions known to be in group and predicted to be in group . Notice how most of the mis-classification errors occur between close categories.

5 Conclusion

In this work I presented the feature extraction, feature engineering and model evaluation steps I followed while developing balikasg for CAp 2018 that was ranked first among 14 other systems. I evaluated the efficiency of the different feature groups and found that readbility and complexity scores as well as topic models to be effective predictors. Further, I evaluated the the effectiveness of different classification algorithms and found that Gradient Boosted Trees outperform the rest of the models in this problem.

While in terms of accuracy the system performed excellent achieving 98.2% in the test data, the question raised is whether there are any types of biases in the process. For instance, topic distributions learned with LDA were valuable features. One, however, needs to deeply investigate whether this is due to the expressiveness and modeling power of LDA or an artifact of the dataset used. In the latter case, given that the candidates are asked to write an essay given a subject Geertzen et al. (2013) that depends on their level, the hypothesis that needs be studied is whether LDA was just a clever way to model this information leak in the given data or not. I believe that further analysis and validation can answer this question if the topics of the essays are released so that validation splits can be done on the basis of these topics.


I would like to thank the organisers of the challenge and NVidia for sponsoring the prize of the challenge. The views expressed in this paper belong solely to the author, and not necessarily to the author’s employer.


  1. urlhttp://cap2018.litislab.fr/competition-en.html
  2. At the time of writing of this paper, the test data have not become publicly available.
  3. https://cran.r-project.org/web/packages/koRpus/vignettes/koRpus_vignette.pdf
  4. http://cap2018.litislab.fr/competition_annexes_EN.pdf
  5. https://github.com/shivam5992/textstat
  6. ftp://ftp.gnu.org/gnu/aspell/dict/0index.html
  7. https://github.com/balikasg/topicModelling
  8. https://spacy.io/
  9. https://github.com/Microsoft/LightGBM


  1. Georgios Balikas and Ioannis Partalas. 2018. On the effectiveness of feature set augmentation using clusters of word embeddings. SwissText 2018.
  2. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.
  3. Jeroen Geertzen, Theodora Alexopoulou, and Anna Korhonen. 2013. Automatic linguistic annotation of large scale l2 databases: The ef-cambridge open language database (efcamdat). In Proceedings of the 31st Second Language Research Forum. Somerville, MA: Cascadilla Proceedings Project.
  4. Kenneth Heafield. 2011. Kenlm: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197. Association for Computational Linguistics.
  5. Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H Clark, and Philipp Koehn. 2013. Scalable modified kneser-ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 690–696.
  6. Yan Huang, Akira Murakami, Theodora Alexopoulou, and Anna Korhonen. 2018. Dependency parsing of learner english. International Journal of Corpus Linguistics, 23(1):28–54.
  7. Robert Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, pages 4444–4451.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description