Mislabel Detection of Finnish Publication Ranks

Mislabel Detection of Finnish Publication Ranks


The paper proposes to analyze a data set of Finnish ranks of academic publication channels with Extreme Learning Machine (ELM). The purpose is to introduce and test recently proposed ELM-based mislabel detection approach with a rich set of features characterizing a publication channel. We will compare the architecture, accuracy, and, especially, the set of detected mislabels of the ELM-based approach to the corresponding reference results in [10].

1 Introduction

Finland, in the spirit of Norway and Denmark, introduced ranking system for academic publication channels (referring to scientific journals, conference series, book publishers etc.) called as Jufo (i.e. ”Julkaisufoorumi” in Finnish, ”Publication Forum” in English) in 2010, together with the renewed university legislation. The ranking of a publication channel, ranging from 0 (non-peer-reviewed) to 3 (most distinguished academic publication forums), is decided by a specially nominated panel of a particular scientific discipline. These panels decide the rankings based on their academic expertise in regular meetings. Because the rankings are directly linked to the allocated funding of the universities, there has been and is a lot of discussion about the fairness and objectivity of the ranks.

A versatile analysis of the 2015 Jufo-rankings was done in [10]. There, by using association rule mining, decision trees, and confusion matrices with respect to Norwegian and Danish ranks, it was shown that most of the expert-based rankings could be predicted and explained with machine learning methods. Moreover, it was found out that those publication channels, for which the Finnish expert-based rank is higher than the estimated one, are characterized by higher publication activity or recent upgrade of the rank. Hence, the outcomes of the system, the publication ranks, need to be assessed and evaluated regularly and rigorously.

Extreme Learning Machine (ELM), as proposed by Huang et al. [6, 5], provides one of the key randomized neural network frameworks [4]. Probabilistic convergence analysis of the technique was provided in [8, 7], where the necessity of repeated sampling of the feedforward kernel and the advantage of weight decay (ridge regression) were concluded. Here, to identify possibly mislabeled publication channel ranks, we apply the MD-ELM algorithm described and successfully tested in [1].

The rest of the paper is organized as following. The next section 2 introduces the original dataset of Jufo rankings. The methodology, section 3, describes the feature extraction process and summarizes the MD-ELM method. Section 4 explains the experimental setup, general prediction performance, and provides the comparison with the previous results in [10]. The last section 5 summarizes the findings and describes the future research directions.

2 Data

The data for this study comes from two publicly available databases containing the Finnish publication source information and the actual national publication activity information.

  1. JuFoDB: database of the Finnish publication forum, JuFo1, which contains all nationally evaluated publication channels. Data was retrieved from this database in February 2015, so it describes the ranking situation after complete reevaluation round by the end of 2014.

  2. JuuliDB: The publicly accessible database of Juuli2 that contains all publications of Finnish researchers. Each publication channel in JuFoDB has a unique Juuli ID, through which all Finnish publications in that particular channel can be found. Data was retrieved from this database in September 2015, because only then all published work by the end of 2014 had been checked and included in the repository.

29,443 different publication channels with 33 attributes were retrieved from JuFoDB and 107,289 publications from JuuliDB. The Finnish expert-based rank of each publication channel as well as the Norwegian and Danish expert-based rankings can be obtained directly through the JuFoDB and also the three bibliometric indicators from Scopus, that is the SJR, the SNIP and the IPP, are featured. Moreover, through the link to JuuliDB, one can directly access the information of all researchers in Finland who have published in the particular channel.

The panel variable determines the list of experts3 who have evaluated the publication channel and decided the Finnish expert-based rank. It basically indicates the research discipline of the publication channel. Field, MinEdu field, Web of Science fields, Scopus fields are further variables indicating the discipline of the publication channel. However, multiple linkings are possible for these variables and for some publication channels these linkings are not available at all. But each publication channel is attached to only one panel and the panel information is available for all publication channels except for 6,562 book publishers that have mostly been evaluated as rank 0 [10].

In addition to some more general data, such as the title, subtitle, website, country of publication, language, unique identifier (ID), ISSN, Sherpa/Romeo code, starting year, and publisher, the JuFoDB also provides information such as abbreviation, title details, ISBN, DOAJ, end year, continued under the name and continued JuFo-rank. The evaluation history provides information about the previous ranks in the system.

Similarly as in [10], the continuous variables are directly utilized as features and the categorical variables are transformed to own binary features for each category. All of the 29,443 publication channels have missing values for at least some of the 33 total variables. Hence, for utilizing all of available data in the analysis, one faces a significant sparsity problem [9]. Since the missing information was discovered as an important predictor of the Finnish expert-based rank in [10], we utilize here all the described variables as features plus for each variable the binary information whether it has an available value. Thus, for our final model we had 942 features (452 original + 400 added non-linear feature combinations).

3 Methodology

3.1 Feature extraction

The original variables as described in the previous section were transformed into numerical features, either real-valued or binary ones. Each original feature has its own specific transformation into numerical format. The absence of a value, similarly to [10], is encoded with a separate binary variable for most features, as it provides valuable information (i.e., absence of a website of a poor quality conference).

The original features that are used for the analysis task, and their corresponding transformations are described below in Table 1. The results are notably missing Jufo-rankings for the previous years; those are omitted on purpose to make the rank prediction task unbiased by the previous decisions.

# Feature Meaning Numerical representation
1 Level Current Jufo ranking An output variable with integer values in range
2 Title,
Subtitle Title and subtitle (if available) of the publication Encoded in a Bag-of-Words representation, dimensionality reduced from 3700 to 30 by a Sparse Random Projection
3 Website Website of the publication Country code of the host represented in one-hot encoding with 117 binary variables (including unknown)
4 Type Publication type (journal, conference, book series) Represented in one-hot encoding with 3 binary variables; this feature has no missing values
5 ISSN ISSN numbers of printed and online versions A binary variable representing whether the publication has an ISSN, two variable total
6 StartYear Start year of the publication A logarithm of age of the publication, plus a binary variable representing missing value
7 Publication Country Country of publication One-hot encoding of the publication origin with 114 binary variables, including the unknown origin
8 Publisher Publisher of the series One-hot encoding of 100 most popular publishers, plus other publisher
9 Language Language of the publication One-hot encoding of the publication language with 49 binary variables, including undetermined
10 ERIH-class ERIH ranking of publications One-hot encoding of the four available ranks, plus a missing rank
11 SJR
IPP Impact factors in three different systems Three real-valued variables for the impact factors, plus three binary variables indicating the absence of an impact factor
Sherpa/Romeo Open access types Eight binary variables: two for DOAJ levels, and six for the Sherpa/Romeo levels
13 Field The field of study in Finnish classification Ten binary variables for the ten fields, a publication may belong to multiple fields
14 MinEdu Field The field of study according to the Ministry of Education classification 70 binary variables for the Ministry of Education fields, a publication may belong to several of them
15 Panel The scientific panel that assigned a corresponding score One-hot encoding of the panel number with 25 binary variables, including a not available panel
16 ISBN ISBN numbers used by the publication One variable representing the number of different ISBNs; can be zero
Table 1: The list of original features and their numerical representations.

3.2 Mislabel detection using MD-ELM

The mislabel detection is based on the MD-ELM algorithm from [1]. The key idea is to include in a data set artificial mislabels, which then can be used as baseline in a statistical detection of unknown mislabels using Welch’s t-test and directly computable Leave-One-Out (LOO) cross-validation error (PRESS statistics). In this way, the MD-ELM algorithm detects samples whose original labels are likely incorrect.

More precisely, the MD-ELM analyses the changes in the LOO error of the model in response to randomly changing labels of a few training samples. If the new labels reduce the global LOO error, the mislabel score of those samples is increased. A small part of the samples, whose labels are randomly changed on purpose, create the control group called artificial mislabels. Scores of the artificially mislabeled samples help to determine whether the MD-ELM method succeeds, and define the stopping criterion.

The mislabel detection method uses Extreme Learning Machine as the powerful nonlinear prediction model with a fast LOO error. A practical implementation employs several ELM models with different sets of artificial mislabels, eliminating their possible impact on the results. The predicted originally mislabeled samples are samples with the mislabel score higher that the given quantile of a normal distribution fitted to all the scores.

4 Experimental results

4.1 Prediction performance

A successful MD-ELM method application requires a precise prediction model to work with. The prediction task uses features 2-16 from Table 1 as inputs and the feature 1 as the target output.

The dataset exhibit a strong class imbalance (see Table 2). The imbalance causes rank 3 to be completely neglected in the predictions, unless class balancing measures are taken.

Rank 0 1 2 3
# 5,743 20,503 2,329 668
Table 2: Number of data samples of each Jufo-rank.

The benchmark performance level is obtained with the Random Forest classifier. It achieves 89.3% test accuracy, but the predictions are biased due to the strong class imbalance as shown on Figure 1. The smallest class 3 has only 18,6% correct predictions, while the largest class 1 is predicted correctly 98,4% of times.

Figure 1: Confusion matrix of Random Forest classifier on out-of-batch data.

Unfortunately, Random Forest model cannot be used in the Mislabeled Detection framework. So an Extreme Learning Machine was train instead. The input features consisted of the 542 numerical features derived from the data, 200 standard non-linear ELM neurons and another 200 Radial Basis Function neurons.

The output layer training proved difficult due to both class imbalance, and a high number of irrelevant linear features. The only successful model was an ElasticNet linear classifier that combined L1 and L2 regularization, trained with the Stochastic Gradient Descend. The regularization strength parameter is found by a 5-fold stratified cross-validation, that keeps the proportion of samples from different classes equal between the folds. Additionally, the method performed class balancing by computing the corresponding sample weights.

The resulted ELM achieved 85% total accuracy, distributed much more equally among the classes as shown in Figure  2. The resulting model selected only 289 input features out of the total of 942, reducing the data size for the MD-ELM method.

Figure 2: Confusion matrix of an ELM model with an ElasticNet classifier output layer.

4.2 MD-ELM performance

The MD-ELM method uses 289 best features selected in the prediction experiment. The method does not implement class balancing, so the scope of the experiment is limited to detecting incorrectly labeled samples of rank 3 using a dataset of 900 random samples from ranks 0,1,2 plus all the 668 samples of rank 3. Such reduced dataset has a smaller class imbalance, that does not negatively affect the results.

The final predictions are averaged over 10 different MD-ELM models. Each model uses its own dataset with different random samples of ranks 0,1,2, a random subset of 100 input features out of the available 289, and a different random subset of 3% artificially mislabeled samples. At each iteration of the method, two samples have their labels changes, one of which is always an original rank 3 sample.

The method continues until artificially mislabeled samples get an average score of 100. This takes 400,000 iterations. By that time, non-artificially mislabeled samples achieve an average mislabel score of only 19 with standard deviation of 28. The difference between the scores shows that MD-ELM methods succeeds at separating artificially mislabeled samples from the rest; it means that it should also succeed in detecting the originally mislabeled samples.

The mislabel scores of all the samples with the original rank 3 are shown on Figures 3. A few outliers are clearly visible, together with other candidates to be the originally mislabeled samples. The analysis of these samples is presented below.

Figure 3: Mislabel scores of samples with the original rank 3 averaged over 10 MD-ELM models; zoomed version on the right. The quantile values of 99% and 99.9% are shown by horizontal lines. Artificial mislabels achieve an average score of 100.

4.3 Characterization of misclassified publication channels and comparison to earlier results

As explained above, we concentrate only on misclassifications for the highest JuFo ranking, that is publication channels that were evaluated by the Finnish discipline experts as 3, but for which the automatic model suggested a lower rank. We restrict our misclassification analysis here to this set because it also resembles the largest difference to the Danish and Norwegian systems that include only ranks 0, 1 and 2.

With a mislabelled score over 99% quantile of average scores, 34 publication channels were identified for which the Finnish expert-based ranking was 3 but the model suggested a different rank. However, 30 of these misclassifications could immediately be explained by the ranks in the Danish and Norwegian model, which evaluated these publication channels as 2, that is the highest rank in their systems.

The four remaining publication channels for which both, the automated model and the Danish and Norwegian systems, suggested a lower rank were LIGHT: SCIENCE & APPLICATIONS, Etudes classiques, New German critique (for all three of these journals, the rank has recently been updated to a higher one), and the British medical journal. The last one has a considerable higher publication activity: The average number of Finnish publications in JuFo rank 3 channels is 10.78 but the British medical journal has a total of 26 publications. All of these journals were also detected to be mislabled in [10], but the misclassification could actually be explained. The three Scopus indicators had incorrectly not been included in JuFoDB for LIGHT: SCIENCE & APPLICATIONS and the British medical journal. These indicators could be manually found from Scopus and in both cases the indicators were so high that rank 3 actually seemed justified.

Although the methods utilized in here were very different from the ones utilized in [10], the main results obtained and the misclassification detected in here are to a large extend the same as the ones in [10]. Thus, we conclude that methodological triangulation [2, 3] has strengthen our analysis results.

5 Conclusions

An extended version of the analysis of Finnish publication channel ranks was provided in this paper. Compared to the reference models in [10], we used here much more versatile set of features, with fully nonlinear ELM-based rank prediction model. The mislabel detection was based on the MD-ELM algorithm proposed in [1] and briefly recapitulated in section 3.2.

In summary, the experimental results obtained and reported in Section 4.3 are very similar to the analysis results in [10]. In our future work, we intend to repeat the mislabel detection also for the other ranks, especially rank 2 for which the most suspicious publication channel quality misclassifications were identified in [10] and that, as explained above, actually contain the most misclassifications. The MD-ELM method will also be extended with a class balancing mechanism, allowing it to handle the whole original dataset.


  1. Available at http://www.tsv.fi/julkaisufoorumi/haku.php.
  2. Available at http://www.juuli.fi/?&lng=en.
  3. See http://www.julkaisufoorumi.fi/en/publication-forum/panels.


  1. A. Akusok, D. Veganzones, Y. Miche, K. Björk, P. du Jardin, E. Severin and A. Lendasse (2015) MD-elm: originally mislabeled samples detection using op-elm model. Neurocomputing 159, pp. 242–250. Cited by: §1, §3.2, §5.
  2. A. Bryman (2004) Triangulation. In The SAGE Encyclopedia of Social Science Research Methods, pp. 1143–1144. External Links: Document, Link Cited by: §4.3.
  3. N. Denzin (1970) Strategies of Multiple Triangulation. The Research Act: A Theoretical Introduction to Sociological Methods, pp. 297–313. Cited by: §4.3.
  4. C. Gallicchio, J. D. Martin-Guerrero, A. Micheli and E. Soria-Olivas (26-28 April 2017) Randomized machine learning approaches: Recent developments and challenges. In ESANN 2017 Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 77–86. Cited by: §1.
  5. G. Huang, H. Zhou, X. Ding and R. Zhang (2012) Extreme learning machine for regression and multiclass classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42 (2), pp. 513–529. Cited by: §1.
  6. G. Huang, Q. Zhu and C. Siew (2006) Extreme learning machine: theory and applications. Neurocomputing 70 (1), pp. 489–501. Cited by: §1.
  7. S. Lin, X. Liu, J. Fang and Z. Xu (2015) Is extreme learning machine feasible? a theoretical assessment (part ii). IEEE Transactions on Neural Networks and Learning Systems 26 (1), pp. 21–34. Cited by: §1.
  8. X. Liu, S. Lin, J. Fang and Z. Xu (2015) Is extreme learning machine feasible? a theoretical assessment (part i). IEEE Transactions on Neural Networks and Learning Systems 26 (1), pp. 7–20. Cited by: §1.
  9. M. Saarela and T. Kärkkäinen (2015) Analysing Student Performance using Sparse Data of Core Bachelor Courses. JEDM-Journal of Educational Data Mining 7 (1), pp. 3–32. External Links: ISSN 2157-2100, Link Cited by: §2.
  10. M. Saarela, T. Kärkkäinen, T. Lahtonen and T. Rossi (2016) Expert-based versus citation-based ranking of scholarly and scientific publication channels. Journal of Informetrics 10 (3), pp. 693–718. Cited by: Mislabel Detection of Finnish Publication Ranks, §1, §1, §2, §2, §3.1, §4.3, §4.3, §5, §5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description