Qualitative Judgement of Research Impact: Domain Taxonomy as a Fundamental Framework for Judgement of the Quality of Research
The appeal of metric evaluation of research impact has attracted considerable interest in recent times. Although the public at large and administrative bodies are much interested in the idea, scientists and other researchers are much more cautious, insisting that metrics are but an auxiliary instrument to the qualitative peer-based judgement. The goal of this article is to propose availing of such a well positioned construct as domain taxonomy as a tool for directly assessing the scope and quality of research. We first show how taxonomies can be used to analyse the scope and perspectives of a set of research projects or papers. Then we proceed to define a research team or researcher’s rank by those nodes in the hierarchy that have been created or significantly transformed by the results of the researcher. An experimental test of the approach in the data analysis domain is described. Although the concept of taxonomy seems rather simplistic to describe all the richness of a research domain, its changes and use can be made transparent and subject to open discussions.
Keywords: research impact, scientometrics, stratification, rank aggregation, multicriteria decision making, semantic analysis, taxonomy
1 Introduction: The Problem and Background
This article constructively supports the view expressed in the Leiden Manifesto (Hicks et al., 2015), as well as other recent documents such as DORA (Dora, 2013) and Metrics Tide Report (Metric Tide, 2016). All of these advance the principle that assessment of research impact should be made primarily according to qualitative judgment rather than by using citation and similar metrics. It may be maintained, due to the lack of comprehensive recording of process, that the traditional organisation of qualitative judgment via closed committees is prone to bias, mismanagement and corruption. In this work, it is proposed to use domain taxonomies for development of open, transparent and unbiased frameworks for qualitative judgments.
In this article, the usefulness of this principled approach is illustrated by, first, the issue of context based mapping and, second, the issue of assessment of quality of research. We propose the direct evaluation of the quality of research, and this principled approach is innovative. We also demonstrate how it can be deployed by using that part of the hierarchy of the popular ACM Classification of Computer Subjects (ACM, 2016) that relates to data analysis, machine learning and data mining. We define a researcher’s rank by those nodes in the hierarchy that have been created or significantly transformed by the results of the researcher. The approach is experimentally tested by using a sample of leading scientists in the data analysis domain. The approach is universal and can be applied by research communities in other domains.
In part 1 of this work, starting with section 3, there is the engendering and refining of taxonomy. We express it thus to indicate the strong contextual basis, and how one faces and addresses, policy and related requirements. In part 2 of this work, staring with section 5, ranking is at issue that accounts fully for both quantitative and qualitative performance outcomes.
2 Review of Research Impact Measurement and Critiques
The issue of measuring research impact is attracting intense attention of scientists because metrics of research impact are being widely used by various administrative bodies and by public at large as easy-to-get shortcuts for assessment of comparative strengths among scientists, research centres, and universities. This is further boosted by the wide availability of digitalized data and, as well, by the fact that research nowadays becomes a widespread activity. The number of citations and such derivatives as Hirsch index are produced by a number of organizations including the inventors, currently Thomson Reuters (Thomson Reuters, 2016), Scopus and Google. There is increasing pressure to use these or similar indexes in evaluation and management of research. There have been a number of proposals to amend the indexes, say, by using less extensive characteristics, such as centrality indexes in the inter-citation graphs or by following only citations in the work of “lead scientists” (Aragnón, 2013). Other proposals deny the usefulness of bibliometrics altogether; some propose even such alternative measures as the “careful socialization and selection of scholars, supplemented by periodic self-evaluations and awards” (Osterloh and Frey, 2014), that is, a social- and behavioural-based, administrative, exemplary model. Other, more practical systems, such as the UK Research Assessment Exercise (RAE), now the REF, Research Excellence Framework), intends to assess most significant contributions only, and in a most informal way, which seems a better option. However, there have been criticisms of the RAE-like systems as well: first, in the absence of a citation index, the peer reviews are not necessarily consistent in evaluations (Eisen et al., 2013), and, second, in the long run, the system itself seems somewhat short-sighted; it has cut off everything which is out of the mainstream (Lee et al., 2013). There have been a number of recent initiatives undertaken by scientists themselves such as the San-Francisco Declaration DORA (Dora, 2013), Leiden Manifesto (Hicks et al., 2015), The Metrics Tide Report (Metric Tide, 2016). DORA, for example, emphasizes that research impact should be scored over all scientific production elements including data sets, patents, and codes among others (Dora, 2013). Altogether, these declarations and manifestos claim that citation and other metrics should be used as an auxiliary instrument only; the assessment of research quality should be based on “qualitative judgement” of the research portfolio (Hicks et al., 2015). Yet there is no clarity on the practical implementation of these recommendations.
This article is a further step in this direction. Any unbiased consideration of metrics as well as of other systems for assessment of research impact (Eisen et al., 2013; Lee et al., 2013) leads to conclusions that “qualitative judgment” should be a preferred option (Dora, 2013; Hicks et al., 2015; Metric Tide, 2016). This article points out to the concept of domain taxonomy which should be used as a main tool in actual organization of assessment of research impact in general and quality of research, specifically.
The remainder of this article is organized as follows. We begin by briefly reviewing direct and straightforward application of domain taxonomy, for supporting qualitative judgement. Relating to the policy-related work of a national research funding agency, and to the editorial work of a journal, these preliminary studies were pioneering.
The third section explains how a domain taxonomy can be used for assessing the quality of research. The fourth section provides an experiment in testing the approach empirically. The fifth section compares the taxonomic ranking of our sample of scientists with rankings over citation and merit.
3 Qualitative, Content-Based Mapping, into which the Quantitative Indicators are Mapped
In this section and in the next section, we develop taxonomies using sets of keywords or selected actionable terms. It is sought to be, potentially, fully data-driven. Levels of resolution in our taxonomy can be easily formed through term aggregation. Mapping the taxonomy, as a tree endowed with an ultrametric, to a metric space, when using levels of aggregation, provides an approach to having focus (in a general sense, orientation and direction) in the analytics.
Here we give a first example, in which the taxonomies were generated with the goal to provide a tool for open and unbiased qualitative judgment in such contexts as research publishing and research funding. Concept hierarchies can be established by domain experts, and deployed in such contexts as research publishing and research funding.
A short review was carried out of thematic evolution of The Computer Journal, relating to 377 papers published between January 2000 through to September 2007. The construction of a concept hierarchy, or ontology, was “bootstrapped” from the published articles. The top level terms, child nodes of the concept tree root, were “Systems – Physical”, “Data and Information”, and “Systems – Logical”. Noted was that the category of “bioinformatics” did not require further concept child nodes. A limited set of sub-categories was used for “software engineering”, these being “Design”, “Education”, and “Programming languages”. Under the top level category of “Data and information”, one of the eight child nodes was “Machine learning”, and one of its child nodes was “Plagiarism”. This was justified by the appropriateness of the contents of published work relating to plagiarism. Once the concept hierarchy was set up, the 377 published articles from the seven years under investigation were classified, with mostly two of the taxonomy terms being used for a given article. There was a maximum of four taxonomy terms, and a minimum of one. Table 1 displays the concept hierarchy that was used at that time.
A Correspondence Analysis of this data, here with a focus on the top level themes, presents an interesting and revealing view. A triangle pattern is to be seen, in Figure 1, where Inf is counterposed on the first factor to the two other, more traditional Computer Science themes. Factor 2 counterposes the physical and the logical in the set of published research work. The information displayed in Figure 1 comprises all information, that is the inertia of the cloud of publications, and of the cloud of these top level themes. The year of publication, as a supplementary attribute of the publications, is inactive in the factor space definition, and each is projected into the factor space. We see the movement from year to year, in terms of the top level themes. There is further general discussion in Murtagh (2008).
The perspective described, for archival, scholarly journal publishing, relates to the narrative or thematic evolution of research outcomes.
4 Application of Narrative Analysis to Science and Engineering Policy
This same perspective as described in the previous section was prototyped for the narrative ensuing from national science research funding. The aim here was thematic balance and evolution. Therefore it was complementary to the operational measures of performance – numbers of publications, patents, PhDs, company start-ups, etc. In Murtagh (2010), the full set of large research centres (8 of these, with up to 20 million euro funding) and a class of less large research centres (12 of these, each with 7.5 million euro funding) were mapped into a Euclidean metric endowed, Correspondence Analysis, factor space. In this space there is displayed the centres, their themes, and, as a prototyping study, just one attribute of the research centres, their research budget. The first factor clearly counterposed centres for biosciences to centres for telecoms, computing and nanotechnology. The second factor clearly counterposed centres for computing and telecoms to nanotechnology. This is further elaborated in section 4.1.
All in all, there is enormous scope for insight and understanding, that starts from subject matter and content. Quantitative indicators are well accommodated, with their additional or complementary information. It may well be hoped that in the future, qualitative, content-based analytics, coupled with quantitative indicators, will be extended. For this purpose, it may well be very useful to consider not just published research, but all written, and subsequently submitted, research results and/or plans. Similarly for research funding, the content-based mapping and assessment of rejected work is relevant too, not least in order to contextualize the content of all domains and disciplines.
The role of taxonomy is central to the information focusing that is under discussion in this section. Information focusing is carried out through mapping the ontology, or concept hierarchy, as a level of aggregation, corresponding therefore to non-terminal, i.e. non-singleton, nodes. Our interest in this data is to have implications of this for data mining with decision policy support in view.
Consider a fairly typical funded research project, and its phases up to and beyond the funding decision. A narrative can always be obtained, in one form or another, and is likely to be a requirement. All stages of the proposal and successful project life cycle, including external evaluation and internal decision making, are highly document – and as a consequence narrative – based.
As a first step, let us look at the very general role of narrative in national research development. The following comprise our motivation: Overall view, i.e. overall synthesis of information; Orientation of strands of development; Their tempo and rhythm.
Through such an analysis of narrative, among the issues to be addressed are the following: Strategy and its implementation in terms of themes and subthemes represented; Thematic focus and coverage; Organisational clustering; Evaluation of outputs in a global context; All the above over time.
The aim here is to view the “big picture”. It is also to incorporate contextual attributes. These may be the varied performance measures of success that are applied, such as publications, patents, licences, numbers of PhDs completed, company start-ups, and so on. It is instead to appreciate the broader configuration and orientation, and to determine the most salient aspects underlying the data.
4.1 Assessing Coverage and Completeness
SFI Centres for Science, Engineering and Technology (CSETs) are campus-industry partnerships typically funded at up to €20 million over 5 years. Strategic Research Clusters (SRCs) are also research consortia, with industrial partners and over 5 years are typically funded at up to €7.5 million.
We cross-tabulated 8 CSETs and 12 SRCs by a range of 65 terms derived from title and summary information; together with budget, numbers of PIs (Principal Investigators), Co-Is (Co-Investigators), and PhDs. We can display any or all of this information on a common map, for visual convenience a planar display, using Correspondence Analysis.
In mapping SFI CSETs and SRCs, we will now show how Correspondence Analysis is based on the upper (near root) part of an ontology or concept hierarchy. This we view as information focusing. Correspondence Analysis provides simultaneous representation of observations and attributes. Retrospectively, we can project other observations or attributes into the factor space: these are supplementary observations or attributes. A 2-dimensional or planar view is likely to be a gross approximation of the full cloud of observations or of attributes. We may accept such an approximation as rewarding and informative. Another way to address this same issue is as follows. We define a small number of aggregates of either observations or attributes, and carry out the analysis on them. We then project the full set of observations and attributes into the factor space. For mapping of SFI CSETs and SRCs a simple algebra of themes as set out in the next paragraph achieves this goal. The upshot is that the 2-dimensional or planar view is a better fit to the full cloud of observations or of attributes.
From CSET or SRC characterization as: Physical Systems (Phys), Logical Systems (Log), Body/Individual, Health/Collective, and Data & Information (Data), the following thematic areas were defined.
eSciences = Logical Systems, Data & Information
Biosciences = Body/Individual, Health/Collective
Medical = Body/Individual, Health/Collective, Physical Systems
ICT = Physical Systems, Logical Systems, Data & Information
eMedical = Body/Individual, Health/Collective, Logical Systems
eBiosciences = Body/Individual, Health/Collective, Data & Information
This categorization scheme can be viewed as the upper level of a concept hierarchy. It can be contrasted with the somewhat more detailed scheme that we used for analysis of articles in the Computer Journal, (Murtagh, 2008).
CSETs labelled in the Figures are: APC, Alimentary Pharmabiotic Centre; BDI, Biomedical Diagnostics Institute; CRANN, Centre for Research on Adaptive Nanostructures and Nanodevices; CTVR, Centre for Telecommunications Value-Chain Research; DERI, Digital Enterprise Research Institute; LERO, Irish Software Engineering Research Centre; NGL, Centre for Next Generation Localization; and REMEDI, Regenerative Medicine Institute.
In Figure 2 eight CSETs and major themes are shown. Factor 1 counterposes computer engineering (left) to biosciences (right). Factor 2 counterposes software on the positive end to hardware on the negative end. This 2-dimensional map encapsulates 64% (for factor 1) + 29% (for factor 2) = 93% of all information, i.e. inertia, in the dual clouds of points. CSETs are positioned relative to the thematic areas used. In Figure 3, sub-themes are additionally projected into the display. This is done by taking the sub-themes as supplementary elements following the analysis as such. From Figure 3 we might wish to label additionally factor 2 as a polarity of data and physics, associated with the extremes of software and hardware.
4.2 Change Over Time
We take another funding programme, the Research Frontiers Programme, to show how changes over time can be mapped.
This programme follows an annual call, and includes all fields of science, mathematics and engineering. There are approximately 750 submissions annually. There was a 24% success rate (168 awards) in 2007, and 19% (143 awards) in 2008. The average award was €155k in 2007, and €161k in 2008. An award runs for three years of funding, and this is moving to four years in 2009 to accommodate a 4-year PhD duration. We will look at the Computer Science panel results only, over 2005, 2006, 2007 and 2008.
Grants awarded in these years, respectively, were: 14, 11, 15, 17. The breakdown by institutes concerned was: UCD – 13; TCD – 10; DCU – 14; UCC – 6; UL – 3; DIT – 3; NUIM – 3; WIT – 1. These institutes are as follows: UCD, University College Dublin; DCU, Dublin City University; UCC, University College Cork; UL, University of Limerick; NUIM, National University of Ireland, Maynooth; DIT, Dublin Institute of Technology; and WIT, Waterford Institute of Technology.
One theme was used to characterize each proposal from among the following: bioinformatics, imaging/video, software, networks, data processing & information retrieval, speech & language processing, virtual spaces, language & text, information security, and e-learning. Again this categorization of computer science can be contrasted with one derived for articles in recent years in the Computer Journal (Murtagh, 2008).
Figures 4, 5 and 6 show different facets of the Computer Science outcomes. By keeping the displays separate, we focus on one aspect at a time. All displays however are based on the same list of themes, and so allow mutual comparisons. Note that the principal plane shown accounts for just 9.5% + 8.9% of the inertia. Although accounting for 18.4% of the inertia, this plane, comprising factors, or principal axes, 1 and 2, accounts for the highest amount of inertia (among all possible planar projections). Ten themes were used, and what the 18.4% information content tells us is that there is importance attached to most if not all of the ten.
4.3 Conclusion on the Policy Case Studies
The aims and objectives in our use of the Correspondence Analysis and clustering platform is to drive strategy and its implementation in policy.
What we are targeting is to study highly multivariate, evolving data flows. This is in terms of the semantics of the data – principally, complex webs of interrelationships and evolution of relationships over time. This is the narrative of process that lies behind raw statistics and funding decisions.
5 Domain Taxonomy and Researcher’s Rank for Data Analysis
Here we turn to a domain taxonomy, that is the Computing Classification System maintained and updated by the Association of Computing Machinery (ACM-CCS); the latest release, of 2012, is publicly available at ACM (2012). Parts of ACM-CCS 2012 related to the loosely defined subject of “data analysis” including “Machine learning” and “Data mining”, up to a rather coarse granularity, are presented in Table 2.
|Subject index||Subject name|
|1.||Theory of computation|
|1.1.||Theory and algorithms for application domains|
|2.||Mathematics of computing|
|2.1.||Probability and statistics|
|3.1.||Data management systems|
|3.2.||Information systems applications|
|3.3.||World Wide Web|
It should be noted that a taxonomy is a hierarchical structure for shaping knowledge. The hierarchy involves just one relation “A is part of B” so that it leaves aside many other aspects of knowledge including, for example, the differences between theoretical interrelations, computational issues and application matters of the same set of concepts. These, however, may sneak in, even if unintentionally, in practice. For example, topics representing “Cluster analysis” occur in the following six branches within the ACM-CCS taxonomy: (i) Theory and algorithms for application domains, (ii) Probability and statistics, (iii) Machine learning, (iv) Design and analysis of algorithms, (v) Information systems applications, (vi) Information retrieval. Among them, (i) and (ii) refer to theoretical work, (iv) to algorithms, (v) and (vi) to applications. Item (iii), Machine learning, probably embraces all of them.
Unlike in biology, the taxonomies of specific research domains cannot be specified exactly because of the changing structure of the domain and, therefore, are subject to much change. For example, if one compares the current ACM Computing Classification System 2012 (ACM, 2012) with its previous version, the ACM Classification of Computing Subjects 1998 which is available at the same site, one cannot help but notice great differences in both the list of sub-domains and the structure of their mutual arrangement.
We consider the set of branches in Table 2 as a taxonomy of its own, referred to below as the Data Analysis Taxonomy (DAT). An extended version of the taxonomy, along with three to four more layers of higher granularity, presented in Mirkin and Orlov (2015, pp. 241-249), will be used throughout for illustration of our approach.
Out of various uses of a domain taxonomy, we pick up here its use for determining a scientist rank according to the rank of that node in the taxonomy which has been created or significantly transformed because of the results by the scientist (Mirkin, 2013).
The concept of taxonomic rank is not uncommon in the sciences. It is quite popular, for example, in biology: “A Taxonomic Rank is the level that an organism is placed within the hierarchical level arrangement of life forms” (see http://carm.org/dictionary-taxonomic-rank). As mentioned in Mirkin and Orlov (2015), Eucaryota is a domain (rank 1) containing Animals kingdom (rank 2). The latter contains Cordata phylum (rank 3) which contains Mammals class (rank 4) which contains Primates order (rank 5) which contains Hominidae family (rank 6) which contains Homo genus (rank 7) which contains, finally, Homo sapiens species (rank 8). Similarly, the rank of the scientist who created the “World wide web” (Berners-Lee, 2010), (the item 3.3 in Table 2) at layer 2 of the DAT taxonomy, is 2; and the rank of the scientist who developed a sound theory for “Boosting” (Schapire, 1990), (the item 126.96.36.199 in DAT (Mirkin and Orlov, 2015)), is 4, whereas the rank of the scientists who proposed a sound approach to “Topic modeling” (Blei et al., 2003) (the item 188.8.131.52.4 in DAT (Mirkin and Orlov, 2015)) is 5. This specification of taxonomic rank, TR, is associated with qualitative innovation, whereas the dominant current approach is to only reward or take account of low rank, and particular, topic items.
Using taxonomic ranks (TRs) based on domain taxonomies for evaluating the quality of research differs from the other methods currently available, through the following features:
The TR method directly measures the quality of results themselves rather than any related feature such as popularity;
The TR evaluation is well subject-focused; a scientist with good results in optimization may get rather modest evaluation in data analysis because a taxonomy for data analysis would not include high-level nodes on optimization;
The TR rank can get reversed if the taxonomy is modified so that the rank-giving taxon gets a less favourable location in the hierarchical tree;
The granularity of evaluation can be changed by increasing the granularity of the underlying taxonomy;
The TR evaluations in different domains can be made comparable by using taxonomies of the same depth;
The maintenance of a domain taxonomy can be effectively organized by a research community as a special activity subject to regular checking and scrutinising;
Assigning the TR to a scientist or their result(s) is derived from mapping them to a sub-domain that has been significantly affected by them, and this is not a simple issue. The persons who do the mapping must be impartial and have deep knowledge of the domain and the results.
The last two items in the list above refer to the core of the proposal in this paper. They can be considered a clarification of the main claim over evaluation of the research impact made by the scientists: qualitative considerations should prevail over metrics (Dora, 2013; Hicks et al., 2015; Metric Tide, 2016). Here the wide meaning of “qualitative” is reduced to two points: (a) developing and maintaining of a taxonomy, and (b) mapping results to the taxonomy. Both taxonomy developing any mapping decisions involve explicitly stated judgements which can be discussed openly and corrected if needed. This differs greatly from the currently employed procedures of peer-reviewing which can be highly subjective and dependent on various external considerations (Eisen et al., 2013; Engels et al., 2013; Van Raan, 2006). The activity of developing and maintaining taxonomies can be left to the governmental agencies and funding bodies, or to scholarly academies, or to discipline and sub-discipline expert organisational bodies, whereas the mapping activity should be left, in a transparent way, to scientific discussions involving all relevant individuals. Of course, there is potential for further developments of the formats: taxonomies could be extended to include various aspects characterizing research developments, and mapping can be softened up to include spontaneous and uncertain judgements.
6 A Prototype of Empirical Testing
We focus on the field of Computer Science related to data analysis, machine learning, cluster analysis and data mining along with its taxonomy derived from the ACM Computing Classification System 2012 (ACM, 2012), as explained above. We pick up a sample of 30 leading scientists in the field (about half from the USA, and other, mostly European, countries are represented by 2–3 representatives), such that the information of their research results is publicly available. Although we tried to predict the leaders, their Google-based citation indexes are highly different, from a few thousand to a hundred thousand. We picked up 4–6 most important papers by each of the sampled scientists and manually mapped each of the papers to taxons significantly affected by that. Since some of the relevant subjects, such as “Consensus clustering” and “Anomaly detection”, have not been presented in the ACM-CCS, we added them to DAT (Data Analysis Taxonomy) as leaves, implying that a previous terminal node becomes a non-terminal node. The results of the mapping are presented in Table 3. The table also presents the derived taxonomic ranks and the same ranks, 0–100 normalized. To derive the taxonomic rank of a scientist, we first take the minimum of their ranks as the base rank. Then we subtract from it as many one tenths as there are subdomains of that rank in their list and as many one hundredths as there are subdomains of greater ranks in the list. For example, the list of S23 comprises ranks 4, 5, 4 leading to 4 as the base rank. Subtraction of two tenths and one hundredth from 4 gives the derived rank 3.79. The normalization is such that the minimum rank, 3.50, gets a 100 mark, and the maximum rank, 4.89, gets a 0. The last column, the stratum, is assigned according to the distance of the mark to either 70 or 30 or 0.
|Scientist||Mapping to taxonomy||Layers||Tr||Trn||Stratum|
|S1||184.108.40.206, 220.127.116.11.7, 18.104.22.168.7||4,5,5||3.88||73||1|
|S2||22.214.171.124, 126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11||4,4,4,4,4||3.50||100||1|
|S3||18.104.22.168.2, 22.214.171.124.3, 126.96.36.199.7, 188.8.131.52.4 ,|
|S4||184.108.40.206.3, 220.127.116.11, 18.104.22.168.1,22.214.171.124.7,|
|S5||126.96.36.199.4, 188.8.131.52.4, 184.108.40.206.5, 220.127.116.11.6, 18.104.22.168.1||5,5,5,5,5||4.50||29||2|
|S6||22.214.171.124.2, 126.96.36.199.1, 188.8.131.52.1 ,|
|S8||184.108.40.206.1, 220.127.116.11.1, 18.104.22.168.1, 22.214.171.124.1, 126.96.36.199.4||5,5,5,5,5||4.50||29||2|
|S9||188.8.131.52.3, 184.108.40.206.2, 220.127.116.11.1, 18.104.22.168.4,|
|S17||22.214.171.124.1, 126.96.36.199.1, 188.8.131.52.5., 184.108.40.206.4,|
|220.127.116.11.4, 18.104.22.168.3.2, 22.214.171.124.4., 126.96.36.199.1||5,5,5,5,6,5,5||4.39||36||2|
7 Comparing Taxonomic Rank with Citation and Merit
We compared our taxonomic ranks with more conventional criteria: (a) Citation and (b) Merit. The Citation criterion was derived from Google-based indexes of the total number of citations, the number of works receiving 10 or more citations, and Hirsch index h, the number of papers receiving h citations or more. The merit criterion was computed from data on the following three indices: the number of successful PhDs (co)-supervised, the number of conferences co-organized, and the number of journals for which the researcher-scientist is a member of the Editorial Board.
To aggregate the indexes into a convex combination, that is, a weighted sum, automatically, a principled approach which works for correlated or inconsistent criteria has been developed. According to this approach, given the number of strata (in our case 3), the aggregate criterion is to be found so that its direction in the index space is such that all the observations are projected into compact well-separated clusters along this direction (Mirkin and Orlov, 2013, 2015).
To be more specific, consider a data matrix scientist-to-criteria where are indices of scientists, are indices of criteria, and is the score of âth criterion for the âth scientist. Let us consider a weight vector such that for every and , for the set of criteria. Then the combined criterion is where is th column of matrix . The problem is to find disjoint subsets of the set of indices , referred to as strata, according to values of the combined criterion . Each stratum is characterized by a value of the combined criterion , the stratum centre. Geometrically, strata are formed by layers between parallel planes in the space of criteria. At any stratum , we want the value of the combined criterion at any to approximate the stratum centre . In other words, in the equations , are errors to be minimized over vector . A least-squares formulation of the linear stratification (LS) problem: find a vector , a set of centres and a partition to solve the problem in (1), as follows.
This problem can be tackled using the alternating minimzation approach, conventional in cluster analysis. For any given weight vector , the criterion in (1) is just the conventional square-error clustering criterion of the -means clustering algorithm over a single feature, the combined criterion . Finding an appropriate at a given stratification can be reached by using standard quadratic optimization software.
To illustrate the approach as it is and, also, its difference from the widely used Principal Component Analysis (PCA) approach to linearly combining criteria, let us consider the following example. In Table 4, scores of two criteria over 8 scientists are presented.
Although usually criteria values are normalized into a 0–100% scale, we do not do that here to keep things simple. It appears, the data ideally, with zero error, fall into three strata, , as shown in Figure 7, according to combined criterion . In contrast, the PCA based linear combination, , admits a residual of 13.4% of the total data scatter, and leads to a somewhat different ordering, at which two top stratum scientists get lesser aggregate scores than two scientists of the B stratum.
For convenience, the combined criteria scores are presented in Table 5.
In the thus aggregated Citation criterion, the Hirsch index received a zero coefficient, while the other two were one half each. The zeroing of the Hirsch index weight is in line with the overwhelming critiques this index has been exposed to in recent times, (Albert, 2013; Osterloh and Frey, 2014; Dora, 2013; Van Raan, 2006). A similarly aggregated Merit criterion is formed with weights 0.22 for the number of PhD students, 0.10 for the number of conferences, and 0.69 for the number of journals, which is consistent with the prevailing practice of maintaining a heavy and just submission reviewing process in leading journals.
To compare these scales, let us compute Pearson correlation coefficients between them, see Table 6.
As expected, the Citation and Merit criteria do not correlate with the Taxonomic rank of the scientists. On the other hand, the traditional Citation and Merit criteria are somewhat positively correlated, probably because they both relate to the popularity of a scientist.
Assessments can be carried out at different levels, a region, an organization, a team or an individual researcher; within a domain or inter domains. What we can metaphorically express as wider horizons, are brought to our attention, through analysis of quality. Among the recommendations arising from this work, on the regional level, there are three on the particular subjects of our concern:
Set out a more structured and strategic process for proposing projects.
Conduct a systematic analysis of the existing infrastructure.
Take a more systematic approach to evaluating the impact of operational projects.
With these recommendations, we are emphasizing the importance of these underpinning themes. These themes, and their underpinnings, should be pursued assertively for journals and other scholarly publishing, and also for research funding programmes.
We both observe and demonstrate that evaluation of research, especially at the level of teams or individuals can be organized by, firstly, developing and maintaining a taxonomy of the relevant subdomains and, secondly, a system for mapping research results to those subdomains that have been created or significantly transformed because of these research results. This would bring a well-defined meaning to the widely-held opinion that research impact should be evaluated, first of all, based on qualitative considerations. Further steps can be, and should be, undertaken in the directions of developing and maintaining a system for assessment of the quality of research across all areas of knowledge. Of course, developing and/or incorporating systems for other elements of research impact, viz., knowledge transfer, industrial applications, social interactions, etc., are to be taken into account also. In comprehensively covering quality and quantitative research outcomes, there can be distinguished at least five aspects of an individual researcher’s research impact:
Research and presentation of results (number, quality)
Research functioning (journal/volume editing, running research meetings, reviewing)
Teaching (knowledge transfer, knowledge discovery)
Technology innovations (programs, patents, consulting)
Societal interactions (popularization, getting feedback)
Many, if not all, of the items in this list can be maintained by developing and using corresponding taxonomies. The development of a system of taxonomies for the health system in the USA, IHTSDO SNOMED CT (SNOMED CT, 2016), extended now to many other countries, and languages, should be considered an instructive example of such a major undertaking.
This suggests directions for future work. Among them are the following.
In methods: (i) Enhancing the concept of taxonomy by including theoretical, computational, and industrial facets, as well as dynamic aspects to it; (ii) Developing methods for relating paper’s texts, viz. content, and taxonomies; (iii) Developing methods for taxonomy building using such research paper texts, i.e. content; (iv) Developing methods for mapping research results to taxonomy units affected by them; (v) Using our prototyping here, developing comprehensive methods for ranking the impact of results to include expert-driven components; (vi) Also based on our prototyping here, developing accessible and widely used methods for aggregate rankings.
In substance: (i) Developing and maintaining a permanent system for assessment of the scope and quality of research at different levels; (ii) Developing a system of domains in research subjects and their taxonomies; (iii) Cataloguing researchers, research and funding bodies, and research results; (iv) Creating a platform and forums for discussing taxonomies, results and assessments.
A spin-off of our very major motivation for qualitative analytics is to propose using a full potential of the research efforts on a regional level. In our journal editorial roles, we realise very well that sometimes quite predictable rejection of article submissions can raise such questions as the following: is there no qualitative interest at all in such work? How can, or how should, improvement be recommended? At least as important, and far more so in terms of wasteful energy and effort, is the qualitative analysis of rejected research funding proposals. (As is well known, a relatively small proportion of the research projects gets a “go ahead” nod. For example, The European Horizon 2020 FET-Open, Future Emerging Technologies, September 2015 proposal submission resulted in less than a 2% success rate (FET, 2016): 13 successful research proposals out of 822 proposal submissions.) Given the workload at issue, on various levels and from various vantage points, there is potential for data mining and knowledge discovery in the vast numbers of rejected research funding proposals. Ultimately, and given the workload undertaken, it is both potentially of benefit, and justified, to carry out such analytics.
ABRAMO, G., CICERO, T., ANGELO, C.A. (2013). “National peer-review research assessment exercises for the hard sciences can be a complete waste of money: the Italian case”, Scientometrics, 95(1), 311–324.
ACM (2012). The 2012 ACM Computing Classification System, http://www.acm.org/about/class/2012 (Viewed 2017-02-05).
ALBERT, B. (2013). “Impact factor distortions”, Science, 340, no. 6134, 787.
ARAGNÓN, A.M. (2013). “A measure for the impact of research”, Scientific Reports 3, Article number: 1649.
BERNERS-LEE, T. (2010). “Long live the Web”, Scientific American. 303 (6). 80–85.
BLEI, D.M., NG, A.Y., JORDAN, M.I., LAFFERTY, J. (2003). “Latent Dirichlet allocation”, Journal of Machine Learning Research. 3: 993–1022.
CANAVAN, J., GILLEN, A., SHAW, A. (2009). “Measuring research impact: developing practical and cost-effective approaches”, Evidence and Policy: A Journal of Research, Debate and Practice, 5.2. 167–177.
DORA (2013). San Francisco Declaration on Research Assessment (DORA), www.ascb.org/dora (viewed 2017-02-05).
EISEN, J.A., MACCALLUM, C.J., NEYLON, C. (2013). “Expert failure: Re-evaluating research assessment”. PLoS Biology, 11(10): e1001677.
ENGELS, T.C., GOOS, P., DEXTERS, N., SPRUYT, E.H. (2013). “Group size, h-index, and efficiency in publishing in top journals explain expert panel assessments of research group quality and productivity”. Research Evaluation, 22(4), 224–236.
“FET-Open: 3 new proposals start preparation for Grant Agreements”,
Future Emerging Technologies Newsletter, 21 March 2016.
HICKS, D., WOUTERS, P., WALTMAN, L., DE RIJCKE, S., RAFULS, I. (2015). “The Leiden Manifesto for research metrics”. Nature, 520, 429–431.
SNOMED CT (2016).
IHTSDO, International Health Terminology Standards Development Organisation,
SNOMED CT, Systematized Nomenclature of Medicine, Clinical Terms.
http://www.ihtsdo.org/snomed-ct (viewed 2017-02-05).
LEE, F.S., PHAM, X., GU, G. (2013). “The UK research assessment exercise and the narrowing of UK economics”. Cambridge Journal of Economics, 37(4), 693–717.
METRIC TIDE (2016).
“The Metric Tide: Report of the Independent Review of the Role of Metrics in Research Assessment and Management”,
http://www.hefce.ac.uk/pubs/rereports/Year/2015/metrictide/Title,104463,en.html (viewed 2017-02-05).
MIRKIN, B. (2013). “On the notion of research impact and its measurement”, Control in Large Systems, Special Issue: Scientometry and Experts in Managing Science. 44. 292–307, Institute of Control Problems, Moscow (in Russian).
MIRKIN, B., ORLOV, M. (2013). “Methods for Multicriteria Stratification and Experimental Comparison of Them”, Preprint WP7/2013/06, Higher School of Economics, Moscow, 31 pp. (in Russian).
MIRKIN, B., ORLOV, M. (2015). “Three aspects of the research impact by a scientist: measurement methods and an empirical evaluation”, in A. Migdalas, A. Karakitsiou, Eds., Optimization, Control, and Applications in the Information Age, Springer Proceedings in Mathematics and Statistics. 130. 233–260.
MURTAGH, F. (2008). “Editorial”. The Computer Journal, 51(6), 612–614.
MURTAGH, F. (2010). “The Correspondence Analysis platform for uncovering deep structure in data and information”. The Computer Journal, 53(3), 304–315.
NG, W.L. (2007). “A simple classifier for multiple criteria ABC analysis”. European Journal of Operational Research. 177. 344–353.
ORLOV, M., MIRKIN, B. (2014). “A concept of multicriteria stratification: A definition and solution”, Procedia Computer Science, 31, 273–280.
OSTERLOH, M., FREY, B.S. (2014). “Ranking games”. Evaluation review, Sage, pp. 1–28.
RAMANATHAN, R. (2006). “Inventory classification with multiple criteria using weighted linear optimization”, Computers and Operations Research. 33. 695–700.
SCHAPIRE, R.E. (1990). “The strength of weak learnability”. Machine Learning. 5(2), 197–227.
SIDIROPOULOS, A., KATSAROS, D., MANOLOPOULOS, Y. (2014). “Identification of influential scientists vs. mass producers by the perfectionism index”. Preprint, ArXiv:1409.6099v1, 27 pp.
SUN, Y., HAN, J., ZHAO, P., YIN, Z., CHENG, H., WU, T. (2009). “RankClus: integrating clustering with ranking for heterogeneous information network analysis”. EDBT ’09 Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, ACM, NY, 565–576.
THOMSON REUTERS (2016). “Thomson Reuters intellectual property and science”. (Acquisition of the Thomson Reuters Intellectual Property and Science Business by Onex and Baring Asia Completed. Independent business becomes Clarivate Analytics) http://ip.thomsonreuters.com (Viewed 2017-02-05).
UNIVERSITY GUIDE (2016).
“The Complete University League Guide”.
http://www.thecompleteuniversityguide.co.uk/league-tables/methodology. (Viewed 2017-02-05)
VAN RAAN, A.F. (2006). “Comparison of the Hirsch-index with standard bibliometric indicators and with peer judgment for 147 chemistry research groups”. Scientometrics, 67(3), 491–502.