Statistical Modelling of Citation Exchange Between Statistics Journals
Abstract
Rankings of scholarly journals based on citation data are often met with skepticism by the scientific community. Part of the skepticism is due to disparity between the common perception of journals’ prestige and their ranking based on citation counts. A more serious concern is the inappropriate use of journal rankings to evaluate the scientific influence of authors. This paper focuses on analysis of the table of crosscitations among a selection of Statistics journals. Data are collected from the Web of Science database published by Thomson Reuters. Our results suggest that modelling the exchange of citations between journals is useful to highlight the most prestigious journals, but also that journal citation data are characterized by considerable heterogeneity, which needs to be properly summarized. Inferential conclusions require care in order to avoid potential overinterpretation of insignificant differences between journal ratings. Comparison with published ratings of institutions from the UK’s Research Assessment Exercise shows strong correlation at aggregate level between assessed research quality and journal citation ‘export scores’ within the discipline of Statistics.
Statistical Modelling of Citation Exchange]Statistical Modelling of Citation Exchange Between
Statistics Journals
C. Varin, M. Cattelan, and D. Firth]Manuela Cattelan
C. Varin, M. Cattelan, and D. Firth]David Firth
1 Introduction
The problem of ranking scholarly journals has arisen partly as an economic matter. When the number of scientific journals started to increase, librarians were faced with decisions as to which journal subscriptions should consume their limited economic resources; a natural response was to be guided by the relative importance of different journals according to a published or otherwise agreed ranking. Gross and Gross (1927) proposed the counting of citations received by journals as a direct measure of their importance. Garfield (1955) suggested that the number of citations received should be normalized by the number of citable items published by a journal. This idea is at the origin of the Impact Factor, the best known index for ranking journals. Published since the 1960s, the Impact Factor is ‘an average citation rate per published article’ (Garfield, 1972).
The Impact Factor of the journals where scholars publish has also been employed — improperly, many might argue — in appointing to academic positions, in awarding research grants, and in ranking universities and their departments. The San Francisco Declaration on Research Assessment (2013) and the IEEE Position Statement on Appropriate Use of Bibliometric Indicators for the Assessment of Journals, Research Proposals, and Individuals (IEEE Board of Directors, 2013) are just two of the most recent authoritative standpoints regarding the risks of automatic, metricbased evaluations of scholars. Typically, only a small fraction of all published articles accounts for most of the citations received by a journal (Seglen, 1997). Single authors should ideally be evaluated on the basis of their own outputs and not through citations of other papers that have appeared in the journals where their papers have been published (Seglen, 1997; Adler et al., 2009; Silverman, 2009). As stated in a recent Science editorial about Impact Factor distortions (Alberts, 2013):
‘(…) the leaders of the scientific enterprise must accept full responsibility for thoughtfully analyzing the scientific contributions of other researchers. To do so in a meaningful way requires the actual reading of a small selected set of each researcher’s publications, a task that must not be passed by default to journal editors’.
Indicators derived from citations received by papers written by a particular author (e.g., Bornmann and Marx, 2014) can be a useful complement for evaluation of trends and patterns of that author’s impact, but not a substitute for the reading of papers.
Journal rankings based on the Impact Factor often differ substantially from common perceptions of journal prestige (Theoharakis and Skordia, 2003; Arnold and Fowler, 2011). Various causes of such discrepancy have been pointed out. First, there is the phenomenon that more ‘applied’ journals tend to receive citations from other scientific fields more often than do journals that publish theoretical work. This may be related to uncounted ‘indirect citations’ arising when methodology developed in a theoretical journal is then popularized by papers published in applied journals accessible to a wider audience and thus receiving more citations than the original source (JournalRanking.com, 2007; Putirka et al., 2013). Second is the short timeperiod used for computation of the Impact Factor, which can be completely inappropriate for some fields, in particular for Mathematics and Statistics (van Nierop, 2009; Arnold and Fowler, 2011). Finally, there is the risk of manipulation, whereby authors might be asked by journal editors to add irrelevant citations to other papers published in their journal (Sevinc, 2004; Frandsen, 2007; Archambault and Larivière, 2009; Arnold and Fowler, 2011). According to a large survey published in Science (Wilhite and Fong, 2012), about of academics in socialscience and business fields have been asked to ‘pad their papers with superfluous references to get published’ (van Noorden, 2012). The survey data also suggest that junior faculty members are more likely to be pressured to cite superfluous papers. Recently, Thomson Reuters has started publishing the Impact Factor both with and without journal selfcitations, thereby allowing evaluation of the contribution of selfcitations to the Impact Factor calculation. Moreover, Thomson Reuters has occasionally excluded journals with an excessive selfcitation rate from the Journal Citation Reports.
Given the above criticisms, it is not surprising that the Impact Factor and other ‘quantitative’ journal rankings have given rise to substantial skepticism about the value of citation data. Several proposals have been developed in the bibliometric literature to overcome the weaknesses of the Impact Factor; examples include the Article Influence Score (Bergstrom, 2007; West, 2010), the H index for journals (Braun et al., 2006; Pratelli et al., 2012), the Source Normalized Impact per Paper (SNIP) index (Waltman et al., 2013), and methods based on percentile rank classes (Marx and Bornmann, 2013).
The aforementioned Science editorial (Alberts, 2013) reports that
’(…) in some nations, publication in a journal with an impact factor below is officially of zero value.’
In the latest edition (2013) of the Journal Citation Reports, the only journal with an Impact Factor larger than 5 in the category Statistics and Probability was the Journal of the Royal Statistical Society Series B, with Impact Factor 5.721. The category Mathematics achieved still lower Impact Factors, with the highest value there in 2013 being for Communications on Pure and Applied Mathematics. Several bibliometric indicators have been developed, or adjusted, to allow for crossfield comparisons, e.g. Leydesdorff et al. (2013), Waltman and Van Eck (2013), and could be considered to alleviate unfair comparisons. However, our opinion is that comparisons between different research fields will rarely make sense, and that such comparisons should be avoided. Research fields differ very widely, for example in terms of the frequency of publication, the typical number of authors per paper and the typical number of citations made in a paper, as well as in the sizes of their research communities. Journal homogeneity is a minimal prerequisite for a meaningful statistical analysis of citation data (Lehmann et al., 2009).
Journal citation data are unavoidably characterized by substantial variability (e.g., Amin and Mabe, 2000). A clear illustration of this variability, suggested by the Associate Editor of this paper, comes from an early editorial of Briefings in Bioinformatics (Bishop and Bird, 2007) announcing that this new journal had received an Impact Factor of . However, the editors noted that a very large fraction of the journal’s citations came from a single paper; if that paper were to be dropped, then the journal’s Impact Factor would decrease to about . The variability of the Impact Factor is inherently related to the heavytailed distribution of citation counts. Averaged indicators such as the Impact Factor are clearly unsuitable for summarizing highly skew distributions. Nevertheless, quantification of uncertainty is typically lacking in published rankings of journals. A recent exception is Chen et al. (2014) who employ a bootstrap estimator for the variability of journal Impact Factors. Also the SNIP indicator published by Leiden University’s Centre for Science and Technology Studies based on the Elsevier Scopus database, and available online at www.journalindicators.com, is accompanied by a ‘stability interval’ computed via a bootstrap method. See also Hall and Miller (2009, 2010) and references therein for more details on statistical assessment of the authority of rankings.
The Impact Factor was developed to identify which journals have the greatest influence on subsequent research. The other metrics mentioned in this paper originated as possible improvements on the Impact Factor, with the same aim. PalaciosHuerta and Volij (2004) list a set of properties that a ranking method which measures the intellectual influence of journals, by using citation counts, should satisfy. However, the list of all desirable features of a ranking method might reasonably be extended to include features other than citations, depending on the purpose of the ranking. For example, when librarians decide which journals to take, they should consider the influence of a journal in one or more research fields, but they may also take into account its costeffectiveness. The website www.journalprices.com, maintained by Professors Ted Bergstrom and Preston McAfee, ranks journals according to their price per article, price per citation, and a composite index.
A researcher when deciding where to submit a paper most likely considers each candidate journal’s record of publishing papers on similar topics, and the importance of the journal in the research field; but he/she may also consider the speed of the reviewing process, the typical time between acceptance and publication of the paper, possible page charges, and the likely effect on his/her own career. Certain institutions and national evaluation agencies publish rankings of journals which are used to evaluate researcher performance and to inform the hiring of new faculty members. For various economics and managementrelated disciplines, the Journal Quality List, compiled by Professor AnneWil Harzing and available at www.harzing.com/jql.htm, combines more than 20 different rankings made by universities or evaluation agencies in various countries. Such rankings typically are based on bibliometric indices, expert surveys, or a mix of both.
Modern technologies have fostered the rise of alternative metrics such as “webometrics” based on citations on the internet or numbers of downloads of articles. Recently, interest has moved from webcitation analysis to socialmedia usage analysis. In some disciplines the focus is now towards broader measurement of research impact through the use of webbased quantities such as citations in socialmedia sites, newspapers, government policy documents, blogs, etc. This is mainly implemented at the level of individual articles, see for example the Altmetric service (Adie and Roe, 2013) available at www.altmetric.com, but the analysis may also be made at journal level. Along with the advantages of timeliness, availability of data and consideration of different sources, such measures also have certain drawbacks related to data quality, possible bias, and data manipulation (Bornmann, 2014).
A primary purpose of the present paper is to illustrate the risks of overinterpretation of insignificant differences between journal ratings. In particular, we focus on the analysis of the exchange of citations among a relatively homogeneous list of journals. Following Stigler (1994), we model the table of crosscitations between journals in the same field by using a BradleyTerry model (Bradley and Terry, 1952) and thereby derive a ranking of the journals’ ability to ‘export intellectual influence’ (Stigler, 1994). Although the Stigler approach has desirable properties and is simple enough to be promoted also outside the statistics community, there have been rather few published examples of application of this model since its first appearance; Stigler et al. (1995) and Liner and Amin (2004) are two notable examples of its application to the journals of Economics.
We pay particular attention to methods that summarize the uncertainty in a ranking produced through the Stigler modelbased approach. Our focus on properly accounting for ‘modelbased uncertainty in making comparisons’ is close in spirit to Goldstein and Spiegelhalter (1996). We propose to fit the Stigler model with the quasilikelihood method (Wedderburn, 1974) to account for interdependence among the citations exchanged between pairs of journals, and to summarize estimation uncertainty by using quasivariances (Firth and de Menezes, 2005). We also suggest the use of the ranking lasso penalty (Masarotto and Varin, 2012) when fitting the Stigler model, in order to combine the benefits of shrinkage with an enhanced interpretation arising from automatic presentational grouping of journals with similar merits.
The paper is organised as follows. Section 2 describes the data collected from the Web of Science database compiled by Thomson Reuters; then as preliminary background to the paper’s main content on journal rankings, Section 3 illustrates the use of cluster analysis to identify groups of Statistics journals sharing similar aims and types of content. Section 4 provides a brief summary of journal rankings published by Thomson Reuters in the Journal Citation Reports. Section 5 discusses the Stigler method and applies it to the table of crosscitations among Statistics journals. Section 6 compares journal ratings based on citation data with results from the UK Research Assessment Exercise, and Section 7 collects some concluding remarks.
The citation data set and the computer code used for the analyses written in the R language (R Core Team, 2014) are made available in the Supplementary Web Materials.
2 The Web of Science database
The database used for our analyses is the 2010 edition of the Web of Science produced by Thomson Reuters. The citation data contained in the database are employed to compile the Journal Citation Reports (JCR), whose Science Edition summarizes citation exchange among more than 8,000 journals in science and technology. Within the JCR, scholarly journals are grouped into 171 overlapping subject categories. In particular, in 2010 the Statistics and Probability category comprised 110 journals. The choice of the journals that are encompassed in this category is to some extent arbitrary. The Scopus database, which is the main commercial competitor of Web of Science, in 2010 included in its Statistics and Probability category 105 journals, but only about two thirds of them were classified in the same category within Web of Science. The Statistics and Probability category contains also journals related to fields such as Econometrics, Chemistry, Computational Biology, Engineering and Psychometrics.
A severe criticism of the Impact Factor relates to the time period used for its calculation. The standard version of the Impact Factor considers citations received to articles published in the previous two years. This period is too short to reach the peak of citations of an article, especially in mathematical disciplines (Hall, 2009). van Nierop (2009) finds that articles published in Statistics journals typically reach the peak of their citations more than three years after publication; as reported by the JCR, the median age of the articles cited in this category is more than 10 years. Thomson Reuters acknowledges this issue and computes a second version of the Impact Factor using citations to papers published in the previous five years. Recent published alternatives to the Impact Factor, to be discussed in Section 4, also count citations to articles that appeared in the previous five years. The present paper considers citations of articles published in the previous ten years, in order to capture the influence, over a more substantial period, of work published in statistical journals.
A key requirement for the methods described here, as well as in our view for any sensible analysis of citation data, is that the journals jointly analysed should be as homogeneous as possible. Accordingly, analyses are conducted on a subset of the journals from the Statistics and Probability category, among which there is a relatively high level of citation exchange. The selection is obtained by discarding journals in Probability, Econometrics, Computational Biology, Chemometrics and Engineering, and other journals not sufficiently related to the majority of the journals in the selection. Furthermore, journals recently established, and thus lacking a record of ten years of citable items, also are dropped. The final selection consists of the 47 journals listed in Table 1. Obviously, the methods discussed in this paper can be similarly applied to other selections motivated by different purposes. For example, a statistician interested in applications to Economics might consider a different selection with journals of econometrics and statistical methodology, discarding instead journals oriented towards biomedical applications.
The JCR database supplies detailed information about the citations exchanged between pairs of journals through the Cited Journal Table and the Citing Journal Table. The Cited Journal Table for journal contains the number of times that articles published in journal during 2010 cite articles published in journal in previous years. Similarly, the Citing Journal Table for journal contains the number of times that articles published in journal in previous years were cited in journal during 2010. Both of the tables contain some very modest loss of information. In fact, all journals that cite journal are listed in the Cited Journal Table for journal only if the number of citing journals is less than . Otherwise, the Cited Journal Table reports only those journals that cite journal at least twice in all past years, thus counting also citations to papers published earlier than the decade 2001–2010 here considered. Remaining journals that cite journal only once in all past years are collected in the category ‘all others’. Information on journals cited only once is similarly treated in the Citing Journal Table.
Cited and Citing Journal Tables allow construction of the crosscitation matrix , where is the number of citations from articles published in journal in 2010 to papers published in journal in the chosen time window (). In our analyses, , the number of selected Statistics journals, and the time window is the previous ten years. In the rest of this section we provide summary information about citations made and received by each Statistics journal at aggregate level, while Sections 3 and 5 discuss statistical analyses derived from citations exchanged by pairs of journals.
Table 2 shows the citations made by papers published in each Statistics journal in 2010 to papers published in other journals in the decade 2001–2010, as well as the citations that the papers published in each Statistics journal in 2001–2010 received from papers published in other journals in 2010. The same information is visualized in the backtoback bar plots of Figure 1. Citations made and received are classified into three categories, namely journal selfcitations from a paper published in a journal to another paper in the same journal, citations to/from journals in the list of selected Statistics journals, and citations to/from journals not in the selection.
The total numbers of citations reported in the second and fifth columns of Table 2 include citations given or received by all journals included in the Web of Science database, not only those in the field of Statistics. The totals are influenced by journals’ sizes and by the citation patterns of other categories to which journals are related. The number of references to articles published in 2001–2010 ranges from 275 for citations made in Statistical Modelling, which has a small size publishing around 350–400 pages per year, to 4,022 for Statistics in Medicine, a large journal with size ranging from 3,500 to 6,000 pages annually in the period examined. The number of citations from a journal to articles in the same journal is quite variable and ranges from 0.8% of all citations for Computational Statistics to 24% for Stata Journal. On average, 6% of the references in a journal are to articles appearing in the same journal and 40% of references are addressed to journals in the list, including journal selfcitations. The Journal of the Royal Statistical Society Series A has the lowest percentage of citations to other journals in the list, at only 10%. Had we kept the whole Statistics and Probability category of the JCR, that percentage would have risen by just 2 points to 12%; most of the references appearing in Journal of the Royal Statistical Society Series A are to journals outside the Statistics and Probability category.
The number of citations received ranges from 168 for Computational Statistics to 6,602 for Statistics in Medicine. Clearly, the numbers are influenced by the size of the journal. For example, the small number of citations received by Computational Statistics relates to only around 700 pages published per year by that journal. The citations received are influenced also by the citation patterns of other subject categories. In particular, the number of citations received by a journal oriented towards medical applications benefits from communication with a large field including many highimpact journals. For example, around 75% of the citations received by Statistics in Medicine came from journals outside the list of Statistics journals, mostly from medical journals. On average, 7% of the citations received by journals in the list came from the same journal and 40% were from journals in the list.
As stated already, the Statistics journals upon which we focus have been selected from the Statistics and Probability category of the JCR, with the aim of retaining those which communicate more. The median fraction of citations from journals discarded from our selection to journals in the list is only 4%, while the median fraction of citations received by nonselected journals from journals in the list is 7%. An important example of an excluded journal is Econometrica, which was ranked in leading positions by all of the published citation indices. Econometrica had only about 2% of its references addressed to other journals in our list, and received only 5% of its citations from journals within our list.
3 Clustering journals
Statistics journals have different stated objectives, and different types of content. Some journals emphasize applications and modelling, while others focus on theoretical and mathematical developments, or deal with computational and algorithmic aspects of statistical analysis. Applied journals are often targeted to particular areas, such as, for example, statistics for medical applications, or for environmental sciences. Therefore, it is quite natural to consider whether the crosscitation matrix allows the identification of groups of journals with similar aims and types of content. Clustering of scholarly journals has been extensively discussed in the bibliometric literature and a variety of clustering methods have been considered. Examples include the hillclimbing method (Carpenter and Narin, 1973), means (Boyack et al., 2005), and methods based on graph theory (Leydesdorff, 2004; Liu et al., 2012).
Consider the total number of citations exchanged between journals and ,
(1) 
Among various possibilities — see, for example, Boyack et al. (2005) — the distance between two journals can be measured by quantity , where is the Pearson correlation coefficient of variables and (), i.e.,
with . Among the many available clustering algorithms, we consider a hierarchical agglomerative cluster analysis with complete linkage (Kaufman and Rousseeuw, 1990). The clustering process is visualized through the dendrogram in Figure 2. Visual inspection of the dendrogram suggests cutting it at height thereby obtaining eight clusters, two of which are singletons. The identified clusters are grouped in grey boxes in Figure 2.
We comment first on the groups and later on the singletons, following the order of the journals in the dendrogram from left to right. The first group includes a large number of general journals concerned with theory and methods of Statistics, but also with applications. Among others, the group includes Journal of Time Series Analysis, Journal of Statistical Planning and Inference, and Annals of the Institute of Statistical Mathematics.
The second group contains the leading journals in the development of statistical theory and methods: Annals of Statistics, Biometrika, Journal of the American Statistical Association and Journal of the Royal Statistical Society Series B. The group includes also other methodological journals such as Bernoulli, Scandinavian Journal of Statistics and Statistica Sinica. It is possible to identify some natural subgroups: Journal of Computational and Graphical Statistics and Statistics and Computing; Biometrika, Journal of the Royal Statistical Society Series B, and Journal of the American Statistical Association; Annals of Statistics and Statistica Sinica.
The third group comprises journals mostly dealing with computational aspects of Statistics, such as Computational Statistics and Data Analysis, Communications in Statistics – Simulation and Computation, Computational Statistics, and Journal of Statistical Computation and Simulation. Other members of the group with a less direct orientation towards computational methods are Technometrics and Journal of Applied Statistics.
The fourth group includes just two journals both of which publish mainly review articles, namely American Statistician and International Statistical Review.
The fifth group comprises the three journals specializing in ecological and environmental applications: Journal of Agricultural, Biological and Environmental Statistics, Environmental and Ecological Statistics and Environmetrics.
The last group includes various journals emphasising applications, especially to health sciences and similar areas. It encompasses journals oriented towards biological and medical applications such as Biometrics and Statistics in Medicine, and also journals publishing papers about more general statistical applications, such as Journal of the Royal Statistical Society Series A and Series C. The review journal Statistical Science also falls into this group; it is not grouped together with the other two review journals already mentioned. Within the group there are some natural subgroupings: Statistics in Medicine with Statistical Methods in Medical Research; and Biometrics with Biostatistics.
Finally, and perhaps not surprisingly, the two singletons are the softwareoriented Journal of Statistical Software and Stata Journal. The latter is, by some distance, the most remote journal in the list according to the measure of distance used here.
4 Ranking journals
The Thomson Reuters JCR website annually publishes various rating indices, the best known being the already mentioned Impact Factor. Thomson Reuters also publishes the Immediacy Index, which describes the average number of times an article is cited in the year of its publication. The Immediacy Index is unsuitable for evaluating Statistics journals, but it could be worthy of attention in fields where citations occur very quickly, for example some areas of neuroscience and other life sciences.
It is well known in the bibliometric literature that the calculation of the Impact Factor contains some important inconsistencies (Glänzel and Moed, 2002). The numerator of the Impact Factor includes citations to all items, while the number of citable items in the denominator excludes letters to the editor and editorials; such letters are an important element of some journals, notably medical journals. The inclusion of selfcitations, defined as citations from a journal to articles in the same journal, exposes the Impact Factor to possible manipulation by editors. Indeed, Sevinc (2004), Frandsen (2007) and Wilhite and Fong (2012) report instances where authors were asked to add irrelevant references to their articles, presumably with the aim of increasing the Impact Factor of the journal. As previously mentioned, recently Thomson Reuters has made available also the Impact Factor without journal selfcitations. Journal selfcitations can also be a consequence of authors’ preferring to cite papers published in the same journal instead of equally relevant papers published elsewhere, particularly if they perceive such selfcitation as likely to be welcomed by the journal’s editors. Nevertheless, the potential for such behaviour should not lead to the conclusion that selfcitations are always unfair. Many selfcitations are likely to be genuine, especially since scholars often select a journal for submission of their work according to the presence of previously published papers on related topics.
The Eigenfactor Score and the derived Article Influence Score (Bergstrom, 2007; West, 2010) have been proposed to overcome the limitations of the Impact Factor. Both the Eigenfactor and the Article Influence Score are computed over a fiveyear time period, with journal selfcitations removed in order to eliminate possible sources of manipulation. The idea underlying the Eigenfactor Score is that the importance of a journal relates to the time spent by scholars in reading that journal. As stated by Bergstrom (2007), it is possible to imagine that a scholar starts reading an article selected at random. Then, the scholar randomly selects another article from the references of the first paper and reads it. Afterwards, a further article is selected at random from the references included in the previous one and the process may go on ad infinitum. In such a process, the time spent in reading a journal might reasonably be regarded as an indicator of that journal’s importance.
Apart from modifications needed to account for special cases such as journals that do not cite any other journal, the Eigenfactor algorithm is summarized as follows. The Eigenfactor is computed from the normalized citation matrix , whose elements are the citations from journal to articles published in the previous five years in journal divided by the total number of references in in those years, . The diagonal elements of are set to zero, to discard selfcitations. A further ingredient of the Eigenfactor is the vector of normalized numbers of articles , with being the number of articles published by journal during the fiveyear period divided by the number of articles published by all considered journals. Let be the row vector of ones, so that is a matrix with all identical columns . Then
is the transition matrix of a Markov process that assigns probability to a random movement in the journal citation network, and probability to a random jump to any journal; for jumps of the latter kind, destinationjournal attractiveness is simply proportional to size.
The damping parameter is set to , just as in the PageRank algorithm at the basis of the Google search engine; see Brin and Page (1998). The leading eigenvector of corresponds to the steadystate fraction of time spent reading each journal. The Eigenfactor Score for journal is defined as ’the percentage of the total weighted citations that journal receives’; that is,
where denotes the th element of vector . See www.␣eigenfactor.org/methods.pdf for more details of the methodology behind the Eigenfactor algorithm.
The Eigenfactor ‘measures the total influence of a journal on the scholarly literature’ (Bergstrom, 2007) and thus it depends on the number of articles published by a journal. The Article Influence Score of journal is instead a measure of the perarticle citation influence of the journal, obtained by normalizing the Eigenfactor as follows:
Distinctive aspects of the Article Influence Score with respect to the Impact Factor are:

The use of a formal stochastic model to derive the journal ranking;

The use of bivariate data — the crosscitations — in contrast to the univariate citation counts used by the Impact Factor.
An appealing feature of the Article Influence Score is that citations are weighted according to the importance of the source, whereas the Impact Factor counts all citations equally (Franceschet, 2010). Accordingly, the bibliometric literature classifies the Article Influence Score as a measure of journal ‘prestige’ and the Impact Factor as a measure of journal ‘popularity’ (Bollen et al., 2006). Table 3 summarizes some of the main features of the ranking methods discussed in this section and also of the Stigler model that will be discussed in Section 5 below.
The rankings of the selected Statistics journals according to Impact Factor, Impact Factor without journal selfcitations, fiveyear Impact Factor, Immediacy Index, and Article Influence Score are reported in columns two to six of Table 4. The substantial variation among those five rankings is the first aspect that leaps to the eye; these different published measures clearly do not yield a common, unambiguous picture of the journals’ relative standings.
A diffuse opinion within the statistical community is that the four most prestigious Statistics journals are (in alphabetic order) Annals of Statistics, Biometrika, Journal of the American Statistical Association, and Journal of the Royal Statistical Society Series B. See, for example, the survey about how statisticians perceive Statistics journals described in Theoharakis and Skordia (2003). Accordingly, a minimal requirement for a ranking of acceptable quality is that the four most prestigious journals should occupy prominent positions. Following this criterion, the least satisfactory ranking is, as expected, the one based on the Immediacy Index, which ranks Journal of the American Statistical Association only 22nd and Biometrika just a few positions ahead at 19th.
In the three versions of Impact Factor ranking, Journal of the Royal Statistical Society Series B always occupies first position, Annals of Statistics ranges between second and sixth, Journal of the American Statistical Association between fourth and eighth, and Biometrika between tenth and twelfth. The two software journals have quite high Impact Factors: Journal of Statistical Software is ranked between second and fifth by the three different Impact Factor versions, while Stata Journal is between seventh and ninth. Other journals ranked highly according to the Impact Factor measures are Biostatistics and Statistical Science.
Among the indices published by Thomson Reuters, the Article Influence Score yields the most satisfactory ranking with respect to the four leading journals mentioned above, all of which stand within the first five positions.
All of the indices discussed in this section are constructed by using the complete Web of Science database, thus counting citations from journals in other fields as well as citations among Statistics and Probability journals.
5 The Stigler model
Stigler (1994) considers the export of intellectual influence from a journal in order to determine its importance. The export of influence is measured through the citations received by the journal. Stigler assumes that the logodds that journal exports to journal rather than viceversa is equal to the difference of the journals’ export scores,
(2) 
where is the export score of journal . In Stephen Stigler’s words ‘the larger the export score, the greater the propensity to export intellectual influence’. The Stigler model is an example of the BradleyTerry model (Bradley and Terry, 1952; David, 1963; Agresti, 2013) for paired comparison data. According to (2), the citation counts are realizations of binomial variables with expected value
(3) 
where and is the total number of citations exchanged between journals and , as defined in (1).
The Stigler model has some attractive features:

Statistical modelling. Similarly to the Eigenfactor and the derived Article Influence Score, the Stigler method is based on stochastic modelling of a matrix of crosscitation counts. The methods differ regarding the modelling perspective — Markov process for Eigenfactor versus BradleyTerry model in the Stigler method — and, perhaps most importantly, the use of formal statistical methods. The Stigler model is calibrated through wellestablished statistical fitting methods, such as maximum likelihood or quasilikelihood (see Section 5.1), with estimation uncertainty summarized accordingly (Section 5.3). Moreover, Stiglermodel assumptions are readily checked by the analysis of suitably defined residuals, as described in Section 5.2.

The size of the journals is not important. Rankings based on the Stigler model are not affected by the numbers of papers published. As shown by Stigler (1994, pg. 102), if two journals are merged into a single journal then the odds in favour of that ‘super’ journal against any third journal is a weighted average of the odds for the two separate journals against the third one. Normalization for journal size, which is explicit in the definitions of various Impact Factor and Article Influence measures, is thus implicit for the Stigler model.

Journal selfcitations are not counted. In contrast to the standard Impact Factor, rankings based on journal export scores are not affected by the risk of manipulation through journal selfcitations.

Only citations between journals under comparison are counted. If the Stigler model is applied to the list of 47 Statistics journals, then only citations among these journals are counted. Such an application of the Stigler model thus aims unambiguously to measure influence within the research field of Statistics, rather than combining that with potential influence on other research fields. As noted in Table 3, this property differentiates the Stigler model from the other ranking indices published by Thomson Reuters, which use citations from all journals in potentially any fields in order to create a ‘global’ ranking of all scholarly journals. Obviously it would be possible also to recompute more ‘locally’ the various Impact Factor measures and/or Eigenfactorbased indices, by using only citations exchanged between the journals in a restricted set to be compared.

Citing journal is taken into account. Like the Article Influence Score, the Stigler model measures journals’ relative prestige, because it is derived from bivariate citation counts and thus takes into account the source of each citation. The Stigler model decomposes the crosscitation matrix differently, though; it can be reexpressed in loglinear form as the ‘quasi symmetry’ model,
(4) in which the export score for journal is .

Lackoffit assessment. Stigler et al. (1995) and Liner and Amin (2004) observed increasing lack of fit of the Stigler model when additional journals that trade little with those already under comparison are included in the analysis. Ritzberger (2008) states bluntly that the Stigler model ‘suffers from a lack of fit’ and dismisses it — incorrectly, in our view — for that reason. We agree instead with Liner and Amin (2004) who suggest that statistical lackoffit assessment is another positive feature of the Stigler model that can be used, for example, to identify groups of journals belonging to different research fields, journals which should perhaps not be ranked together. Certainly the existence of principled lackoffit assessment for the Stigler model should not be a reason to prefer other methods for which no such assessment is available.
See also Table 3 for a comparison of properties of the ranking methods considered in this paper.
5.1 Model fitting
Maximum likelihood estimation of the vector of journal export scores can be obtained through standard software for fitting generalized linear models. Alternatively, specialized software such as the R package BradleyTerry2 (Turner and Firth, 2012) is available through the CRAN repository. Since the Stigler model is specified through pairwise differences of export scores , model identification requires a constraint, such as, for example, a ‘reference journal’ constraint or the sum constraint . Without loss of generality we use the latter constraint in what follows.
Standard maximum likelihood estimation of the Stigler model would assume that citation counts are realizations of independent binomial variables . Such an assumption is likely to be inappropriate, since research citations are not independent of one another in practice; see Cattelan (2012) for a general discussion on handling dependence in pairedcomparison modelling. The presence of dependence between citations can be expected to lead to the wellknown phenomenon of overdispersion. A simple way to deal with overdispersion is provided by the method of quasilikelihood (Wedderburn, 1974). Accordingly, we consider a ‘quasiStigler’ model,
(5) 
where is the dispersion parameter. Let be the vector obtained by stacking all citation counts in some arbitrary order, and let and be the corresponding vectors of totals and expected values , respectively. Then estimates of the export scores are obtained by solving the quasilikelihood estimating equations
(6) 
where is the Jacobian of with respect to the export scores , and is the diagonal matrix with elements . Under the assumed model (5), quasilikelihood estimators are consistent and asymptotically normally distributed with variancecovariance matrix . The dispersion parameter is usually estimated via the squared Pearson residuals as
(7) 
where is the vector of estimates , with being the quasilikelihood estimate of the export score , and the number of pairs of journals that exchange citations. Wellknown properties of quasilikelihood estimation are robustness against misspecification of the variance matrix and optimality within the class of linear unbiased estimating equations.
The estimate of the dispersion parameter obtained here, for the model applied to Statistics journal crosscitations between 2001 and 2010, is , indicative of overdispersion. The quasilikelihood estimated export scores of the Statistics journals are reported in Table 5 and will be discussed later in Section 5.4.
5.2 Model validation
An essential feature of the Stigler model is that the export score of any journal is a constant. In particular, in model (2) the export score of journal is not affected by the identity of the citing journal . Citations exchanged between journals can be seen as results of contests between opposing journals and the residuals for contests involving journal should not exhibit any relationship with the corresponding estimated export scores of the ‘opponent’ journals . With this in mind, we define the journal residual for journal as the standardized regression coefficient derived from the linear regression of Pearson residuals involving journal on the estimated export scores of the corresponding opponent journals. More precisely, the th journal residual is defined here as
where is the Pearson residual for citations of by ,
The journal residual indicates the extent to which performs systematically better than predicted by the model either when the opponent is strong, as indicated by positivevalued journal residual for , or when the opponent is weak, as indicated by a negativevalued journal residual for . The journal residuals thus provide a basis for useful diagnostics, targeted specifically at readily interpretable departures from the assumed model.
Under the assumed quasiStigler model, journal residuals are approximately realizations of standard normal variables and are unrelated to the export scores. The normal probability plot of the journal residuals displayed in the left panel of Figure 3 indicates that the normality assumption is indeed approximately satisfied. The scatterplot of the journal residuals against estimated export scores shows no clear pattern; there is no evidence of correlation between journal residuals and export scores. As expected based on approximate normality of the residuals, only two journals — i.e., of journals — have residuals larger in absolute value than . These journals are Communications in Statistics  Theory and Methods () and Test (). The overall conclusion from this graphical inspection of journal residuals is that the assumptions of the quasiStigler model appear to be essentially satisfied for the data used here.
5.3 Estimation uncertainty
Estimation uncertainty is commonly unexplored, and is rarely reported, in relation to the various published journal rankings. Despite this lacuna, many academics have produced vibrant critiques of ‘statistical citation analyses’, although such analyses are actually rather nonstatistical. Recent research in the bibliometric field has suggested that uncertainty in estimated journal ratings might be estimated via bootstrap simulation; see the already mentioned Chen et al. (2014) and the ‘stability intervals’ for the SNIP index. A key advantage of the Stigler model over other ranking methods is straightforward quantification of the uncertainty in journal export scores.
Since the Stigler model is identified through pairwise differences, uncertainty quantification requires the complete variance matrix of . Routine reporting of such a large variance matrix is impracticable for space reasons. A neat solution is provided through the presentational device of quasivariances (Firth and de Menezes, 2005), constructed in such a way as to allow approximate calculation of any variance of a difference, , as if and were independent:
Reporting the estimated export scores with their quasivariances, then, is an economical way to allow approximate inference on the significance of the difference between any two journals’ export scores. The quasivariances are computed by minimizing a suitable penalty function of the differences between the true variances, , and their quasivariance representations . See Firth and de Menezes (2005) for details.
Table 5 reports the estimated journal export scores computed under the sum constraint and the corresponding quasi standard errors, defined as the square root of the quasivariances. Quasivariances are calculated by using the R package qvcalc (Firth, 2012). For illustration, consider testing whether the export score of Biometrika is significantly different from that of the Journal of the American Statistical Association. The test statistic as approximated through the quasivariances is
The ‘usual’ variances for those two export scores in the sumconstrained parameterization are respectively 0.0376 and 0.0344, and the covariance is 0.0312; thus the ‘exact’ value of the statistic in this example is
so the approximation based upon quasivariances is quite accurate. In this case the statistic suggests that there is insufficient evidence to rule out the possibility that Biometrika and Journal of the American Statistical Association have the same ability to ‘export intellectual influence’ within the Statistics journals in the list.
5.4 Results
We proceed now with interpretation of the ranking based on the Stigler model. It is reassuring that the four leading Statistics journals mentioned previously are ranked in the first four positions. Journal of the Royal Statistical Society Series B is ranked first with a remarkably larger export score than the secondranked journal, Annals of Statistics: the approximate statistic for the significance of the difference of their export scores is . The third position is occupied by Biometrika, closely followed by Journal of the American Statistical Association.
The fifthranked journal is Biometrics, followed by Journal of the Royal Statistical Society Series A, Bernoulli, Scandinavian Journal of Statistics, Biostatistics, Journal of Graphical and Computational Statistics, and Technometrics.
The ‘centipede’ plot in Figure 4 visualizes the estimated export scores along with the comparison intervals with limits , where ‘qse’ denotes the quasi standard error. The centipede plot highlights the outstanding position of Journal of the Royal Statistical Society Series B, and indeed of the four top journals whose comparison intervals are well separated from those of the remaining journals. However, the most striking general feature is the substantial uncertainty in most of the estimated journal scores. Many of the small differences that appear among the estimated export scores are not statistically significant.
5.5 Ranking in groups with lasso
Shrinkage estimation offers notable improvement over standard maximum likelihood estimation when the target is simultaneous estimation of a vector of mean parameters; see, for example, Morris (1983). It seems natural to consider shrinkage estimation also for the Stigler model. Masarotto and Varin (2012) fit BradleyTerry models with a lassotype penalty (Tibshirani, 1996) which, in our application here, forces journals with close export scores to be estimated at the same level. The method, termed the ranking lasso, has the twofold advantages of shrinkage and enhanced interpretation, because it avoids overinterpretation of small differences between estimated journal export scores.
For a given value of a bound parameter , the ranking lasso method fits the Stigler model by solving the quasilikelihood equations (6) with an penalty on all the pairwise differences of export scores; that is,
(8) 
where the are datadependent weights discussed below.
Quasilikelihood estimation is obtained for a sufficiently large value of the bound . As decreases to zero, the penalty causes journal export scores that differ little to be estimated at the same value, thus producing a ranking in groups. The ranking lasso method can be interpreted as a generalized version of the fused lasso (Tibshirani et al., 2005).
Since quasilikelihood estimates coincide with maximum likelihood estimates for the corresponding exponential dispersion model, ranking lasso solutions can be computed as penalized likelihood estimates. Masarotto and Varin (2012) obtain estimates of the adaptive ranking lasso by using an augmented Lagrangian algorithm (Nocedal and Wright, 2006) for a sequence of bounds ranging from complete shrinkage () — i.e., all journals have the same estimated export score — to the quasilikelihood solution ().
Many authors (e.g., Fan and Li, 2001; Zou, 2006) have observed that lassotype penalties may be too severe, thus yielding inconsistent estimates of the nonzero effects. In the ranking lasso context, this means that if the weights in (8) are all identical, then the pairwise differences whose ‘true’ value is nonzero might not be consistently estimated. Among various possibilities, an effective way to overcome the drawback is to resort to the adaptive lasso method (Zou, 2006), which imposes a heavier penalty on small effects. Accordingly, the adaptive ranking lasso employs weights equal to the reciprocal of a consistent estimate of , such as with being the quasilikelihood estimate of the export score for journal .
Lasso tuning parameters are often determined by crossvalidation. Unfortunately, the interjournal ‘tournament’ structure of the data does not allow the identification of internal replication, hence it is not clear how crossvalidation can be applied to citation data. Alternatively, tuning parameters can be determined by minimization of suitable information criteria. The usual Akaike information criterion is not valid with quasilikelihood estimation because the likelihood function is formally unspecified. A valid alternative is based on the Takeuchi information criterion (TIC; Takeuchi, 1976) which extends the Akaike information criterion when the likelihood function is misspecified. Let denote the solution of (8) for a given value of the bound . Then the optimal value for is chosen by minimization of
where is the misspecified loglikelihood of the Stigler model
computed at , and . Under the assumed quasiStigler model, and the TIC statistic reduces to
where is the number of distinct groups formed with bound . The dispersion parameter can be estimated as in (7). The effect of overdispersion is inflation of the AIC modeldimension penalty.
Figure 5 displays the path plot of the ranking lasso, while Table 5 reports estimated export scores corresponding to the solution identified by TIC. See also Table 4 for a comparison with the Thomson Reuters published rankings. The path plot of Figure 5 visualizes how the estimates of the export scores vary as the degree of shrinkage decreases, i.e., as the bound increases. The plot confirms the outstanding position of Journal of the Royal Statistical Society Series B, the leader in the ranking at any level of shrinkage. Also Annals of Statistics keeps the second position for about threequarters of the path before joining the paths of Biometrika and Journal of the American Statistical Association. Biometrics is solitary in fifth position for almost the whole of its path. The TIC statistic identifies a sparse solution with only 10 groups. According to TIC, the five top journals are followed by a group of six further journals, namely Journal of the Royal Statistical Society Series A, Bernoulli, Scandinavian Journal of Statistics, Biostatistics, Journal of Computational and Graphical Statistics, and Technometrics. However, the main conclusion from this rankinglasso analysis is that many of the estimated journal export scores are not clearly distinguishable from one another.
6 Comparison with results from the UK Research Assessment Exercise
6.1 Background
In the United Kingdom, the quality of the research carried out in universities is assessed periodically by the governmentsupported funding councils, as a primary basis for future funding allocations. At the time of writing, the most recent such assessment to be completed was the 2008 Research Assessment Exercise (RAE 2008), full details of which are online at www.rae.ac.uk. The next such assessment to report, at the end of 2014, will be the similar ‘Research Excellence Framework’ (REF). Each unit of assessment is an academic ‘department’, corresponding to a specified research discipline. In RAE 2008, ‘Statistics and Operational Research’ was one of 67 such research disciplines; in contrast the 2014 REF has only 36 separate discipline areas identified for assessment, and research in Statistics will be part of a new and much larger ‘Mathematical Sciences’ unit of assessment. The results from RAE 2008 are therefore likely to provide the last opportunity to make a directly Statisticsfocused comparison with journal rankings.
It should be noted that the word ‘department’ in RAE 2008 refers to a disciplinespecific group of researchers submitted for assessment by a university, or sometimes by two universities together: a ‘department’ in RAE 2008 need not be an established academic unit within a university, and indeed many of the RAE 2008 Statistics and Operational Research ‘departments’ were actually groups of researchers working in university departments of Mathematics or other disciplines.
It is often argued that the substantial cost of assessing research outputs through review by a panel of experts, as was done in RAE 2008, might be reduced by employing suitable metrics based upon citation data. See, for example, Jump (2014). Here we briefly explore this in a quite specific way, through data on journals rather than on the citations attracted by individual research papers submitted for assessment.
The comparisons to be made here can also be viewed as exploring an aspect of ‘criterion validity’ of the various journalranking methods: if highly ranked journals tend to contain highquality research, then there should be evidence through strong correlations, even at the ‘department’ level of aggregation, between expertpanel assessments of research quality and journalranking scores.
6.2 Data and methods
We examine only Subpanel 22, ‘Statistics and Operational Research’ of RAE 2008. The specific data used here are:

The detailed ‘RA2’ (research outputs) submissions made by departments to RAE 2008. These list up to 4 research outputs per submitted researcher.

The published RAE 2008 results on the assessed quality of research outputs, namely the ‘Outputs subprofile’ for each department.
From the RA2 data, only research outputs categorized in RAE 2008 as ‘Journal Article’ are considered here. For each such article, the journal’s name is found in the ‘Publisher’ field of the data. A complication is that the name of any given journal can appear in many different ways in the RA2 data, for example ‘Journal of the Royal Statistical Society B’, ‘Journal of the Royal Statistical Society Series B: Statistical Methodology’, etc.; and the International Standard Serial Number (ISSN) codes as entered in the RA2 data are similarly unreliable. Unambiguously resolving all of the many different representations of journal names proved to be the most timeconsuming part of the comparison exercise reported here.
The RAE 2008 ‘Outputs subprofile’ for each department gives the assessed percentage of research outputs at each of five quality levels, these being ‘world leading’ (shorthand code ‘4*’), ‘internationally excellent’ (shorthand ‘3*’), then ‘2*’, ‘1*’ and ‘Unclassified’. For example, the Outputs subprofile for University of Oxford, the highestrated Statistics and Operational Research submission in RAE 2008, is
4*  3*  2*  1*  U 

37.0  49.5  11.4  2.1  0 
Our focus will be on the fractions at the 4* and 3* quality levels, since those are used as the basis for research funding. Specifically, in the comparisons made here the RAE ‘score’ used will be the percentage at 4* plus onethird of the percentage at 3*, computed from each department’s RAE 2008 Outputs subprofile. Thus, for example, Oxford’s RAE 2008 score is calculated as . This scoring formula is essentially the one used since 2010 to determine fundingcouncil allocations; we have considered also various other possibilities, such as simply the percentage at 4*, or the percentage at 3* or higher, and found that the results below are not sensitive to this choice.
For each one of the journalranking methods listed in Table 3, a bibliometricsbased comparator score per department is then constructed in a natural way as follows. Each RAEsubmitted journal article is scored individually, by for example the Impact Factor of the journal in which it appeared; and those individual article scores are then averaged across all of a department’s RAEsubmitted journal articles. For the averaging, we use the simple arithmetic mean of scores; an exception is that Stiglermodel export scores are exponentiated prior to averaging, so that they are positivevalued like the scores for the other methods considered. Use of the median was considered as an alternative to the mean; it was found to produce very similar results, which accordingly will not be reported here.
A complicating factor for the simple scoring scheme just described is that journal scores were not readily available for all of the journals named in the RAE submissions. For the various ‘global’ ranking measures (cf. Table 3), scores were available for the 110 journals in the JCR Statistics and Probability category, which covers approximately 70% of the RAEsubmitted journal articles to be scored. For the Stigler model as used in this paper, though, only the subset of 47 Statistics journals listed in Table 1 are scored; and this subset accounts for just under half of the RAEsubmitted journal articles. In the following we have ignored all articles that appeared in unscored journals, and used the rest. To enable a more direct comparison with the use of Stiglermodel scores, for each of the ‘global’ indices we computed also a restricted version of its mean score for each department, i.e., restricted to using scores for only the 47 Statistics journals from Table 1.
Of the 30 departments submitting work in ‘Statistics and Operational Research’ to RAE 2008, 4 turned out to have substantially less than 50% of their submitted journal articles in the JCR Statistics and Probability category of journals. The data from those 4 departments, which were relatively small groups and whose RAEsubmitted work was mainly in Operational Research, is omitted from the following analysis.
The statistical methods used below to examine departmentlevel relationships between the RAE scores and journalbased scores are simply correlation coefficients and scatterplots. Given the arbitrary nature of dataavailability for this particular exercise, anything more sophisticated would seem inappropriate.
6.3 Results
Table 6 shows, for bibliometricsbased mean scores based on each of the various journalranking measures discussed in this paper, the computed correlation with departmental RAE score. The main features of Table 6 are:

The Article Influence and Stigler Model scores correlate more strongly with RAE results than do scores based on the other journalranking measures.

The various ‘global’ measures show stronger correlation with the RAE results when they are used only to score articles from the 47 Statistics journals of Table 1, rather than to score everything from the larger set of journals in the JCR Statistics and Probability category.
The first of these findings unsurprisingly gives clear support to the notion that the use of bivariate citation counts, which take account of the source of each citation and hence lead to measures of journal ‘prestige’ rather than ‘popularity’, is important if a resultant ranking of journals should relate strongly to the perceived quality of published articles. The second finding is more interesting: for good agreement with departmental RAE ratings, it can be substantially better to score only those journals that are in a relatively homogeneous subset than to use all of the scores that might be available for a larger set of journals. In the present context, for example, citation patterns for research in Probability are known to differ appreciably from those in Statistics, and ‘global’ scoring of journals across these disciplines would tend not to rate highly even the very best work in Probability.
The strongest correlations found in Table 6 are those based on journal export scores from the Stigler model, from columns ‘SM’ and ‘SM grouped’ of Table 5. The departmental means of grouped export scores from the rankinglasso method correlate most strongly with RAE scores, a finding that supports the notion that small estimated differences among journals are likely to be spurious. Figure 6 (left panel) shows the relationship between RAE score and the mean of ‘SM grouped’ exponentiated journal export scores, for the 26 departments whose RAEsubmitted journal articles were predominantly in the JCR Statistics and Probability category; the correlation as reported in Table 6 is 0.82. The four largest outliers from a straightline relationship are identified in the plot, and it is notable that all of those four departments are such that the ratio
(9) 
is less than onehalf. Thus the largest outliers are all departments for which the majority of RAEsubmitted journal articles are not actually scored by our application of the Stigler model, and this seems entirely to be expected. The right panel of Figure 6 plots the same scores but now omitting all of the 13 departments whose ratio (9) is less than onehalf. The result is, as expected, much closer to a straightline relationship; the correlation in this restricted set of the most ‘Statistical’ departments increases to 0.88.
Some brief remarks on interpretation of these findings appear in Section 7.5 below. The data and Rlanguage code for this comparison are included in this paper’s Supplementary Web Materials.
7 Concluding remarks
7.1 The role of statistical modelling in citation analysis
In his Presidential Address at the 2011 Institute of Mathematical Statistics Annual Meeting about controversial aspects of measuring research performance through bibliometrics, Professor Peter Hall concluded that
‘As statisticians we should become more involved in these matters than we are. We are often the subject of the analyses discussed above, and almost alone we have the skills to respond to them, for example by developing new methodologies or by pointing out that existing approaches are challenged. To illustrate the fact that issues that are obvious to statisticians are often ignored in bibliometric analysis, I mention that many proponents of impact factors, and other aspects of citation analysis, have little concept of the problems caused by averaging very heavy tailed data. (Citation data are typically of this type.) We should definitely take a greater interest in this area’ (Hall, 2011).
The modelbased approach to journal ranking discussed in this paper is a contribution in the direction that Professor Hall recommended. Explicit statistical modelling of citation data has two important merits. First, transparency, since model assumptions need to be clearly stated and can be assessed through standard diagnostic tools. Secondly, the evaluation and reporting of uncertainty in statistical models can be based upon well established methods.
7.2 The importance of reporting uncertainty in journal rankings
Many journals’ websites report the latest journal Impact Factor and the journal’s corresponding rank in its category. Very small differences in the reported Impact Factor often imply large differences in the corresponding rankings of Statistics journals. Statisticians should naturally be concerned about whether such differences are significant. Our analyses conclude that many of the apparent differences among estimated export scores are insignificant, and thus differences in journal ranks are often not reliable. The clear difficulty of discriminating between journals based on citation data is further evidence that the use of journal rankings for evaluation of individual researchers will often — and perhaps always — be inappropriate.
In view of the uncertainty in rankings, it makes sense to ask whether the use of ‘grouped’ ranks such as those that emerge from the lasso method of Section 5.5 should be universally advocated. If the rankings or associated scores are to be used for prediction purposes, then the usual arguments for shrinkage methods apply and such grouping, to help eliminate apparent but spurious differences between journals, is likely to be beneficial; predictions based on grouped ranks or scores are likely to be at least as good as those made without the grouping, as indeed we found in Section 6.3 in connection with RAE 2008 outcomes. For presentational purposes, though, the key requirement is at least some indication of the amount of uncertainty, and ungrouped estimates coupled with realistically wide intervals, as in the centipede plot of Figure 4, will often suffice.
7.3 A ‘read papers’ effect?
Read papers organised by the Research Section of the Royal Statistical Society are a distinctive aspect of the Journal of the Royal Statistical Society Series B. It is natural to ask whether there is a ‘read papers effect’ which might explain the prominence of that journal under the metric used in this paper. During the study period 2001–2010, Journal of the Royal Statistical Society Series B published in total 446 articles, 36 of which were read papers. Half of the read papers were published during the three years 2002–2004. The Journal of the Royal Statistical Society Series B received in total 2,554 citations from papers published in 2010, with 1,029 of those citations coming from other Statistics journals in the list. Despite the fact that read papers were only 8.1% of all published Journal of the Royal Statistical Society Series B papers, they accounted for 25.4% () of all citations received by Journal of the Royal Statistical Society Series B in 2010, and 23.1% () of the citations from the other Statistics journals in the list.
Read papers are certainly an important aspect of the success of Journal of the Royal Statistical Society Series B. However, not all read papers contribute strongly to the citations received by the journal. In fact, a closer look at citation counts reveals that the distribution of the citations received by read papers is very skew, not differently from what happens for ‘standard’ papers. The most cited read paper published in 2001–2010 was Spiegelhalter et al. (2002), which alone received 11.9% of all Journal of the Royal Statistical Society Series B citations in 2010, and 7.4% of those received from other Statistics journals in the list. Some 75% of the remaining read papers published in the study period each received less than 0.5% of the 2010 Journal of the Royal Statistical Society Series B citations.
A precise quantification of the readpaper effect is difficult. Refitting the Stigler model dropping the citations received by read papers seems an unfair exercise. Proper evaluation of the readpaper effect would require removal also of the citations received by other papers derived from read papers and published either in Journal of the Royal Statistical Society Series B or elsewhere.
7.4 Possible extensions
Fractioned citations.
The analyses discussed in this paper are based on the total numbers of citations exchanged by pairs of journals in a given period and available through the Journal Citation Reports. One potential drawback of this approach is that citations are all counted equally, irrespective of the number of references contained in the citing paper. A number of recent papers in the bibliometric literature (e.g., Zitt and Small, 2008; Moed, 2010; Leydesdorff and Opthof, 2010; Leydesdorff and Bornmann, 2011) suggest to recompute the Impact Factor and other citation indices by using fractional counting, in which each citation is counted as with being the number of references in the citing paper. Fractional counting is a natural expedient to take account of varying lengths of reference lists in papers; for example, a typical review article contains many more references than does a short, technical research paper. The Stigler model extends easily to handle such fractional counting, for example through the quasisymmetry formulation (4); and the rest of the methodology described here would apply with straightforward modifications.
Evolution of export scores.
This paper discusses a ‘static’ Stigler model fitted to data extracted from a single JCR edition. A natural extension would be to study the evolution of citation exchange between pairs of journals over several years, through a dynamic version of the Stigler model. A general form for such a model is
where each journal’s timedependent export score is assumed to be a separate, smooth function of . Such a model would not only facilitate the systematic study of timetrends in the relative intellectual influence of journals, it would also ‘borrow strength’ across years to help smooth out spurious variation, whether it be ‘random’ variation arising from the allocation of citing papers to a specific year’s JCR edition, or variation caused by transient, idiosyncratic patterns of citation. A variety of such dynamic extensions of the BradleyTerry model have been developed in other contexts, especially the modelling of sports data; see, for example, Fahrmeir and Tutz (1994), Glickman (1999), KnorrHeld (2000) and Cattelan et al. (2013).
7.5 Citationbased metrics and research assessment
From the strong correlations found in Section 6 between RAE 2008 outcomes and journalranking scores, it is tempting to conclude that the expertreview element of such a research assessment might reasonably be replaced, mainly or entirely, by automated scoring of journal articles based on the journals in which they have appeared. Certainly Figure 6 indicates that such scoring, when applied to the main journals of Statistics, can perform quite well as a predictor of RAE outcomes for research groups whose publications have appeared mostly in those journals.
The following points should be noted, however:

Even with correlation as high as 0.88, as in the right panel of Figure 6, there can be substantial differences between departments’ positions based on RAE outcomes and on journal scores. For example, in the right panel of Figure 6 there are two departments whose mean scores based on our application of the Stigler model are between 1.9 and 2.0 and thus essentially equal; but their computed RAE scores, at 16.7 and 30.4, differ very substantially indeed.

High correlation was achieved by scoring only a relatively homogeneous subset of all the journals in which the RAEsubmitted work appeared. Scoring a wider set of journals, in order to cover most or all of the journal articles appearing in the RAE 2008 ‘Statistics and Operational Research’ submissions, leads to much lower levels of agreement with RAE results.
In relation to point (a) above it could of course be argued that, in cases such as the two departments mentioned, the RAE 2008 panel of experts got it wrong; or it could be that the difference seen between those two departments in the RAE results is largely attributable to the 40% or so of journal articles for each department that were not scored because they were outside the list in Table 1. Point (b), on the other hand, seems more clearly to be a severe limitation on the potential use of journal scores in place of expert review. The use of cluster analysis as in Section 3, in conjunction with expert judgements about which journals are ‘core’ to disciplines and subdisciplines, can help to establish relatively homogeneous subsets of journals that might reasonably be ranked together; but comparison across the boundaries of such subsets is much more problematic.
The analysis described in this paper concerns journals. It says nothing directly about the possible use of citation data on individual research outputs, as were made available to several of the review panels in the 2014 REF for example. For research in mathematics or statistics it seems clear that such data on recent publications carry little information, mainly because of long and widelyvarying times taken for good research to achieve ‘impact’ through citations; indeed, the Mathematical Sciences subpanel in REF 2014 chose not to use such data at all. Our analysis does, however, indicate that any counting of citations to inform assessment of research quality should at least take account of the source of each citation.
Acknowledgments
The authors are grateful to Alan Agresti, Mike Titterington, the referees, the Series A Joint Editor and Associate Editor, and the Editor for Discussion Papers, for helpful comments on earlier versions of this work. The kind permission of Thomson Reuters to distribute the JCR 2010 crosscitation counts is also gratefully acknowledged.
This work was supported by the UK Engineering and Physical Sciences Research Council through CRiSM grant EP/D002060/1, by University of Padua grant CDPA131553, and by an IRIDE grant from DAIS, Ca’ Foscari University.
References
 Adie, E. and Roe, W. (2013). Altmetric: enriching scholarly content with articlelevel discussion and metrics. Learned Publishing 26, 11–17.
 Adler, R., Ewing, J. and Taylor, P. (2009). Citation statistics (with discussion and rejoinder). Statistical Science 24, 1–14.
 Agresti, A. (2013). Categorical Data Analysis. Third Edition. New York: Wiley.
 Alberts, B. (2013). Impact factor distortions. Science 340, 787.
 Amin, M. and Mabe, M. (2000). Impact factors: Use and abuse. Perspectives in Publishing 1, 1–6.
 Archambault, E. and Larivière, V. (2009). History of the journal impact factor: Contingencies and consequences. Scientometrics 79, 635–649.
 Arnold, D. N. and Fowler, K. K. (2011). Nefarious numbers. Notices of the American Mathematical Society 58, 434–437.
 Bergstrom, C. (2007). Eigenfactor: Measuring the value and the prestige of scholarly journals. College & Research Libraries News 68, 314–316.
 Bishop, M. and Bird, C. (2007). BIB’s first impact factor is 24.37. Briefings in Bioinformatics 8, 207.
 Bollen, J., Rodriguez, M. A. and de Sompel, H. V. (2006). Journal status. Scientometrics 69, 669–687.
 Bornmann, L. (2014). Do altmetrics point to the broader impact of research? An overview of benefits and disadvantages of altmetrics. Journal of Informetrics 8, 895–903.
 Bornmann, L. and Marx, W. (2014). How to evaluate individual researchers working in the natural and life sciences meaningfully? A proposal of methods based on percentiles of citations. Scientometrics 98, 487–509.
 Boyack, K. W., Klavans, R. and Börner, K. (2005). Mapping the backbone of science. Scientometrics 64, 351–374.
 Bradley, R. A. and Terry, M. E. (1952). The rank analysis of incomplete block designs. I. The method of paired comparisons. Biometrika 39, 324–345.
 Braun, T., Glänzel, W. and Schubert, A. (2006). A Hirschtype index for journals. Scientometrics 69, 169–173.
 Brin, S. and Page, L. (1998). The anatomy of a largescale hypertextual web search engine. Computer Networks and ISDN Systems 30, 107–117.
 Carpenter, M. P. and Narin, F. (1973). Clustering of scientific journals. Journal of the American Society for Information Science 24, 425–436.
 Cattelan, M. (2012). Models for paired comparison data: A review with emphasis on dependent data. Statistical Science 27, 412–433.
 Cattelan, M., Varin, C. and Firth, D. (2013). Dynamic BradleyTerry modelling of sports tournaments. Journal of the Royal Statistical Society Series C 62, 135–150.
 Chen, K.M, Jen, T.H. and Wu., M. (2014). Estimating the accuracies of journal impact factor through bootstrap. Journal of Informetrics 8, 181–196.
 David, H. A. (1963). The Method of Paired Comparisons. New York: Hafner Press.
 Fahrmeir, L. and Tutz, G. (1994). Dynamic stochastic models for timedependent ordered paired comparison systems. Journal of the American Statistical Association 89, 1438–1449.
 Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360.
 Firth, D. (2012). qvcalc: Quasi variances for factor effects in statistical models. R package version 0.88. URL CRAN.Rproject.org/package=qvcalc
 Firth, D. and de Menezes, R. X. (2005). Quasivariances. Biometrika 91, 65–80.
 Franceschet, M. (2010). Ten good reasons to use the Eigenfactor metrics. Information Processing & Management 46, 555–558.
 Frandsen, T. F. (2007). Journal selfcitations  Analysing the JIF mechanism. Journal of Informetrics 1, 47–58.
 Garfield, E. (1955). Citation indices for Science. Science 122, 108–111.
 Garfield, E. (1972). Citation analysis as a tool in journal evaluation. Science 178, 471–479.
 Glänzel, W. and Moed, H. F. (2002). Journal impact measures in bibliometric research. Scientometrics 53, 171–193.
 Glickman, M. (1999). Parameter estimation in large dynamic paired comparison experiments. Journal of the Royal Statistical Society Series C 48, 377–394.
 Goldstein, H. and Spiegelhalter, D. J. (1996). League tables and their limitations: Statistical issues in comparisons of institutional performance. Journal of the Royal Statistical Society Series A 159, 385–443.
 Gross, P. L. K. and Gross, E. M. (1927). College libraries and chemical education. Science 66, 385–389.
 Hall, P. G. (2009). Comment: Citation statistics. Statistical Science 24, 25–26.
 Hall, P. G. (2011). ‘Ranking our excellence,’ or ’assessing our quality,’ or whatever…. Institute of Mathematical Statistics Bulletin 40, 12–14.
 Hall, P. and Miller, H. (2009). Using the bootstrap to quantify the authority of an empirical ranking. The Annals of Statistics 37, 3929–3959.
 Hall, P. and Miller, H. (2010). Modeling the variability of rankings. The Annals of Statistics 38, 2652–2677.
 IEEE Board of Directors (2013). IEEE position statement on ‘Appropriate use of bibliometric indicators for the assessment of journals, research proposals, and individuals’.
 JournalRanking.com (2007). Present ranking endeavors. Red Jasper Limited. URL www.journalranking.com/ranking/web/content/intro.html
 Jump, P. (2014). Light dose of metrics could ease REF pain. Times Higher Education, No. 2178 (13 November 2014), 11. URL www.timeshighereducation.co.uk/news/regulardietofmetricslitemaymakefullrefmorepalatable/2016912.article
 Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley.
 KnorrHeld, L. (2000). Dynamic rating of sports teams. Journal of the Royal Statistical Society Series D 49, 261–276.
 Lehmann, S., Lautrup, B. E. and Jackson, A. D. (2009). Comment: Citation statistics. Statistical Science 24, 17–20.
 Leydesdorff, L. (2004). Clusters and maps of science based on biconnected graphs in Journal Citation Reports. Journal of Documentation 60, 371–427.
 Leydesdorff, L. and Bornmann, L. (2011). How fractional counting of citations affects the impact factor: Normalization in terms of differences in citation potentials among fields of science. Journal of the American Society for Information Science and Technology 62, 217–229.
 Leydesdorff, L. and Opthof, T. (2010). Scopus’ Source Normalized Impact per Paper (SNIP) versus the Journal Impact Factor based on fractional counting of citations. Journal of the American Society for Information Science and Technology 61, 2365–2369.
 Leydesdorff, L., Radicchi, F., Bornmann, L., Castellano, C. and de Nooy, W. (2013). Fieldnormalized impact factors (IFs): A comparison of rescaling and fractionally counted IFs. Journal of the American Society for Information Science and Technology 64, 2299–2309.
 Liner, G. H. and Amin, M. (2004). Methods of ranking economics journals. Atlantic Economic Journal 32, 140–149.
 Liu, X., Glänzel, W. and De Moor, B. (2012). Optimal and hierarchical clustering of largescale hybrid networks for scientific mapping. Scientometrics 91, 473–493.
 Marx, W. and Bornmann, L. (2013). Journal impact factor: ‘the poor man’s citation analysis’ and alternative approaches. European Science Editing 39, 62–63.
 Masarotto, G. and Varin, C. (2012). The ranking lasso and its application to sport tournaments. Annals of Applied Statistics 6, 1949–1970.
 Moed, H. F. (2010). Measuring contextual citation impact of scientific journals. Journal of Informetrics 4, 265–277.
 Morris, C.N. (1983). Parametric empirical Bayes inference: Theory and applications. Journal of the American Statistical Association 78, 47–65.
 Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Second edition. Springer.
 PalaciosHuerta, I. and Volij, O. (2004). The measurement of intellectual influence. Econometrica 72, 963–977.
 Pratelli, L., Baccini, A., Barabesi, L. and Marcheselli, M. (2012). Statistical analysis of the Hirsch Index. Scandinavian Journal of Statistics 39, 681–694.
 Putirka, K., Kunz, M., Swainson, I. and Thomson, J. (2013). Journal Impact Factors: Their relevance and their influence on societypublished scientific journals. American Mineralogist 98, 1055–1065.
 R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL www.Rproject.org
 Ritzberger, K. (2008). A ranking of journals in economics and related fields. German Economic Review 9, 402–430.
 San Francisco Declaration on Research Assessment (DORA) (2013). URL am.ascb.org/dora/
 Seglen, P. O. (1997). Why the impact factor of journals should not be used for evaluating research. British Medical Journal 314, 497.
 Sevinc, A. (2004). Manipulating impact factor: An unethical issue or an editor’s choice? Swiss Medical Weekly 134, 410.
 Silverman, B. W. (2009). Comment: Citation statistics. Statistical Science 24, 21–24.
 Spiegelhalter, D. J., Best, N. G., Carlin, B. P. and van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society Series B 64, 583–639.
 Stigler, G. J., Stigler, S. M. and Friedland, C. (1995). The journals of economics. The Journal of Political Economy 103, 331–359.
 Stigler, S. M. (1994). Citation patterns in the journals of statistics and probability. Statistical Science 9, 94–108.
 Takeuchi, K. (1976). Distribution of informational statistics and a criterion of model fitting. SuriKagaku (Mathematical Sciences) (in Japanese) 153, 12–18.
 Theoharakis, V. and Skordia, M. (2003). How do statisticians perceive statistics journals? The American Statistician 57, 115–123.
 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B 58, 267–288.
 Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Kneight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society Series B 67, 91–108.
 Turner, H. and Firth, D. (2012). BradleyTerry models in R: The BradleyTerry2 package. Journal of Statistical Software 48, 1–21.
 van Nierop, E. (2009). Why do statistics journals have low impact factors? Statistica Neerlandica 63, 52–62.
 van Noorden, R. (2012). Researchers feel pressure to cite superfluous papers. Nature News, February 12, 2012.
 Waltman, L., van Eck, J.N., van Leeuwen, T. N. and Visser, M. S. (2013). Some modifications to the SNIP journal impact indicator. Journal of Informetrics 7, 272–285.
 Waltman, L. and Van Eck, N.J. (2013). Source normalized indicators of citation impact: An overview of different approaches and an empirical comparison. Scientometrics 96, 699–716.
 Wedderburn, R. W. M. (1974). Quasilikelihood, generalized linear models, and the GaussNewton method. Journal of the Royal Statistical Society Series B 61, 439–447.
 West, J. D. (2010). Eigenfactor: Ranking and mapping scientific knowledge. Ph.D. Dissertation. University of Washington.
 Wilhite, A. W. and Fong, E. A. (2012). Coercive citation in academic publishing. Science 335, 542–543.
 Zitt, M. and Small, H. (2008). Modifying the journal impact factor by fractional citation weighting: The audience factor. Journal of the American Society for Information Science and Technology 59, 1856–1860.
 Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101, 1418–1429.