Predicting Citation Counts with a Neural Network
Abstract
We here describe and present results of a simple neutral network that predicts individual researchers’ future citation counts based on a variety of data from the researchers’ past. For publications available on the open accessserver arXiv.org we find a higher predictability than previous studies.
1 Introduction
Measuring and predicting scientific excellence is as daunting as controversial. But regardless of how one feels about quantifying quality, measures for scientific success are being used, and they will continue to be used. The best that we, as scientists, can do is to at least come up with good measures.
The attempt to quantify scientific success by the number of publications dates back to the early 19th century [1], but the idea really took off with the advent of the internet when data for publications and citations became easily accessible. Since then, a large variety of different measures has been suggested which rely on citation counts, publication statistics, impact factors, media coverage, and more. For a recent review, the reader is referred to [2].
We maintain that the best way to assess a researcher’s promise is to study their work in depth. But bowing to the need to have a tool for administrative and organizational purposes that is fast and easy to use and allows at least superficial evaluation, we here report on the training of a neural network to improve on presently existing measures.
Due to the large variety of existing measures, we will not compare the method presented here to all of them. We will focus in particular on three previous studies that put our work into perspective. This is (1) the original paper about the predictivity of the Hirschindex [3] (hereafter index), (2) a 2012 study [4] which used a linear regression model to predict the index for research in the life sciences, and (3) a 2017 report [5] on various machine learning models used to predict the index for a large cohort of authors in the computer sciences.
Our paper is organized as follows. In the next section we clarify exactly what we aim to achieve. In section 3 we document which input data we have used and how we have encoded it. In section 4 we explain how we set up the neural network, and in 5 we will present our results. We finish with a discussion and conclusion in section 6.
2 Aim
Before we build a neural network, we first need to make precise what we mean by “predictive” and how we will measure this “predictiveness.”
Our neural network will be fed publication data (section 3) for a training group of researchers during a first phase of publishing activity, hereafter referred to as the ‘initial period’. The aim is then to use the neural network (section 4) to predict individual authors’ performance in a second phase of publishing activity, hereafter referred to as the ‘forecasting period’. This input data for the initial period does not include citations from papers that were only published during the forecasting period. The prediction period starts at 1/1/2008 which is chosen such that we can evaluate the neural network’s performance for predictions up to 10 years into the future of the initial period.
Once the network has learned forecasting from the training group, we make a forecast for the remaining researchers – the “validation group” – and evaluate how good our forecast was. This is to say, in this present study we do not make actual future forecasts because we want to assess how well out network performs, but our method is designed so that it could be used to make real forecasts.
We will use two different approaches to evaluate the forecasting performance in the validation group.
The first approach (see section 5.1) follows the procedure used in Ref. [3], which evaluated the predictiveness of the index for various quantities in terms of the correlation coefficient, . In this approach, we do not include citations from papers in the initial period in the number of citations to be forecasted in the forecasting period. In making this clean separation, we get a better grasp on predicting future scientific achievement as opposed to cumulative achievement.
For this first approach, we follow the notation of [3] and denote the number of citations received between and , i.e. during the forecasting period, as . Since it is known that the index roughly scales with the squareroot of citations, we will more precisely use to make it easier to compare our results with those of Ref. [3].
The second approach (see section 5.2) follows a different procedure which is better suited for comparison with the results of Refs. [4, 5]. For this, we feed the neural network the same input data as in the first approach but then predict the cumulative index after years until the end of the forecasting period. Furthermore, in this second approach we use the coefficient of determination, , instead of the correlation coefficient, , to quantify the goodness of our prediction because the same procedure was followed in Refs. [4, 5].
Unless otherwise stated, our general procedure is to employ 20 rounds of Monte Carlo crossvalidation – i.e. redo the random split into training and validation data 20 times and retrain the neural network – and report the mean as well as the standard deviation. One reason for employing crossvalidation is that different splits of our dataset into training and validation data leads to slightly different results as will be further discussed below. Another reason is that our dataset is not particularly large. Crossvalidation then allows to avoid splitting our dataset into a training, a validation, and test group while still avoiding overfitting to a particular split into training and validation data.
3 Data
We have obtained the publications for each author from the arXiv through the publicly available Open Archives Initiative API [6] and corresponding citation data from Paperscape [7]. For the purposes of this present work, we consider only the publications in the ‘physics’ set, which gives us a total of 934,650 papers. Journal Impact Factors (JIFs) are taken from Ref. [8]. We group together similar author names and treat them as a single author by the same procedure as laid out in [9].
From the complete dataset, we select a sample of authors and trim it in various ways. First, we require that they published their first arXiv paper between 1/1/1996 and 1/1/2003. We have chosen that period to span the time between 5 and 12 years prior to the cutoff which matches the procedure of Ref. [5].
We then remove authors who have published fewer than 5 or more than 500 papers to avoid statistical outliers which would unduly decrease the predictivity of our method.
Finally, we exclude collaborations, since their publication activity differs greatly from that of individuals. For this, we remove all author names that contain the word ‘collaboration’ and all papers with more than 30 authors.
After this, we are left with a sample of 39,412 author IDs. From these, we randomly chose a subset of 28,000 as the ‘training group’. The rest is our validation data by help of which we evaluate how well the neural network performs after training is completed. Note that this random split into training and validation data is done independently for each round of crossvalidation.
4 The Neural Network
We used a feedforward neural network, which means that the neural network consists of layers of neurons where the input of the neurons in one layer is the output of the neurons in the previous layer and the layers are not arranged as a cycle. The output of the first layer are the input data described in Appendix A, and the output of the last layer is taken as the output of the whole neural network. In our case, the last layer consists of ten neurons such that the neural network’s output is list of ten real numbers.
For this network, the output of the neurons in one layer follows from the output of the neurons in the previous layer by
(1) 
Here, is a vector which contains the outputs of the neurons in one layer, is a vector which contains the outputs of the neurons in the previous layer, is a real matrix whose elements are called the weights and is a real vector with elements which are called the biases. Further, is a function which is applied to each element of the vector and is called the activation function. The weights and biases are different for each layer, so that for each added layer one gets another weight matrix and another bias vector. These are the free parameters of the neural network which are determined by the training procedure.
During the training of the neural network, the output of the neural network is calculated with the training data described above as input and the weights and biases are adjusted in order to get the output of the neural network for each author as close as possible to the actual . More precisely, the weights and biases are adjusted in order to minimize the socalled loss function which we take to be the mean squared error across all authors in the training set. Note that for a socalled fullyconnected layer all elements of the weights matrix and the bias vector are independently adjusted, while for other types of layers, e.g. socalled convolutional layers, some structure is imposed. As will be explained below, we use both types of layers.
Since our dataset is not particularly large, we tried to avoid overfitting by reducing the number of parameters and structured the network as follows (see Figure 1):

The first layer is a convolutional layer and contains k = 70 neurons for each paper. Although convolutional layers are normally used to exploit translational symmetry in the input data, we use a convolutional layer with a convolution window of a single paper in order to ensure invariance under permutations of the papers. The effect of this is that each paper has a corresponding set of 70 neurons which see only the data from that paper, and each paper’s neuronset has matching weights and biases. The input to this layer contains every input except the broadness value, i.e., it contains all per paper input but not the per author input.

In a next step, these 70 neurons per paper are reduced to 70 neurons in total by averaging each of the 70 neurons over the different papers. After this, no information about individual papers is left, and only average values are retained in the remaining 70 neurons. This layer does not add free parameters to the neural network.

After that, a fullyconnected layer with 70 neurons is added. In addition to the output of the previous layer, this layer obtains input which is specific to the author not to the individual papers.

The final layer is a fullyconnected layer with ten neurons with a ReLu activation function. The first neuron represents the prediction 1 year after the cutoff and the other neurons represent the differences between the prediction after and years. For instance, if the neural network’s output is , the corresponding prediction is for , for , for , etc. This ensures that the neural network’s prediction is a monotonically increasing time series.
A detailed list of input data for each level can be found in Appendix A. The neural network architecture described above can be implemented in Keras with just a few lines of code which are reproduced in Appendix B.
The neurons, except for the neurons in the output layer, are taken to have activation functions. Training is done using an Adam optimizer and a mean squared error loss function. No regularization is employed and the final result is obtained after 150 epochs of training with a batch size of 50.
We would like to end this section with a comment on the way the neural network described above makes predictions for different years after the cutoff. As described above, each year corresponds to one of the 10 neurons in the very last layer of the neural network. An alternative would be to have 10 neural networks with only a single neuron in the output layer. Each of the networks would then be trained to make a prediction for one particular . One might argue that this alternative leaves more freedom for the neural networks to learn the specific requirements in making a prediction for one specific instead of for all at the same time. However, it seems this is not the case in practice, since we have tried both approaches with the only difference in the neural network architecture being the number of neurons in the output layer and the resulting performance was very similar.
5 Results
For both approaches, we will compare the neural network’s performance to that of a naive index predictor which is given only the index of an author for which a prediction is to be made. By this naive index predictor we mean the following: For a given quantity to predict, i.e. the future cumulative index or , take all authors in the training set with a given index at the time of the cutoff. Then, calculate the arithmetic mean of the quantity to be predicted and take this mean value as a prediction for authors in the validation set given their .
Note that there may be authors in the validation set with, typically high, values of that are not present in the training set. For those authors, the prediction is determined as follows. First, a linear polynomial is fitted to the naive index predictor for the values of that can be calculated in the way described in the previous paragraph. Then, the value of the fitted polynomial is taken as the prediction for the other values of .
5.1 Comparison with the index
For the neural network trained to predict , we are only interested in the prediction for years after the cutoff. Therefore, we ignore the prediction of both the neural network and the naive index predictor for the first 9 years after the cutoff.
The correlation coefficients betweeen the neural network’s prediction and is (see Fig. 2). The error here and in the following is the standard deviation across the 20 rounds of crossvalidation. In comparison, the naive index predictor described above yields . We clearly see that the neural network performs significantly better than the naive index predictor.
Next, we tested whether any one type of input data is especially important to the performance of the neural network by training the network with single types of input data separately removed. This, we found, barely changes the results. The biggest impact comes from removing the paper dates and the number of citations, which make the correlation coefficient drop to and , respectively, which is still very close to the original . We may speculate that the network gathers its information from combinations of input which are themselves partly redundant, so that removing any single one has little effect.
Finally, we checked whether the neural net still performs better than the index when given only citation counts as input data. This resulted in a correlation coefficient of . We see that the neural net performs better than the index even with only the number of citations as input data.
For more details on the comparison between our results and those of Ref. [3], please see Appendix C.
5.2 Comparison with earlier machine learning predictions
For the second approach, we compare the predicted with the actual cumulative index for years after the cutoff date. We quantify the goodness of this prediction with the coefficient of determination, , both for the neural network and for the naive index predictor discussed at the beginning of this section. Note that we calculate separately for each by restricting the predictions as well as the actual values to that particular . The result is shown in Fig. 3. The error bars show standard deviations calculated from the 20 rounds of crossvalidation.
We see that the neural net and the naive index predictor are similarly predictive for , but the neural network’s prediction becomes better in comparison for larger . This agrees with the findings Refs. [4, 5].
As mentioned in section 2, there are fluctuations of the neural network’s performance with the rounds of crossvalidation. These fluctuations are illustrated in Fig. 4. In particular, Fig. 4 shows the neural network’s (Fig. 4, left) and (Fig. 4, right) after years for the different rounds of crossvalidation. We see that there are nonnegligible fluctuations in the network’s performance. A possible explanation could be that the fluctuations are due to the neural network overfitting on particular kinds of splits of the whole dataset into training and validation data. However, we think it is more likely that the fluctuations are due to intrinsic properties of the dataset. This is because the naive index predictor has only a few tens of parameters so that overfitting should not be an issue but Fig. 4 shows that there are comparable fluctuations in this naive index predictor’s performance nonetheless.
In Fig. 5, left, we show the result for the neural network operating on the training and validation datasets of the first round of crossvalidation when training for 5 and 10 additional epochs. We see that there are fluctuations of the neural network’s performance with the training epoch which typically affect at or below the one percent level. In Fig. 5, right, we show the result of averaging across all round of crossvalidation, where the averaging is done separately for the neural networks obtained after training for 150, 155, and 160 epochs. We see that the fluctuations with the training epoch have cancelled and the difference between the averages after training for 150, 155, and 160 epochs is negligible.
Quantitatively, our results give higher values of than both Ref. [4] (0.48 after 10 years) and Ref. [5] (0.72 after 10 years). However, since not only our neural network but also our simple index predictor give higher values than Refs. [4] and [5], this difference is probably partly due to the different datasets. We have therefore also applied the predictor proposed in [4] to our dataset, with the results shown in Figure 3 (see Appendix D for details). The predictability of our dataset is indeed higher than that studied in [4], but the prediction of our network still outperforms the previous study. Unfortunately, a similar direct comparison to the results from [5] which used yet another dataset is not possible. Still, our value of after ten years is remarkably predictive, especially given that the methods we have employed here are likely to improve further in the soon future.
Since the value of by itself is not so illuminating, we show in Figure 6 some examples for which
we display the actual index versus the networkprediction with the training and validation datasets from the first round of crossvalidation.
Here too we have investigated how important various input data are to the neural network’s performance by removing one at a time. The results for and are shown in Fig. 7, where we plot the ratio of the coefficients of determination with all input () and with certain inputs removed (). The lower the ratio, the more important the removed input was. The error bars show standard deviations calculated from the 20 rounds of crossvalidation.
We see that for , the number of citations is the only important input, while for other inputs gain importance with the number of citations still being the most important one. That the citation data are most important for agrees with the results of Ref. [4] and Ref. [5]. We also see from Fig. 7 that is always larger than which again indicates that the input data are partly redundant. However, the changes we notice due to some input removals are so small that fluctuations in the training results are no longer negligible.
6 Discussion
We have demonstrated here that neural nets are powerful tools to make predictions for the citations a researcher’s work accumulates. These predictions are likely to improve in the future. One of the major limitations of this present study, for example, is that our sample does not include papers which are not on the arXiv, and that about one of five published papers could not be associated with a Journal Impact Factor (see Appendix A). But the more bibliometric data becomes available the more input the network can be fed, and thus predictivity is bound to become better for some more time.
The methods we used here are straightforward to implement and do not require much computing resources. It is thus foreseeable that in the near future the use of neural nets to predict researcher’s promise will become more widely spread.
We therefore want to urge the community to not ignore this trend in the hope it will go away. It would benefit academic research if scientists themselves proposed a variety of predictors, and offered a variety of data, to more accurately present the variety of ways to do highquality research.
In this work we focused on the index in order to compare our results with previous results, and also because this value is easy to extract from existing data. But a lot of valuable information about researchers and their work is presently difficult or impossible to obtain and analyze. For example, how often researchers are named in acknowledgements, their seminar and conference activity, the frequency by which they act as peer reviewers and/or editors, or how specialized their research topic is. All these are important factuals about the many individual approaches to research. Providing and analyzing such data would enable us to develop measures for success tailored to specific purposes and thereby avoid that researchers focus efforts on optimizing citation counts.
Acknowledgements
TM thanks Sebastian Weichwald for helpful discussions. This work was supported by the Foundational Questions Institute (FQXi).
Appendix A
The building blocks of most of the input data for the neural network are lists where the position in the list corresponds to a paper and the value at a certain position corresponds to some input data associated with the corresponding paper. E.g. there is a list containing the number of citations for each paper of the given author which in Python syntax would look like , meaning the mostcited paper of this author has citations, the secondmost cited paper has citations etc. A corresponding list for the number of coauthors would look like , meaning the paper with citations was a singleauthored paper, the paper with citations has two authors etc.
Since different authors have different numbers of papers, these lists will have different lengths for different authors. However, our neural network requires a fixedsize input. Therefore, we take the lists for all authors to have the same length as those of the author with the largest number of papers. The positions in the lists that do not correspond to a paper, are filled with zeros.
The biggest part of the input to the neural network is then the list of all the lists described above and consists of

A list which contains 1 at each position in the list which corresponds to a paper and 0 at each position which corresponds to zeropadding.

A list which contains the number of citations of each paper.

A list which contains the publication date of each paper relative to the cutoff date.

A list which contains each paper’s pagerank [12], an interactive measure of relevance that works similar to Google’s pagerank algorithm just that, instead being based on hyperlinks, a paper’s pagerank is based on the citation graph. The pagerank is calculated from the citation graph at the cutoff date, 1/1/2008.

A list which contains each paper’s length.

A list which contains 0 for each paper with an empty journal reference and 1 for each paper with a nonempty journal reference.

A list which contains the JIF of the journal each paper is published in (further details below). If a paper is not published or no JIF is known for a journal, we take the corresponding input to be 0. The JIFs are taken at the cutoff date.

A list which contains the number of coauthors of each paper.

Three lists which contain the coauthors’ minimum, maximum, and average pagerank. Here, the pagerank is calculated from the coauthor graph at the cutoff date.

For each arXiv category a list which contains zeros except at position which correspond to papers which are in the respective category. In order to reduce the amount of data, categories of the form a.b are all treated as category a, e.g. astroph.CO and astroph.HE are treated as the same category.

For each paper, a 50dimensional vector representing a paper’s latent topic distribution, obtained from the keyword analysis done by [9] when operating on the arXiv data up to the cutoff.
The final input to the neural network is a ‘broadness’ value calculated from a keyword analysis [9] with the same data as the paper topics described above. This broadness quantifies how widely spread the topics which an author publishes on are over all arXiv categories. Since this broadness value is one value per author and not one value per paper it is handled separately from the other inputs to the neural network.
All input data except the categories and the paper vectors is normalized to unit variance and zero mean. More specifically, for each input (e.g. number of citations, publication date, broadness, …) a transformation is determined from the training data such that this transformation brings the given input to zero mean and unit variance for the training data across all authors. This transformation is then applied both to the training and the validation input data.
To assign the JIFs, we associate papers in our database with a journal by heuristically matching the journal reference given in the arXiv metadata to the journal abbreviation from Ref. [8]. Concretely, we reduce both the journal reference in the arXiv metadata and the journal abbreviation from Ref. [8] to lowercase alphanumeric characters and cut at the first numeric character. Next, we remove the suffixes ’vol’ and ’volume’ if present. If the two values obtained this way are identical, we consider the given arXiv paper to be published in the corresponding journal.
To reduce the number of papers where this procedure does not work, we have further used a manually assembled translation table. This table contains identifications between reduced arXiv journal references to journals from Ref. [8] for which the method outlined in the previous paragraph does not work. The table allows us to match the 69 most frequent reduced journal references from the arXiv that could not be mapped by the previous method. By this procedure, we have assigned a Journal Impact Factor to 378,134 of the 477,176 papers with a nonempty arXiv journal reference.
Appendix B
Appendix C
The correlation coefficient obtained from the naive index predictor is roughly the same as the correlation coefficient obtained by plotting over the index at the time of the cutoff for both the training and validation data and calculating the correlation coefficient from that, see Fig. 8. Note that this second way of calculating a correlation coefficient corresponds to what was done in Ref. [3].
The correlation coefficients from Fig. 8 are consistenly smaller than those of the sample PRB80 from Ref. [3] but higher than those of the sample APS95 from Ref. [3]. See Ref. [3] for a discussion of the differences between the samples PRB80 and APS95 regarding their differing correlation coefficients. Our sample differs from both PRB80 and APS95 in both the data source and the cuts applied. Therefore, it is not surprising that there are differences in the resulting correlation coefficients and the results are not directly comparable. One important difference is that we employ the same cutoff, 1/1/2008, for all authors while Ref. [3] applies a different cutoff for each author at 12 years after each author’s first paper.
Appendix D
We used what the authors of [4] refer to as the “simplified model,” that – as they have shown – performs almost as well as their full model on their dataset. We changed the selected journals from Nature, Science, Nature Neuroscience, PNAS and Neuron to Science, Nature, PNAS, and PRL. As laid out in the supplementary material of [4], we used the Rpackage ‘glmnet’ with . Note that we did not employ our own Monte Carlo crossvalidation here. Instead – as was done in [4] – we relied on the crossvalidation included in the ‘glmnet‘ package.
Footnotes
 The reader be warned that these examples were not randomly chosen. We handselected a set of noticeably different index outcomes for purely illustrative purposes.
References
 A. Csiszar. The catalogue that made metrics, and changed science. Nature, 551(7679):163–165, 2017.
 L. Waltman. A review of the literature on citation impact indicators. arXiv:1507.02099 [cs.DL], 2015.
 J. E. Hirsch. Does the h index have predictive power? Proceedings of the National Academy of Sciences, 104(49):19193–19198, 2007.
 D. E. Acuna, S. Allesina, and Kording K. P. Predicting scientific success. Nature, 489(7415):201â202, 2012.
 L. Weihs and O. Etzioni. Learning to predict citationbased impact measures. In 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 1–10, June 2017.
 arxiv oai api. https://arxiv.org/help/oa/index.
 Damien P. George and Robert Knegjens. Paperscape. http://paperscape.org.
 Clarivate Analytics. 20012009 journal citation reports, 2017.
 T. Price and S. Hossenfelder. Measuring scientific broadness. in preparation, 2018.
 François Chollet et al. Keras. https://github.com/kerasteam/keras, 2015.
 Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Largescale machine learning on heterogeneous systems. https://www.tensorflow.org/, 2015.
 Samuel M.H. Pagerank + sparse matrices+ python (ipython notebook). http://blog.samuelmh.com/2015/02/pageranksparsematricespythonipython.html.