Citation Count Analysis for Papers with Preprints
Abstract
We explore the degree to which papers prepublished on arXiv garner more citations, in an attempt to paint a sharper picture of fairness issues related to prepublishing. A paper’s citation count is estimated using a negativebinomial generalized linear model (GLM) while observing a binary variable which indicates whether the paper has been prepublished. We control for author influence (via the authors’ hindex at the time of paper writing), publication venue, and overall time that paper has been available on arXiv. Our analysis only includes papers that were eventually accepted for publication at toptier CS conferences, and were posted on arXiv either before or after the acceptance notification. We observe that papers submitted to arXiv before acceptance have, on average, 65% more citations in the following year compared to papers submitted after. We note that this finding is not causal, and discuss possible next steps.
1 Introduction
Preprint servers like arXiv enable researchers to selfdistribute scientific paper drafts with minimal moderation. While some of these papers are never published elsewhere, many are also accepted for publication at academic venues after a doubleblind peerreview process. Authors of these papers are faced with the decision to distribute their papers on arXiv before or after acceptance at their target publication venues. We refer to those papers that are posted on arXiv before acceptance as prepublished.
With the increasing popularity of prepublishing computer science (CS) papers on arXiv (Sutton and Gong, 2017), this decision has been the subject of a considerable debate in the CS research community (among others).
On the other side of the debate, some researchers prefer publishing drafts of their work on arXiv before it is accepted for publication. One reason is to allow other researchers to build on their work which can expedite scientific developments. Another reason is to allow researchers aside from the official reviewers at the target venue to provide feedback, which can be used to further improve the paper even before it is published (i.e., before the cameraready due date). Authors also may use arXiv for “flagplanting”, i.e., claiming a research contribution before getting scooped by other researchers who may be doing similar work.
Quoting a recent blog post by Yoav Goldberg: “[T]here is also a rising trend of people using arXiv for flagplanting, and to circumvent the peerreview process. This is especially true for work coming from ‘strong’ groups. Currently, there is practically no downside to posting your (often very preliminary, often incomplete) work to arXiv, only potential benefits.”
To motivate this study, consider the following scenario: Two researchers R1 and R2 worked independently developed an outstanding method around the same time. R1 decides to prepublished her draft on arXiv while R2 decides to wait until the paper is accepted for publication at target venue. Naturally, the earlier exposure of the research community to R1’s paper may result in researchers attributing most of the credit to R1 rather than R2, despite both being eventually published at the same venue. We may consequently observe a higher number of citations for R1’s work rather than R2’s work. This is especially concerning when metrics derived from citation counts (e.g. hindex) play a significant role in hiring and promotion decisions in universities and research labs (despite the controversy surrounding number of citations as a measure of a paper’s impact).
In this draft, we explore the degree to which prepublished papers garner more citations, in an attempt to paint a sharper picture of arXivrelated fairness issues. We use a negativebinomial generalized linear model (GLM) to regress a paper’s number of citations onto a binary indicator representing arXiv prepublishing, and control for author influence (via the authors’ hindex at the time of paper writing), publication venue, and overall time that paper has been available on arXiv. We analyze papers that were eventually accepted for publication at toptier CS conferences, and were posted on arXiv either before or after the acceptance notification. We observe a significant positive association between citation count and prepublishing on arXiv.
Our results are consistent with previous work (e.g., Larivière et al., 2014) which found papers posted on arXiv to have higher citation rate (among all papers published in Web of Science).
2 Data
Here, we describe the data we used for this analysis in some detail.
2.1 Venues
All papers included in our study were eventually published at one of the following toptier computer science conferences, which have a significant portion of their papers on arXiv: AAAI, ACL, CVPR, ECCV, EMNLP, FOCS, HLTNAACL, ICCV, ICML, ICRA, IJCAI, INFOCOM, KDD, NIPS, SODA and WWW. We include papers published since 2007 and no later than 2016, so that we can count the number of citations they receive during the year following their publication.
To obtain this data, we queried Semantic Scholar for all the papers published in a particular conference. We then looked up each of these papers in the arXiv metadata dump contributed by Sutton and Gong (2017),
See Table 1 for a perconference break down of the 4392 papers in our dataset.
Venue  No. of Papers 

AAAI  3726 
NIPS  3393 
IJCAI  3001 
WWW  2958 
ACL  2676 
ICML  2200 
KDD  1661 
ECCV  1477 
EMNLP  1248 
SODA  1234 
HLTNAACL  876 
CVPR  467 
FOCS  305 
INFOCOM  183 
ICRA  182 
ICCV  156 
We used the Calls for Papers Wiki
2.2 Citations
The response variable we would like to model is the number of times a paper is cited in the calendar year following the conference, which we label as “all citations.”
We also experiment with a modified definition of the response variable meant to count meaningful citations (e.g., omitting self citations), which we label as “Influential Citations” to distinguish it from “All Citations.” Our definition of influential citations is based on Valenzuela et al. (2015), and only counts citations with no overlap in the author lists. In an influential citation, the cited paper is referenced three times or more in the narrative of the citing paper, not consistently combined with other references, mentioned in context of experimental results, or explicitly mentioned as foundation for the citing paper.
2.3 Author influence
We suspect that well known authors tend to garner more citations than less known authors. In order to control for this source of bias in our analysis, we model an observed variable which represents the authorâs influence. Given the paper in question, we first compute the hindex for its authors one year before it was published. Then we take the maximum hindex among all the authors of a paper and use this single value as a perpaper summary for author influence. Let be the hindex for author at a specified year. The author influence for paper can then be written as:
Because hindex is nonlinear in its relationship with citation counts, we model it as a categorical variable with ten buckets each of which containing the same number of papers. The first bucket included all papers with and the last bucket included all papers with .
2.4 Time available on arXiv
Papers prepublished on arXiv before acceptance have had more time to gather citations than those posted to arXiv after acceptance, which may explain any differences in citation counts. To control for this factor, we compute the fraction of the year the paper has been available on arXiv. In particular, we measure the number of days between the first arXiv submission and the beginning of the calendar year in which we count citations of that paper, then divide by the number of days in the year, as illustrated by the following Python code.
next_year_jan_1 = datetime(year=conf_year + 1, month=1, day=1).date() delta = next_year_jan_1  arxiv_submission_date frac_year_remaining = np.maximum(delta.days / 365, 0)
We clamp the difference (delta.days
) at a minimum of zero because a paper may be put on arXiv for the first time long after it is officially published.
2.5 Submitted to arXiv before vs. after acceptance
This variable is an indicator for whether the paper was posted to arXiv before or after it was accepted for publication. Ideally, we’d like to observe whether the arXiv submission date is before or after the acceptance notification, but since the exact acceptance dates were not available for all venues, we use a conservative estimate of +28 days after the the submission deadline of the conference as our prepublishing threshold. Figure (b)b contains a histogram showing the distribution of arXiv submission dates relative to the paper’s target venue deadline date.
2.6 Summary of variables
To summarize, we compute the following variables for each paper :

cites_1year
 number of papers that cited and were published in the calendar year following the official publication of (continuous response variable). 
influential_cites_1year
 number of papers that cited and were published in the calendar year following the official publication of and satisfied ‘influential’ criteria (continuous response variable). 
max_hindex_decile
 the decile into which the maximum (across all authors) hindex of falls into (categorical feature  10 values). 
submitted_before_deadline
 whether was submitted 28 days after the conference submission deadline (binary feature). 
frac_year_remaining
 fraction of year remaining from arXiv submission date until the year after the conference in which paper was published (continuous feature). 
conf
 the conference where was published (categorical feature  16 values).
3 Analysis
Here, we describe how we model the variables discussed in the previous section then analyze the results.
3.1 Model
Negative binomial GLMs are a common option for modeling countvalued response variables that exhibit overdispersion (i.e. when variance of the variable exceeds its mean, thus deviating from the standard Poisson count model) which is typical of realworld data (Hilbe, 2007). One can interpret the negative binomial distribution as a marginalized Poisson distribution where its mean is drawn from a Gamma distribution.
The conditional mean model is expressed as:
where is the response variable, is the vector of covariates/features, and is the learned weight of the th feature . In our case, the response variable is either cites_1year
or influential_cites_1year
. Within our feature vector , our primary covariate of interest is submitted_before_deadline
, while the other features are possible confounders that we want to control for.
We use Python’s statsmodels
(Seabold and Perktold, 2010) to fit the following regression models (expressed in the standard formula minilanguage from R
that is also used in statsmodels
):
max width= {BVerbatim}
cites_1year max_hindex_decile + frac_year_remaining + conf
cites_1year max_hindex_decile + frac_year_remaining + conf + submitted_before_deadline
The only difference between these two models is the presence of the submitted_before_deadline
binary variable. We repeat this again for influential_cites_1year
as the response variable.
3.2 Results
We conducted a likelihood ratio test on the two models and the resulting pvalue was tiny: . This means that the second model has a significantly higher likelihood, indicating that it better fits the data. The coefficients of the full model that includes submitted_before_deadline
are shown below:
max width= {BVerbatim}
Generalized Linear Model Regression Results ============================================================================== Dep. Variable: cites_1year No. Observations: 4392 Model: GLM Df Residuals: 4365 Model Family: NegativeBinomial Df Model: 26 Link Function: log Scale: 3.30268468922 Method: IRLS LogLikelihood: 14832. Date: Mon, 12 Mar 2018 Deviance: 6376.3 Time: 11:46:30 Pearson chi2: 1.44e+04 No. Iterations: 11 ===================================================================================================== coef std err z P¿—z— [0.025 0.975] —————————————————————————————————– Intercept 0.9192 0.198 4.634 0.000 0.530 1.308 max_hindex_decile[T.(6.0, 10.0]] 0.2249 0.160 1.408 0.159 0.088 0.538 max_hindex_decile[T.(10.0, 13.0]] 0.3543 0.165 2.147 0.032 0.031 0.678 max_hindex_decile[T.(13.0, 16.0]] 0.3265 0.158 2.062 0.039 0.016 0.637 max_hindex_decile[T.(16.0, 19.0]] 0.5266 0.154 3.416 0.001 0.224 0.829 max_hindex_decile[T.(19.0, 22.0]] 0.7298 0.161 4.532 0.000 0.414 1.045 max_hindex_decile[T.(22.0, 26.0]] 0.4174 0.155 2.695 0.007 0.114 0.721 max_hindex_decile[T.(26.0, 32.0]] 0.5917 0.150 3.953 0.000 0.298 0.885 max_hindex_decile[T.(32.0, 41.0]] 0.6185 0.151 4.105 0.000 0.323 0.914 max_hindex_decile[T.(41.0, 99.0]] 1.0595 0.145 7.284 0.000 0.774 1.345 submitted_before_deadline[T.True] 0.5029 0.083 6.080 0.000 0.341 0.665 conf[T.ACL] 1.2415 0.201 6.169 0.000 0.847 1.636 conf[T.CVPR] 1.4699 0.155 9.488 0.000 1.166 1.773 conf[T.ECCV] 1.4585 0.190 7.658 0.000 1.085 1.832 conf[T.EMNLP] 0.9585 0.207 4.637 0.000 0.553 1.364 conf[T.FOCS] 0.0017 0.178 0.010 0.992 0.347 0.350 conf[T.HLTNAACL] 1.1061 0.272 4.060 0.000 0.572 1.640 conf[T.ICCV] 1.1248 0.208 5.418 0.000 0.718 1.532 conf[T.ICML] 0.5132 0.147 3.480 0.001 0.224 0.802 conf[T.ICRA] 0.0980 0.223 0.439 0.661 0.536 0.339 conf[T.IJCAI] 0.2673 0.199 1.341 0.180 0.658 0.123 conf[T.INFOCOM] 0.1444 0.202 0.715 0.474 0.540 0.251 conf[T.KDD] 0.5083 0.213 2.385 0.017 0.091 0.926 conf[T.NIPS] 0.6280 0.156 4.031 0.000 0.323 0.933 conf[T.SODA] 0.6441 0.165 3.892 0.000 0.968 0.320 conf[T.WWW] 0.5485 0.217 2.531 0.011 0.124 0.973 frac_year_remaining 0.1710 0.107 1.599 0.110 0.039 0.381 =====================================================================================================
Due to the term in the regression function, these coefficients can be interpreted as having a multiplicative effect instead of an additive effect as in linear regression. We can thus look at the 0.5029 coefficient of submitted_before_deadline
(the coef column), and interpret its effect as multiplying the number of citations by exp(0.5029) = 1.65. In other words, the fitted regression model estimates that papers submitted to arXiv before acceptance, on average, tend to have 65% more citations in the following year compared to papers submitted after.
The difference is even more pronounced when we look at the number of influential citations.
Note that in this framework, each categorical variable with values has only coefficients. Each coefficient can be interpreted as being relative to some baseline value, which is determined by the leftout value. For example, the baseline category for max_hindex_decile
is [0, 6], and the coefficients for the other nine deciles capture how many more citations one can expect to have with higher hindices (in an associative, not causal, sense).
In particular, an hindex between 42 and 99 is associated (on average) with more than double the number of nextyear citations than if you had an hindex between 0 and 6. These coefficients increase in a nearlymonotonic way as hindex deciles increase, which is consistent with our intuition that more famous authors tend to get more citations.
Similarly, the baseline conference is AAAI.
The results suggest that frac_year_remaining
is a minor variable, with an estimate of 0 being part of the 95% confidence interval (last two columns).
This is somewhat surprising since we expected papers which have been on arXiv for a longer fraction of a given year to have more citations in the following year.
4 Conclusion
Our exploratory analysis shows that publishing a CS paper on arXiv before it is eventually accepted (as opposed to after) for publication at a top tier target venue is associated with 65% more citations in the calendar year following the conference. Although we take into account other factors which can influence number of citations (namely, author influence, publication venue, time available on arXiv), there may be other confounding factors which we did not include in our study (e.g., author affiliation, paper quality). We invite researchers interested in this analysis to explore the effect of other factors we have not included in the model, and invite conference chairs to conduct randomized controlled experiments in which authors submitting their drafts to the conference agree to prepublish their drafts on arXiv if they are randomly selected.
We note that identifying the potential unfair advantage given to prepublished papers may not give researchers a sufficiently compelling reason to delay posting their paper drafts on arXiv until the review process has completed.
Instead, we encourage the community to adopt anonymous prepublished submissions (with prespecified time limits on the anonymity) on arXiv and related platforms, similar to how the OpenReview platform implemented the peer reviewing process for ICLR 2018.
Acknowledgements
We are grateful for the Semantic Scholar team as well as the teams behind arXiv and WikiCFP for their commitment to promoting transparency and openness in scientific communication. We thank Oren Etzioni, Yoav Goldberg and Mark Neumann for helpful comments.
Footnotes
 See Marti Hearst’s and Kelly Cruz’s thoughtful discussions of this topic at https://acl2017.wordpress.com/2017/02/19/arxivandthefutureofdoubleblindconferencereviewing/ and http://www.astrobetter.com/blog/2011/12/12/topostornottopostpublishingtothearxivbeforeacceptance/.
 https://medium.com/@yoav.goldberg/anadversarialreviewofadversarialgenerationofnaturallanguage409ac3378bd7
 While most peer reviews are not publicly available, a notable exception is the International Conference on Learning Representations (ICLR) which makes all reviews available and also allows any researcher to comment on papers under submission using the openreview.net platform.
 http://webofknowledge.com/
 https://github.com/casutton/csarxivpopularitycode
 http://wikicfp.com
 Alternatively, we could have simply counted all citations a paper received but this would require making stronger assumptions about how the number of citations change over years, which is not the focus of this study.
 We omit detailed results for influential citations for brevity.
 https://iclr.cc/archive/www/doku.php%3Fid=iclr2018:faq.html#what_is_the_signature_field_when_submitting_a_comment_review
References
 Joseph M. Hilbe. Negative Binomial Regression. Cambridge University Press, 2007. doi: 10.1017/CBO9780511811852.
 Vincent Larivière, Cassidy R. Sugimoto, Benoit Macaluso, Stasa Milojevic, Blaise Cronin, and Mike Thelwall. arxiv eprints and the journal of record: An analysis of roles and relationships. JASIST, 65:1157–1169, 2014.
 Henk F. Moed. The effect of ’open access’ upon citation impact: An analysis of arxiv’s condensed matter section. JASIST, 58:2047–2054, 2007.
 J.S. Seabold and J. Perktold. Statsmodels: Econometric and statistical modeling with python. In Proceedings of the 9th Python in Science Conference, 2010.
 Richard T. Snodgrass. Singleversus doubleblind reviewing: an analysis of the literature. SIGMOD Record, 35:8–21, 2006.
 Charles Sutton and Linan Gong. Popularity of arxiv.org within computer science. ArXiv eprints, abs/1710.05225, 2017.
 Andrew Tomkins, Min Zhang, and William D. Heavlin. Reviewer bias in single versus doubleblind peer review. In Proceedings of the National Academy of Sciences of the United States of America, 2017.
 Marco Valenzuela, Vu Ha, and Oren Etzioni. Identifying meaningful citations. In AAAI Workshop: Scholarly Big Data, 2015.