Reproducible data citations for computational research
1 Computational Social Science, ETH Zurich, Zurich, Switzerland
The general purpose of a scientific publication is the exchange and spread of knowledge. A publication usually reports a scientific result and tries to convince the reader that it is valid. With an ever-growing number of papers relying on computational methods that make use of large quantities of data and sophisticated statistical modeling techniques, a textual description of the result is often not enough for a publication to be transparent and reproducible. While there are efforts to encourage sharing of code and data, we currently lack conventions for linking data sources to a computational result that is stated in the main publication text or used to generate a figure or table.
Thus, here I propose a data citation format that allows for an automatic reproduction of all computations. A data citation consists of a descriptor that refers to the functional program code and the input that generated the result. The input itself may be a set of other data citations, such that all data transformations, from the original data sources to the final result, are transparently expressed by a directed graph. Functions can be implemented in a variety of programming languages since data sources are expected to be stored in open and standardized text-based file formats. A publication is then an online file repository consisting of a Hypertext Markup Language (HTML) document and additional data and code source files, together with a summarization of all data sources, similar to a list of references in a bibliography.
The amount of knowledge scientists produce every year is steadily increasing, as are the obstacles to clear communication of this work. The Web of Science citation index  contains almost as many publication entries for the first 15 years of the 21st century as for the whole 20th century. This is no surprise when scientists need to show their ability to consistently publish results, preferably in high-impact journals, to advance their careers. The transparency and interpretability of results are at the same time decaying due to many factors. Positive, significant results wrapped in a clear story are favored, possibly at the expense of a complete description. Furthermore, with scientific progress, the complexity of new findings naturally rises. Finally, precisely for quantitative research, the digital transformation creates new potential and challenges. With an unprecedented availability and scale of data , scientists make use of Data Mining and Machine learning techniques that may result in comprehensive computational experiments that are difficult to protocol within the framework of traditional publications.
Ideally, a publication is completely transparent about all of its computational steps. It has been suggested that reproducibility should be a minimum standard in the computational sciences [3, 4]. Reproducibility means that with the same data and methods, the same result as stated in the publication is achievable. Therefore, code and data should be provided along with the paper . This does not guarantee replicability (same method with new data), and ultimately, generalizability, but at least enables a first verification of reported results.
Here, a publication format is proposed that ensures automatic reproducibility and is especially suited for publishing complex computational findings. Any computational result stated in the main text of the paper needs to be linked to its data sources. A publication written with this condition does not need to be structured or laid out differently from what would be expected from a typical scientific article. This backward compatibility ensures that a paper can still be submitted to any journal that does not yet adopt these principles. Journals that want to appeal to a broad audience often even ask for a separate methods & materials section to not obstruct the reading flow with technical details. The problem is that a typically brief description, together with mathematical notation, cannot completely communicate the actual comprehensive data transformations. With the wide-spread usage of statistical toolboxes, it becomes trivial to make a large number of methodological choices without a detailed discussion of their appropriateness and selection of parameters. Therefore, connecting the article with the actual code and data helps the reader to dig deeper when there is ambiguity or difficulty to understand. Furthermore, not everything that is coded needs to be explained in the main text. Programming environments increasingly allow for more concise code, which is potentially more understandable than an ambiguous verbal description.
The idea of such executable papers has been established since the beginnings of electronic publishing . SHARE  packages the complete programming environment into a virtual machine, which guarantees successful execution on another computer, but does not try to come up with a standardized way of linking publications with data and code. Another proposal is to accompany each computational result of a paper by a verifiable identifier that is created from a repository containing the code and data that generated the result . In contrast, our goal is to transparently document the sequence of steps necessary to generate the result. In this vein,  caches intermediate results in order to avoid long execution times. However, it is limited to a single programming language. Finally, knitR  can be used to generate reports from a mix of programming and documenting code, but these are not easily transformed to article formats suited for submission to scientific journals.
The remainder of this paper is organized in three sections. First, the concept of data citations is introduced to formulate a reproducible dependency graph that links all computations. Second, an implementation is described that supports the use of these data citations. And lastly, the paper closes with a discussion of the feasibility of such an approach.
Model of computation
Similar to a bibliographic citation of other published sources, a claim of a computational result should reference its origin. Such a computational citation refers to a data source that may be created from other data sources. For example, in an empirical study, we would start with measured, collected or generated data and perform an analysis that transforms them to an aggregated, numerical result which is then reported in the publication. Assuming a single computation step, it would be sufficient to provide all input and output data and a function (i.e., a computer program) that maps the input to the output. However, most papers refer to multiple computational results or explain results in a set of plots, tables and numerical statements within the main publication text. Additionally, generating a result might involve a sequence of processing steps. Fig. 1 outlines an example of a publication that consists of a set of computational trees with parts of the paper representing the roots and original data sources forming the leaves of the trees. Such a computational tree is a directed acyclic graph, where vertices are described as a result of a function and the input is given by the output of other results or a set of parameters. The code needs to be encapsulated in functions. On this level of abstraction, principles of functional programming are applicable independent of the actual implementation of a function, which can follow any programming paradigm and be written in most programming languages. For a reproducible paper it is required that the provided code consists of pure functions, i.e., the same input always leads to the same output (referential transparency) without side-effects (it does not modify state outside of the scope of the function). Furthermore, a function may output new functions, such that other higher-order functions may take them as input. For example, a function creates a model from training data, which is itself a function that is then applied to different data to test its predictive power.
<source uri>: A data source URI (uniform resource identifier) is a unique path relative to the project directory and refers to the storage location of the function output. It may contain several comma-separated URIs in case of multiple outputs. A URI can be used to refer to this data source as the input of another data source generation. Only the original raw data sources do not need such a formal descriptor. Results can be retrieved at the URI once it is computed.
<data format>: The format in which data is being created or interpreted when specified as an input. Valid formats are explained below.
<function uri>: Identifier that refers to a function implementation provided in a specific programming language.
<execution environment>: Identifier that refers to a programming environment that can execute the specified function. These environments are separately configured and provide a build automation that, for example, includes external libraries.
A nostore flag, which indicates that there should be no persistent storage of this data source and it should only be computed when referred by other data sources.
<parameter name>: Name used in function definition. Parameters are not ordered.
<argument value>: Can be directly provided in JSON notation (which includes simple numeric or text values) or as any other data format encoded as a JSON string. Alternatively, a URI of another data source can be specified as an input. Wildcard characters may be used to refer to several URIs that can be merged to a single unordered stream or table input.
A parallel flag may be specified to allow the input to be split to be processed by multiple processors, when possible.
For all data sources, there needs to be an agreement on how data is stored and can be parsed to serve as input of another function. The exact specification of possible data formats is central for building computational trees independent of programming languages, libraries, and implementations. Therefore, all original, intermediate or resulting data should be made available in non-proprietary, platform-independent data formats, both human-readable and machine-readable. Here, a list of formats is compiled that should cover most needs for scientific data:
JSON as defined in . Can represent primitive data types as well as complex hierarchical data structures. In contrast to the comparably popular Extensible Markup Language (XML), it is less verbose, and provides a simple syntax for lists and maps. In principle, JSON alone would be sufficient as the only data exchange format. However, for convenience and computational efficiency, further data formats are included.
JSONL: Sequence of lines of valid JSON values (separated by a line break). Each line can be parsed independently. Each JSON line typically represents an instance of the same data structure. It is suitable for large data sets enabling distributed computing and data stream processing.
CSV as defined in . Contrary to the specification, a header is always required to simplify usage. Tabular data is commonly used in scientific analysis. Typically, a row represents an observation and each column a variable.
TXT: Any other text file format that is interpreted as a sequence of Unicode characters (encoded in UTF-8). Instead of supporting further text-based data formats such as XML, Hypertext Markup Language (HTML), Scalable Vector Graphics (SVG) or Resource Description Framework (RDF) serializations directly, it is up to the user-defined function to decide for an appropriate parsing method.
BIN: Any binary file format. Interpreted as sequence of bytes. While we would prefer to only have human-readable text-based formats, for some purposes binary data such as Portable Network Graphics (PNG) or Joint Photographic Experts Group (JPEG) image files are more appropriate. Applications include image processing or a plot output, when text-based vector graphics such as SVG is not the best choice.
FUNC: A functional object for the use in higher-order functions. The only data type that is not required to be serializable (i.e. stored in a persistent file system).
To allow for an automatic reproduction of all results, we need to re-assess the suitability of current publication format technologies. The most widely-spread document types, LaTeX and Word, usually compiled to the Portable Document Format (PDF), cannot easily satisfy this criterion. While both provide excellent tools to produce printable journal articles, I propose that the focus should be on electronic web-based publishing instead. By relying on one of the core web technologies, HTML , with its long history of standardization and its wide range of applications, almost any kind of user interface can be represented. The simple concept of hyperlinks makes it an ideal choice for creating scientific publications that need to refer to data and code sources in a reproducible way. Writing articles directly in HTML has the advantage of adding semantic features such as the proposed data citations. Other markup languages (such as Wiki) may be simpler to start with but could lack crucial elements of a scientific publication. HTML is undoubtedly more complicated, yet easy to learn when only utilizing a subset of features that are needed for writing articles instead of more involved web applications.
For reproducibility, an article written in HTML is published together with code and data that are linked to the main text. It is supplied as a self-contained project folder that includes the article, code, and data with an arbitrary sub-folder structure. It can be a folder of a computer file system or published on a web server under a permanent URL. A single directory for all project content and the use of text-based file formats also simplify the use of version control systems.
A file named sources.json lists all data source descriptors of a project and should be referenced by the article HTML document. Conceptually, this is similar to a BibTeX file used for literature references. A browser can trigger a computation of a data source, either because it is missing or needs to be overwritten due to changes in code or input sources. On the server, a Scheduler manages these computation requests and delegates them to Executors, which perform the actual computation. Each Executor is itself a server and is started and stopped by the Scheduler. They are separated from the file server and Scheduler since they are potentially implemented in a different programming language, and also need a project-specific configuration for including code, external libraries, and other parameters. All communication between different parts of the system is performed using the Hypertext Transfer Protocol (HTTP) and according to the principles of Representational state transfer (REST).
The role of an Executor is to perform a computation as specified by a data source descriptor. Since many programming languages are suitable for scientific computing, a scientist should be able to use any programming tool that supports best the particular task. Often, a specific statistical library that is needed is only available in a particular programming language, or acceptable performance is just feasible by using the tools provided by another programming environment. A mixed use of different programming languages within a single project is possible since functions are described solely by their input and output data formats. Each project defines a set of execution environments, which need to start a server program that is provided in different programming languages. Here, implementations for Java, Python and R are discussed. To facilitate support for additional programming languages, the required interface for a new implementation is kept as simple as possible. The primary requirement is that an HTTP server can respond to computation requests as specified by a data source descriptor. This means that there must exist a mapping of (1) all data formats to an in-memory representation (deserialization) and (2) an output back to a persistable representation (serialization). An overview is given in Table 1.
In Java, a function reference is found using Java reflection. In contrast to the other dynamic languages, Java’s static type system requires the input to be an instance of the parameter’s type. For a conversion to and from JSON, Google’s Gson library is utilized, which provides a comprehensive mapping, even in the case of Java’s generic types. JSONL maps to java.util.Stream, where each line of the input deserializes to the specified type of the stream. It is then up to the function implementation whether to load the complete data into memory, to process the stream element-wise, or not to use the stream at all. For CSV data, a new data type is provided, which allows operations on tabular data similar to built-in types found in other programming languages. In Python, JSON values can be mapped to built-in types using the standard library. JSONL data is provided by a Python generator. CSVs are directly loaded in a DataFrame of the Pandas library. In R, most data formats can be expected to be converted to a data frame object.
|Other text data||Stream<String>||Generator||char vector|
|Binary data||byte||bytes||raw vector|
The Scheduler manages all computation-related tasks. The computation itself is delegated to an Executor. When a computation of a single or a set of data sources is requested, all recursively dependent data sources need to be computed as well, in an order determined by the dependency graph. Each data source specifies an execution environment identifier under which the function can be called. A project needs to supply a command line script that takes the environment identifier as an input, starts an Executor and returns its server URL. The Executor could be run on the same machine as the scheduler, using a different server port, or another machine to distribute computations. While Executors in the form of a server program are provided in a variety of programming languages, it is up to the project to configure source code inclusion or translation, library dependencies, amount of memory to be reserved or a multi-machine setup. An Executor may be re-used by the Scheduler for multiple computations, since functions are required to have no side-effects.
The publication contains the following elements:
<h[1..6]>, <p> to structure the text,
$…$ for including math expressions using TeX syntax,
<a> to link to bibliometric references, figures, tables, equations etc.,
<span data-url="…"/> to include computational results,
and <div class="references|sources" data-url="…"/> for importing bibtex literature references and a list of data source descriptors.
All communication is conducted via HTTP, which may be seen as a limiting factor. However, given sufficient communication bandwidth, it is data serialization and deserialization that represent the main bottleneck. These are dependent on the utilized libraries of the respective programming language. Also, latency is not an issue, since a data source is typically a result of a long-running computation, where communication overhead is insignificant. In the case of a high number of short-running function calls within a higher-order function, they can be performed directly in the same process, given that function objects were created in the same Executor. In the browser, loading of an HTML document, fetching all of its linked sources and formatting might take some time, depending on the complexity of the paper. Here, a solution would be to supply a pre-rendered document once the article is ready for publication.
Perhaps the greatest obstacle is adoption. Scientists may not want to publish code since with full transparency it could be easier to point out flaws in their work. For example, changing a parameter or input data could lead to a failure to arrive at the original conclusions. On the other hand, with automatic reproducibility, it should also be more straightforward for the authors to modify their experiments, and then state more clearly the limitations of their work and under which conditions the results hold. A badge system that rewards authors for being completely transparent  could be used to incentivize the practice. In some cases, publishing code or data may be perceived as losing a competitive advantage over other researchers. Additionally, some data may not be publishable due to copyright or privacy considerations. In contrast, in the presented system, both code and data publishing is not a binary decision. The representation as a computational tree allows the authors to decide at which depth level code and data are included. For example, at a certain level, data is aggregated enough that no personal information about the subjects is revealed.
On the publisher side, such a system requires that scientific journals adapt their technical infrastructure. This is a substantial barrier. However, since the presented publication format aims to be superficially indistinguishable from established formats, an intermediate solution could be to offer a complementary publication repository that supports the proposed functionality, similar to existing pre-print servers such as arXiv, alongside submission of manuscripts to traditional publishers.
- 1. Clarivate Analytics. Web of Science citation index; 2017. Available from: https://webofknowledge.com.
- 2. Lazer D, Pentland AS, Adamic L, Aral S, Barabasi AL, Brewer D, et al. Life in the network: the coming age of computational social science. Science (New York, NY). 2009;323(5915):721.
- 3. Peng RD. Reproducible research in computational science. Science. 2011;334(6060):1226–1227.
- 4. Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten simple rules for reproducible computational research. PLoS computational biology. 2013;9(10):e1003285.
- 5. Stodden V, McNutt M, Bailey DH, Deelman E, Gil Y, Hanson B, et al. Enhancing reproducibility for computational methods. Science. 2016;354(6317):1240–1241.
- 6. Claerbout JF, Karrenbach M. Electronic documents give reproducible research a new meaning. In: SEG Technical Program Expanded Abstracts 1992. Society of Exploration Geophysicists; 1992. p. 601–604.
- 7. Van Gorp P, Mazanek S. SHARE: a web portal for creating and sharing executable research papers. Procedia Computer Science. 2011;4:589–597.
- 8. Gavish M, Donoho D. A universal identifier for computational results. Procedia Computer Science. 2011;4:637–647.
- 9. Peng RD, Eckel SP. Distributed reproducible research using cached computations. Computing in Science & Engineering. 2009;11(1):28–34.
- 10. Xie Y. Dynamic Documents with R and knitr. vol. 29. CRC Press; 2015.
- 11. Ecma International. Standard ECMA-404: The JSON Data Interchange Format; 2013. Available from: http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf.
- 12. Internet Engineering Task Force. RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files; 2005. Available from: https://tools.ietf.org/html/rfc4180.
- 13. World Wide Web Consortium. HTML 5 Recommendation; 2017. Available from: https://www.w3.org/TR/html/.
- 14. Kidwell MC, Lazarević LB, Baranski E, Hardwicke TE, Piechowski S, Falkenberg LS, et al. Badges to acknowledge open practices: A simple, low-cost, effective method for increasing transparency. PLoS Biology. 2016;14(5):e1002456.