SPOT: Open Source framework for scientific data repository and interactive visualization
spot is an open source and free visual data analytics tool for multi-dimensional data-sets. Its web-based interface allows a quick analysis of complex data interactively. The operations on data such as aggregation and filtering are implemented. The generated charts are responsive and OpenGL supported. It follows FAIR principles to allow reuse and comparison of the published data-sets. The software also support PostgreSQL database for scalability.
SPOT: Open Source framework for scientific data repository and interactive visualization
Faruk Diblen††thanks: Corresponding author Netherlands eScience Center Science Park 140 1098 XG Amsterdam The Netherlands firstname.lastname@example.org Jisk Attema Netherlands eScience Center Science Park 140 1098 XG Amsterdam The Netherlands email@example.com Rena Bakhshi Netherlands eScience Center Science Park 140 1098 XG Amsterdam The Netherlands Sascha Caron Institute for Mathematics, Astro- and Particle Physics IMAPP, Radboud Universiteit, Nijmegen, The Netherlands email Luc Hendriks Institute for Mathematics, Astro- and Particle Physics IMAPP, Radboud Universiteit, Nijmegen, The Netherlands email Bob Stienen Institute for Mathematics, Astro- and Particle Physics IMAPP, Radboud Universiteit, Nijmegen, The Netherlands email
July 30, 2019
Keywords visualization high-dimensional data theoretical models open data FAIR particle physics
1 Motivation and significance
Most scientific fields produce theoretical or experimental data which is not necessarily the result of a measurement, but also of simulations or evaluations of theoretical models. This data is intrinsically complex consisting of multiple parameters or multiple observables, and thus, data-sets can be regarded as point clouds in a high-dimensional space. Often, due to the restrictions imposed by the use of paper for data visualization, e.g. a figure in a journal, the status quo is still to publish the data in a two (or three) dimensional format. A typical example is Figure LABEL:fig:example.
However, such a two dimensional (2D) representation obscures most of the correlations within the solution space. In order to encourage the publication of high-level data in the complete high-dimensional space, without restrictions, data visualization can be done via web-based tools which allow for, e.g. an automatic generation of multiple relevant histograms. The aim of spot  is to provide a flexible data visualization framework to visualize such data. spot, which is typically coupled to a database holding the data-sets is a tool to promote the use of open research data and open science . It follows FAIR  principles to allow reuse and comparison of the published data-sets (see Fig. 1). This paper briefly introduces spot. The source code and the documentation is available at https://github.com/NLeSC/spot.
2 Software description
spot provides users with an interactive data exploration environment for high-dimensional data-sets. The focus is on scientific use, with the aim of facilitating open science, data sharing and reuse (see Figure 1). It is ideal for numerical data, but categorical (labeled) and temporal data is supported.
Built on a number of concepts from the field of information visualization, it allows a user to create multiple coordinated views called charts, showing the data from different perspectives. All charts allow direct manipulation (i.e., selecting and zooming) of the data, and provide visual clues or animations when data changes due to user interaction.
2.1 Software Architecture
The software consists of three components: a framework, a frontend, and a server. A brief description for each of these components follows:
The framework provides classes for data-sets, data views, partitions, aggregation and filtering. A data-set consists of a number of items (or rows), and each item has a set of facets. Facets can be used to partition the data, or they can be aggregated (counted). For numerical data more complex operations are possible, namely, summation, averaging, extremes, standard deviation. One or more facets make up a filter, and all filters combined together form the data view. The user interacts with the framework by setting ranges or selections for the filters, and by adding or removing filters from the data view. After filtering, the partitioned and aggregated data is available as a simple array, which can be plotted or further processed. All filters in a data view are linked, and a change in one filter triggers an update of the whole data view.
The server processes requests for data and applies the necessary filtering and aggregation. When data becomes available, it is pushed to the client which can then update the charts.
We currently have two different implementations for the server component. The first one, which is included in spot-framework, is based on Crossfilter.js and runs in the user’s web browser without requiring any further resources or even internet access. The second implementation (spot-server), provides a bridge to an external PostgreSQL database for scalability. Database queries are run in parallel, and make use of indices for extra performance. Connections to other datastores, like MySQL or MongoDB, can easily be achieved by extending the server component.
2.2 Software Functionality
Data import and database connection
There are two options to import a new data-set. In the first option, users can upload data available on their own system. The software supports most common data formats CSV and JSON, to make data import process easy for different scientific domains. After the import, the data is checked to automatically detect data types, such as integers and strings. Users can then fix auto detection issues. In the second option, the data is imported from spot-server. The meta data for a data-set (e.g. the name and description) can easily be set in a configuration file stored on server-side.
spot has eight ready-to-use chart types, namely, horizontal and vertical histograms, line chart, pie chart, bubble chart, 3-d scatter chart, radar chart and network chart. Charts are added to the dashboard by clicking the chart icon. The chart’s filter requires one or more facets to partition over, and can take up to 4 facets to aggregate. Charts show their configuration pane by default. A visual feature of the chart can be linked to a specific facet by dragging a facet from the top of the screen and dropping it on a slot in the configuration pane. A Partition or Aggregation can be further configured by clicking on its name on the configuration pane.
Download and share
The dashboard generated by the user can be saved as a single file, a session file, in JSON format. This file contains aggregated data and settings of the dashboard such as existing charts, existing filters. The session file then can be used to restore the analysis. In addition, the session file can be uploaded to a cloud storage and a link to the session file can be shared.
3 Illustrative Examples
We show the applicability and the features of the spot software on two examples of data-set: (1) the Titanic data-set , a well-known data-set in data science, and (2) a high-dimensional data-set containing models for dark matter (see [6, 7] for more information).
3.1 Titanic data-set
The top of the figure shows the chart types, each of which can be selected to make a new chart. Directly below that are the data facets, which can be dragged-and-dropped into the empty charts to create any visualisation of any parameter(s).
3.2 Example from High-Energy Physics
A real world example where spot can help the scientific community is visualizing e.g. models of high-energy physics. The data in this field is typically high-dimensional and even though different models have different theoretical parameters, they share the same observables. spot allows comparison of these observables from different data-sets, providing the user an unprecedented ability to compare the model space. In Figure LABEL:fig:dmmodels, two data-sets for models predicting Dark Matter are compared for three observables: the dark matter mass, a annihilation probability and a cosmic density .
The high-dimensional space can be stored in the spot database so that the data-set can be published along with the paper. While the paper still contains the most relevant 2D plots, a researcher can plot different variables using spot for further research. By intuitively making cuts in for example histograms, researchers can investigate the high-dimensional space in an unprecedented way. In addition, comparison between data-sets on shared observables is likely to lead to new research questions.
Intuitive data visualization is still in its infancy, especially in the field of high energy physics. spot aims to be the first tool to provide an intuitive interface for visualizing high-dimensional data-sets and as a place for researcher to store their high-dimensional data.
Thus, there are two categories of applications that are related to spot: online sharing services and visualization libraries. The most relevant to spot are Microsoft Power BI , Spotfire  and Tableau  but these are commercial products. The most popular sharing services include data sharing platform Zenodo , digital repository for sharing Figshare , and high-energy physics specific platform Hepdata . In comparison with spot, however, these services basically provide only storage that allow researchers to upload and publicly store their data and figures, but not compare different visualizations from different articles. On the other hand, visualization frameworks such as Dash  is a Python framework that gives users possibility of interactive visualization, but also it requires users to have an advance knowledge to create a desired dashboard.
The software has been written to provide a free and open state-of-the-art platform for researchers in many scientific domains. It helps researchers to publish, share their data-sets, and collaborate by comparing their data-sets to identify the differences. This makes spot a perfect candidate for FAIR data platforms.
This work is supported by the Netherlands eScience Center under the project iDark: The intelligent Dark Matter Survey.
-  Jisk Attema and Faruk Diblen. Nlesc/spot: Version 0.1.0, October 2017.
-  Jorge Machado. Open data and open science. Open Science, page 189, 2015.
-  Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, et al. The fair guiding principles for scientific data management and stewardship. Scientific data, 3, 2016.
-  Georges Aad et al. Summary of the ATLAS experiment’s sensitivity to supersymmetry after LHC Run 1 — interpreted in the phenomenological MSSM. JHEP, 10:134, 2015.
-  Kaggle.com. Titanic: Machine Learning form Disaster. https://www.kaggle.com/c/titanic/data, 2018. Online. Accessed on 01-June-2018.
-  Abraham Achterberg, Melissa van Beekveld, Sascha Caron, Germán A Gómez-Vargas, Luc Hendriks, and Roberto Ruiz de Austri. Implications of the Fermi-LAT Pass 8 Galactic Center excess on supersymmetric dark matter. Journal of Cosmology and Astroparticle Physics, 2017(12):040, 2017.
-  Abraham Achterberg, Simone Amoroso, Sascha Caron, Amsterdam Nikhef, Science Park, Luc Hendriks, Roberto Ruiz de Austri, and Christoph Weniger. A description of the Galactic Center excess in the Minimal Supersymmetric Standard Model. Journal of Cosmology and Astroparticle Physics, 2015(08), 8 2015.
-  Teo Lachev and Edward Price. Applied Microsoft Power BI: Bring Your Data to Life! Prologika Press, 3 edition, 2018.
-  Christopher Ahlberg. Spotfire: an information exploration environment. ACM SIGMOD Record, 25(4):25–29, 1996.
-  Jeffrey Heer, Jock Mackinlay, Chris Stolte, and Maneesh Agrawala. Graphical histories for visualization: Supporting analysis, communication, and evaluation. IEEE transactions on visualization and computer graphics, 14(6), 2008.
-  OpenAIRE and CERN. Zenodo. https://zenodo.org/,, 2018.
-  Digital Science. Figshare. http://figshare.com, 2018.
-  Eamonn Maguire, Lukas Heinrich, and Graeme Watt. Hepdata: a repository for high energy physics data. In Journal of Physics: Conference Series, volume 898, page 102006. IOP Publishing, 2017.
-  Plotly. Dash. http://dash.plot.ly, 2018.