Data Shapes and Data Transformations

Data Shapes and Data Transformations

Michael Hausenblas DERI, NUI Galway, Ireland
firstname.lastname@deri.org
   Boris Villazón-Terrazas iSOCO, Madrid, Spain
bvillazon@isoco.com
   Richard Cyganiak DERI, NUI Galway, Ireland
firstname.lastname@deri.org
Abstract

Nowadays, information management systems deal with data originating from different sources including relational databases, NoSQL data stores, and Web data formats, varying not only in terms of data formats, but also in the underlying data model. Integrating data from heterogeneous data sources is a time-consuming and error-prone engineering task; part of this process requires that the data has to be transformed from its original form to other forms, repeating all along the life cycle. With this report we provide a principled overview on the fundamental data shapes tabular, tree, and graph as well as transformations between them, in order to gain a better understanding for performing said transformations more efficiently and effectively.

1 Motivation

These days, content and information management systems have to deal with data originating from an array of sources, such as relational databases, NoSQL data stores, and Web data formats. The data sources vary not only in terms of data formats, but first and foremost in the underlying data model, be it implicit—such as with JSON—or explicit, think: RDF.

As recently put forward by Helland [Hel11], data integration of heterogeneous data sources is a time-consuming, costly, and error-prone engineering task. Typically, the data has to be transformed from its original form to other forms, repeating all along the life cycle. For example, let us assume we want to publish data from an government agency such as spreadsheets containing statistical information into the Linked Open Data cloud111http://lod-cloud.net. One task of the Linked Data life cycles would then be to transform the original tabular spreadsheet data into (graph-shaped) RDF. Once we have the data transformed into RDF according to, say, the RDF Data Cube Vocabulary222http://www.w3.org/TR/vocab-data-cube/, we want to visualize it in an appealing way, so we decide to use the Google Charts API333https://developers.google.com/chart/ requiring us to provide input as tabular data. So, again we have to transform a graph into a tabular, and then the application visualize the information in an appealing way.

Apparently, even in this toy example, we have to transform data from its original form to potentially many (intermediate) other forms. This was the motivation for us to compile this report, aiming to provide a principled overview of possible fundamental data shapes and transformations between them. In the following we will focus on the data transformation within the context of the “Extract, Transform and Load” process, a valuable process, growing in its use and scale. We use the term data shape to refer to how the data is arranged and structured, closely related to the term data model444http://en.wikipedia.org/wiki/Structured_data: we have identified three fundamental data shapes: tabular, tree, and graph and respective transformations between them.

The remainder of the report is organized as follows: in Section 2 we introduce and motivate the fundamental data shapes, then in Section 3 we describe transformations between data shapes, and, finally, in Section 4 we discuss open issues and challenges concerning the transformations.

2 Fundamental data shapes

In the following we motivate and introduce the three fundamental data shapes tabular, tree, and graph, derived from data structures555http://en.wikibooks.org/wiki/Data_Structures and as found in the wild in various datas sources, including but not limited to relational databases (RDB), NoSQL data stores [Cat11], or Web data formats such as JSON, OData and RDF serialisations.

2.1 Tabular

A tabular data shape organizes data items into a table. A table is a set of data elements (values) that are organized using a model of vertical columns (identified by their name), and horizontal rows. A table has a specified number of columns. Examples of tabular data shapes are:

  • CSV (Comma Separated Values) files as of RFC 4180666http://tools.ietf.org/html/rfc4180—These files are used to store tabular data, capable of storing numbers as well as text in a plain-text format that can be easily written and read by humans and software alike.

  • RDB (relational databases)—A relational database is essentially a group of tables (entities). Tables are made up of columns and rows (tuples). Those tables have constraints, and relationships are defined between them. Relational databases are queried using SQL, and result sets are produced from queries that access data from one or more tables.

2.2 Tree

A tree is a non-empty set, one element of which is designated the root of the tree while the remaining elements are partitioned into non-empty sets each of which is a subtree of the root. Tree nodes have many useful properties. The depth of a node is the length of the path (or the number of edges) from the root to that node. The height of a node is the longest path from that node to its leaves. The height of a tree is the height of the root. A leaf node has no children, its only path is up to its parent.

A particular case of a tree is the key-value data shape—a linked list of key-value pairs. Examples of tree data shapes are:

  • XML (eXtensible Markup Language)—An open and flexible format used to exchange a wide variety of data on and off the Web. XML is a tree structure of nodes and nested nodes of information where the user defines the names of the nodes777http://www.w3.org/XML/.

  • JSON (JavaScript Object Notation)— A lightweight data-interchange format. It is easy for humans to read and write as well as straightforward for machines to parse and generate888http://json.org/.

  • YAML (YAML Ain’t Markup Language)—A Super-set of JSON and general-purpos data serialization language designed to be human-friendly and work well with modern programming languages for common everyday tasks999http://yaml.org/.

2.3 Graph

A graph is a mathematical structure consisting of a set of vertexes (also called nodes), and a set of edges. An edge is a pair of vertexes. The two vertexes are called edge endpoints. A graph may be either undirected or directed. Intuitively, an undirected edge models a “two-way” or “duplex” connection between its endpoints, while a directed edge is a “one-way” connection, and is typically represented by an arrow.

Examples of graphs are:

  • RDF (Resource Description Framework)— A family of World Wide Web Consortium (W3C) specifications originally designed as a metadata model. It has come to be used as a general method for conceptual description or modeling of information that is implemented in web resources, using a variety of syntax formats101010http://www.w3.org/RDF/.

  • Topic Maps—Topic maps are an ISO standard for describing knowledge structures and associating them with information resources111111http://www.isotopicmaps.org/.

3 Data shapes transformations

In this section we compare the possible transformations we can perform between two given data shapes. To this end, we have identified a set of features along three dimensions—the input, the output, and the transformation process—and provide motivational usage scenarios per transformation. We acknowledge that the characterisations and the formats presented in following are neither exhaustive nor complete, however, serve as a useful starting point.

  • Dimension 1—concerning the input data shape:

    • The generic data shape, e.g., tabular, tree or graph.

    • The specific implementation of the data shape, e.g., XML, JSON, relational database, ect.

  • Dimension 2—concerning the output data shape:

    • The generic data shape, e.g., tabular, tree or graph.

    • The specific implementation of the data shape, e.g., XML, JSON, relational database, ect.

  • Dimension 3—concerning the transformation process:

    • The transformation process can be declarative or operational.

      • Declarative. There is a transformation description, the transformation is based on a language that describes the mappings between the input and output shapes.

      • Operational. The transformation is only based on an ad-hoc transformation engine.

    • The transformation process can have an information loss (also known as lossy transformation) defined by: “all queries that are possible on the original shape are also possible on the resultant shape”. We have information loss when we change the abstraction level; this happens typically, when we transform a “richer” shape into a “less rich shape”, e.g., from graph to tabular.

The Table 3.1 illustrates all possible transformations between two given data shapes and provides pointers to the respective subsections where we discuss them in further detail.

from/to tree tabular graph
tree cf. Section 3.1 cf. Section 3.2 cf. Section 3.3
tabular cf. Section 3.4 cf. Section 3.5 cf. Section 3.6
graph cf. Section 3.7 cf. Section 3.8 cf. Section 3.9
Table 3.1: Data shapes transformations overview.

3.1 Tree–Tree

In this case, the transformation takes as input a given tree and outputs another tree. Let us suppose we have a set of XML documents that contain the description of the transactions of a company, and we need to submit these tin JSON files instead of XML, so we need to perform a transformation from tree to tree. Examples of these transformations are:

  • XML to XML. An XSLT that turns a DocBook121212http://www.docbook.org/ file into XHTML.

  • XML to JSON. A program that turns a XML file into JSON, or, for example via XSLT.

3.2 Tree–Tabular

This transformation takes as input a tree and outputs a tabular. Let us suppose we have a set of XML that contain the description of the transactions of a company, and we need to submit these to an entity such as a government agency that works with CSV files instead of XML, so we need to perform a transformation from tree to tabular. Examples of these transformations are:

3.3 Tree–Graph

This transformation takes as input a tree and outputs a graph. Let us suppose we have a set of XML document that contain the description of the transactions of a company, and we need to submit these to an entity such as a government agency that works with RDF for integration purposes.In this setup, we need to transform from tree to graph. Examples of these transformations are:

3.4 Tabular–Tree

This transformation takes as input a tabular shape and outputs a tree shape. Let us suppose we have a relational database containing transaction data of a company, and we need to submit these transactions that requires, for integration purposes, the data in XML, so we need to transform from tabular to tree. Examples of these transformations are not standardised, but there are bespoke systems such as:

3.5 Tabular–Tabular

This transformation takes as input a tabular and outputs a tabular. Let us suppose we have a relational database that contain the description of the transactions in our company. We need to display these transactions in the company web page. To this end, we have to transform from a tabular (RDB) to a tabular (web page). Examples of these transformations are:

  • RDB to RDB: SQL SELECT.

  • CSV to RDB: relying on a particular DBMS import tool.

3.6 Tabular–Graph

This transformation takes as input a tabular and outputs a graph. Let us suppose we have a relational database that contain the description of the transactions in our company, and we need to submit these transactions into the central office in London. For integration purposes the central office is using RDF, so we need to transform from tabular to graph. Examples of these transformations are:

  • RDB to RDF:

    • W3C’s RDB2RDF activity232323http://www.w3.org/2001/sw/rdb2rdf/: Direct Mapping and R2RML, a language for expressing customized mappings from relational databases to RDF datasets.

  • CSV to RDF:

    • XLWrap - language

    • TopBraid - tool

    • RDF extension of Google Refine - tool - GUI

  • RDB to Topic maps [NP09].

3.7 Graph–Tree

This transformation takes as input a graph and outputs a tree. Let us suppose we have an RDF dataset for representing the statistical information of a company and we need to transfer this information to an XML-based format such as PC-Axis, used by the Irish CSO. Examples of these transformations are:

3.8 Graph–Tabular

The transformation takes as input a graph and outputs a tabular. Let us suppose we have an RDF dataset for representing the statistical information of a company and want to use Google Charts for visualising it. This requires a tabular representation in CSV and therefore we have to perform a transformation from graph to tabular. Examples of these transformations are:

  • SPARQL SELECT

  • ad-hoc conversion scripts.

3.9 Graph–Graph

This transformation takes as input a graph and outputs a graph. Let us suppose we have an RDF dataset for representing the statistical information of a company, expressed in the W3C RDF Data Cube vocabulary272727http://www.w3.org/TR/vocab-data-cube/. Now, further assume that someone is still using the deprecated SCOVO282828http://purl.org/NET/scovo# vocabulary for representing the statistical information. Therefore we need to transform our data expressed in RDF Data Cube to SCOVO. In this case, we have to perform a transformation from a graph to graph. Examples of these transformations are:

3.10 Summary

In Table 3.2 we provide a summary of the data shapes transformation and their characteristics.

 Input Output Nature Lossy? Standard
Tabular (RDB) Tabular (RDB) Declarative No SQL
Tabular (RDB) Tree (XML) Operational No No
Tabular (RDB) Graph (RDF) Declarative No RDB2RDF
Tree (XML) Tabular (RDB) Operational No No
Tree (XML) Tree (XML) Declarative No XSLT
Tree (XML) Tree (XML) Declarative No XSLT
Graph (RDF) Tabular (RDB) Declarative Yes SPARQL SELECT
Graph (RDF) Tree (XML) Declarative Yes No
Graph (RDF) Graph (RDF) Declarative No SPARQL CONSTRUCT
Table 3.2: Data shapes transformations comparison.

4 Discussion

Motivated by our experiences gathered in data integration projects as well as in standardisation activities within W3C we wanted to provide a principled overview on the fundamental data shapes and transformations between them. Summarising, we can state the following:

  • We can perform (loss-less) data shape transformations between certain shapes.

  • A number of data shape transformations are already standards or in the process of being standardised, including:

    • For RDB2RDF, see R2RML and Direct Mapping.

    • For XML2XML, see XSLT.

    • For XML2RDF, see GRDDL.

  • We found that some data shape transformations are declarative in nature and it would be interesting to learn if others can and should be expressed declaratively as well.

  • To this end, we have not taken provenance information in the transformation process into account. Again, this is something worthwhile to follow up on.

  • In certain cases we have to deal with lossy transformations. A more systematic study of these cases, including an assessment of the implications concerning the data integration process is subject to future research.

We hope that the report in the current form is useful for both researchers and practitioners alike and consider it as one contribution in helping to establish a discussion around data shapes and their transformations in order to advance the state of the art.

5 Acknowledgements

The authors are grateful for the support received through the European Commission FP7 ICT projects BIG–Big Data Public Private Forum (Grant Agreement No. 257943) as well as LATC–LOD Around-The-Clock (Grant Agreement No. 256975).

References

  • [Cat11] Rick Cattell. Scalable SQL and NoSQL data stores. SIGMOD Rec., 39:12–27, May 2011.
  • [Fli09] Amy Flik. Transforming XML into a Relational Database Using XML Schema Document Type. Technical Library. Paper 48., Grand Valley State University, 2009.
  • [Hel11] Pat Helland. If You Have Too Much Data, then ’Good Enough’ Is Good Enough. Queue, 9:40:40–40:50, May 2011.
  • [NP09] Thomas Neidhart and Rani Pinchuk. Semantic Integration of Relational Data Sources With Topic Maps. In Fifth International Conference on Topic Maps Research and Applications, 2009.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
54973
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description