A declarative end-to-end workflow for machine learning
We present sql4ml , a system for expressing supervised machine learning (ML) models in SQL and automatically training them in TensorFlow. The primary motivation for this work stems from the observation that in many data science tasks there is a back-and-forth between a relational database that stores the data and a machine learning framework. Data preprocessing and feature engineering typically happen in a database, whereas learning is usually executed in separate ML libraries. This fragmented workflow requires from the users to juggle between different programming paradigms and software systems. With sql4ml the user can express both feature engineering and ML algorithms in SQL, while the system translates this code to an appropriate representation for training inside a machine learning framework. We describe our translation method, present experimental results from applying it on three well-known ML algorithms and discuss the usability benefits from concentrating the entire workflow on the database side.
A declarative end-to-end workflow for machine learning
|Athens University of Economics and Business|
|Athens University of Economics and Business|
Today data scientists suffer a fragmented workflow: typically feature engineering is performed in a relational database, while predictive models are developed with a machine learning library. Data scientists must tackle a fundamental impedance mismatch between relational databases and ML with the attendant accidental complexity of bridging the two worlds. The workflow involves transforming between data represented in relations and tensors (multi-dimensional arrays), context-switching between the declarative SQL and imperative Python (or R) programming paradigms, and plumbing between siloed software systems. While relational databases provide advanced query optimization and permit selecting features by efficiently joining normalized tables, they are not always suited/optimal for linear algebra and have limited support for the type of iteration required for ML algorithms. In contrast, ML libraries provide mature support for linear algebra operations, automatic differentiation and mathematical optimization algorithms, although they are not adequate for relational queries and require the data matrix in a denormalized form.
The overhead for interfacing databases and ML libraries materializes as increased costs: code increases in complexity and maintenance burden, projects need to specialize labor into data engineers on the SQL side and data scientists on the Python/R side, and the time to develop a model is slowed down by moving data through the pipeline and coordinating multiple people. The data science process (see Figure sql4ml A declarative end-to-end workflow for machine learning) is often experimental trying out different features, training sets, and ML models, thus the overhead costs compound on every iteration.
We envision a more efficient workflow where an individual data scientist can perform both data management and ML in a single software system. We present sql4ml as an integrated approach for data scientists to work entirely in SQL, while using a ML library like TensorFlow [?] as the backend for ML tasks. We target users that are familiar with SQL and would like to have more advanced capabilities for data analysis. They need to understand the math for defining an ML model and write it in SQL, but do not need to deal with the lower level details of ML libraries, such as data batching and representation. Data representation in relations is standardized in the SQL language.
By synergizing a database and ML library, the user has more flexibility to define the algorithm of her choice compared to prepackaged ML algorithms. Our implementation specifically targets SQL and TensorFlow (Python API), although the ideas are equally applicable to other database languages such as Datalog, other ML libraries such as PyTorch, and other data science languages such as R.
In addition to using SQL for traditional data management, the user can also define and train supervised ML models in SQL. She provides concrete SQL tables of features and target values (a.k.a. labels) of training observations, and specifies the ML model by SQL queries for the model prediction function and objective function in terms of an empty table of weights (a.k.a. parameters or coefficients) which will be learned during training. sql4ml generates the boilerplate code to move the features from relations in the database to CSV files on disk or directly to tensors in the ML library.
sql4ml automatically translates SQL queries defining an ML model to TensorFlow Python code, which is sufficient to invoke gradient descent to find the optimal weights that minimize the objective function. The objective function written in SQL can be understood as a query for the optimal weights that minimize it. The generated TensorFlow/Python code is effectively a physical plan of the query. sql4ml thus makes it easier for a single person to run the full data science cycle, because it eliminates the need for two programming paradigms and automates moving data between systems, thus leaving more resources to find the right features and model.
Finally, sql4ml moves the computed weights from the ML library back to the database. Thus, the data scientist can stay in the database to compute predictions on test data and statistics about the ML model, such as percent error or precision and recall. Furthermore, the database can be used to store multiple trained ML models and keep track of the features that were used in each ML model, in contrast to ad hoc management of weights in files on the filesystem. In this way, she can also use the database as a system for keeping records of experimental results and manage the entire ML lifecycle in one place.
In summary, we provide an end-to-end workflow that starts in the database, pushes the training of the model to the ML library, and stores the computed weights back in the database. A core contribution is a method for translating objective functions of ML algorithms written in SQL to linear algebra operations in an ML library. We demonstrate how our translation method works and apply it on popular machine learning algorithms, including Linear Regression, Factorization Machines [?], and Logistic Regression.
The rest of the paper is organized as follows. Section sql4ml A declarative end-to-end workflow for machine learning presents an overview of sql4ml workflow in terms of input and generated code. In section sql4ml A declarative end-to-end workflow for machine learning we discuss differences in the functionality and data structures between relational databases and ML frameworks, whereas in section sql4ml A declarative end-to-end workflow for machine learning we describe the steps of our approach in detail. Sections sql4ml A declarative end-to-end workflow for machine learning and sql4ml A declarative end-to-end workflow for machine learning present the usability benefits of sql4ml and an experimental evaluation respectively. Finally, section \thefigure discusses related work and section sql4ml A declarative end-to-end workflow for machine learning concludes the paper.
sql4ml takes as input SQL code and generates Python code consisting either of TensorFlow API functions or other Python modules, which automates stages of the workflow in Figure sql4ml A declarative end-to-end workflow for machine learning. An overview of the code files involved in sql4ml is presented in Figure sql4ml A declarative end-to-end workflow for machine learning and highlighted next.
User’s code. The user provides SQL code defining which features shall be used for training and the objective function of the ML model (left side of Figure sql4ml A declarative end-to-end workflow for machine learning).
Generated code. There are four main groups of code that sql4ml generates, which correspond to the four dotted rectangles of Figure sql4ml A declarative end-to-end workflow for machine learning. The first regards the transfer of data from relations to tensors. sql4ml generates SQL queries for exporting feature and target values of training data and Python code for loading these values into tensors. In the second group, there is code for the definition of the ML model. This is generated by translating SQL queries that define the objective function of the ML model to TensorFlow API functions. Finally, boilerplate code is generated for triggering gradient descent and importing the computed values of weights back to database tables after training. The execution order of the generated code goes from top to bottom following the arrow in Figure sql4ml A declarative end-to-end workflow for machine learning.
DB and ML icons in Figure sql4ml A declarative end-to-end workflow for machine learning are used to indicate which parts of the code run on the database and which on the ML framework. This is also denoted by the file extensions .sql and .py on each filename (in sql4ml we chose to translate SQL to Python code, including functions of the TensorFlow API, and use TensorFlow as an execution engine.) We describe in detail how each part of the code is generated and provide specific examples of user provided and automatically generated code in section sql4ml A declarative end-to-end workflow for machine learning.
In this section we briefly discuss some fundamental differences between relational databases and ML frameworks, which pose challenges in integrating functionality of the two worlds. Some of the differences stem from the foundations of relational algebra and linear algebra, whereas others are related to programming paradigms and the use of declarative versus procedural languages.
Relational databases are built around the abstraction of relation, which is an unordered set of tuples. There is no notion of order between the columns of a tuple. Because of this, the user is not concerned with the particular physical representation of relations in the database, and the system is free to optimize the representation (e.g., with indexes). On the other hand ML frameworks operate on tensors, which are multidimensional arrays. The word “array” is key here. It indicates that the elements of a tensor are accessed via indices, basically natural numbers from to (or ), which in turn implies a specific order on the elements. The connection between boolean matrices and relations is well-established, whereas recent work [?] studies mappings between matrices of numeric values and -relations.
On the functionality front, relational operators map input relations to an output relation and include selection, projection, joins and set-theoretic operators. In contract, linear algebra operators include matrix multiplication, summations over rows and columns, matrix transpose, element-wise arithmetic operations and others. In addition to this and as far as programming model is concerned, databases support declarative languages, traditionally SQL and in some cases Datalog, whereas ML frameworks usually prefer imperative languages, such as Python and C++. It has been shown that a set of commonly used linear algebra operators can be implemented in SQL using element-wise mathematical operations and aggregations [?], [?], [?]. For example, a matrix-vector multiplication is an element-wise multiplication between columns and a summation over the elements of each tuple in a relation. However, other parts of machine learning algorithms are more difficult to implement inside a database.
A significant limitation of SQL is the lack of iteration constructs. RDBMS support for iterations is typically limited to fixed-points over sets. Iterative processes are very common in machine learning though. The training of a ML model is one of them and is done using well-known mathematical optimization algorithms, such as gradient descent. Because of their wide use in machine learning, ML frameworks provide highly optimized implementations of such algorithms.
Another very useful capability of some ML frameworks, which is not currently supported in databases, is automatic differentiation. Systems that do not provide automatic differentiation require the user to explicitly write the derivatives of the objective function, which becomes nontrivial for complicated functions.
On the other hand, ML systems lack support when it comes to data processing. Unlike databases, a tensor library like TensorFlow does not provide relational operators (e.g., joins and filters). While a ML framework can be combined with other libraries in the host language providing tabular operations, such as Python Pandas [?], there are two main issues associated with this approach. First, the scalability of these solutions is limited by the main memory of the machine. Second, such libraries offer a quite procedural way of writing relational operations that exposes the order in which they are executed, e.g. A.join(B).join(C). On the other hand, databases incorporate an extensive body of work to optimize queries just describing the output set and handle large datasets.
Given the mismatch between the data and programming paradigms in databases and machine learning systems, we propose a unified approach where both can be programmed from SQL. In sql4ml , the data analyst writes SQL queries both to prepare the features and to define the ML task. In particular, the user writes the objective function of the ML model in SQL and the system automatically translates it to the corresponding TensorFlow code. Moreover, the data are transparently moved from relations in the database to tensors in the ML system.
A large class of supervised algorithms in ML, including linear regression, logistic regression, Factorization Machines and SVM, follow a common pattern in how they work. Unsupervised algorithms, such as k-means, can be reformulated to fit into this pattern, too, if we replace the concept of features with data points. We explain the encountered similarities below.
Given a set of input features , a machine learning algorithm can be specified by an objective function. For example, in logistic regression the relationship between the target variable and the input features is defined by the prediction function
where are the unknown weights of the ML model to be optimized. The objective function (a.k.a. loss function or cost function) of logistic regression is
The minimization/maximization of the objective function is done using a mathematical optimization algorithm, such as gradient descent. Gradient descent is an iterative method, which computes the derivative of the objective function and updates the ML model weights in each iteration. Given a system such as TensorFlow that supports automatic differentiation and implements a mathematical optimization algorithm, the user only needs to provide the objective function in terms of input features and weights.
We propose a workflow, where the user writes SQL to both query the data and express the objective function of the ML model. A TensorFlow representation of the SQL queries defining the ML model and consisting of linear algebra operators and tensors is generated.
This representation is created in two steps. First, the SQL code defining the objective function of the ML model is converted to an abstract syntax tree (AST), which is then automatically translated to code in the host language of a ML framework. This code is supplemented with a few more lines calling gradient descent on the objective function inside a loop. The entire program is then executed on the ML framework. In order to feed data from relations to tensors, we automatically generate code either for connecting and querying the database using a suitable driver or for exporting data to files. After the generated code is executed and the optimal weights for the objective function are computed, they are transferred to relations inside the database. The proposed workflow is displayed in Figure sql4ml A declarative end-to-end workflow for machine learning.
Note that joins needed in feature engineering are still evaluated inside the database, whereas any linear algebra operation written in SQL is translated to the appropriate operators of the ML framework and is evaluated in it. On ML frameworks supporting it, that also means the linear algebra part can run on GPUs. Relational databases provide quite advanced query optimization and there are already mature solutions for linear algebra operations, automatic differentiation and mathematical optimization algorithms outside the database ecosystem. In our approach the key components of a machine learning algorithm are provided using SQL, whereas the training of the ML model is executed on a system specializing in machine learning, hence avoiding the need for implementing an iterative process in the language of the database. Essentially, we attempt to combine the best from both worlds, but at the same time do so in a transparent way, in order to reduce the manual work required by the user and the need to be familiar with lower level details of machine learning frameworks, such as data representation. Also by creating an under the hood synergy between a database and mathematical optimizers of ML frameworks, the user has more flexibility to define the algorithm of her choice compared to using ready made ML algorithms.
In this section we present how machine learning algorithms are written in SQL through the example of logistic regression. As stated earlier, the user writes the objective function of the ML model in SQL.
We implement logistic regression on the following schema:
The observations table stores the identification numbers of training observations. For every observation and feature, the features table has an entry with the feature’s value. For example, the entry (1, "feature1", 30.5) means that feature "feature1" of observation 1 has value 30.5. The targets table stores the observed target values of the training observations. Data for these three tables are provided by the user, extensional to the database. Finally, the weights table is filled in by learning the weights of a model in TensorFlow.
Given these tables, we define SQL views sigmoid for the sigmoid function of the logistic regression model and objective for the objective function (see Listing 1). These functions embody the Equations sql4ml A declarative end-to-end workflow for machine learning and sql4ml A declarative end-to-end workflow for machine learning and are expressible in SQL using numeric and aggregation operations. Instead of evaluating them inside a database, they are translated to TensorFlow linear algebra functions for evaluation inside TensorFlow. Note that the user defines the model by writing prediction and objective functions in SQL, while sql4ml generates the code to train the ML model.
The SQL code as is cannot be evaluated because the weights table is empty. Even if the weights were initialized with random values, only a single round of evaluation would be possible. No iterative process is defined in the code of Listing 1 to update tuples in weights. Using the SQL representation we produce a representation in TensorFlow/Python API to be evaluated in TensorFlow.
In this section we describe the process of translating machine learning algorithms from SQL to TensorFlow API. We start with a simple example, the prediction function of linear regression.
This may be implemented in SQL as follows:
Using TensorFlow API we could write this as follows:
In the mathematical equation and TensorFlow, dimensions between and need to match, in order for the matrix-vector multiplication to be executed. Hence, the transpose operation on in Equation sql4ml A declarative end-to-end workflow for machine learning. In SQL this part is implemented as an element-wise multiplication between columns and a group by aggregation, which implies that dimensions are irrelevant in this case. The translation process generates the TensorFlow snippet above from the corresponding SQL snippet.
A SQL program is a list of SQL queries. We handle a specific (albeit general enough to express supervised ML algorithms) type of select query that is described in algorithm 1. We assume that there is only one numeric expression per query, although it may involve multiple nested computations, e.g. (a+b)/c. Because we do not handle subqueries, we store the result of a query as a view and use the view name in subsequent queries when needed. Algorithm 1 displays the high-level steps to translate a create view query to a TensorFlow command. The translation proceeds by applying typical steps from compiler theory. A SQL query is first tokenized (lexing) and parsed. Then an AST is produced for it. Based on the AST we extract the numeric expression involved in the query as well as the name of the view, and generate an equivalent TensorFlow command.
The AST represents necessary information for translating a SQL query to functions of TensorFlow API. This information includes the operators that are involved in a SELECT numeric expression, the columns on which the operators operate, the tables where the columns came from, as well as columns involved in group by expressions. In order to generate the TensorFlow code, we need to match SQL numeric operators and aggregation functions with linear algebra operators. In Table sql4ml A declarative end-to-end workflow for machine learning we present equivalences between these two categories of operators encountered in examples in this paper.
|Linear algebra operator||SQL||TensorFlow|
|matrix product||SUM(_ * _)||tf.tensordot|
As soon as we extract the numeric expression among the SELECT expressions of a query, we analyze it recursively by visiting each subexpression and decomposing it to the operators and the columns on which the operators take place as described in Algorithm 2. We apply a compositional translation where there is a mapping that preserves the structure between the SQL and TensorFlow expression. If the expression is a column or a constant, we output a TensorFlow constant or variable name. Otherwise, we match it to a TensorFlow operation according to Table sql4ml A declarative end-to-end workflow for machine learning and proceed with translating the operands of the expression. Recall that tensors store only real values, so analyzing column projections/expressions with other types is not applicable.
Hence, in the select query of view predictions we will translate only the expression SUM(features.v * weights.v) to TensorFlow code. In Figure sql4ml A declarative end-to-end workflow for machine learning we present the TensorFlow AST that corresponds to the SQL AST of a matrix-vector multiplication. Based on Table sql4ml A declarative end-to-end workflow for machine learning we match the combination of function SUM and with tf.tensordot. We then scan the columns features.v and weights.v, and identify that they belong to tables features and weights. We use these table names for the tensors participating in the corresponding TensorFlow operation. Finally, the argument axes=1 in tf.tensordot is used to define the axes over which the sum of products takes place. Since our work focuses on data that can be stored in up to two-dimensional tensors, i.e. matrices and vectors, we will always use 1 as the value of the axes argument, which is equivalent to matrix multiplication.
Apart from linear algebra, other mathematical operations may be encountered in machine learning algorithms. An example of this is the sigmoid function used in logistic regression, which is defined in Equation sql4ml A declarative end-to-end workflow for machine learning. The sigmoid function is implemented in SQL as follows.
In TensorFlow API we can write this using the supported sigmoid function directly
or in a more verbose way as follows:
The translation process is very similar to the former example, except that here we have the exponential function operating on the product between the features of each observation and the weights . As a result apart from the mapping between element-wise numeric and linear algebra operators, we also need to match mathematical functions used in SQL to functions of TensorFlow API. All mathematical functions we have encountered so far in our SQL examples are also included in the TensorFlow API. We present the correspondence between the two in Table sql4ml A declarative end-to-end workflow for machine learning.
Assuming that the products between features and weights has been already defined in a different query and based on the SQL AST in Figure sql4ml A declarative end-to-end workflow for machine learning, we identify four operators, /, +, EXP and - in the numeric select expression of view sigmoid. Each of them will be matched to a TensorFlow operator forming the AST at the right of Figure sql4ml A declarative end-to-end workflow for machine learning. More specifically - is matched with tf.negative. We treat it as a different case from tf.subtract, when it is applied to a single constant / column. EXP is matched with tf.exp, whereas / and + are matched with tf.div and tf.add respectively. We combine all operators to a single TensorFlow command by nesting them. The generated TensorFlow command will be
Finally, when we have a group by clause combined with an aggregation function in SQL, such as SUM or AVG, we need to analyze it, in order to extract necessary information regarding the parameters of the matched reduce operation in TensorFlow. In reduce operations in TensorFlow, the user also has to specify the dimensions to reduce over, i.e. the axis parameter. If the group by in the SQL statement is on a primary key, then we need to reduce across columns and the value of axis parameter is 1. If the group by column is not a key, then we need to reduce across rows and the axis parameter is 0. According to TensorFlow API, the axis parameter takes values in the range [-rank(input_tensor), rank(input_tensor)], but as we limit our translation method to matrices and vectors, we do not deal with higher dimensional tensors of rank greater than 2. In case there is no group by clause, that means that we give the value None to the axis parameter, and the reduction happens across all dimensions, as for example in objective query of logistic regression in Listing 1.
To provide a translation, apart from the SQL code, we also request from the user a few information hints. Those hints can be provided in the form of command line arguments or in a configuration file. More specifically, the names and dimensions of the following tables are needed:
Tables that store features, e.g. features,
Columns in the aforementioned tables that store names of the features, e.g. featureName in table features,
Tables that store weights of the ML model, which are initially empty and will be populated once the gradient descent algorithm has computed the optimal values for them, e.g. weights,
The dimensions of the tables that store weights of the ML model, which can be derived from the cardinalities of all columns in the table except from the one storing the value of the weight and whose type will be double, e.g. . The number of weights should also be equal to the number of features.
The table that stores target values of training observations, e.g. targets.
Based on these tables, we are able to determine which tensors will be defined as constants or variables at the TensorFlow side, as well as provide their dimensions. Finally, the following information are necessary for the training loop and the connection to the database.
Hyperparameters for the gradient descent algorithm: the number of iterations to run (e.g. 1000) and the learning step (e.g. 0.00001),
A url and the credentials of a user to connect to the database, e.g. mluser@localhost/items.
So far we described how we translate SQL numeric and aggregation operations into TensorFlow operators. In order to execute the TensorFlow program and compute the optimal weights of the ML model, we need to feed the tensors with data. Data are initially stored as relations inside the database. So we need a mechanism to transfer the data from relations to tensors. However, in section sql4ml A declarative end-to-end workflow for machine learning we described the mismatch between the two data structures.
Let us start with possible ways of storing training data in relations. Suppose each observation has a number of features and a target value. We could store this information in three relations: observations, features and targets. The schema of these relations would be
Given this schema we need to transfer the feature values of each observation to a matrix. Although relation features stores each feature of an observation in a different tuple, e.g., (1, ”price”, 3.5) and (1, ”size”, 20), for a matrix we need to consolidate all features of an observation to a single row, e.g. (3.5, 20). In addition to this, we are interested only on the real values, discarding the observation IDs and names of features. Hence, we automatically construct a SQL query, which transforms the schema of features relation to a different one storing all features of an observation in a single tuple as it is displayed in Figure sql4ml A declarative end-to-end workflow for machine learning. The resulting schema is the following:
We create a column per feature name and store only the value of the feature in it. The pivot query in Listing 3, which transforms rows to columns for features relation, is generated automatically and is added above the TensorFlow code defining the ML model, so that we can feed tensors with the appropriate data before training begins. Essentially, a relation can be formulated as a matrix where
Of course, in order to be able to execute the pivot query, we also add a few lines of boilerplate code to connect to the database and run the query there. The result of the query is then processed with a little bit more of boilerplate code, which discards rowID and keeps only feature values, in order to be fed to a tensor.
Target values are fed to a vector in a similar manner. Because this relation stores a single tuple for each rowID, we do not need a pivot query. We rather execute a select query on table targets and keep only the target value in order to store it in a vector.
However, instead of storing all features in a single table, we could have a case where features are stored in different tables. Thus, we would need to join tables to gather the features of an observation. Suppose that we have a database storing items and their sales and we want to use it to train a ML model for future sales prediction. The schema of the database is
In the schema above an observation depends on three dimensions: items, stores and dates. The same holds for sales. We can use these data to predict future sales of an item at a particular store and date. The features of the ML model are based on item or store characteristics and are stored in familyFeat and cityFeat. Because families of items and cities are strings, the user has created a one-hot encoding representation in familyFeat and cityFeat relations. Using this representation one can convert categorical variables into a numeric form, where each family and city value becomes a real feature whose value can be either 0 or 1. Hence, in order to gather the feature values for an observation, we need to join observations with familyFeat and cityFeat. Listing 5 displays the automatically generated query that joins observations with familyFeat and cityFeat and gathers all features of an observation in a single row. Figure sql4ml A declarative end-to-end workflow for machine learning presents the mapping between features from different tables and columns of the matrix storing features for all observations. It displays the export of a table to a dense matrix, but in case of categorical features with lots of zero values exporting to a sparse representation, such as csr [?], to save on space and time is also possible.
Tables corresponding to weights of the ML model are also mapped to a single vector. By mapping indices of the weights vector to the initial tables and columns, we are able to store computed weights back to the database after training. Finally, all TensorFlow commands involving feature and weight tensors as generated by their initial representation using normalized tables are rewritten in terms of the global features and weights tables.
By normalizing features across tables and connecting the observations with them using foreign keys we avoid redundancy in storing shared feature values multiple times. When gathering features to populate a global feature matrix as in the previous subsection, the repetitive nature of features can be exploited to reuse computations. Reuse of computations can be carried out via subqueries or precomputed tables/materialized views, both standard or widely supported techniques in relational databases.
In the example schema of listing 4, observations are based on three dimensions: items, stores and dates. The number of observations is less or equal to . Features that are based solely on items, stores or dates are shared across observations that concern the same item, store or date. For example, if we have two observations, (”item1”, ”store1”, ”date1”) and (”item1”, ”store1”, ”date2”), only the features involving date change between the two observations. Item and store features remain the same. Product family and city may be an item and store feature respectively. Hence, instead of recomputing family and city features for each observation, we can either compute them once using a subquery or precompute them and store them in a materialized view/table.
Continuing on the example above, a query for one-hot-encoding the categorical features of family and city, and gathering all features together could be the following:
This query naively computes the one-hot encoding representation of family and city features using case when expressions for every observation. sql4ml generates a more efficient version of the query above, as is depicted in Listing 5, making use of subqueries that denormalize the categorical features of family and city for every item and store respectively. Then each observation is joined on itemID and storeID with corresponding tuples from familyFeat and cityFeat tables. The result of the subqueries can also be materialized.
Due to the statistics the database holds, it is easy to figure out that the cardinalities of items, stores and dates are smaller than the total number of observations and thus that some feature computations will be repeated. As a result, features that are based on individual dimensions of observations can be automatically precomputed by the system. We report evaluation results comparing time to export features with and without precomputation in section sql4ml A declarative end-to-end workflow for machine learning.
The code for translation and transferring data between data structures is implemented in Haskell 111https://github.com/nantiamak/sql2tf. For lexing, parsing and generating the AST of SQL code we used the open source project Queryparser [?], also implemented in Haskell. Queryparser project supports three sql-dialects (Vertica, Hive and Presto), but we found it adequate for parsing at least select queries, create view and create table commands in MySQL and PostgreSQL, as well. Following the generation of AST, we developed the rest of the code for analyzing it and generating the TensorFlow program end to end.
A translation is essentially a function from a source type to a target type. As such, functional languages are a good fit for developing translation processes. For translating SQL to TensorFlow, SQL is captured in algebraic data types (ADTs). The AST of a query is represented using the ADTs provided by the Queryparser project. We defined extra data types to represent objects of interest particularly to the TensorFlow translation.
The assembly of a TensorFlow command, i.e. the output of translateNumericExpr and getViewName in Algorithm 1, is carried out through functions walking over the AST and performing pattern matching on it. Various other functions are implemented in a similar way to extract group by keys, key constraints, column expressions, create table statements and generally any information needed to generate the end-to-end TensorFlow program from the AST. We found that features like pattern matching and tail recursion encountered in functional languages, such as Haskell, make the manipulation of tree structures easier than how it is developed in imperative languages.
Assuming the data science workflow displayed in Figure sql4ml A declarative end-to-end workflow for machine learning, we argue that the use of sql4ml has a number of benefits from a usability perspective.
We propose the use of SQL for both feature engineering and the definition of a ML model. It is a common scenario to have a data engineer exporting the appropriate data from a database and a data scientist writing code for a ML model [?], [?]. The first part involves data processing pipelines built out of relational operators common in SQL, Datalog and data processing APIs, whereas the second is mainly based on linear algebra operators and probabilities, and is expressed in ML framework APIs written in imperative languages. Expressing the entire workflow in a single language increases ease of maintenance, as it is no longer required from the users to be familiar with more than one programming paradigms. Moreover, declarative database languages, such as SQL, are based on high-level, standardized abstractions, which hide their physical implementation from the user.
sql4ml automates the move of training data from relations to tensors and removes the burden of data exports from the user. In addition to this, processing data with SQL also facilitates a few frequent operations in machine learning. When working with large datasets, training usually happens on batches. To create randomized batches as a means of avoiding implicit bias to a ML model, training data are shuffled. Using a library, such as Python Pandas, on the ML framework side for shuffling requires loading the entire training set in-memory, which results in out-of-memory errors when data are larger than the machine’s memory specs. This situation can be avoided if we use a RDBMS for arranging the tuples of a table in random order. Also, the creation of batches themselves might fail due to memory issues. Lately, TensorFlow 222https://www.tensorflow.org/api_docs/python/tf/data/Dataset and PyTorch 333https://pytorch.org/tutorials/beginner/data_loading_tutorial.html added data structures for handling datasets incrementally, but if this is not provided by the library, then a typical procedure is to load data in-memory and take random samples to create each batch. In this case, executing SQL queries to fetch tuples per batch can provide a solution to memory issues.
Because in sql4ml weights of a model are stored in a table, the user is able to query them. For example, she might be interested in examining the range of weights in order to identify large values using a SQL query similar to the following.
In ML frameworks models are saved to files, whose extension is not standard. For example TensorFlow uses the .ckpt and .data extensions, whereas PyTorch uses .pt or .pth files. Each API provides its own functions to access parts of these files, though each has its own particularities with which the user needs to get familiar. We argue that declarative database languages provide a standard and higher-level way to query data and as such it can be used to manage ML models, too. We could also store metadata regarding features and weights in tables. One such example is to add columns or tables keeping track of different versions of a model built from different features subsets.
sql4ml generates valid and readable TensorFlow/Python code. It can be combined with other code in Python, which is a popular language used among the data science community. In addition to this, TensorFlow code can run on GPUs, providing additional hardware options for training outside the database.
We conducted two types of experiments: (1) the time translation takes to generate the entire TensorFlow program for three models defined in SQL code and various feature sets, (2) the time difference between exporting training observations with and without precomputed features.
Experimental setup Experiments ran on a machine with Intel Core i7-7700 3.6 GHz, 8 cores, 16GB RAM and Ubuntu 16.04. TensorFlow code ran on CPUs, unless stated otherwise. We use PostgreSQL as an RDBMS and whenever TensorFlow API is mentioned, it refers to the Python API.
Datasets Two datasets are used throughout the experiments: Boston Housing [?], [?] and Favorita [?]. Boston Housing is a small dataset including characteristics, such as per capita crime rate, pupil-teacher ratio, for suburbs in the Boston area and the median value of owner-occupied homes. The Favorita dataset is a real public dataset consisting of millions of daily sales of products from different stores of Favorita grocery chain. Dataset characteristics are displayed in Table sql4ml A declarative end-to-end workflow for machine learning.
|Dataset||Observations||Total Features||Numeric features||Categorical features|
In Tables sql4ml A declarative end-to-end workflow for machine learning and sql4ml A declarative end-to-end workflow for machine learning we measure the time to translate the original SQL code provided by the user to TensorFlow code. For every translation we report average wall-clock time after five runs. The generated TensorFlow code includes all steps of the workflow, end-to-end, i.e. both code for exporting/importing data and defining the ML model. In Table sql4ml A declarative end-to-end workflow for machine learning we provide translation time from SQL implementations of three models, linear regression, Factorization Machines and logistic regression having all features in a single table. We also report lines of SQL code, as well as lines of generated and handwritten TensorFlow code. SQL code includes both the select queries for defining the ML model and the commands for creating the tables used in its definition. We assume that queries are formatted reasonably in multiple lines, with select expressions, from, where and group by clauses put in separate lines. Features for training these models are based on the Boston Housing dataset.
We also report translation time from a linear regression model operating on different sets of normalized features, i.e. features stored in multiple tables, in Table sql4ml A declarative end-to-end workflow for machine learning. Essentially, this latter experiment refers to the second case of denormalization described in section sql4ml A declarative end-to-end workflow for machine learning, where more rewriting steps are executed by sql4ml . We focus on how translation time is affected by increasing the number of involved features. In Table sql4ml A declarative end-to-end workflow for machine learning the sets of features are based on Favorita dataset. Product and store characteristics, such as the family of a product and the location of a store, can be used as categorical features for training a model. In Table sql4ml A declarative end-to-end workflow for machine learning we assume that categorical features are converted to a one-hot encoding representation, where each value of a categorical feature becomes a numeric feature.
|ML model||Lines of SQL code||Translation time (sec)||Lines of TF generated code||Lines of TF handwritten code|
|Feature tables||Total categorical features||Numeric 1-hot encoded features||Translation time (sec)|
|family, city, state, store type||4||76||0.051|
|family, city, state, store type, holiday type, locale, locale_name||7||109||0.068|
|family, city, state, store type, holiday type, locale, locale_name, item class||8||446||0.0906|
Takeaways Experiments show that translation time takes a few dozens millieseconds. Note that translation time is not affected by the number of training observations, only by the lines of SQL code and number of features, which affect the generation of the denormalization queries in Listings 3 and 5. From the results in Table sql4ml A declarative end-to-end workflow for machine learning, we can also observe that for the same ML model, translation is more time-consuming when features are normalized across tables, but the increase remains within reasonable limits. The reason for this is that the translation algorithm needs to create mappings between feature/weight tables and the columns of the global feature/weight matrix that concentrate all values in a single place. Regarding the lines of code, we observe that there might exist a negligible difference between the automatically generated and handwritten code. This difference is due to the fact that generated code could be more verbose as we do not support the use of subqueries in SQL code. For example, in the least squared error formula the automatically generated version would compute the square of errors and their average in two operations coming from two separate SQL queries,
whereas a Python developer would probably prefer to combine these steps in a single line, e.g.
taking advantage of the square_difference operation of TensorFlow API. Similar differences might occur in case an operation is provided by TensorFlow API but not by SQL. An example is the computation of sigmoid function. TensorFlow provides tf.sigmoid, whereas in SQL the user needs to write the formula definition. As a result the generated code for this specific part might involve more operations than its handwritten version.
Another point where differences in code might occur is data loading. The automatically generated code follows the pattern of reading features and targets from two separate files/data structures, corresponding to the two tables/columns inside the database. A developer though might have other options for loading data. For example, all data might be exported in a single file and he might use Pandas operations to split them to two dataframes before loading them to tensors.
Using the feature reuse process described in section sql4ml A declarative end-to-end workflow for machine learning, we compare the time needed to export training data from the Favorita dataset with and without precomputing and materializing shared features in tables. Table sql4ml A declarative end-to-end workflow for machine learning shows time performance on exporting four different sets of features. The first three sets were exported on 80000000 observations. Only the last one was exported on 13870445 observations as holiday features of the Favorita dataset do not apply for most observations. Favorita observations are based on three dimensions: items, stores and dates. There are 4100 items, 54 stores and 1684 dates. When precomputation is used, we compute and store the values of each feature in a table, which we join with every observation during exporting. The naive version of the query computes every feature for every observation (see Listing 6) ignoring the fact that some features are shared among observations.
|Observations||Features||Time without precomputation||Time with precomputation||Precomputation time|
Time performance results indicate that feature precomputation reduced exporting time on the Favorita dataset by about 50% while the time needed to compute and store features in tables remains low.
To check correctness of the generated code, we ran experiments with two ML models on the Boston Housing dataset of Table sql4ml A declarative end-to-end workflow for machine learning and the epsilon dataset [?], and compared the loss function between their generated and handwritten versions. The epsilon dataset is a synthetic dataset of 400000 observations with 2000 features that can be used for binary classification as its observations are annotated with either 1 or -1. We trained a linear regression model on Boston Housing data that predicts the median value of owner-occupied homes using the rest of the characteristics as features. The learning rate used in gradient descent is 0.000003 and we ran 10000 iterations. The loss function is mean squared error. Figure sql4ml A declarative end-to-end workflow for machine learning displays the loss function across iterations from both the handwritten and the generated code of the linear regression model. We observe that both curves follow the same decreasing pattern.
Next we provide results from two logistic regression models on the epsilon dataset. Due to a hard size limit of 2GB in protocol buffer 444https://developers.google.com/protocol-buffers/, TensorFlow cannot load the entire epsilon dataset at once. For this reason in order to train a logistic regression model, we need to perform mini-batching on the data, where each gradient descent iteration runs on a single mini-batch instead of the entire dataset.
On the SQL side, the synthetic nature of epsilon dataset is not well-suited to the relational structure. All features are real values with no specific meaning. For our experiment we used a highly denormalized schema with 4 tables, each storing 500 features, and a table for the labels, as displayed below.
A query that joins all 4 tables and projects 2000 columns in order to gather all features together ran out of memory on a machine with 16GB. However, because TensorFlow poses a 2GB limit due to protocol buffer anyway, we can load the dataset in two halves using two SQL queries and create mini-batches from the first half at the beginning and the second half later on. We compare the loss function of the generated TensorFlow code using the above configuration with a handwritten version that loads the entire dataset from a file to a NumPy array, which is then divided into mini-batches at the TensorFlow side. For this experiment the learning rate in gradient descent was set to 0.01. We ran 300 epochs and set batch size to 200. Each epoch contained 2000 iterations to cover the entire training set. The loss function is logistic loss. Figure sql4ml A declarative end-to-end workflow for machine learning displays the loss function across iterations from both versions of the logistic regression model. Due to the stochastic nature of mini-batching, there are small fluctuations between the two plots. However, overall the pattern is very similar.
The line of work in sql4ml is mainly relevant to the following three directions.
In-database machine learning There have already been efforts to bring machine learning inside the database. MADlib [?] is a library of in-database methods for machine learning and data analysis. It provides SQL-based ML algorithms, which run on PostgreSQL [?] or Greenplum database [?]. SQL operators are combined with user defined functions (UDF) in Python and C++ implementing ML algorithms for popular data analysis tasks, such as classification, clustering and regression. Bismarck [?] describes a unified architecture for integrating analytics techniques as user-defined aggregates (UDAs) to RDBMS. Similarly, Big Query [?], which is based on the Dremel technology [?], supports a set of ML models that can be called in SQL queries. In systems of this flavor, the user can select from a predefined collection of available algorithms. Moreover, code inside UDFs/UDAs remains a black box and is not optimized by the database system. In our approach we leave the user define the objective function of the algorithm, in order to provide more flexibility on the ML models she can develop.
A second direction regarding in-database learning aims at mingling the dataset construction with the learning phase and accelerating the latter by exploiting the relational structure of the data. F [?], [?] and AC/DC [?] operate on normalized data and rewrite objective functions of ML algorithms by pushing the involved aggregations through joins. This idea of decomposing the computations of ML algorithms comes under the umbrella of factorized ML. The purpose of these works is to showcase that given the appropriate optimizations a RDBMS can efficiently perform ML computations and the need for denormalization of the data can be avoided. Initial efforts [?], [?] in this area required manual rewriting of each ML algorithm to its factorized version and because of that suffered from low generalizability. More recent work [?], [?] aims to provide a more systematic approach to automatic rewriting of ML computations to their factorized version. Aspects of this line of work are orthogonal to sql4ml . For example, since our workflow also starts in the database the optimization with functional dependencies presented in [?] can also be exploited by our translation method in order to reduce the weights of a model. Another example is the use of lazy joins for more efficient denormalization of data as described in Morpheus [?]. We argue though that by interoperating with existing ML frameworks, we gain in usability and reduce development overhead, as we can exploit helpful features, such as automatic differentiation and mathematical optimization algorithms.
Finally, [?] propose an extension on SQL to support matrices/vectors and a set of linear algebra operators, whereas [?] presents optimizations on executing recursion and large query plans on RDBMS, which can make them suitable for distributed machine learning. In sql4ml we do not require any changes on the RDBMS or the machine learning framework. We rather propose a translation method that gets standard SQL code as input and generates valid TensorFlow/Python code.
Machine learning frameworks SystemML [?], TensorFlow [?], PyTorch [?], Mahout Samsara [?] and BUDS [?] provide domain specific languages (DSLs) and APIs that support linear algebra operations and data structures, probability distribution and/or deep learning functions, as well as useful ML-centric features, such as automatic differentation. These systems are more than ML libraries as they apply algebraic rewrites and operator fusion [?], [?] to optimize users’ code. They employ an analogy of logical and physical plans similar to query optimization in database systems. The user essentially constructs a graph defining dependencies between operators and operands whose execution order is chosen by the system. Because of the emphasis on linear algebra, ML frameworks operate on denormalized data. sql4ml targets SQL users who work with relational data and would like to perform more advanced analytics using machine learning.
Mathematical optimization on relational data To the best of our knowledge, MLog [?] and SolverBlox [?], [?] of the LogicBlox database [?] are the closest systems to sql4ml . Both systems model ML algorithms as mathematical optimization problems using queries that find the optimal values that minimize/maximize an objective function defined either in Datalog or a tensor-based declarative language similar to SQL. MLog translates the user’s code at first in Datalog and then in TensorFlow, which finally computes the optimal weights for the objective function. Similarly, SolverBlox, currently supporting only linear programming, translates Datalog programs to an appropriate format consumed by a linear programming solver. In sql4ml we do not propose a new language for modeling ML algorithms, but rather we use SQL, an already popular declarative language. Regarding the supported ML algorithms, we expand outside linear programming as MLog does, but describe in detail a translation method directly from SQL to TensorFlow API without using any other languages for intermediate representations.
We presented sql4ml , an integrated approach for data scientists to work entirely in SQL, while using a ML framework as an execution engine for ML computations. The approach of sql4ml automates and facilitates tedious parts of the current fragmented data science workflow. We developed a prototype and showcased our translation method from SQL to TensorFlow API code on well-known ML models. Evaluation results demonstrate that this translation is completed in little time .
As future work we would like to investigate further the use of sql4ml in defining and training neural networks, such as Recurrent Neural Networks (RNN), as well as the adaptation of the system to unsupervised ML models. Another very interesting line of work is the study of query/code optimization techniques whose application could result in more efficient code on ML framework side. Recent work on in-database machine learning [?] discusses the use of functional dependencies in reducing the dimensionality of ML models. Such techniques could prove useful in our approach, as well, as they could produce lower-dimensional ML models and accelerate training, even if this is executed outside the database.
- 1 Apache mahout. http://mahout.apache.org/.
- 2 Big query. https://cloud.google.com/bigquery/.
- 3 The boston housing dataset. https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html.
- 4 Data engineers vs. data scientists. https://www.oreilly.com/ideas/data-engineers-vs-data-scientists.
- 5 The difference between data scientists and data engineers. https://www.kdnuggets.com/2019/03/odsc-difference-data-scientists-data-engineers.html.
- 6 Favorita dataset. https://www.kaggle.com/c/favorita-grocery-sales-forecasting.
- 7 Greenplum database. https://greenplum.org/.
- 8 Postgresql. https://www.postgresql.org/.
- 9 Python pandas. https://pandas.pydata.org/.
- 10 Queryparser. https://github.com/uber/queryparser.
- 11 Morpheusflow: a case study of learning over joins with tensorflow. Technical report, University of California, San Diego, 2018.
- 12 M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
- 13 M. Abo Khamis, H. Q. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. In-database learning with sparse tensors. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, SIGMOD/PODS ’18, pages 325–340, New York, NY, USA, 2018. ACM.
- 14 M. Aref, B. ten Cate, T. J. Green, B. Kimelfeld, D. Olteanu, E. Pasalic, T. L. Veldhuizen, and G. Washburn. Design and implementation of the logicblox system. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages 1371–1382, New York, NY, USA, 2015. ACM.
- 15 M. Boehm, M. W. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, A. C. Surve, and S. Tatikonda. Systemml: Declarative machine learning on spark. Proc. VLDB Endow., 9(13):1425–1436, Sept. 2016.
- 16 M. Boehm, B. Reinwald, D. Hutchison, P. Sen, A. V. Evfimievski, and N. Pansare. On optimizing operator fusion plans for large-scale machine learning in systemml. Proc. VLDB Endow., 11(12):1755–1768, Aug. 2018.
- 17 C. Borraz-Sánchez, D. Klabjan, E. Pasalic, and M. Aref. SolverBlox: Algebraic modeling in Datalog. In M. Kifer and Y. A. Liu, editors, Declarative Logic Programming: Theory, Systems, and Applications. ACM and Morgan & Claypool, 2018. To appear.
- 18 R. Brijder, M. Gyssens, and J. V. den Bussche. On matrices and k-relations. CoRR, abs/1904.03934, 2019.
- 19 L. Chen, A. Kumar, J. Naughton, and J. M. Patel. Towards linear algebra over normalized data. Proc. VLDB Endow., 10(11):1214–1225, Aug. 2017.
- 20 T. Elgamal, S. Luo, M. Boehm, A. V. Evfimievski, S. Tatikonda, B. Reinwald, and P. Sen. SPOOF: sum-product optimization and operator fusion for large-scale machine learning. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings, 2017.
- 21 X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture for in-rdbms analytics. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, pages 325–336, New York, NY, USA, 2012. ACM.
- 22 Z. J. Gao, S. Luo, L. L. Perez, and C. Jermaine. The buds language for distributed bayesian machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 961–976, New York, NY, USA, 2017. ACM.
- 23 D. Harrison and D. Rubinfeld. Hedonic Housing Prices and the Demand for Clean Air. Journal of Environmental Economics and Management, 5:81–102, 1978.
- 24 J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The madlib analytics library: Or mad skills, the sql. Proc. VLDB Endow., 5(12):1700–1711, Aug. 2012.
- 25 D. Jankov, S. Luo, B. Yuan, Z. Cai, J. Zou, C. Jermaine, and Z. J. Gao. Declarative recursive computation on an rdbms: Or, why you should use a database for distributed machine learning. Proc. VLDB Endow., 12(7):822–835, Mar. 2019.
- 26 M. A. Khamis, H. Q. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. Ac/dc: In-database learning thunderstruck. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, DEEM’18, pages 8:1–8:10, New York, NY, USA, 2018. ACM.
- 27 A. Kumar, J. Naughton, and J. M. Patel. Learning generalized linear models over normalized data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages 1969–1984, New York, NY, USA, 2015. ACM.
- 28 X. Li, B. Cui, Y. Chen, W. Wu, and C. Zhang. Mlog: Towards declarative in-database machine learning. Proc. VLDB Endow., 10(12):1933–1936, Aug. 2017.
- 29 S. Luo, Z. J. Gao, M. N. Gubanov, L. L. Perez, and C. M. Jermaine. Scalable linear algebra on a relational database system. In 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017, pages 523–534, 2017.
- 30 N. Makrynioti, N. Vasiloglou, E. Pasalic, and V. Vassalos. Modelling machine learning algorithms on relational data with datalog. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, DEEM’18, pages 5:1–5:4, New York, NY, USA, 2018. ACM.
- 31 S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. Communications of the ACM, 54:114–123, 2011.
- 32 A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
- 33 S. Rendle. Factorization machines. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, pages 995–1000, Washington, DC, USA, 2010. IEEE Computer Society.
- 34 Y. Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2nd edition, 2003.
- 35 M. Schleich, D. Olteanu, and R. Ciucanu. Learning linear regression models over factorized joins. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD ’16, pages 3–18, New York, NY, USA, 2016. ACM.
- 36 M. Schleich, D. Olteanu, M. A. Khamis, H. Ngo, and X. Nguyen. A layered aggregate engine for analytics workloads. In Proceedings of the 2019 ACM SIGMOD International Conference on Management of Data, SIGMOD ’19. ACM, 2019.
- 37 G.-X. Yuan, C.-H. Ho, and C.-J. Lin. An improved glmnet for l1-regularized logistic regression. J. Mach. Learn. Res., 13(1):1999–2030, June 2012.