Sato: Contextual Semantic Type Detection in Tables
Detecting the semantic types of data columns in relational tables is important for various data preparation and information retrieval tasks such as data cleaning, schema matching, data discovery, and semantic search. However, existing detection approaches either perform poorly with dirty data, support only a limited number of semantic types, fail to incorporate the table context of columns or rely on large sample sizes in the training data. We introduce Sato, a hybrid machine learning model to automatically detect the semantic types of columns in tables, exploiting the signals from the context as well as the column values. Sato combines a deep learning model trained on a large-scale table corpus with topic modeling and structured prediction to achieve support-weighted and macro average F1 scores of 0.901 and 0.973, respectively, exceeding the state-of-the-art performance by a significant margin. We extensively analyze the overall and per-type performance of Sato, discussing how individual modeling components, as well as feature categories, contribute to its performance.
Many data preparation and information retrieval tasks including data cleaning, integration, discovery and search rely on the ability to accurately detect data column types. Automated data cleaning uses transformation and validation rules that depend on data types [2011-wrangler, Raman:2001:PWI:645927.672045]. Schema matching for data integration leverages data types to find correspondences between data columns across tables [rahm2001survey]. Similarly, data discovery benefits from detecting types of data columns in order to return semantically relevant results for user queries [aurum, seeping-semantics]. Recognizing the semantics of table values helps aggregate information from multiple tabular data sources. Search engines also rely on the detection of semantically relevant column names to extend support to tables [venetis2011recovering].
We can consider two categories of types for table columns: atomic and semantic. Atomic types such as boolean, integer, and string provide basic, low-level type information about a column. On the other hand, semantic types such as location, birthDate, and name, convey finer-grained, richer information about column values. Detecting semantic types can be a powerful tool, and in many cases may be essential for enhancing the effectiveness of data preparation and analysis systems. In fact, commercial systems such as Google Data Studio [googledatastudio], Microsoft Power BI [powerbi], Tableau [tableau], and Trifacta [trifacta] attempt to detect semantic types, typically using a combination of regular expression matching and dictionary lookup. While reliable for detecting atomic types and simple, well-structured semantic types such as credit card numbers or e-mail addresses, these rule-based approaches are not robust enough to process dirty or missing data, support only a limited variety of types, and fall short for types without strict validations. However, many tables found in legacy enterprise databases and on the Web have column names that are either unhelpful (cryptic, abbreviated, malformed, etc.) or missing altogether.
In response, recent work [Hulsebos:2019:KDD] introduced Sherlock, a deep learning model for semantic type detection trained on a massive table corpora [hu2019viznet]. Sherlock formulates semantic type detection as a multi-class classification problem where classes correspond to semantic types. It leverages more than 600K real-world table columns for learning with a multi-input feed forward deep neural network, providing state-of-the-art results.
While Sherlock represents a significant leap in applying deep learning to semantic typing, it suffers from two problems. First, it under-performs for types that do not have a sufficiently large number of samples in the training data. Although this is a known issue for deep learning models, it nevertheless restricts Sherlock’s application to underrepresented types, which form a long tail of data types appearing in tables at large. Second, Sherlock uses only the values of a column to predict its type, without considering the column’s context in the table. Predicting the semantic type of a column based solely on the column values, however, comprises an under-determined problem in many cases.
Consider the example in Fig. 1, for a column that contains ‘Florence,’ ‘Warsaw,’ ‘London,’ and ‘Braunschweig’ as values; location, city, or birthPlace could all be reasonable semantic types for the column. It can be hard to resolve such ambiguities using only column values because the semantic types also depend on the context of the table. Continuing with the example, it is highly likely that the column’s type would be birthPlace if it came from Table A since the table contains biographical information about influential personalities. However, the same column in Table B would be more likely to have the type city, as the table’s other columns present information about European cities.
In this paper, we introduce Sato (SemAntic Type detection with table cOntext), a hybrid machine learning model that incorporates table contexts to predict the semantic types of table columns. Sato combines topic modeling [Blei:2012:TopicModels] and structured learning [Lafferty:2001:CRF] together with single-column type prediction based on the Sherlock model. Similar to earlier work [Hulsebos:2019:KDD], we consider 78 common semantic types and use the WebTables dataset from the VizNet corpus [hu2019viznet] to train our model. We evaluate Sato through several experiments and show that it achieves support-weighted and macro average scores of 0.901 and 0.973, respectively, substantially outperforming the state-of-the-art. Through a per-type performance analysis, we find that Sato substantially increases the prediction accuracy for the underrepresented semantic types. Overall evaluation results demonstrate that incorporating the table context of a column when detecting its semantic type can help resolve ambiguities, such as those exemplified above, as well as ameliorate the need for large sample sizes for accuracy, improving prediction performance over the long tail of data types.
To facilitate future research and applications, we open source our code, trained model, and online demo powered by Sato at https://github.com/megagonlabs/sato.
2 Problem Formulation
Our goal is to predict semantic types for table columns using their values, without considering the header information. We formulate it as multi-class classification, each class corresponding to a predefined semantic type.
We consider the training data as a set of tables. Let be the columns of a given table and be the true semantic types of these columns, where , the set of labels for possible semantic types considered (e.g., city, country, population). Similarly, let be a feature extractor function that takes a single column and returns an -dimensional feature vector . One approach to semantic typing is to learn a mapping from values of single columns to semantic types. We refer to this model as single-column prediction. The Sherlock [Hulsebos:2019:KDD] model falls into this category.
In Sato, in order to make the best use of table contexts and resolve semantic ambiguity with single-column predictions, we formulate the problem as multi-column prediction. A multi-column prediction model learns a mapping from the entire table (a sequence of columns) to a sequence of semantic types. This formulation allows us to incorporate table context into semantic type prediction in two ways.
First, we use features generated from the entire table as table context. For example, the column values ‘Italy,’ ‘Poland,’ … and ‘380,948,’ ‘1,777,972,’ … are also used to predict the semantic type of the first column in Table B (in Fig. 1.) Second, we can jointly predict the semantic types of columns from the same table. Again, for Table B, with the joint prediction the predicted types country and population of neighboring columns would help to make a more accurate prediction for the first column.
Sato is a novel hybrid machine learning model developed to predict the semantic types of columns in tables. It has two modeling components: (1) A topic-aware prediction component that estimates the intent (a global descriptor) of a table using topic modeling and extends the single-column prediction model with an additional topic subnetwork. (2) A structured output prediction model that combines the topic-aware predictions for all columns and performs multi-column joint semantic type prediction. Fig. 2 illustrates the high-level architecture of Sato. We next discuss each Sato component and its implementation in detail.
3.1 Single-column prediction model
As shown in Fig. 2, Sato’s topic-aware module is built on top of a single-column prediction model that uses a deep neural network. We first provide a brief background on deep learning and a description of the single-column model.
Deep learning Deep learning [LeCun:2015:DeepLearning] is a form of representation learning that uses neural networks with multiple layers. Through simple but non-linear transformations of the input representation at each layer, deep learning models can learn representations of the data at varying levels of abstractions that are useful for the problem at hand (e.g., classification, regression). When provided enough training data and computing power, neural networks with multiple layers can be effectively trained with a stochastic gradient descent, where the gradient of an objective function with respect to the layer-wise transformation coefficients (i.e., weights) can be computed using the backpropagation algorithm [rumelhart1986learning]. With increased access to large-scale training data and computing power, deep learning models have shown remarkable improvement in the last decade, achieving practical successes in solving long-lasting problems in machine learning, ranging from image recognition to language translation.
Deep learning combined with the availability of massive table corpora [webtables, hu2019viznet] presents opportunities to learn from tables in the wild [halevy2009unreasonable]. It also presents opportunities to improve existing approaches to semantic type detection as well as other research problems related to data preparation and information retrieval. Although prior research has used shallow neural networks for related tasks (e.g., [li1994semantic]), it is only more recently that Hulsebos et al. [Hulsebos:2019:KDD] developed Sherlock, a large-scale deep learning model for semantic typing.
Deep learning for single-column prediction Sato builds on single-column prediction by using column-wise features and employs an architecture which allows any single-column prediction model to be used. In our current work, we choose Sherlock as our single-column prediction model due to its recently demonstrated performance.
The column-wise features used in Sato include character embeddings (Char), word embeddings (Word), paragraph embeddings (Para), as well as column statistics (e.g., mean, std) (Stat.) The dimension of the column-wise features from the four groups is 1587 in total.
A multi-layer subnetwork is applied to the column-wise features to compress high-dimensional vectors into compact dense vectors, with the exception of the Stat feature set, which consists of only 27 features. The output of the three subnetworks is concatenated to the statistical features, forming the input to the primary network. After the concatenation of these features, in the primary network two fully-connected layers (ReLU activation) with BatchNorm and Dropout layers are applied before the output layer. The final output layer, which includes a softmax function, generates confidence values (i.e., probabilities) for the 78 semantic types.
3.2 Topic-aware prediction model
The first component of Sato is a topic-aware prediction module that first characterizes a table with a topic vector and then incorporates it into the column-wise prediction by extending the neural network model above with an additional subnetwork to take
topic vectors as input. We next discuss how using topic modeling to characterize table semantics can be useful in resolving ambiguities in type detection.
Table semantics Tables are collections of related data entities organized in rows. Every table is created with an intent [venetis2011recovering] in the user’s mind and semantic types of the columns in a table can be considered a meaningful expression (or utterance) of that intent. Each column of the table partially fulfills the intent by describing one attribute of the entities. As illustrated in Fig. 2(a), we assume that the intent of a table determines the semantic types of the columns in the table, which in turn generates the column values, acting as latent variables. We refer to the set of all column values in a table as table values.
Thus, being able to accurately infer the table intent can help to improve the prediction of column semantics. Table captions or titles usually capture table intent. For example, in Fig. 1, Table A intends to provide biographical information about influential personalities in history and Table B talks about geographical information about cities in Europe. However, as with column semantics, a clear and well-structured description of intent is not always available in real-world tables. Therefore we need to estimate table intent without relying on any header or meta information.
Sato estimates a table’s intent by mapping its values onto a low-dimensional space. Each of these dimensions
corresponds to a “topic,” describing one aspect of a possible table intent. The final estimation is a distribution over the latent topic dimensions generated using topic modeling approaches. Next, we provide a brief background on topic models and explain how Sato extracts topic vectors from tables and
feed them to topic-aware models.
Topic models Finding the topical composition of textual data is useful for many tasks, such as document summarization or featurization. Topic models [Blei:2012:TopicModels] aim to automatically discover thematic topics in text corpora and discrete data collections in an unsupervised manner. Latent Dirichlet allocation (LDA) [Blei:2003:LDA] is a simple yet powerful generative probabilistic topic model, widely used for quantifying thematic structures in text. LDA represents documents as random mixtures of latent topics and each latent topic as a distribution over words. The main advantage of probabilistic topic models such as LDA over clustering algorithms is that probabilistic topic models can represent a data point (e.g., document) as a mixture of topics. Although LDA was originally applied to text corpora, since then many variants have been developed to discover thematic structures in non-textual data as well (e.g., [blei2003modeling, fei2005bayesian, Yuan:2012:DRD].)
Table intent estimator We use an LDA model to estimate a table’s intent as a topic-vector, treating values of each table as a “document.” As illustrated in Fig. 2(b), we implement the table intent estimator as a pre-trained LDA model. It takes table values as input and outputs a fixed-length vector named “table topic vector” over the topic dimensions. For Sato, we pre-train an LDA model with 400 topic dimensions on public tables that have had their headers and captions removed.
The topics are generated during training according the data’s semantic structure, so they do not have pre-defined meanings. However, by looking at the representative semantic types associated with each topic, we found some examples with good interpretations. For example, topic # 192 is closely associated with the semantic types “origin, nationality, country, continent, and sex” and thus possibly captures aspects about personal information, while topic # 264 corresponds to “code, description, create, company, symbol” and can be interpreted as a business-related topic. Detailed topic analysis can be found in Section 5.4.
Learning and prediction Fig. 2(b) shows how topic-aware models take the values in a table topic vector as additional features for both learning and prediction. We augment the single-column neural network model with an additional subnetwork to take topic vectors as input and then append its output before feeding into the primary network. In this way, the topic-aware model will learn not only relationships between the input column and its type but also how the column type correlates to the table-level contextual information.
3.3 Structured prediction model
We have shown that Sato captures table-level context by introducing topic vectors into single-column models. However, the table topic vector is shared by all columns in the table and can be considered as “global context.” Incorporating only global context, the topic-aware model ignores the inferred semantic types of surrounding columns in the same table. In other words, it cannot capture “local context” which is the relationship between semantic types of neighboring columns. We introduce the second component of Sato, a structured prediction model utilizing information from surrounding columns to better capture local context.
Through preliminary analysis, we confirm that certain pairs of semantic types co-occur in tables much more frequently than others. For example, in a WebTables sample, the most frequent pair city and state co-occurs 4 times more often than the tenth most frequent pair name and type (detailed co-occurrence statistics available in Section 4.1). Such inter-column relationships show the value of “local” contextual information from surrounding columns in addition to the “global” table topic. Sato models the relationships between columns through pairwise dependencies in a graphical model and performs table-wise prediction using structured learning techniques.
Next, we provide background for structured output learning using graphical models and explain its application in Sato.
Structured output learning In addition to semantic type detection, many other prediction problems such as named entity extraction, language parsing, speech recognition, and image segmentation have spatial or semantic structures that are inherent to them. Such structures mean that predictions of neighboring instances correlate to one another. Structured learning algorithms [bakir2007predicting], including probabilistic graphical models [koller2009probabilistic] and recurrent neural networks [lstm97hochreiter, rumelhart1986learning], model dependencies among the values of structurally linked variables such as neighboring pixels or words to perform joint predictions. Structured output learning models are widely used in computer vision and natural language processing [Nowozin:2011:StructuredLearningCV, Smith:2011:LinguisticStructurePrediction] for prediction tasks that have structures in output space, instead of applying a multi-class classification for each output variable independently.
A conditional random field (CRF) [Lafferty:2001:CRF] is a discriminative undirected probabilistic graphical model and one of the most popular techniques for structured learning with successful applications in labeling, parsing and segmentation problems across domains. Similar to Markov random fields (MRFs) [geman1986markov, koller2009probabilistic], exact inference for general CRFs is intractable but there are special structure such as linear-chains that allow exact inference. There are also several efficient approximate inference algorithms based on message passing, linear-programming relaxation, and graph cut optimization for CRFs with general graphs [Lafferty:2001:CRF].
Modeling column dependencies Sato uses a linear-chain CRF to explicitly encode the inter-column relationship while still considering features for each column. We encode the output of a column-wise prediction model (i.e., predicted semantic types of the columns) and the combinations of semantic types of columns in the same table as CRF parameters. As shown in Fig. 3(a), in the CRF model, each variable represents the type of a column with corresponding column values as the observed evidence. Variables representing the types of adjacent columns are linked with an edge. Given a sequence of columns in a table, the goal is to find the best sequence of semantic types , which provides the largest conditional probability .
The conditional probability can be written as a normalized product of a set of real-valued functions. Following the convention, we refer to these functions in log scale as “potential functions.” Unary potential captures the likelihood of predicting type based on the content of the corresponding column . Pairwise potential represents the “coupling degree” between types and .
We use a linear-chain CRF, where the conditional distribution is defined by the unary prediction potentials and pairwise potentials between adjacent columns:
is an input-dependent normalization function.
Unary potential functions We use unary potentials to model the probability of a semantic type given the column content. In other words, the unary potential of a semantic type for a given column can be considered the probability of that semantic type based on the values of the column. The architecture of Sato supports using estimates of any valid column-wise prediction model as unary potentials. In the current work, we obtain the unary potentials of the semantic types for a given column from the output of our topic-aware prediction model, which uses both table-level topic vector and column features as input.
Pairwise potential functions Pairwise potentials capture the relationship between the semantic types of two columns in the same table. These relationships can be parameterized with a matrix , where is the set of all possible types and () is a weight parameter for the “coupling degree” of semantic types and in adjacent columns. Such a coupling degree can be approximated by the co-occurrence frequency. We expect the pairwise weight of two semantic types to be proportional to their frequency of co-occurrence in adjacent columns. Pairwise potential weights in our CRF model are trainable parameters, updated by gradient descent.
Learning and prediction We use the following objective function to train a Sato model. The objective function is the log-likelihood of semantic types of columns in the same table:
Here, the normalization term sums over all possible semantic type combinations. To efficiently calculate , we can use the forward-backward algorithm [rabiner1989tutorial], which uses dynamic programming to cache intermediate values while moving from the first to the last columns. After the training phase, as shown in Fig. 3(b), Sato performs holistic type prediction with learned pairwise potential and unary potential provided by topic-aware prediction. To obtain prediction results, we conduct maximum a posteriori (MAP) inference of semantic types:
does not affect since it is a constant with respect to . Then we use the Viterbi algorithm [viterbi1967error] to calculate and store partial combinations with the maximum score at each step of the column sequence traversal, avoiding redundant computation.
We compare Sato and its two basic variants obtained by ablation with the state-of-the-art semantic type prediction model, Sherlock [Hulsebos:2019:KDD]. As demonstrated in [Hulsebos:2019:KDD], Sherlock’s deep learning approach clearly outperforms matching-based algorithms, decision-tree-based semantic typing, and human annotation. Therefore, we omit these comparisons in our evaluation and directly compare against the Sherlock model implemented as the Base method.
We evaluate the effectiveness of the proposed models on the WebTables corpus from VizNet [hu2019viznet] and restrict ourselves to the relational web tables with valid headers that appear in the 78 semantic types. To avoid filtering out columns with slight variation in capitalization and representation, we convert all column headers to a “canonical form” before matching. The canonicalization process starts with trimming content in parentheses. We then convert strings to lower case, capitalize words except for the first (if there are more than one word) and concatenate the results into a single string. For example, strings ‘YEAR,’ ‘Year’ and ‘year (first occurrence)’ will all have canonical form ‘year,’ and ‘birth place (country)’ will be converted to ‘birthPlace.’
Since we formulate semantic typing as a multi-column type detection problem, we extract 80K tables, instead of columns, from a subset of the WebTables corpus as our dataset . The column headers in their canonical forms act as the groundtruth labels for semantic types. To help evaluate the importance of incorporating table semantics, we also create a filtered version with 33K tables. We filter out singleton tables (those containing only one column) since they lack context as defined in this paper. We then randomly split each dataset into a training set (80%) and a held-out set (20%) that is used for evaluation.
Figure 5 shows the count of each semantic type in the dataset . The distribution is clearly unbalanced with a long tail. Single-column models tend to perform poorly on the less-common types that comprise the long-tail. By effectively incorporating context, Sato significantly improves prediction accuracy for those types.
To better understand relationships between the semantic types of columns in the same table, we conduct a preliminary analysis on the co-occurrence patterns of types. Figure 6, shown in log-scale for readability, reports the frequencies of selected pairs of semantic types occurring in the same table. Most frequently co-occurring pairs include (city, state), (age, weight), (age, name), (code, description).
4.2 Feature extraction
We use the public Sherlock feature extractors111https://github.com/mitmedialab/sherlock-project to extract the four groups of base features, Char, Word, Para and Stat, for each column in a table. In order to provide a fair comparison, these base features were used by both baseline methods and proposed methods in the experiments. To generate table topics as introduced in Section 3.2, we train an LDA model that captures the mapping from table values to the latent topic dimensions. Since LDA is an unsupervised model, we only need the vocabulary of the tables without requiring any headers or semantic annotation. We convert numerical values into strings and then concatenate all values in the table sequentially to form a “document” for each table. Using the gensim [rehurek_lrec] library, we train an LDA model with 400 topics on a separate dataset of 10K tables. With the pre-trained LDA, we can extract topic vectors for tables using values from the entire table as input. Every table has a single topic vector, shared across columns.
4.3 Model implementation
We implement the multi-input neural network introduced in [Hulsebos:2019:KDD] using PyTorch [paszke2017automatic] as the Base single-column model. Throughout the experiments discussed here, we train the Base neural network model for 100 epochs using the Adam optimizer with a learning rate of and a weight decay rate of .
For topic-aware prediction in Sato, the table topic features go through a separate subnetwork with an architecture identical to the subnetworks of the Base feature groups. Before going into the primary network, the outputs of all four subnetworks are concatenated with Stat to form a single vector.
We train Sato’s CRF layer with a batch size of 10 tables, using the Adam optimizer with a learning rate of for 15 epochs. We initialize the pairwise potential parameters of the CRF model with the column co-occurrence matrix calculated from a held-out set of the WebTables corpus. We set the CRF unary potentials for columns to be their normalized topic-aware prediction score.
4.4 Evaluation metrics
We measure the prediction performance on each target semantic type by calculating . Since the semantic type distribution is not uniform, we report two basic types of average performances using the support-weighted and macro average scores. The support-weighted score is the average of per-type values weighted by support (the number of samples in the test set for the respective type) and reflects the overall performance. The macro average score is the unweighted average of the per-type scores, treating all types equally, and is therefore more sensitive to types with small sample sizes compared to support-weighted .
|Multi-column tables||All tables|
|Macro average||Support-weighted||Macro average||Support-weighted|
|Base||0.752 (+0.0%)||0.932 (+0.0%)||0.692 (+0.0%)||0.867 (+0.0%)|
|Sato||0.901 (+14.9%)||0.973 (+4.1%)||0.783 (+9.1%)||0.908 (+4.1%)|
|0.865 (+11.3%)||0.956 (+2.4%)||0.768 (+7.6%)||0.897 (+3.0%)|
|0.828 (+7.6%)||0.959 (+2.7%)||0.717 (+2.5%)||0.885 (+1.8%)|
Table 1 reports improvements of the Sato variants over the Base method on both the dataset , which includes only tables with more than one column, and the complete dataset . On multi-column tables, Sato improves the macro average score by and the support-weighted score by compared to the single-column base. When evaluated on all tables we still see improvement on macro average score and improvement on support-weighted , although these scores are diluted by the inclusion of tables without valid table context. The results confirm that Sato can effectively improve the accuracy of semantic type prediction by incorporating contextual information embedded in table semantics.
We also evaluate the variants of Sato with single components: only performed topic-aware prediction using table values and conducted structured prediction using Base output as unary potential without considering table topic features. As shown in Table 1, both and provide improvements over the Base model but are outperformed by the combined effort in Sato. The results indicate that the structured prediction model and the topic-aware prediction model make use of different pieces of table context information for semantic type detection.
We note that there are always larger improvements on macro average scores than support-weighted scores, suggesting that a significant amount of Sato’s improvements come from boosting accuracy for the less represented types. To better understand the influence of techniques used in Sato, we next perform a per-type evaluation for both Sato components on multi-column tables.
5.1 Topic-aware prediction
Fig. 8 shows the per-type comparison of scores between models with or without the topic-aware prediction component. More specifically, Fig. 7(a) compares the full Sato against Sato without table values (i.e., ,) and Fig. 7(b) compares (only topic-aware model) against Base. Including information in table values improved 59 out of 78 semantic types for with 9 types getting equal and 10 types getting worse performances. Similarly, improves the performance for 64 types and decreases it for 11 types. The prediction performance stays unchanged for 3 types.
We also see significant improvements in the previously “hard” semantic types with small support size. The types with the highest accuracy increases, affiliate, director, person, ranking, and sales, all come from the fifteen least represented types as shown in Fig. 5. This shows incorporating table values effectively alleviates the problem of lacking training data for the rare types.
In terms of averages, with topic-aware prediction, Base is improved by in macro average and in support-weighted , and is improved in macro average and in support-weighted . Since macro average is known to be less biased towards large classes, the differences between metrics again demonstrate that a significant portion of the improvement comes from boosting accuracy on the rare types. Overall, we confirm that incorporating table vocabulary improves the semantic type detection performance with or without structured prediction.
5.2 Structured prediction
To evaluate the contribution of structured prediction, we compare Sato with its variant without structured prediction, (Fig. 8(a)). Similarly, we compare the performance of (structured prediction directly on Base output) with that of Base (Fig. 8(b)).
Base is improved on 50 types and is improved on 59 types with structured prediction. For a subset of rare types (e.g., depth, sales, affiliate,) the prediction accuracy is dramatically improved. While for others (e.g., person, director, ranking,) there is no noticeable improvement as with topic-aware prediction. This shows structured prediction is less effective in boosting the accuracy of rare types compared to topic-aware prediction. However, at the same time, both the number of types that get worse accuracy (4 and 5 respectively) and the drop in scores for those types are smaller with structured prediction as compared to topic-aware prediction. Enforcing table-level context can be too aggressive sometimes, leading to worse performance for certain types. Through modeling relationships between inferred types of surrounding columns, the structured prediction module in Sato “salvages” some of these overly aggressive predictions. We conduct qualitative analysis in Section 5.6 to further look into this effect.
With structured table-wise prediction, Base is improved by in macro average and in support-weighted , and is improved in macro average and in support-weighted . From the results, we conclude that multi-column predictions from the structured prediction model, with or without topic modeling, outperforms the single-column prediction of the Base model.
To get a preliminary understanding of how sensitive Sato is to the initialization method used for the pairwise potential parameters in the CRF layer, we also compare the co-occurrence matrix initialization method with a random initialization. We find that both initialization methods converge to the same result, though the co-occurrence matrix initialization performs better in the first few epochs of learning.
5.3 Feature importance
To better understand the influence of the different feature groups, we perform permutation importance [altmann2010permutation] analysis on Base and Sato variants. For each fitted model and a specific feature group, we take the input tables and perform shuffling by only swapping features in the specified feature group with randomly selected tables. Such feature mismatch will cause less accurate predictions and a worse overall performance. Shuffling crucial features will break the strong relationships between input and output, leading to a significant drop in accuracy. We took the average of the normalized drop in scores over five random trials as the feature importance measurement.
Fig. 10 shows that for both the Base model and , the Word and Char feature groups are the most important feature groups. This matches the conclusions in [Hulsebos:2019:KDD]. When considering table vocabulary, the additional Topic feature group has comparable or greater importance than Word and Char. The effect is more obvious with respect to the macro average metric, confirming the help of table values information, especially on less-represented types.
5.4 Topic interpretation
We conduct qualitative analysis on the LDA model to investigate how the model captures semantics from each table and provides contextual information to Sato. To obtain the topic distribution of each semantic type, we calculate the average topic distribution based on the topic distributions of the -th table that contains the semantic type. For each topic, we chose top- semantic types as representative semantic types by the probability of the topic.
We find that some topics had “flat” distributions where most semantic types have almost the same probabilities. Since these topics are not very useful for classifying semantic types, we compute a saliency score for each topic and sort the topics by their saliency. Our saliency score averages the probabilities of the top- semantic types for each topic.
Table 2 shows the top-5 salient topics and the representative semantic types. Following the standard approach in topic model analysis [Blei:2012:TopicModels, Blei:2003:LDA], we manually devise an interpretation for each topic. For example, topic dimension #192 and #99 are activated by personal information in table values, whereas #264 is closely related to business tables. These examples demonstrate that semantic space learned using LDA could capture intent information from tables.
|Topic ID||Top-5 semantic types||Interpretation|
|192||origin, nationality, country, continent, sex||person|
|99||affiliate, class, person, notes, language||person|
|232||industry, format, notes, genre, type||product, movie, song|
|394||religion, family, address, teamName, publisher||person, book|
|264||code, description, creator, company, symbol||business|
5.5 Column embeddings (Col2Vec)
To verify how the table intent features help the Sato model capture the table semantics, we analyze and compare the embedding vectors from the final layer of the Sato model and the baseline Sherlock model as column embeddings. As described above, we can consider these embeddings as column embeddings since the final layer combines input signals to compose semantic representations. For comparison, we used the final layer of the single-column prediction model of Sato, before the CRF layer. Therefore, we assume that the Table Intent features account for the difference in the embeddings.
Following prior examples (e.g., [Zeiler:2014:Visualizing]), we analyze column embeddings of the test columns used in the experiments. We use t-SNE [vanDerMaaten:2008:tSNE] to reduce the dimensionality of the embedding vectors to two and then visualize them using a two-dimensional scatterplot. To embed vectors of the two methods in a common space, we fit a single t-SNE model for all data points, and then visualized major semantic types that are related to organizations (affiliate, teamName, family, and manufacturer) to investigate how the Sato model with the Table Intent features can appropriately distinguish columns of those ambiguous semantic types.
Fig. 11 shows the visualization of embedding vectors of Sato and Sherlock. With Sherlock, the column embeddings of each semantic type partially form a cluster, but some clusters are overlapped compared to the column embeddings by Sato. In Fig. 11 (a), we observe a clearer separation between the organization-related semantic types with little perturbation. From the results, we qualitatively confirm that topic-aware prediction helps Sato distinguish semantically similar semantic types by capturing the table context of an input table. Note that these column embeddings are from the test set, and any label information from these columns was not used to obtain the column embeddings. Thus, we can also confirm that Sato appropriately generalizes and learns column embeddings for these semantic types.
5.6 Qualitative analysis
To better understand how structured prediction further helped Sato with the existence of topic-aware predictions, we conducted qualitative analysis by identifying examples where table-wise prediction “salvages” bad predictions in the column-wise predictions (i.e., using the Base and models).
Table 3 shows a selected set of example tables from the test sets where the incorrect predictions from the Base model are corrected by applying structured prediction using our trained CRF layer. For example, with table #4575, the columns company and sales was wrongly predicted as name and duration by the single-column Base model. By modeling inter-column dependencies, correctly predicts the types company and sales, which tend to co-occur more with surrounding columns symbol and isbn for tables about books and magazines.
Table 2(b) shows examples where made incorrect predictions using table values and was subsequently corrected by the use of structured prediction (i.e., Sato). Table #4369 and table #4531 are examples where location-related vocabulary in tables made a large impact. It produced overly aggressive predictions with multiple location columns, whereas Sato with the additional structured inference step successfully corrected one of the columns.
Furthermore, taking surrounding types into consideration, structured prediction effectively improves performance for numerical columns like duration/sales from table #4575, age/weight from table #3865, code/weight from table #410.
Using learned representations
Sato’s single column prediction module based on Sherlock incorporates four categories of features that characterize different aspects of column values, amassing more than 1.5K feature values per column. The Sherlock authors note feature extraction to be necessary for performance results. However, the availability of large-scale table corpora presents a unique opportunity to develop pre-trained representation models and eschew manual feature extraction.
To test the viability of using representation models, we fine-tuned the BERT model [devlin2019bert], a state-of-the-art model for language representation, for our semantic type detection task. Models based on fine-tuning BERT have recently improved prior
art on several NLP benchmarks without manual featurization [devlin2019bert, liu2019fine, liu2019roberta]. We trained the BERT model using the default BERT parameters, achieving a support-weighted F1 score of 0.866, which is slightly better than 0.852 achieved by the Sherlock model. This result is promising because a “featurization-free” method with default parameters is able to achieve a prediction accuracy comparable to that of Sherlock. However, our multi-column prediction still outperforms the BERT model by a large margin, indicating the importance of incorporating table context into column type prediction. A promising avenue of future research is to combine our multi-column model with BERT-like pre-trained learned representation models.
Exploiting type hierarchy through ontology In this paper, semantic types are defined in the flat structure and there is no hierarchical structure in the semantic types. In fact, some semantic types can have a parent-child relationship. For example, location can be the parent class of country and city. The benefits of incorporating such prior knowledge into the model are (1) virtually increasing the training data since we can virtually convert training data for a semantic type into that of the parent semantic type (e.g., city can be considered as country,) and (2) adequate modeling of relationship between columns. Therefore, we could expect that incorporating prior knowledge would improve prediction accuracy especially for the semantic types for which few training data are available.
However, we consider our table-wise structured prediction of Sato appropriately models the pairwise relationship between prediction results of two columns in the same table, and thus it can also leverage the information from the other columns to predict the semantic type of a target column. From the experimental results, we empirically confirm that our Table Vocabulary and table-wise structured prediction improve the performance on predicting the semantic types, especially which have relatively little training data.
7 Related Work
Sato builds on prior machine learning approaches to semantic type detection. It is also related to existing systems and research that perform
semantic type detection using regular expression matching, dictionary lookup, ontologies, statistical similarity, and ensembles of expert detectors.
Regular expression and dictionary lookup Semantic type detection enhances the functionality of commercial data preparation and analysis systems such as Microsoft Power BI [powerbi], Trifacta [trifacta], and Google Data Studio [googledatastudio]. These commercial tools typically rely on manually defined rule-based approaches such as regular expression patterns dictionary lookups to detect semantic types. For instance, Trifacta detects around 10 types (e.g., gender and zip code) and Power BI only supports time-related semantic types (e.g., date/time and duration.) Open source libraries such as messytables [messytables], and csvkit [csvkit] similarly use heuristics to detect a limited set of types.
Ontology-based Prior research work, with roots in the semantic web and schema matching literature, provide alternative approaches to semantic type detection. One body of work leverages existing data on the web, such as WebTables [webtables], and ontologies (or, knowledge bases) such as DBPedia [dbpedia], Wikitology [syed2010exploiting], and Freebase [freebase]. Venetis et al. [venetis2011recovering] construct a database of value-type mappings, then assign types using a maximum likelihood estimator based on column values. Syed et al. [syed2010exploiting] use column headers and values to build a Wikitology query, the result of which maps columns to types.
Statistical similarity Several earlier approaches rely on statistical similarity or other measures of data similarity to match columns with types. Ramnandan et al. [ramnandan2015assigning] first separate numerical and textual column types, then compare column values to those with labels from a dataset using the Kolmogorov-Smirnov (K-S) test and Term Frequency-Inverse Document Frequency (TF-IDF,) respectively. Pham et al. [pham2016semantic] use additional features and tests, including the Mann-Whitney test for numerical data and Jaccard similarity for textual data, to train logistic regression and random forest models.
Synthesized Puranik [puranik] proposes combining the predictions of “experts,” including regular expressions, dictionaries, and machine learning models. More recently, Yan and He [yan2018synthesizing] introduced a system that, given a search keyword and a set of positive examples, synthesizes type detection logic from open source GitHub repositories. This system provides a novel approach to leveraging domain-specific heuristics for parsing, validating, and transforming semantic data types.
Learned Another line of prior work employs machine learning, including probabilistic graphical models. Goel et al. [goel2012exploiting] use CRFs to predict the semantic type of each value within a column, then combine these predictions into a prediction for the whole column. Limaye et al. [limaye2010annotating] use probabilistic graphical models to annotate values with entities, columns with types, and column pairs with relationships. These predictions simultaneously maximize a potential function using a message passing algorithm. Takeoka et al. [Takeoka:2019:Meimei] extend this approach with multi-label classifiers to support additional types, including numerical data types, and improve its predictive performance. Similar to these earlier approaches, Sato also uses a probabilistic graphical model, a linear-chain CRF for structured prediction. Unlike these earlier approaches, Sato uses the CRF model to combine topic-aware predictions of a large-scale deep learning model, leveraging massive table corpora available in the wild to significantly improve the performance. Note that many automated semantic matching and integration approaches (e.g., [doan2001reconciling, doan2003learning, li1994semantic, rahm2001survey]) sidestep explicit labeling and directly compare tables while trying to capture the semantics of tables with learned models.
Although prior research used shallow neural networks in the past for related tasks (e.g., [li1994semantic]) Sherlock [Hulsebos:2019:KDD] is the first deep learning model directly applied to semantic type detection for table columns. Trained on a large number of columns, Sherlock uses a multi-input neural network to make type prediction based on features of column values. Sato builds on Sherlock and addresses its two related drawbacks; the low prediction accuracy for underrepresented types and the lack of consideration for table context in prediction. We compare the performance of Sato against Sherlock through several experiments.
Automated semantic typing is becoming ever more important with the rapid increase in the need for the fusion of information from multiple often heterogeneous, large-scale data sources. The semantics of a table column (or any other data source for that matter) are embodied by its context as well as its data values. Here, we introduce Sato to automatically detect the semantic types of table columns, leveraging the signals from the table context of columns as well as the data values of columns. Sato combines the power of large-scale deep learning together with structured prediction and topic modeling to achieve a prediction performance that significantly exceeds the state-of-the-art. Through ablation and permutation experiments, we evaluate Sato extensively and show how individual modeling choices as well as feature types contribute to the performance. In order to facilitate future applications and extended research, we publicly release our trained model and source code for training along with an interactive web application demonstrating Sato’s use at https://github.com/megagonlabs/sato.
We thank Jonathan Engel for suggesting the name Sato and his proofreading help. We also thank Kevin Hu for his help in making the Sherlock source code accessible.