Sato: Contextual Semantic Type Detection in Tables

Sato: Contextual Semantic Type Detection in Tables

Dan Zhang\titlenoteWork done during internship at Megagon Labs
UMASS Amherst
Yoshihiko Suhara
Megagon Labs
Jinfeng Li
Megagon Labs
dzhang@cs.umass.edu yoshi@megagon.ai jinfeng@megagon.ai
   Madelon Hulsebos
KPN
Çağatay Demiralp
Megagon Labs
Wang-Chiew Tan
Megagon Labs
mmhulsebos@gmail.com cagatay@megagon.ai wangchiew@megagon.ai
Abstract

Detecting the semantic types of data columns in relational tables is important for various data preparation and information retrieval tasks such as data cleaning, schema matching, data discovery, and semantic search. However, existing detection approaches either perform poorly with dirty data, support only a limited number of semantic types, fail to incorporate the table context of columns or rely on large sample sizes in the training data. We introduce Sato, a hybrid machine learning model to automatically detect the semantic types of columns in tables, exploiting the signals from the context as well as the column values. Sato combines a deep learning model trained on a large-scale table corpus with topic modeling and structured prediction to achieve support-weighted and macro average F1 scores of 0.901 and 0.973, respectively, exceeding the state-of-the-art performance by a significant margin. We extensively analyze the overall and per-type performance of Sato, discussing how individual modeling components, as well as feature categories, contribute to its performance.

\numberofauthors

6

1 Introduction

Many data preparation and information retrieval tasks including data cleaning, integration, discovery and search rely on the ability to accurately detect data column types. Automated data cleaning uses transformation and validation rules that depend on data types [2011-wrangler, Raman:2001:PWI:645927.672045]. Schema matching for data integration leverages data types to find correspondences between data columns across tables [rahm2001survey]. Similarly, data discovery benefits from detecting types of data columns in order to return semantically relevant results for user queries [aurum, seeping-semantics]. Recognizing the semantics of table values helps aggregate information from multiple tabular data sources. Search engines also rely on the detection of semantically relevant column names to extend support to tables [venetis2011recovering].

We can consider two categories of types for table columns: atomic and semantic. Atomic types such as boolean, integer, and string provide basic, low-level type information about a column. On the other hand, semantic types such as location, birthDate, and name, convey finer-grained, richer information about column values. Detecting semantic types can be a powerful tool, and in many cases may be essential for enhancing the effectiveness of data preparation and analysis systems. In fact, commercial systems such as Google Data Studio [googledatastudio], Microsoft Power BI [powerbi], Tableau [tableau], and Trifacta [trifacta] attempt to detect semantic types, typically using a combination of regular expression matching and dictionary lookup. While reliable for detecting atomic types and simple, well-structured semantic types such as credit card numbers or e-mail addresses, these rule-based approaches are not robust enough to process dirty or missing data, support only a limited variety of types, and fall short for types without strict validations. However, many tables found in legacy enterprise databases and on the Web have column names that are either unhelpful (cryptic, abbreviated, malformed, etc.) or missing altogether.

Figure 1: Two actual tables with unknown column types (Table A and Table B) from the VizNet corpora. The last column of Table A and the first column of Table B have identical values: ‘Florence,’ ‘Warsaw,’ ‘London,’ and ‘Braunschweig.’ However powerful, a prediction model based solely on column values (i.e., single-column prediction) cannot resolve the ambiguity to infer the correct semantic types, birthplace and city. Sato incorporates signals from table context and perform a multi-column type prediction to help effectively resolve ambiguities like these and improve the accuracy of semantic type predictions.

In response, recent work [Hulsebos:2019:KDD] introduced Sherlock, a deep learning model for semantic type detection trained on a massive table corpora  [hu2019viznet]. Sherlock formulates semantic type detection as a multi-class classification problem where classes correspond to semantic types. It leverages more than 600K real-world table columns for learning with a multi-input feed forward deep neural network, providing state-of-the-art results.

While Sherlock represents a significant leap in applying deep learning to semantic typing, it suffers from two problems. First, it under-performs for types that do not have a sufficiently large number of samples in the training data. Although this is a known issue for deep learning models, it nevertheless restricts Sherlock’s application to underrepresented types, which form a long tail of data types appearing in tables at large. Second, Sherlock uses only the values of a column to predict its type, without considering the column’s context in the table. Predicting the semantic type of a column based solely on the column values, however, comprises an under-determined problem in many cases.

Consider the example in Fig. 1, for a column that contains ‘Florence,’ ‘Warsaw,’ ‘London,’ and ‘Braunschweig’ as values; location, city, or birthPlace could all be reasonable semantic types for the column. It can be hard to resolve such ambiguities using only column values because the semantic types also depend on the context of the table. Continuing with the example, it is highly likely that the column’s type would be birthPlace if it came from Table A since the table contains biographical information about influential personalities. However, the same column in Table B would be more likely to have the type city, as the table’s other columns present information about European cities.

In this paper, we introduce Sato (SemAntic Type detection with table cOntext), a hybrid machine learning model that incorporates table contexts to predict the semantic types of table columns. Sato combines topic modeling [Blei:2012:TopicModels] and structured learning [Lafferty:2001:CRF] together with single-column type prediction based on the Sherlock model. Similar to earlier work [Hulsebos:2019:KDD], we consider 78 common semantic types and use the WebTables dataset from the VizNet corpus [hu2019viznet] to train our model. We evaluate Sato through several experiments and show that it achieves support-weighted and macro average scores of 0.901 and 0.973, respectively, substantially outperforming the state-of-the-art. Through a per-type performance analysis, we find that Sato substantially increases the prediction accuracy for the underrepresented semantic types. Overall evaluation results demonstrate that incorporating the table context of a column when detecting its semantic type can help resolve ambiguities, such as those exemplified above, as well as ameliorate the need for large sample sizes for accuracy, improving prediction performance over the long tail of data types.

To facilitate future research and applications, we open source our code, trained model, and online demo powered by Sato at https://github.com/megagonlabs/sato.

2 Problem Formulation

Our goal is to predict semantic types for table columns using their values, without considering the header information. We formulate it as multi-class classification, each class corresponding to a predefined semantic type.

We consider the training data as a set of tables. Let be the columns of a given table and be the true semantic types of these columns, where , the set of labels for possible semantic types considered (e.g., city, country, population). Similarly, let be a feature extractor function that takes a single column and returns an -dimensional feature vector . One approach to semantic typing is to learn a mapping from values of single columns to semantic types. We refer to this model as single-column prediction. The Sherlock [Hulsebos:2019:KDD] model falls into this category.

In Sato, in order to make the best use of table contexts and resolve semantic ambiguity with single-column predictions, we formulate the problem as multi-column prediction. A multi-column prediction model learns a mapping from the entire table (a sequence of columns) to a sequence of semantic types. This formulation allows us to incorporate table context into semantic type prediction in two ways.

First, we use features generated from the entire table as table context. For example, the column values ‘Italy,’ ‘Poland,’ … and ‘380,948,’ ‘1,777,972,’ … are also used to predict the semantic type of the first column in Table B (in Fig. 1.) Second, we can jointly predict the semantic types of columns from the same table. Again, for Table B, with the joint prediction the predicted types country and population of neighboring columns would help to make a more accurate prediction for the first column.

3 Model

Sato is a novel hybrid machine learning model developed to predict the semantic types of columns in tables. It has two modeling components: (1) A topic-aware prediction component that estimates the intent (a global descriptor) of a table using topic modeling and extends the single-column prediction model with an additional topic subnetwork. (2) A structured output prediction model that combines the topic-aware predictions for all columns and performs multi-column joint semantic type prediction. Fig. 2 illustrates the high-level architecture of Sato. We next discuss each Sato component and its implementation in detail.

Figure 2: In Sato, the topic-aware module extends single-column models with additional topic subnetworks, incorporating a context modeling table intent into the model. The structure prediction module then combines the topic-aware results for all columns, providing the final semantic type prediction for the columns in the table.

3.1 Single-column prediction model

As shown in Fig. 2, Sato’s topic-aware module is built on top of a single-column prediction model that uses a deep neural network. We first provide a brief background on deep learning and a description of the single-column model.

Deep learning Deep learning [LeCun:2015:DeepLearning] is a form of representation learning that uses neural networks with multiple layers. Through simple but non-linear transformations of the input representation at each layer, deep learning models can learn representations of the data at varying levels of abstractions that are useful for the problem at hand (e.g., classification, regression). When provided enough training data and computing power, neural networks with multiple layers can be effectively trained with a stochastic gradient descent, where the gradient of an objective function with respect to the layer-wise transformation coefficients (i.e., weights) can be computed using the backpropagation algorithm [rumelhart1986learning]. With increased access to large-scale training data and computing power, deep learning models have shown remarkable improvement in the last decade, achieving practical successes in solving long-lasting problems in machine learning, ranging from image recognition to language translation.

Deep learning combined with the availability of massive table corpora [webtables, hu2019viznet] presents opportunities to learn from tables in the wild [halevy2009unreasonable]. It also presents opportunities to improve existing approaches to semantic type detection as well as other research problems related to data preparation and information retrieval. Although prior research has used shallow neural networks for related tasks (e.g., [li1994semantic]), it is only more recently that Hulsebos et al. [Hulsebos:2019:KDD] developed Sherlock, a large-scale deep learning model for semantic typing.

Deep learning for single-column prediction Sato builds on single-column prediction by using column-wise features and employs an architecture which allows any single-column prediction model to be used. In our current work, we choose Sherlock as our single-column prediction model due to its recently demonstrated performance.

The column-wise features used in Sato include character embeddings (Char), word embeddings (Word), paragraph embeddings (Para), as well as column statistics (e.g., mean, std) (Stat.) The dimension of the column-wise features from the four groups is 1587 in total.

A multi-layer subnetwork is applied to the column-wise features to compress high-dimensional vectors into compact dense vectors, with the exception of the Stat feature set, which consists of only 27 features. The output of the three subnetworks is concatenated to the statistical features, forming the input to the primary network. After the concatenation of these features, in the primary network two fully-connected layers (ReLU activation) with BatchNorm and Dropout layers are applied before the output layer. The final output layer, which includes a softmax function, generates confidence values (i.e., probabilities) for the 78 semantic types.

(a)
(b)
Figure 3: Sato’s topic-aware modeling is based on the premise that every table is created with an intent in mind and that the semantic types of the columns in a table are expressions of that intent with thematic coherence. In other words, (a) the intent of a table determines the semantic types of the columns in the table, which in turn generate the column values, acting as latent variables. (b) Sato estimates the intent of a given table with a topic vector obtained from a pre-trained LDA model and combines it with the local evidence from per-column values using a deep neural network.

3.2 Topic-aware prediction model

The first component of Sato is a topic-aware prediction module that first characterizes a table with a topic vector and then incorporates it into the column-wise prediction by extending the neural network model above with an additional subnetwork to take topic vectors as input. We next discuss how using topic modeling to characterize table semantics can be useful in resolving ambiguities in type detection.

Table semantics Tables are collections of related data entities organized in rows. Every table is created with an intent [venetis2011recovering] in the user’s mind and semantic types of the columns in a table can be considered a meaningful expression (or utterance) of that intent. Each column of the table partially fulfills the intent by describing one attribute of the entities. As illustrated in Fig. 2(a), we assume that the intent of a table determines the semantic types of the columns in the table, which in turn generates the column values, acting as latent variables. We refer to the set of all column values in a table as table values.

Thus, being able to accurately infer the table intent can help to improve the prediction of column semantics. Table captions or titles usually capture table intent. For example, in Fig. 1, Table A intends to provide biographical information about influential personalities in history and Table B talks about geographical information about cities in Europe. However, as with column semantics, a clear and well-structured description of intent is not always available in real-world tables. Therefore we need to estimate table intent without relying on any header or meta information.

Sato estimates a table’s intent by mapping its values onto a low-dimensional space. Each of these dimensions corresponds to a “topic,” describing one aspect of a possible table intent. The final estimation is a distribution over the latent topic dimensions generated using topic modeling approaches. Next, we provide a brief background on topic models and explain how Sato extracts topic vectors from tables and feed them to topic-aware models.

Topic models Finding the topical composition of textual data is useful for many tasks, such as document summarization or featurization. Topic models [Blei:2012:TopicModels] aim to automatically discover thematic topics in text corpora and discrete data collections in an unsupervised manner. Latent Dirichlet allocation (LDA) [Blei:2003:LDA] is a simple yet powerful generative probabilistic topic model, widely used for quantifying thematic structures in text. LDA represents documents as random mixtures of latent topics and each latent topic as a distribution over words. The main advantage of probabilistic topic models such as LDA over clustering algorithms is that probabilistic topic models can represent a data point (e.g., document) as a mixture of topics. Although LDA was originally applied to text corpora, since then many variants have been developed to discover thematic structures in non-textual data as well (e.g., [blei2003modeling, fei2005bayesian, Yuan:2012:DRD].)

Table intent estimator We use an LDA model to estimate a table’s intent as a topic-vector, treating values of each table as a “document.” As illustrated in Fig. 2(b), we implement the table intent estimator as a pre-trained LDA model. It takes table values as input and outputs a fixed-length vector named “table topic vector” over the topic dimensions. For Sato, we pre-train an LDA model with 400 topic dimensions on public tables that have had their headers and captions removed.

The topics are generated during training according the data’s semantic structure, so they do not have pre-defined meanings. However, by looking at the representative semantic types associated with each topic, we found some examples with good interpretations. For example, topic # 192 is closely associated with the semantic types “origin, nationality, country, continent, and sex” and thus possibly captures aspects about personal information, while topic # 264 corresponds to “code, description, create, company, symbol” and can be interpreted as a business-related topic. Detailed topic analysis can be found in Section 5.4.

Learning and prediction Fig. 2(b) shows how topic-aware models take the values in a table topic vector as additional features for both learning and prediction. We augment the single-column neural network model with an additional subnetwork to take topic vectors as input and then append its output before feeding into the primary network. In this way, the topic-aware model will learn not only relationships between the input column and its type but also how the column type correlates to the table-level contextual information.

(a)
(b)
Figure 4: (a) Sato uses a linear-chain CRF to model the dependencies between columns types given their values. (b) For each column, Sato plugs in the column-wise prediction scores for each type as the unary potentials of the corresponding node in the CRF model. Then Sato learns the pairwise potential through backpropagation updates using stochastic gradient descent, maximizing the posterior probability . Although we choose to use predictions from topic-aware models in the current implementation, the Sato architecture is flexible to support unary potentials from arbitrary column-wise models.

3.3 Structured prediction model

We have shown that Sato captures table-level context by introducing topic vectors into single-column models. However, the table topic vector is shared by all columns in the table and can be considered as “global context.” Incorporating only global context, the topic-aware model ignores the inferred semantic types of surrounding columns in the same table. In other words, it cannot capture “local context” which is the relationship between semantic types of neighboring columns. We introduce the second component of Sato, a structured prediction model utilizing information from surrounding columns to better capture local context.

Through preliminary analysis, we confirm that certain pairs of semantic types co-occur in tables much more frequently than others. For example, in a WebTables sample, the most frequent pair city and state co-occurs 4 times more often than the tenth most frequent pair name and type (detailed co-occurrence statistics available in Section 4.1). Such inter-column relationships show the value of “local” contextual information from surrounding columns in addition to the “global” table topic. Sato models the relationships between columns through pairwise dependencies in a graphical model and performs table-wise prediction using structured learning techniques.

Next, we provide background for structured output learning using graphical models and explain its application in Sato.

Structured output learning In addition to semantic type detection, many other prediction problems such as named entity extraction, language parsing, speech recognition, and image segmentation have spatial or semantic structures that are inherent to them. Such structures mean that predictions of neighboring instances correlate to one another. Structured learning algorithms [bakir2007predicting], including probabilistic graphical models [koller2009probabilistic] and recurrent neural networks [lstm97hochreiter, rumelhart1986learning], model dependencies among the values of structurally linked variables such as neighboring pixels or words to perform joint predictions. Structured output learning models are widely used in computer vision and natural language processing [Nowozin:2011:StructuredLearningCV, Smith:2011:LinguisticStructurePrediction] for prediction tasks that have structures in output space, instead of applying a multi-class classification for each output variable independently.

A conditional random field (CRF) [Lafferty:2001:CRF] is a discriminative undirected probabilistic graphical model and one of the most popular techniques for structured learning with successful applications in labeling, parsing and segmentation problems across domains. Similar to Markov random fields (MRFs) [geman1986markov, koller2009probabilistic], exact inference for general CRFs is intractable but there are special structure such as linear-chains that allow exact inference. There are also several efficient approximate inference algorithms based on message passing, linear-programming relaxation, and graph cut optimization for CRFs with general graphs [Lafferty:2001:CRF].

Modeling column dependencies Sato uses a linear-chain CRF to explicitly encode the inter-column relationship while still considering features for each column. We encode the output of a column-wise prediction model (i.e., predicted semantic types of the columns) and the combinations of semantic types of columns in the same table as CRF parameters. As shown in Fig. 3(a), in the CRF model, each variable represents the type of a column with corresponding column values as the observed evidence. Variables representing the types of adjacent columns are linked with an edge. Given a sequence of columns in a table, the goal is to find the best sequence of semantic types , which provides the largest conditional probability .

The conditional probability can be written as a normalized product of a set of real-valued functions. Following the convention, we refer to these functions in log scale as “potential functions.” Unary potential captures the likelihood of predicting type based on the content of the corresponding column . Pairwise potential represents the “coupling degree” between types and .

We use a linear-chain CRF, where the conditional distribution is defined by the unary prediction potentials and pairwise potentials between adjacent columns:

where

is an input-dependent normalization function.

Unary potential functions We use unary potentials to model the probability of a semantic type given the column content. In other words, the unary potential of a semantic type for a given column can be considered the probability of that semantic type based on the values of the column. The architecture of Sato supports using estimates of any valid column-wise prediction model as unary potentials. In the current work, we obtain the unary potentials of the semantic types for a given column from the output of our topic-aware prediction model, which uses both table-level topic vector and column features as input.

Pairwise potential functions Pairwise potentials capture the relationship between the semantic types of two columns in the same table. These relationships can be parameterized with a matrix , where is the set of all possible types and () is a weight parameter for the “coupling degree” of semantic types and in adjacent columns. Such a coupling degree can be approximated by the co-occurrence frequency. We expect the pairwise weight of two semantic types to be proportional to their frequency of co-occurrence in adjacent columns. Pairwise potential weights in our CRF model are trainable parameters, updated by gradient descent.

Learning and prediction We use the following objective function to train a Sato model. The objective function is the log-likelihood of semantic types of columns in the same table:

Here, the normalization term sums over all possible semantic type combinations. To efficiently calculate , we can use the forward-backward algorithm [rabiner1989tutorial], which uses dynamic programming to cache intermediate values while moving from the first to the last columns. After the training phase, as shown in Fig. 3(b), Sato performs holistic type prediction with learned pairwise potential and unary potential provided by topic-aware prediction. To obtain prediction results, we conduct maximum a posteriori (MAP) inference of semantic types:

does not affect since it is a constant with respect to . Then we use the Viterbi algorithm [viterbi1967error] to calculate and store partial combinations with the maximum score at each step of the column sequence traversal, avoiding redundant computation.

4 Evaluation

We compare Sato and its two basic variants obtained by ablation with the state-of-the-art semantic type prediction model, Sherlock [Hulsebos:2019:KDD]. As demonstrated in [Hulsebos:2019:KDD], Sherlock’s deep learning approach clearly outperforms matching-based algorithms, decision-tree-based semantic typing, and human annotation. Therefore, we omit these comparisons in our evaluation and directly compare against the Sherlock model implemented as the Base method.

Figure 5: Counts of the 78 semantic types in the dataset form a long-tailed distribution. Sato improves the prediction accuracy for the types with fewer samples (those in the long-tail) by effectively incorporating table context.
Figure 6: Co-occurrence frequencies in log scale for a selected set of types. Certain pairs like (city, state) or (age, weight) appear in the same table more frequently than others. There are non-zero diagonal values as tables can have multiple columns of the same semantic type.

4.1 Datasets

We evaluate the effectiveness of the proposed models on the WebTables corpus from VizNet [hu2019viznet] and restrict ourselves to the relational web tables with valid headers that appear in the 78 semantic types. To avoid filtering out columns with slight variation in capitalization and representation, we convert all column headers to a “canonical form” before matching. The canonicalization process starts with trimming content in parentheses. We then convert strings to lower case, capitalize words except for the first (if there are more than one word) and concatenate the results into a single string. For example, strings ‘YEAR,’ ‘Year’ and ‘year (first occurrence)’ will all have canonical form ‘year,’ and ‘birth place (country)’ will be converted to ‘birthPlace.’

Since we formulate semantic typing as a multi-column type detection problem, we extract 80K tables, instead of columns, from a subset of the WebTables corpus as our dataset . The column headers in their canonical forms act as the groundtruth labels for semantic types. To help evaluate the importance of incorporating table semantics, we also create a filtered version with 33K tables. We filter out singleton tables (those containing only one column) since they lack context as defined in this paper. We then randomly split each dataset into a training set (80%) and a held-out set (20%) that is used for evaluation.

Figure 5 shows the count of each semantic type in the dataset . The distribution is clearly unbalanced with a long tail. Single-column models tend to perform poorly on the less-common types that comprise the long-tail. By effectively incorporating context, Sato significantly improves prediction accuracy for those types.

To better understand relationships between the semantic types of columns in the same table, we conduct a preliminary analysis on the co-occurrence patterns of types. Figure 6, shown in log-scale for readability, reports the frequencies of selected pairs of semantic types occurring in the same table. Most frequently co-occurring pairs include (city, state), (age, weight), (age, name), (code, description).

4.2 Feature extraction

We use the public Sherlock feature extractors111https://github.com/mitmedialab/sherlock-project to extract the four groups of base features, Char, Word, Para and Stat, for each column in a table. In order to provide a fair comparison, these base features were used by both baseline methods and proposed methods in the experiments. To generate table topics as introduced in Section 3.2, we train an LDA model that captures the mapping from table values to the latent topic dimensions. Since LDA is an unsupervised model, we only need the vocabulary of the tables without requiring any headers or semantic annotation. We convert numerical values into strings and then concatenate all values in the table sequentially to form a “document” for each table. Using the gensim [rehurek_lrec] library, we train an LDA model with 400 topics on a separate dataset of 10K tables. With the pre-trained LDA, we can extract topic vectors for tables using values from the entire table as input. Every table has a single topic vector, shared across columns.

4.3 Model implementation

We implement the multi-input neural network introduced in [Hulsebos:2019:KDD] using PyTorch [paszke2017automatic] as the Base single-column model. Throughout the experiments discussed here, we train the Base neural network model for 100 epochs using the Adam optimizer with a learning rate of and a weight decay rate of .

For topic-aware prediction in Sato, the table topic features go through a separate subnetwork with an architecture identical to the subnetworks of the Base feature groups. Before going into the primary network, the outputs of all four subnetworks are concatenated with Stat to form a single vector.

We train Sato’s CRF layer with a batch size of 10 tables, using the Adam optimizer with a learning rate of for 15 epochs. We initialize the pairwise potential parameters of the CRF model with the column co-occurrence matrix calculated from a held-out set of the WebTables corpus. We set the CRF unary potentials for columns to be their normalized topic-aware prediction score.

4.4 Evaluation metrics

We measure the prediction performance on each target semantic type by calculating . Since the semantic type distribution is not uniform, we report two basic types of average performances using the support-weighted and macro average scores. The support-weighted score is the average of per-type values weighted by support (the number of samples in the test set for the respective type) and reflects the overall performance. The macro average score is the unweighted average of the per-type scores, treating all types equally, and is therefore more sensitive to types with small sample sizes compared to support-weighted .

Multi-column tables All tables
Macro average Support-weighted Macro average Support-weighted
Base 0.752 (+0.0%) 0.932 (+0.0%) 0.692 (+0.0%) 0.867 (+0.0%)
Sato 0.901 (+14.9%) 0.973 (+4.1%) 0.783 (+9.1%) 0.908 (+4.1%)
0.865 (+11.3%) 0.956 (+2.4%) 0.768 (+7.6%) 0.897 (+3.0%)
0.828 (+7.6%) 0.959 (+2.7%) 0.717 (+2.5%) 0.885 (+1.8%)
Table 1: Performance comparison between Sato and Base across the datasets (multi-column only) and (the full dataset), using the macro and supported-weighted scores. Sato improves over the Base model with significant absolute percentage increases in accuracy across the datasets in both aggregate metrics. Furthermore, and , the two Sato variants obtained by excluding one of the two components from the full model, also perform substantially better than Base in both data conditions and metrics.
Figure 7: scores of Sato and Base for each semantic type, grouped by the relative change when Sato is used for prediction in comparison to Base; increased (left), unchanged (middle), and decreased (right). Sato increases the prediction accuracy for the majority of the types, particularly for the previously hard underrepresented ones.

5 Results

Table 1 reports improvements of the Sato variants over the Base method on both the dataset , which includes only tables with more than one column, and the complete dataset . On multi-column tables, Sato improves the macro average score by and the support-weighted score by compared to the single-column base. When evaluated on all tables we still see improvement on macro average score and improvement on support-weighted , although these scores are diluted by the inclusion of tables without valid table context. The results confirm that Sato can effectively improve the accuracy of semantic type prediction by incorporating contextual information embedded in table semantics.

We also evaluate the variants of Sato with single components: only performed topic-aware prediction using table values and conducted structured prediction using Base output as unary potential without considering table topic features. As shown in Table 1, both and provide improvements over the Base model but are outperformed by the combined effort in Sato. The results indicate that the structured prediction model and the topic-aware prediction model make use of different pieces of table context information for semantic type detection.

We note that there are always larger improvements on macro average scores than support-weighted scores, suggesting that a significant amount of Sato’s improvements come from boosting accuracy for the less represented types. To better understand the influence of techniques used in Sato, we next perform a per-type evaluation for both Sato components on multi-column tables.

5.1 Topic-aware prediction

(a) Sato vs.
(b) vs. Base
Figure 8: scores for each type obtained with (blue) and without (orange) topic-aware prediction. (a) compares Sato and (Sato without the topic-aware module), (b) compares (Base with topic) and Base, showing improvements on the majority of types. The effect is significant for many underrepresented types.

Fig. 8 shows the per-type comparison of scores between models with or without the topic-aware prediction component. More specifically, Fig. 7(a) compares the full Sato against Sato without table values (i.e., ,) and Fig. 7(b) compares (only topic-aware model) against Base. Including information in table values improved 59 out of 78 semantic types for with 9 types getting equal and 10 types getting worse performances. Similarly, improves the performance for 64 types and decreases it for 11 types. The prediction performance stays unchanged for 3 types.

We also see significant improvements in the previously “hard” semantic types with small support size. The types with the highest accuracy increases, affiliate, director, person, ranking, and sales, all come from the fifteen least represented types as shown in Fig. 5. This shows incorporating table values effectively alleviates the problem of lacking training data for the rare types.

In terms of averages, with topic-aware prediction, Base is improved by in macro average and in support-weighted , and is improved in macro average and in support-weighted . Since macro average is known to be less biased towards large classes, the differences between metrics again demonstrate that a significant portion of the improvement comes from boosting accuracy on the rare types. Overall, we confirm that incorporating table vocabulary improves the semantic type detection performance with or without structured prediction.

(a) Sato vs.
(b) vs. Base
Figure 9: scores for each type obtained with (blue) and without (orange) structured prediction (a) compares Sato and (Sato without the structured prediction module), (b) compares (Base with structured prediction) and Base, showing improvements on the majority of types. Although the improvements on long-tail types are less significant compared to the topic-aware model in Fig. 8, fewer types get worse predictions (shown in the right panels). Structured prediction can correct mispredictions by directly modeling column relationships.

5.2 Structured prediction

To evaluate the contribution of structured prediction, we compare Sato with its variant without structured prediction, (Fig. 8(a)). Similarly, we compare the performance of (structured prediction directly on Base output) with that of Base (Fig. 8(b)).

Base is improved on 50 types and is improved on 59 types with structured prediction. For a subset of rare types (e.g., depth, sales, affiliate,) the prediction accuracy is dramatically improved. While for others (e.g., person, director, ranking,) there is no noticeable improvement as with topic-aware prediction. This shows structured prediction is less effective in boosting the accuracy of rare types compared to topic-aware prediction. However, at the same time, both the number of types that get worse accuracy (4 and 5 respectively) and the drop in scores for those types are smaller with structured prediction as compared to topic-aware prediction. Enforcing table-level context can be too aggressive sometimes, leading to worse performance for certain types. Through modeling relationships between inferred types of surrounding columns, the structured prediction module in Sato “salvages” some of these overly aggressive predictions. We conduct qualitative analysis in Section 5.6 to further look into this effect.

With structured table-wise prediction, Base is improved by in macro average and in support-weighted , and is improved in macro average and in support-weighted . From the results, we conclude that multi-column predictions from the structured prediction model, with or without topic modeling, outperforms the single-column prediction of the Base model.

To get a preliminary understanding of how sensitive Sato is to the initialization method used for the pairwise potential parameters in the CRF layer, we also compare the co-occurrence matrix initialization method with a random initialization. We find that both initialization methods converge to the same result, though the co-occurrence matrix initialization performs better in the first few epochs of learning.

Figure 10: Importance scores for the feature categories used in our models obtained by measuring the drop in both aggregated values from permutation experiments. Topic features are the most important feature category with respect to the macro average score in the full Sato model, providing additional evidence for the contribution of topic modeling in predicting underrepresented semantic types.

5.3 Feature importance

To better understand the influence of the different feature groups, we perform permutation importance [altmann2010permutation] analysis on Base and Sato variants. For each fitted model and a specific feature group, we take the input tables and perform shuffling by only swapping features in the specified feature group with randomly selected tables. Such feature mismatch will cause less accurate predictions and a worse overall performance. Shuffling crucial features will break the strong relationships between input and output, leading to a significant drop in accuracy. We took the average of the normalized drop in scores over five random trials as the feature importance measurement.

Fig. 10 shows that for both the Base model and , the Word and Char feature groups are the most important feature groups. This matches the conclusions in [Hulsebos:2019:KDD]. When considering table vocabulary, the additional Topic feature group has comparable or greater importance than Word and Char. The effect is more obvious with respect to the macro average metric, confirming the help of table values information, especially on less-represented types.

5.4 Topic interpretation

We conduct qualitative analysis on the LDA model to investigate how the model captures semantics from each table and provides contextual information to Sato. To obtain the topic distribution of each semantic type, we calculate the average topic distribution based on the topic distributions of the -th table that contains the semantic type. For each topic, we chose top- semantic types as representative semantic types by the probability of the topic.

We find that some topics had “flat” distributions where most semantic types have almost the same probabilities. Since these topics are not very useful for classifying semantic types, we compute a saliency score for each topic and sort the topics by their saliency. Our saliency score averages the probabilities of the top- semantic types for each topic.

Table 2 shows the top-5 salient topics and the representative semantic types. Following the standard approach in topic model analysis [Blei:2012:TopicModels, Blei:2003:LDA], we manually devise an interpretation for each topic. For example, topic dimension #192 and #99 are activated by personal information in table values, whereas #264 is closely related to business tables. These examples demonstrate that semantic space learned using LDA could capture intent information from tables.

Topic ID Top-5 semantic types Interpretation
192 origin, nationality, country, continent, sex person
99 affiliate, class, person, notes, language person
232 industry, format, notes, genre, type product, movie, song
394 religion, family, address, teamName, publisher person, book
264 code, description, creator, company, symbol business
Table 2: Examples of the topics learned by the LDA model, semantic types associated with each topic that are obtained by using a saliency metric, and our interpretation for each topic.

5.5 Column embeddings (Col2Vec)

To verify how the table intent features help the Sato model capture the table semantics, we analyze and compare the embedding vectors from the final layer of the Sato model and the baseline Sherlock model as column embeddings. As described above, we can consider these embeddings as column embeddings since the final layer combines input signals to compose semantic representations. For comparison, we used the final layer of the single-column prediction model of Sato, before the CRF layer. Therefore, we assume that the Table Intent features account for the difference in the embeddings.

(a)
(b)
Figure 11: Two-dimensional visualizations of column embeddings by (a) , and (b) Sherlock. Colors denote semantic types. Gray-colored regions are manually added to emphasize the areas of “ambiguity”in the column embeddings. appears to separate similar semantic types better.

Following prior examples (e.g.,  [Zeiler:2014:Visualizing]), we analyze column embeddings of the test columns used in the experiments. We use t-SNE [vanDerMaaten:2008:tSNE] to reduce the dimensionality of the embedding vectors to two and then visualize them using a two-dimensional scatterplot. To embed vectors of the two methods in a common space, we fit a single t-SNE model for all data points, and then visualized major semantic types that are related to organizations (affiliate, teamName, family, and manufacturer) to investigate how the Sato model with the Table Intent features can appropriately distinguish columns of those ambiguous semantic types.

Fig. 11 shows the visualization of embedding vectors of Sato and Sherlock. With Sherlock, the column embeddings of each semantic type partially form a cluster, but some clusters are overlapped compared to the column embeddings by Sato. In Fig. 11 (a), we observe a clearer separation between the organization-related semantic types with little perturbation. From the results, we qualitatively confirm that topic-aware prediction helps Sato distinguish semantically similar semantic types by capturing the table context of an input table. Note that these column embeddings are from the test set, and any label information from these columns was not used to obtain the column embeddings. Thus, we can also confirm that Sato appropriately generalizes and learns column embeddings for these semantic types.

5.6 Qualitative analysis

Table ID True Columns Base (w/o structured prediction) (w/ structured prediction)
6299 code, name, city symbol, team, city code, name, city
898 company, location name, city company, location
2816 product, language name, notes product, language
4575 symbol, company, isbn, sales symbol, name, isbn, duration symbol, company, isbn, sales
5712 type, description weight, name type, description
3865 year, teamName, age year, city, weight year, teamName, age
(a) Corrected tables from Base predictions
Table ID True Columns (w/o structured prediction) Sato (w/ structured prediction)
4289 age, city, country, rank age, city, team, rank age, city, country, rank
410 brand, weight artist, code brand, weight
5655 code, name, city club, name, name code, name, city
4369 day, location, notes name, location, location name, location, notes
30 language, name, origin language, name, description language, name, origin
4303 name, age, club artist, age, team name, age, team
4531 rank, name, city rank, location, location rank, location, city
4520 team, plays, result team, age, result team, plays, result
(b) Corrected tables from predictions
Table 3: Examples of the mispredictions that are corrected by performing a structured prediction using the linear-chain CRF.

To better understand how structured prediction further helped Sato with the existence of topic-aware predictions, we conducted qualitative analysis by identifying examples where table-wise prediction “salvages” bad predictions in the column-wise predictions (i.e., using the Base and models).

Table 3 shows a selected set of example tables from the test sets where the incorrect predictions from the Base model are corrected by applying structured prediction using our trained CRF layer. For example, with table #4575, the columns company and sales was wrongly predicted as name and duration by the single-column Base model. By modeling inter-column dependencies, correctly predicts the types company and sales, which tend to co-occur more with surrounding columns symbol and isbn for tables about books and magazines.

Table 2(b) shows examples where made incorrect predictions using table values and was subsequently corrected by the use of structured prediction (i.e., Sato). Table #4369 and table #4531 are examples where location-related vocabulary in tables made a large impact. It produced overly aggressive predictions with multiple location columns, whereas Sato with the additional structured inference step successfully corrected one of the columns.

Furthermore, taking surrounding types into consideration, structured prediction effectively improves performance for numerical columns like duration/sales from table #4575, age/weight from table #3865, code/weight from table #410.

6 Discussion

Using learned representations Sato’s single column prediction module based on Sherlock incorporates four categories of features that characterize different aspects of column values, amassing more than 1.5K feature values per column. The Sherlock authors note feature extraction to be necessary for performance results. However, the availability of large-scale table corpora presents a unique opportunity to develop pre-trained representation models and eschew manual feature extraction. To test the viability of using representation models, we fine-tuned the BERT model [devlin2019bert], a state-of-the-art model for language representation, for our semantic type detection task. Models based on fine-tuning BERT have recently improved prior art on several NLP benchmarks without manual featurization [devlin2019bert, liu2019fine, liu2019roberta]. We trained the BERT model using the default BERT parameters, achieving a support-weighted F1 score of 0.866, which is slightly better than 0.852 achieved by the Sherlock model. This result is promising because a “featurization-free” method with default parameters is able to achieve a prediction accuracy comparable to that of Sherlock. However, our multi-column prediction still outperforms the BERT model by a large margin, indicating the importance of incorporating table context into column type prediction. A promising avenue of future research is to combine our multi-column model with BERT-like pre-trained learned representation models.

Exploiting type hierarchy through ontology In this paper, semantic types are defined in the flat structure and there is no hierarchical structure in the semantic types. In fact, some semantic types can have a parent-child relationship. For example, location can be the parent class of country and city. The benefits of incorporating such prior knowledge into the model are (1) virtually increasing the training data since we can virtually convert training data for a semantic type into that of the parent semantic type (e.g., city can be considered as country,) and (2) adequate modeling of relationship between columns. Therefore, we could expect that incorporating prior knowledge would improve prediction accuracy especially for the semantic types for which few training data are available.

However, we consider our table-wise structured prediction of Sato appropriately models the pairwise relationship between prediction results of two columns in the same table, and thus it can also leverage the information from the other columns to predict the semantic type of a target column. From the experimental results, we empirically confirm that our Table Vocabulary and table-wise structured prediction improve the performance on predicting the semantic types, especially which have relatively little training data.

7 Related Work

Sato builds on prior machine learning approaches to semantic type detection. It is also related to existing systems and research that perform semantic type detection using regular expression matching, dictionary lookup, ontologies, statistical similarity, and ensembles of expert detectors.

Regular expression and dictionary lookup Semantic type detection enhances the functionality of commercial data preparation and analysis systems such as Microsoft Power BI [powerbi], Trifacta [trifacta], and Google Data Studio [googledatastudio]. These commercial tools typically rely on manually defined rule-based approaches such as regular expression patterns dictionary lookups to detect semantic types. For instance, Trifacta detects around 10 types (e.g., gender and zip code) and Power BI only supports time-related semantic types (e.g., date/time and duration.) Open source libraries such as messytables [messytables], and csvkit [csvkit] similarly use heuristics to detect a limited set of types.

Ontology-based Prior research work, with roots in the semantic web and schema matching literature, provide alternative approaches to semantic type detection. One body of work leverages existing data on the web, such as WebTables [webtables], and ontologies (or, knowledge bases) such as DBPedia [dbpedia], Wikitology [syed2010exploiting], and Freebase [freebase]. Venetis et al. [venetis2011recovering] construct a database of value-type mappings, then assign types using a maximum likelihood estimator based on column values. Syed et al. [syed2010exploiting] use column headers and values to build a Wikitology query, the result of which maps columns to types.

Statistical similarity Several earlier approaches rely on statistical similarity or other measures of data similarity to match columns with types. Ramnandan et al. [ramnandan2015assigning] first separate numerical and textual column types, then compare column values to those with labels from a dataset using the Kolmogorov-Smirnov (K-S) test and Term Frequency-Inverse Document Frequency (TF-IDF,) respectively. Pham et al. [pham2016semantic] use additional features and tests, including the Mann-Whitney test for numerical data and Jaccard similarity for textual data, to train logistic regression and random forest models.

Synthesized Puranik [puranik] proposes combining the predictions of “experts,” including regular expressions, dictionaries, and machine learning models. More recently, Yan and He [yan2018synthesizing] introduced a system that, given a search keyword and a set of positive examples, synthesizes type detection logic from open source GitHub repositories. This system provides a novel approach to leveraging domain-specific heuristics for parsing, validating, and transforming semantic data types.

Learned Another line of prior work employs machine learning, including probabilistic graphical models. Goel et al. [goel2012exploiting] use CRFs to predict the semantic type of each value within a column, then combine these predictions into a prediction for the whole column. Limaye et al. [limaye2010annotating] use probabilistic graphical models to annotate values with entities, columns with types, and column pairs with relationships. These predictions simultaneously maximize a potential function using a message passing algorithm. Takeoka et al. [Takeoka:2019:Meimei] extend this approach with multi-label classifiers to support additional types, including numerical data types, and improve its predictive performance. Similar to these earlier approaches, Sato also uses a probabilistic graphical model, a linear-chain CRF for structured prediction. Unlike these earlier approaches, Sato uses the CRF model to combine topic-aware predictions of a large-scale deep learning model, leveraging massive table corpora available in the wild to significantly improve the performance. Note that many automated semantic matching and integration approaches (e.g., [doan2001reconciling, doan2003learning, li1994semantic, rahm2001survey]) sidestep explicit labeling and directly compare tables while trying to capture the semantics of tables with learned models.

Although prior research used shallow neural networks in the past for related tasks (e.g., [li1994semantic]) Sherlock [Hulsebos:2019:KDD] is the first deep learning model directly applied to semantic type detection for table columns. Trained on a large number of columns, Sherlock uses a multi-input neural network to make type prediction based on features of column values. Sato builds on Sherlock and addresses its two related drawbacks; the low prediction accuracy for underrepresented types and the lack of consideration for table context in prediction. We compare the performance of Sato against Sherlock through several experiments.

8 Conclusion

Automated semantic typing is becoming ever more important with the rapid increase in the need for the fusion of information from multiple often heterogeneous, large-scale data sources. The semantics of a table column (or any other data source for that matter) are embodied by its context as well as its data values. Here, we introduce Sato to automatically detect the semantic types of table columns, leveraging the signals from the table context of columns as well as the data values of columns. Sato combines the power of large-scale deep learning together with structured prediction and topic modeling to achieve a prediction performance that significantly exceeds the state-of-the-art. Through ablation and permutation experiments, we evaluate Sato extensively and show how individual modeling choices as well as feature types contribute to the performance. In order to facilitate future applications and extended research, we publicly release our trained model and source code for training along with an interactive web application demonstrating Sato’s use at https://github.com/megagonlabs/sato.

9 Acknowledgments

We thank Jonathan Engel for suggesting the name Sato and his proofreading help. We also thank Kevin Hu for his help in making the Sherlock source code accessible.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398280
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description