Putting Self-Supervised Token Embedding on the Tables
Information distribution by electronic messages is a privileged means of transmission for many businesses and individuals, often under the form of plain-text tables. As their number grows, it becomes necessary to use an algorithm to extract text and numbers instead of a human. Usual methods are focused on regular expressions or on a strict structure in the data, but are not efficient when we have many variations, fuzzy structure or implicit labels. In this paper we introduce SC2T, a totally self-supervised model for constructing vector representations of tokens in semi-structured messages by using characters and context levels that address these issues. It can then be used for an unsupervised labeling of tokens, or be the basis for a semi-supervised information extraction system.
Today most of business-related information is transmitted in an electronic form, such as emails. Therefore, converting these messages into an easily analyzable representation could open numerous business opportunities, as a lot of them are not used fully because of the difficulty to build bespoke parsing methods. In particular, a great number of these transmissions are semi-structured text, which doesnât necessarily follows the classic english grammar. As seen in Fig. 1, they can be under the form of tables containing diverse elements, words and numbers, afterwards referred to as tokens.
These tables are often implicitly defined, which means that there are no special tags between what is or not part of the table, or even between cells. In these cases, the structure is coming from space or tabs alignment and from the relative order of the tokens. The data often are unlabeled, which means that the content must be read with domain-based knowledge. Thus, automatic extraction of structured information is a major challenge because token candidates come in a variety of forms within a fuzzy context. A high level of supervision is hard to obtain as manual labeling requires time that is hardly affordable when receiving thousands of such emails a day, and even more so as databases can become irrelevant over time. That is why training a generalizable model to extract these data should not rely on labeled inputs, but rather on the content itself - a paradigm called self-supervised learning. Many approaches already exist in Natural Language Processing, such as Part-of-Speech (POS) tagging or Named Entity Recognition (NER), but they do not take advantage of the semi-structured data framework. On the contrary, there exists some information extraction algorithms applied to tables, but they necessitate a great amount of manually defined rules and exceptions. Our model aims to reconcile both approaches for an efficient and totally self-supervised take on information extraction in the particular context of semi-structured data.
In this paper, we present a neural architecture for token embedding in plain-text tables, which provides a useful lower-dimensional representation for tasks such as unsupervised, or semi-supervised clustering. Intuitively, tokens with a similar meaning should be close in the feature space to ease any further information extraction. Our model aims to combine the better of the context and the character composition of each token, and that is why the neural architecture is designed to learn both context and character-level representations simultaneously. Finally, we can take advantage of the distances between tokens in the feature space to create proper tables from fuzzy input data.
Ii Related Work
Ii-a Information Extraction on Semi-Structured Data
The field of Information Extraction on Semi-Structured Data has been particularly active in the 1990’s and the early 2000’s, developed in settings such as the Message Understanding Conferences (MUCs) and, more recently, in the ICDAR 2013 Table Competition . A very complete survey of information extraction in tables can be found in  and in . The main goal of systems such as ,  or TINTIN , is to detect tables in messages, or to label lines such as captions using the density of blank spaces, Conditional Random Fields or Hidden Markov Models respectively. This also has been done more recently in an unsupervised manner by  and . Obviously the main goal is to extract the content of these tables, which is done by [9, 10, 11, 12], with DEByE , DIPRE  or WHISK  by learning patterns to match to the data systematically using manually defined rules and trying to generalize them as much as possible. A very thorough panorama of this class of algorithms is presented in . More recently,  proposes a graph structure in tables to match predefined patterns. Unfortunately, these methods are not flexible enough to be used in the case of a great number of patterns in the data, and need user supervision or gazetteers to work properly, which are not always available. The idea of our model can certainly be related the most with  or , but we add in new Natural Language Processing tools and neural networks – among other differences.
Ii-B Natural Language Processing
In recent years, neural networks have replaced handcrafted features in Natural Language Processing, with excellent results – a recent survey of the topic can be found in . The seminal paper of Collobert et al.  presents a first idea of token embeddings, or word features vectors, based on lookup tables in a fixed vocabulary and using neural networks. It also brings a general solution to problems such as Part of Speech (POS), Chunking and Named Entity Recognition (NER). The work on word features vectors continued with the classic Word2Vec paper  which is now one of the references on the topic, introducing the skip-gram model for text. There, the method used to train the network is trying to predict the next words in a sentence based on surrounding ones. However, a problem of these approaches are that they rely on a dictionary of words, and that “out-of-vocabulary” words such as orthographic errors get a generic representation. In problems such as information extraction, that is a major issue because the content consists mostly in names that are not classic words, and can evolve in time. Besides, closely related words such as “even” and “uneven” should be close in the feature space, which is not guaranteed by these methods. That is why recently the focus has shifted on a study directly on the characters, that mostly solve these questions. Examples can be found in  and  with LSTMs, or in ,  and  with Convolutional Networks. Further developments presented in  and  aim to learn vector representations of sentences or documents instead of limiting the models to the words only. This is done with the same methods used to get words representations, only with whole rows or paragraphs as the input. These are our main inspirations, but all these algorithms have been created to deal with natural and not semi-structured text, so they do not take advantage of the bi-dimensional structure of the data. An effort worth noting is  with the introduction of Multidimensional Recurrent Neural Networks in the Optical Character Recognition (OCR) field, but the idea has not been developed further.
Iii The Sc2t Embedding
We will now present the SC2T (Self-Supervised Character and Context-levels on Tables) embedding. As in , two important ideas guide our neural network architecture: to correctly represent a token, we need to take into account its composition (a number, a word?) as well as its context (the surrounding tokens). As we deal with tokens that mostly are not words in the classic sense of the term, but abbreviations, numbers, unique identifiers… and that we have no dictionary, we can’t use word-level features similar to what was done in . That’s why we will use character-level representations, in the same fashion that , ,  or . We do not use external dictionary or gazetteers, which allows our program to be relevant on any semi-structured text. Note that given raw text as input, the first stage is the tokenization of the data. A discussion on that topic is complex and beyond the scope of this paper, as special rules have to be applied depending on the data and pertinent segmentation.
Iii-a The Architecture
Our architecture is created to learn a character- and context-sensitive embedding of tokens. To build this distributed representation we train our network on a proxy task, which is to reconstruct tokens using only the surrounding ones - an idea recalling auto-encoders. By surrounding, we mean that are contained in a horizontal window of size and a vertical window of size around it, padding with zeros if necessary. This method resembles what is done in  or  for example, but takes advantage of the 2D structure of the data. Selecting tokens which are horizontally adjacent is trivial contrary to vertical ones. Papers such as  and  give good insights on how to define that efficiently. However, for simplicity reasons, we take the tokens of the surrounding lines which rightmost character is closest to the rightmost character of our target token. Each of these surrounding tokens is first transformed in a one-hot encoding on the characters of dimensionality , padded left with blank spaces to achieve the length for all tokens. Then, they all pass in the same character-level convolutional network (ChNN), which structure is inspired by . It is composed of a one-hot-encoding then fully connected (FC) layer, then of two one-dimensional CNNs with filters of size with a max-pooling. Finally, a fully connected layer is added to bring the embedding to the desired size. ReLU activations, batch normalization and dropout are also placed between each layer. A diagram of this network can be found in Fig. 2.
The resulting embeddings are then concatenated and fed into the horizontal (HNN) and vertical (VNN) context networks, that have the same structure as the character-level network excepted the input size and that the max-pooling and FC layer is replaced by a simple Flatten layer. They are kept separate from each other because they are not aimed to learn the same relationships in the data. Then their outputs are merged and passed through two fully connected layers (LNN), the last of them of size . Thus, we have two useful representations for a given token: the output from the LNN network (of size ), plus the output taken directly from the character CNN on the token itself (of size ). We then concatenate and feed them to the last part of the network, E, which consists of two fully connected layers and whose final output is compared to the one-hot-encoding of the original token. The concatenation is followed by a dropout layer to prevent the network to only use the input token. A value of yields the best results in our experience, which confirms the idea presented in . Our model allows a simultaneous training of all the components in the network using backpropagation. Finally, our context- and character-sensitive embedding is obtained by taking the output of the first FC layer in the E network, which has size , and we will see in the next part that it is indeed a useful distributed representation of tokens. A diagram of our whole network can be found in Fig. 3.
We use CNNs in all the stages of our network instead of LSTMs or other layers for two reasons: first, in the case of tables, the sequential aspect is often negligible. Besides, we implemented the same program with bidirectional LSTMs and it did not yield better results, while slowing down the whole process. This is a problem because speed of execution is an important factor in industrial applications treating tens of thousands of messages each day, each containing hundreds or thousands of tokens.
Iii-B Alternative Model
An alternative to the previous model can be considered. Indeed, instead of letting the E network merge the character and context embeddings, we could just concatenate them, applying a constant importance coefficient that has to be defined depending on the data. Indeed, if the different categories in the data are from different types (e.g., textual names and numbers), the character content has to be privileged, unlike the case of more context dependent tokens (e.g., numbers in a certain order). Usually, if the structure of the data is disrupted, we will need to rely more on characters. will increase the weight of one part or another, given that clustering algorithms put more importance on greater values in the data. Obviously, this coefficient necessitates an intervention of the user, and a knowledge of the data. Thus, it is not applicable in general but can be very efficient in particular cases, as we will see in section IV.
Iii-C Tokens and Lines Clustering
Once we obtain our token embeddings, a simple clustering algorithm such as k-means++  can be used to compute a clustering of the tokens. Obtaining coherent groups of tokens can lead to many developments. It can be used for manual labeling and bootstrapping quickly a labeled dataset for supervised learning, but it can also be the basis of an efficient semi-supervised algorithm.
We also need to cluster lines in the data: indeed, a message is often composed of one or multiples headers, the data itself, as well as disclaimers and signatures, and more generally blocks of natural language in the document. Once again, their repartition or presence is not guaranteed, so an adaptable clustering is necessary. To obtain an embedding of the lines, we simply compute a max-pooling of the embeddings of its tokens. We used this method for separating headers, disclaimers and table content by 3-means clustering on our data.
Iv Empirical Results
To assess the efficiency of our embeddings, we use them to label tokens in the Online Retail Data Set from UCI111http://archive.ics.uci.edu/ml/datasets/online+retail via k-means++ clustering. We chose it because this is a varied public dataset that fits the kind of problem we are dealing with. Unfortunately, the relevant Information Extraction papers we found (sec. II-A) used either custom datasets, or datasets that are not online anymore.
Iv-a The Dataset
The Online Retail Data Set consists of a clean list of invoices, totaling rows and columns. InvoiceNo, CustomerID and StockCode are mostly 5 or 6-digit integers with occasional letters. Quantity is mostly 1 to 3-digit integers, a part of them being negative, and UnitPrice is composed of 1 to 6 digits floating values. InvoiceDate are dates all in the same format, Country contains strings representing 38 countries and Description is 4224 strings representing names of products. We reconstruct text mails from this data, by separating each token with a blank space and stacking the lines for a given invoice, grouped by InvoiceNo. We will use the column label as ground truth for the tokens in the dataset. For simplicity reasons we add underscores between words in Country and Description to ease the tokenization. Another slight modification has to be done: of the CustomerId values are missing, and we replace them by ’00000’. A sample can be found in Fig. 4.
Iv-B Labeling of tokens using the SC2T Embedding
We will now create an embedding of the tokens, and use it in a k-means++ clustering. We will use the homogeneity score as metrics, which measures if all the data points that are members of a given cluster are given the same label. It can be written
where is the ensemble of data points in cluster and is the ensemble of data points that have the label which is most present in cluster . It represents the accuracy of a semi-supervised clustering where the user simply gives a label to each cluster, corresponding to the majority of its elements. Obviously, when tends to the number of data points. However, we will not restrain ourselves to taking , the exact number of labels, as varied data can have the same ground truth labels in a real setting. For example, , or could be all labeled as dates, but might be difficult to group into one cluster. That is why we do not consider the completeness score, which measures if all the data points of a given class are elements of the same cluster, as relevant in our case. So, a good measure of the quality of our clustering is the score reached for a certain number of clusters, e.g. or , which will represent the number of points that the user should label to obtain such accuracy. Note that as k-means yields stochastic results, the results given here are a mean of independent runs.
At first, we have a simple problem: all the lines follow the same pattern, so a simple extraction rule can perfectly extract data. This is a good baseline for our program as it should retrieve all the information. Our experiment consists of creating homogeneous clusters according to the labels of the tokens after randomly deleting a portion of them (Del.) and/or replacing randomly a part of the characters (CR) - heavy modifications that are not unlike those found in real-life settings. An example of disrupted data can be found in Fig. 5.
Note that we only used a subset of invoices, lines or approximately tokens, which yielded slightly worse results compared to the tests we made on the whole dataset. It is logical that the more the context is disrupted, the more we will rely on the characters part. We will present the results in two settings: one with the model presented in III-A (NoK), the other one with the parameter presented in III-B (K). Best Char % is the proportion of the norm of the character part of the embedding compared to the norm of the whole embedding, which is controlled by variations of . Results of homogeneity depending on the number of clusters can be found in Table I ( being the number of clusters), and our parameters in Table II. We chose the horizontal window such as it takes into account the whole line, but that could be unadapted in the case of very large tables.
|8||20||100||Best Char %|
|Char. Repl. 5%||NoK||99.3||99.7||100||–|
|Char. Repl. 50%||NoK||60.0||73.1||93.0||–|
|Del. 10% + CR 10%||NoK||73.2||92.2||97.3||–|
|Del. 10% + CR 50%||NoK||62.3||76.1||94.2||–|
|Del. 50% + CR 10%||NoK||76.5||88.3||94.7||–|
|Del. 50% + CR 50%||NoK||70.2||81.6||88.4||–|
|Character Dictionary Dim.|
|Context Embedding Dim.|
|Character-Level Embedding Dim.|
|Max. Length of Tokens|
Obviously, the more disrupted the data, the less accurate our model. First, we can see that the model with is better than without in most cases, but remember that the value of has been cross-validated to obtain the best possible result. This is not realistic in general, but can still be very useful when we have prior knowledge about the data. For example, we observe that without deletions and even with character replacements, the context alone brings 100% accuracy, reflecting that the position entirely determines the label. When we randomly replace characters we cannot rely as much on them, and numbers show that our model is more robust to a deletion of tokens than it is to character replacement, probably because in our dataset tokens with the same label are often similar in composition. It is also interesting to notice that our supervision-free NoK model, even if slightly disadvantaged in simple cases, yields its best results when the data is more disrupted. This is good news, as it is in these cases that we have the least amount of prior knowledge, besides being certainly the most realistic settings and the ones that need new models most.
Without surprise, we noticed that it is often CustomerID, InvoiceNo and to a lesser extent StockCode that are mislabeled, due to their same composition. Even in our most difficult case, 50% deletion and 50% character replacement, we obtain decent results in our unsupervised setting. Overall, with as few as token labels out of we could get a high clustering accuracy on most of our contexts. The size of the embedding also had to be chosen carefully, because it has to encode enough information while avoiding the curse of dimensionality. Finally, note that the network gets less training data when increasing the percentage of deletions, and that we retrained it from scratch in each setting.
Iv-C An Application to Table Alignment
Often, tables are not correctly aligned when data is missing, which creates an erroneous display. To correct this problem, we can define a reference line, that is the longest line that belongs to the table part according to the lines clustering. This line will define the number of columns in our resulting table. Then, for every other line, we try to match each token with a token from the reference line that is on its right, i.e. the token which is closest in the embedding space while allowing the order to be kept. We suppose here that the order is always preserved because in a given table permutations are very unlikely. We then obtain correctly aligned tables, as seen in Fig. 6, which can be very useful for an easier labeling of the tokens. This can be used even if there are different types of lines containing different information, theses lines being separated beforehand by clustering as presented above in III-C. We then take different rows as references.
In this paper we present a new Neural Language model that jointly uses the character composition of tokens and their surrounding context in the particular framework of semi-structured text data, for the purpose of generating a distributed representation. We have seen that the embeddings have linearized the space quite well such that a -means will gather similar tokens, or by max-pooling them, similar lines, and that it could be applied to table realignment. The approach presented here can already allow an information extraction system to function, but it could be even more beneficial to add semi-supervised learning algorithms, as described in  or . Another solution would be to bootstrap large annotated databases for performing supervised learning. We introduce several hyper-parameters to be tuned, mainly the sizes of our embeddings. We want our model to stay as general and unsupervised as possible, and we argue that tuning them manually is the better solution as existing unsupervised measures of the quality of a clustering (Silhouette Coefficient , Calinski-Harabaz Index ) can be misleading for our particular task. Indeed they can favor less clusters that are not homogeneous in terms of labels instead of more cluster that are, which is against our goal. Finally, the fact that we do not have relevant standards for this particular task is problematic. However, our dataset is openly available on the Internet (link above), and can be a simple but representative benchmark for papers to come.
We would like to thank Clement Laisné (Hellebore Technologies) for having developed convenient tools that greatly helped us in our research, as well as all our colleagues for their support. We also thank Caio Filippo Corro for discussions about this paper.
-  M. Göbel, T. Hassan, E. Oro, and G. Orsi, “Icdar 2013 table competition,” in Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 2013, pp. 1449–1453.
-  J. Turmo, A. Ageno, and N. Català, “Adaptive information extraction,” ACM Computing Surveys (CSUR), vol. 38, no. 2, p. 4, 2006.
-  D. W. Embley, M. Hurst, D. Lopresti, and G. Nagy, “Table-processing paradigms: a research survey,” International Journal on Document Analysis and Recognition, vol. 8, no. 2, pp. 66–86, 2006.
-  D. Pinto, A. McCallum, X. Wei, and W. B. Croft, “Table extraction using conditional random fields,” in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM, 2003, pp. 235–242.
-  A. McCallum, D. Freitag, and F. C. Pereira, “Maximum entropy markov models for information extraction and segmentation.” in Icml, vol. 17, 2000, pp. 591–598.
-  P. Pyreddy and W. B. Croft, “Tintin: A system for retrieval in text tables,” in Proceedings of the second ACM international conference on Digital libraries. ACM, 1997, pp. 193–200.
-  E. Cortez and A. S. Da Silva, Unsupervised information extraction by text segmentation. Springer, 2013.
-  E. Yeh, J. Niekrasz, and D. Freitag, “Unsupervised discovery and extraction of semi-structured regions in text via self-information,” in Proceedings of the 2013 workshop on Automated knowledge base construction. ACM, 2013, pp. 103–108.
-  F. Ciravegna, “Adaptive information extraction from text by rule induction and generalisation,” in Proceedings of the 17th international joint conference on Artificial intelligence-Volume 2. Morgan Kaufmann Publishers Inc., 2001, pp. 1251–1256.
-  P. Viola and M. Narasimhan, “Learning to extract information from semi-structured text using a discriminative context free grammar,” in Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2005, pp. 330–337.
-  A. Tengli, Y. Yang, and N. L. Ma, “Learning table extraction from examples,” in Proceedings of the 20th international conference on Computational Linguistics. Association for Computational Linguistics, 2004, p. 987.
-  S. Soderland, “Learning to extract text-based information from the world wide web.” in KDD, vol. 97, 1997, pp. 251–254.
-  A. H. Laender, B. Ribeiro-Neto, and A. S. da Silva, “Debye–data extraction by example,” Data & Knowledge Engineering, vol. 40, no. 2, pp. 121–154, 2002.
-  S. Brin, Extracting Patterns and Relations from the World Wide Web. Berlin, Heidelberg: Springer Berlin Heidelberg, 1999, pp. 172–183. [Online]. Available: http://dx.doi.org/10.1007/10704656_11
-  S. Soderland, “Learning information extraction rules for semi-structured and free text,” Machine learning, vol. 34, no. 1, pp. 233–272, 1999.
-  C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, “A survey of web information extraction systems,” IEEE transactions on knowledge and data engineering, vol. 18, no. 10, pp. 1411–1428, 2006.
-  T. Kasar, T. K. Bhowmik, and A. Belaid, “Table information extraction and structure recognition using query patterns,” in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 2015, pp. 1086–1090.
-  E. Agichtein and V. Ganti, “Mining reference tables for automatic text segmentation,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2004, pp. 20–29.
-  M. Mintz, S. Bills, R. Snow, and D. Jurafsky, “Distant supervision for relation extraction without labeled data,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 2009, pp. 1003–1011.
-  M. M. Lopez and J. Kalita, “Deep learning applied to nlp,” arXiv preprint arXiv:1703.03091, 2017.
-  R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa, “Natural language processing (almost) from scratch,” CoRR, vol. abs/1103.0398, 2011. [Online]. Available: http://arxiv.org/abs/1103.0398
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
-  W. Ling, T. Luís, L. Marujo, R. F. Astudillo, S. Amir, C. Dyer, A. W. Black, and I. Trancoso, “Finding function in form: Compositional character models for open vocabulary word representation,” arXiv preprint arXiv:1508.02096, 2015.
-  G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” arXiv preprint arXiv:1603.01360, 2016.
-  Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush, “Character-aware neural language models,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  J. P. Chiu and E. Nichols, “Named entity recognition with bidirectional lstm-cnns,” arXiv preprint arXiv:1511.08308, 2015.
-  C. D. Santos and B. Zadrozny, “Learning character-level representations for part-of-speech tagging,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1818–1826.
-  J. Li, M.-T. Luong, and D. Jurafsky, “A hierarchical neural autoencoder for paragraphs and documents,” arXiv preprint arXiv:1506.01057, 2015.
-  Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1188–1196.
-  A. Graves and J. Schmidhuber, “Offline handwriting recognition with multidimensional recurrent neural networks,” in Advances in neural information processing systems, 2009, pp. 545–552.
-  T. G. Kieninger, “Table structure recognition based on robust block segmentation,” in Document Recognition V, D. P. Lopresti and J. Zhou, Eds., vol. 3305, Apr. 1998, pp. 22–32.
-  T. Kieninger and A. Dengel, “Applying the t-recs table recognition system to the business letter domain,” in Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on. IEEE, 2001, pp. 518–522.
-  D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.
-  Z. Yu, P. Luo, J. You, H.-S. Wong, H. Leung, S. Wu, J. Zhang, and G. Han, “Incremental semi-supervised clustering ensemble for high dimensional data clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 3, pp. 701–714, 2016.
-  D. Calandriello, A. Lazaric, M. Valko, and I. Koutis, “Incremental spectral sparsification for large-scale graph-based semi-supervised learning,” arXiv preprint arXiv:1601.05675, 2016.
-  P. J. Rousseeuw, “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” Journal of computational and applied mathematics, vol. 20, pp. 53–65, 1987.
-  T. Caliński and J. Harabasz, “A dendrite method for cluster analysis,” Communications in Statistics-theory and Methods, vol. 3, no. 1, pp. 1–27, 1974.