Rethinking Table Parsing using Graph Neural Networks
Abstract
Document structure analysis, such as zone segmentation and table parsing, is a complex problem in document processing and is an active area of research. The recent success of deep learning in solving various computer vision and machine learning problems has not been reflected in document structure analysis since conventional neural networks are not well suited to the input structure of the problem. In this paper, we propose an architecture based on graph networks as a better alternative to standard neural networks for table parsing. We argue that graph networks are a more natural choice for these problems, and explore two gradientbased graph neural networks. Our proposed architecture combines the benefits of convolutional neural networks for visual feature extraction and graph networks for dealing with the problem structure. We empirically demonstrate that our method outperforms the baseline by a significant margin. In addition, we identify the lack of large scale datasets as a major hindrance for deep learning research for structure analysis, and present a new large scale synthetic dataset for the problem of table parsing. Finally, we opensource our implementation of dataset generation and the training framework of our graph networks to promote reproducible research in this direction^{1}^{1}1github.com/shahrukhqasim/TIES2.0.
I Introduction
Structural analysis is one of the most important aspects of document processing. It incorporates both physical and logical layout analysis and also includes parsing of complex structured layouts including tables, receipts, and forms. While there has been a lot of research done in physical and logical layout analysis of documents, there is still ample room for contribution towards parsing of structured layouts, such as tables, within them. Tables provide an intuitive and natural way to present data in a format which could be readily interpreted by humans. Based on its significance and difficulty level, table structure analysis has attracted a large number of researchers to make contributions in this domain.
Table detection and parsing is an old problem with research starting from the late nineties. One of the initial work is by Kieninger et al. [1]. They used a bottomup approach on words bounding boxes using a complex algorithm. Later on, many different custom methods were introduced including [2], [3], [4] and [5] which relied on customdesigned algorithms. Zanibbi et al. [6] present a comprehensive survey of all the old algorithms. An approach to recognize tables in spreadsheets was presented by [7] which classified every cell into either header, title or a data cell. Significant work was done by Shafait et al. [8] where they introduced different performance metrics for the table detection problem. These approaches are not data driven and they make strong assumptions about structures.
Chen et al. [9] made use of support vector machines and dynamic programming for table detection in handwritten documents. Kasar et al. [10] also used SVMs on ruling lines to detect tables. Hao et al. [11] used loose rules for extracting table regions and classified the regions using CNNs. They also used textual information from PDFs to improve the model results. [12] used positional information in every word to classify it as either table or nontable using dense neural networks.
After 2016, the research trod towards using deep learning models to solve the challenge. In 2017, many papers were presented which used deep learning based object detection or segmentation models for table detection and parsing. Gilani et al. [13] employed distance transform encoded information in an image and applied Faster RCNN [14] on these images. Schreiber et al. [15] also used Faster RCNN for table detection and extraction of rows and columns. For parsing, they applied object detection algorithm on vertically stretched document images. Leveraging the tables’ property to empirically contain more numeric data than textual data, Arif et al. [16] proposed to color code the document image to distinguish the numeric text and applied faster RCNN to extract table regions. Similarly, Siddique et al. [17] presented endtoend FasterRCNN pipeline for table detection task and used Deformable Convolutional Neural Network [18] as feature extractor for its capability to mold its receptive field based on the input. He et al. [19] segmented the document image into three classes: text, tables, and figures. They proposed to use Conditional Random Field (CRF) to improve results from Fully Convolutional Network (FCN) conditioned on the output from the contour edge detection network. Finally, they applied some heuristic rules to classify table instances and remove false positives from the output of the CRF. Kavasidis et al. [20] trained Fully Convolutional Neural Network to extract saliency maps for four classes: tables, pie charts, line charts and bar charts. These saliency maps are then, independently improved by CRF models.
Even though many people have shown that using object detection based approaches work well for table detection and recognition, defining the parsing problem in the form of objectdetection problem is hard, especially if the documents are camera captured and contain perspective distortions. Approaches like [15] partially solve the issue but it is still not a natural approach. It also makes it harder to use further features which could be extracted independently, for instance, the language features which could possibly hint towards the existence of a table.
In this paper, we present another way of dealing with the problem, that is, define it in using graph theory and apply graph neural networks to it. One of the initial and excellent research in graph neural networks is done by Scarselli et al. [21] where they formulated a comprehensive graph model based on contraction maps. In recent years, they have gained a lot of traction due to the increase in compute power and with the introduction of newer methods. Many notable works include [22], [23], [24]. Battaglia et al. [25] argued that relational inductive biases are the key to achieve human likeintelligence and showed how graph neural networks are essential to achieving this task.
Use of graphs in document processing is not new. There have been many papers published which employ graphbased models for a wide range of problems. Liang et al. [26] introduced a hierarchical treelike structure for document parsing. Work presented by Wang [27] gained a lot of popularity and was used by many researchers afterward. Hu et al. [28] introduced a very comprehensive graph model involving a Directed Acyclic Graph (DAG) with detailed definitions of various elements of a table. Recently, E. Koci et al. [29] presented an approach where they encoded information in the form of a graph. Afterward, they used a newlyproposed rulebased remove and conquer algorithm. Bunke et al. [30] provides a detailed analysis of different graphbased techniques employed in the context of document analysis. These methods are too detailed and make strong assumptions about the structure which contradicts with the philosophy of deep learning. Even though we are not the first ones to use graphs for document processing, to the best of our knowledge, we are the first ones to apply graph neural networks to our problem. We have done our experiments on the table parsing problem, however, this new problem definition applies to all sorts of problems in structural analysis. There are two advantages to our approach. For one, it is more generic and natural since it doesn’t make any strong assumptions about the structure and it is close to how humans interpret tables, i.e. by matching data cells to their headers. Secondly, it allows us to exploit graph neural networks towards which there has been a lot of push lately.
In particular, we make the following contributions in this paper:

Formulate table parsing problem as a graph problem which is compatible with graph neural networks

Design a novel differentiable architecture which reaps the benefits of both convolutional neural networks for image feature extraction and graph neural networks for efficient interaction between the vertices

Introduce a novel Monte Carlo based technique to reduce memory requirements of training

Fill the gap of large scale dataset by introducing a synthetic dataset

Run tests with two stateoftheart graph based methods and emperically show that they perform better than a baseline network
Ii Dataset
There are a few datasets including UW3, UNLV [31] and 2013 table competition dataset [32], published by the research community. However, the size of all of these datasets is limited. It risks overfitting in deep neural networks and hence, poor generalization. Many people have tried techniques such as transfer learning but these techniques cannot completely offset the utility of a large scale dataset.
We present a large synthetically generated dataset of 0.5M tables divided into four categories, which are visualized in Figure 1. To generate the dataset, we have employed Firefox and Selenium to render synthetically generated HTML. The dataset is loosely inspired by UNLV. The text is also randomly picked from the UNLV dataset. We note that synthetic dataset generation is not new and similar work [2] has also been done before.
Even though it will be hard to generalize algorithms from our dataset to the real world, the dataset provides a standard benchmark for studying different algorithms until a large scale realworld dataset arrives. The images are divided into test, train and validation splits. We have also published our code to produce further data if required.
Iii The graph model
Considering the problem of table parsing, the ground truth is defined as three graphs wherein every word is a vertex. There are three adjacency matrices representing each of the graphs, namely cell, row, and columnsharing matrices. So if two vertices share a row i.e. both words belong to the same row, these vertices are taken to be adjacent to each other. Likewise for cell and column sharing.
The prediction of a deep model is also done in the form of the three adjacency matrices. After getting adjacency matrices, complete cells, rows and columns can be reconstructed by solving the problem of maximal cliques [33] for rows and columns and connected components for cells. It is pictorially shown in Figure 3.
This model is valid not only for table parsing problems but can also be used for document segmentation. In that scenario, if two vertices (could be words again) share the same zone, they are adjacent. The resultant zones can also be reconstructed using the maximal clique problem.
Iv Methodology
All the tested models follow the same parent pattern, shown in Figure 2, divided into multiple parts, including three neural networks: the convolutional neural network for extraction of image features, the interaction part for communication between the vertices, and the classification part to label every paired vertices as being adjacent or not adjacent (class 0 or class 1) in all of three graphs.
The algorithm for the forward pass is also given in Algorithm 1. It takes the image ( — where , and represent height, width and number of channels in the input image respectively), positional features (), other features , number of vertices (). In addition to this, it also takes number of samples per vertex() and three adjacency matrices (, and ) as input during the training time. All the parametric functions are denoted by and nonparametric functions by . If all the parametric functions are differentiable, the complete architecture is as well and backpropagation can be used for training.
The positional features include the coordinates of upper left and bottom right corner of each vertex. Other features consist only of the length of the word in our case. However, in a realworld dataset, natural language features [34] could also be appended which may contain further information.
Iv1 Convolutional neural network
A convolutional neural network () takes an image () as its input and as the output, it generates the respective convolutional features ( — , and being the width, height and number of channels of the convolutional feature map respectively). To keep parameter count low, we have designed a shallow CNN; however, any standard architecture can be used in its place. At the output of CNN, a gather operation () is performed to collect convolutional features for each word corresponding to its spatial position in the image and form gathered features(). Since convolutional neural networks are translation equivariant, this operation works well. If the spatial dimensions of the output features are not the same as the input image (for instance, in our case, they were scaled down), the collect positions are linearly scaled down depending on the ratio between the input and output dimensions. The convolutional features are extended to the rest of the vertex features ().
Iv2 Interaction
After gathering all the vertex features, they are passed as input to the interaction model (). We have tested two graph neural networks to use as the interaction part which are the modified versions of [35] and [36] respectively. These modified networks are referred to as DGCNN* and GravNet* hereafter. In addition to these two, we have also tested with a baseline dense net (dubbed FCNN for Fully Connected Neural Network) with approximately the same number of parameters to show that the graphbased models perform better. For these three models, we have limited the total parameter count to for a fair comparison. This parameter count also includes parameters of preceding CNN and the succeeding classification dense network. As the output, we get representative features ( — being the number of representative features) of each of the vertex which are used for classification.
Iv3 Runtime pair sampling
Classifying every word pair is a memory intensive operation with memory complexity of . Since it would then scale linearly with the batch size, the memory requirements increase even further. To cater to this, we employed a Monte Carlo based sampling. The index sampling function is denoted by . this function would generate a fixed number of samples () for each of the vertex for each of the three problems (cell sharing, row sharing and column sharing).
Uniform sampling is highly biased towards class 0. Since we can’t use a large batch size due to the memory constraints, the statistics are not sufficient to differentiate between the two classes. To deal with this issue, we changed the sampling distribution () to sample, on average, an equal number of elements of class 0 and class 1 for each of the vertex. It can be easily done in a vectorized fashion as shown in the algorithm. Note that in the algorithm denotes allone matrix. Different sets of samples are collected for each of the three classes for each of the vertex (, , ). The values in these matrices represent the index of the paired samples for each of the vertex. For inference, however, we do not need to sample since we don’t need to use the minibatch approach. Hence, we simply do it for every vertex pair. So, during training, and during inference, .
Iv4 Classification
After sampling, elements from the output feature vector () of interaction model and elements from sampling matrices are concatenated () with each other. It is followed by three different neural networks (, and ). As the output, we get three sets of logits , and . They can be used to compute the loss and backpropagate through the function or they can be used to predict the classes and form the resultant adjacency matrices.
Method  

FCNN  99.9  99.9  99.6  99.9  99.6  99.4  99.8  96.8  87.6  99.9  97.7  90.0 
GravNet*  99.8  100  99.7  99.8  99.9  99.5  99.2  95.7  86.2  99.6  96.8  90.5 
DGCNN*  99.8  99.9  100  99.9  99.9  99.8  99.9  98.1  94.1  99.8  99.1  94.3 
Method  

FCNN  0.01  0.5  16.7  0.06  3.05  14.1  0.12  10.4  32.1  0.04  6.31  26.4 
GravNet*  0.15  0.18  6.56  0.17  0.79  9.01  0.58  10.2  33.6  0.28  7.81  25.28 
DGCNN*  0.07  0.46  0.79  0.08  0.22  1.09  0.06  4.8  14.6  0.07  4.39  10.6 
Method  Category  Category  Category  Category 

FCNN  42.4  54.6  10.9  31.9 
GravNet*  65.6  58.6  13.1  31.5 
DGCNN*  96.9  94.7  52.9  68.5 
V Results
Shahab et al [37] defined a set of metrics for proper evaluation of the results of table parsing and detection. They defined criteria for correct and partial detection and defined a heuristic for labeling elements as undersegmented, oversegmented and missed. Among their criterion, two are the most relevant to our case, i.e. percentage of ground truth elements that are detected correctly (true positive rate) and number of prediction elements which do not have a match in the ground truth (false positive rate). In our case, as argued in section III, the elements are cliques. So true positive rate and false positive rate is computed on all three graphs (cells, rows and columns) individually. This rate is averaged over the whole test set. These results are shown in table I and table II.
In addition to this, we also introduce another measure, i.e. perfect matching shown in Table III. If all three predicted adjacency matrices are perfectly matched with the corresponding matrices in ground truth, the parsed table is labeled as being endtoend accurate. This is a harsh metric but it shows how misleading can the statistics computed on the basis of individual table elements be due to large class imbalance. Simple heuristics can be applied to significantly improve the endtoend performance (e.g. by using the fact that cells’ subgraphs are connected components).
As expected, since there is no interaction between vertices in the FCNN, it performs worse than the graph models. Note however, that the vertices in this network are not completely segregated. They can still communicate in the convolutional neural network part. This builds the case to introduce graph neural networks further into document analysis.
Category 3 tables show comparatively worse results compared to category 4 tables. This is because category 4 images also contain those from category 1 and 2 to study the effect of perspective distortion on simpler images. We conclude that while the graph networks struggle with merged row and columns, they gracefully handle the perspective distortions.
Vi Conclusion and future work
In this work, we redefined the structural analysis problem using the graph model. We demonstrated our results on the problem of table parsing and we also argued how several other problems can also be defined using this model. Convolutional neural networks are the most suited at finding representative image features and graph networks are the most suited at fast message passing between vertices. We have shown how we can combine these two abilities using the gather operation. So far, we only used positional features for the vertices, but for a realworld dataset, natural language processing features like GloVe can also be used. There is a lot of room for contribution in improving the performance of graph neural networks, not just the accuracy but also the compute requirement. In conclusion, graph neural networks work well for structural analysis problems and we expect to see much more research on related problems.
Acknowledgment
This project was done as part of TUKLSEECS lab. We would like to thank Optimal lab at SEECS for providing us access to their GPU machines for training purposes.
References
 [1] T. Kieninger and A. Dengel, “The trecs table recognition and analysis system,” vol. 1655, 11 1998, pp. 255–269.
 [2] Y. Wangt, I. Phillipst, and R. Haralick, “Automatic table ground truth generation and a backgroundanalysisbased table structure extraction method,” in Proceedings of Sixth International Conference on Document Analysis and Recognition. IEEE, 2001, pp. 528–532.
 [3] J. Hu, R. S. Kashi, D. P. Lopresti, and G. Wilfong, “Mediumindependent table detection,” in Document Recognition and Retrieval VII, vol. 3967. International Society for Optics and Photonics, 1999, pp. 291–303.
 [4] B. Gatos, D. Danatsas, I. Pratikakis, and S. J. Perantonis, “Automatic table detection in document images,” in International Conference on Pattern Recognition and Image Analysis. Springer, 2005, pp. 609–618.
 [5] S. Tupaj, Z. Shi, C. H. Chang, and H. Alam, “Extracting tabular information from text files,” EECS Department, Tufts University, Medford, USA, 1996.
 [6] R. Zanibbi, D. Blostein, and J. R. Cordy, “A survey of table recognition,” Document Analysis and Recognition, vol. 7, no. 1, pp. 1–16, 2004.
 [7] I. A. Doush and E. Pontelli, “Detecting and recognizing tables in spreadsheets,” in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. ACM, 2010, pp. 471–478.
 [8] F. Shafait and R. Smith, “Table detection in heterogeneous documents,” in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. ACM, 2010, pp. 65–72.
 [9] J. Chen and D. Lopresti, “Table detection in noisy offline handwritten documents,” 09 2011, pp. 399–403.
 [10] T. Kasar, P. Barlas, S. Adam, C. Chatelain, and T. Paquet, “Learning to detect tables in scanned document images using line information,” in 2013 12th International Conference on Document Analysis and Recognition. IEEE, 2013, pp. 1185–1189.
 [11] L. Hao, L. Gao, X. Yi, and Z. Tang, “A table detection method for pdf documents based on convolutional neural networks,” in 2016 12th IAPR Workshop on Document Analysis Systems (DAS). IEEE, 2016, pp. 287–292.
 [12] S. F. Rashid, A. Akmal, M. Adnan, A. Adnan Aslam, and A. Dengel, “Table recognition in heterogeneous documents using machine learning,” 11 2017, pp. 777–782.
 [13] A. Gilani, S. R. Qasim, I. Malik, and F. Shafait, “Table detection using deep learning,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 771–776.
 [14] R. Girshick, “Fast rcnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
 [15] S. Schreiber, S. Agne, I. Wolf, A. Dengel, and S. Ahmed, “Deepdesrt: Deep learning for detection and structure recognition of tables in document images,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 1162–1167.
 [16] S. Arif and F. Shafait, “Table detection in document images using foreground and background features,” 12 2018, pp. 1–8.
 [17] S. A. Siddiqui, M. I. Malik, S. Agne, A. Dengel, and S. Ahmed, “Decnt: Deep deformable cnn for table detection,” IEEE Access, vol. 6, pp. 74 151–74 161, 2018.
 [18] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017, pp. 764–773.
 [19] D. He, S. Cohen, B. Price, D. Kifer, and C. Lee Giles, “Multiscale multitask fcn for semantic page segmentation and table detection,” 11 2017, pp. 254–261.
 [20] I. Kavasidis, S. Palazzo, C. Spampinato, C. Pino, D. Giordano, D. Giuffrida, and P. Messina, “A saliencybased convolutional neural network for table and chart detection in digitized documents,” 04 2018.
 [21] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009.
 [22] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 2017, pp. 1263–1272.
 [23] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in neural information processing systems, 2016, pp. 3844–3852.
 [24] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
 [25] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. SanchezGonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner et al., “Relational inductive biases, deep learning, and graph networks,” arXiv preprint arXiv:1806.01261, 2018.
 [26] J. Liang, I. T. Phillips, and R. M. Haralick, “An optimization methodology for document structure extraction on latin character documents,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 7, pp. 719–734, 2001.
 [27] X. Wang, “Tabular abstraction, editing, and formatting,” 1996.
 [28] J. Hu, R. S. Kashi, D. P. Lopresti, and G. Wilfong, “Table structure recognition and its evaluation,” in Document Recognition and Retrieval VIII, vol. 4307. International Society for Optics and Photonics, 2000, pp. 44–56.
 [29] E. Koci, M. Thiele, W. Lehner, and O. Romero, “Table recognition in spreadsheets via a graph representation,” in 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). IEEE, 2018, pp. 139–144.
 [30] H. Bunke and K. Riesen, “Recent advances in graphbased pattern recognition with applications in document analysis,” Pattern Recognition, vol. 44, pp. 1057–1067, 05 2011.
 [31] “Table ground truth for the uw3 and unlv datasets (dfkitgt2010),” http://tc11.cvc.uab.es/datasets/DFKITGT2010_1.
 [32] M. Göbel, T. Hassan, E. Oro, and G. Orsi, “Icdar 2013 table competition,” in 2013 12th International Conference on Document Analysis and Recognition. IEEE, 2013, pp. 1449–1453.
 [33] C. Bron and J. Kerbosch, “Algorithm 457: finding all cliques of an undirected graph,” Communications of the ACM, vol. 16, no. 9, pp. 575–577, 1973.
 [34] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. [Online]. Available: http://www.aclweb.org/anthology/D141162
 [35] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” arXiv preprint arXiv:1801.07829, 2018.
 [36] S. R. Qasim, J. Kieseler, Y. Iiyama, and M. Pierini, “Learning representations of irregular particledetector geometry with distanceweighted graph networks,” 2019. [Online]. Available: http://arxiv.org/1902.07987
 [37] A. Shahab, F. Shafait, T. Kieninger, and A. Dengel, “An open approach towards the benchmarking of table structure recognition systems,” in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. ACM, 2010, pp. 113–120.