Image-based table recognition: data, model, and evaluation
Important information that relates to a specific topic in a document is often
organized in tabular format to assist readers with information retrieval and
comparison, which may be difficult to provide in natural language. However,
tabular data in unstructured digital documents, e.g. Portable Document Format
(PDF) and images, are difficult to parse into structured machine-readable
format, due to complexity and diversity in their structure and style. To
facilitate image-based table recognition with deep learning, we develop and
release the largest publicly available table recognition dataset
Information in tabular format is prevalent in all sorts of documents. Compared to natural language, tables provide a way to summarize large quantities of data in a more compact and structured format. Tables provide as well a format to assist readers with finding and comparing information. An example of the relevance of tabular information in the biomedical domain is in the curation of genetic databases in which just between 2% to 8% of the information was available in the narrative part of the article compared to the information available in tables or files in tabular format .
Tables in documents are typically formatted for human understanding, and humans are generally adept at parsing table structure, identifying table headers, and interpreting relations between table cells. However, it is challenging for a machine to understand tabular data in unstructured formats (e.g. PDF, images) due to the large variability in their layout and style. The key step of table understanding is to represent the unstructured tables in a machine-readable format, where the structure of the table and the content within each cell are encoded according to a pre-defined standard. This is often referred as table recognition .
This paper solves the following three problems in image-based table recognition, where the structured representations of tables are reconstructed solely from image input:
Data We provide a large-scale dataset PubTabNet, which consists of over 568k images of heterogeneous tables extracted from the scientific articles (in PDF format) contained in PMCOA. By matching the metadata of the PDFs with the associated structured representation (provide by PMCOA
2in XML format), we automatically annotate each table image with information about both the structure of the table and the text within each cell (in HTML format).
Model We develop a novel attention-based encoder-dual-decoder (EDD) architecture (see Fig. 1) which consists of an encoder, a structure decoder, and a cell decoder. The encoder captures the visual features of input table images. The structure decoder reconstructs table structure and helps the cell decoder to recognize cell content. Our EDD model is trained on PubTabNet and demonstrates superior performance compared to existing table recognition methods. The error analysis shows potential enhancements to the current EDD model for improved performance.
Evaluation By modeling tables as a tree structure, we propose a new tree-edit-distance-based evaluate metric for image-based table recognition. We demonstrate that our new metric is superior to the metric  commonly used in literature and competitions.
Ii Related work
Analyzing tabular data in unstructured documents
focuses mainly on three problems: i) table detection: localizing the
bounding boxes of tables in documents, ii) table structure recognition:
parsing only the structural (row and column layout) information of tables, and
iii) table recognition: parsing both the structural information and
content of table cells. Table I compares the datasets that have
been developed to address one or more of these three problems. The PubTabNet
dataset and the EDD model we develop in this paper aim at the image-based table
recognition problem. Comparing to other existing datasets for table recognition
|Synthetic data in ||✗||✓||✓||Unbounded|
The tables are typeset by the publishers of over 6,000 journals in PMCOA, which offers considerably more diversity in table styles than other table datasets.
Cells are categorized into headers and body cells, which is important when retrieving information from tables.
The format of targeted output is HTML, which can be directly integrated into web applications. In addition, tables in HTML format are represented as a tree structure. This enables the new tree-edit-distance-based evaluation metric that we propose in Section V
Traditional table detection and recognition methods rely on pre-defined rules [11, 12, 13, 14, 15, 16] and statistical machine learning [17, 18, 19, 20, 21]. Recently, deep learning exhibit great performance in image-based table detection and structure recognition. Hao et al. used a set of primitive rules to propose candidate table regions and a convolutional neural network to determine whether the regions contain a table . Fully-convolutional neural networks, followed by a conditional random field, have also been used for table detection [23, 24, 25]. In addition, deep neural networks for object detection, such as Faster-RCNN , Mask-RCNN , and YOLO  have been exploited for table detection and row/column segmentation [29, 30, 31, 7]. Furthermore, graph neural networks are used for table detection and recognition by encoding document images as graphs [5, 32].
There are several tools (see Table II) that can convert tables in text-based PDF format into structured representations. However, there is limited work on image-based table recognition. Attention-based encoder-decoder was first proposed by Xu et al. for image captioning . Deng et al. extended it by adding a recurrent layer in the encoder for capturing long horizontal spatial dependencies to convert images of mathematical formulas into LaTeX representation . The same model was trained on the Table2Latex  dataset to convert table images into LaTeX representation. As show in  and in our experimental results (see Table II), the efficacy of this model on image-based table recognition is mediocre.
This paper considerably improves the performance of the attention-based encoder-decoder method on image-based table recognition with a novel EDD architecture. Our model differs from other existing EDD architectures [35, 36], where the dual decoders are independent from each other. In our model, the cell decoder is triggered only when the structure decoder generates a new cell. In the meanwhile, the hidden state of the structure decoder is sent to the cell decoder to help it place its attention on the corresponding cell in the table image.
The evaluation metric proposed in  is commonly used in
table recognition literature and competitions. This metric first flattens the
ground truth and recognition result of a table are into a list of pairwise
adjacency relations between non-empty cells. Then precision, recall, and
F1-score can be computed by comparing the lists. This metric is simple but has
two obvious problems: 1) as it only checks immediate adjacency relations between
non-empty cells, it cannot detect errors caused by empty cells and misalignment
of cells beyond immediate neighbors; 2) as it checks relations by exact
Iii Automatic generation of PubTabNet
PMCOA contains over one million scientific articles in both unstructured (PDF) and structured (XML) formats. A large table recognition dataset can be automatically generated if the corresponding location of the table nodes in the XML can be found in the PDF. Zhong et al. has proposed an algorithm to match the the XML and PDF representations of the articles in PMCOA, which automatically generated the PubLayNet dataset for document layout analysis . We use their algorithm to extract the table regions from the PDF for the tables nodes in the XML. The table regions are converted to images with a 72 pixels per inch (PPI) resolution. We use this low PPI setting to relax the requirement of our model for high-resolution input images. For each table image, the corresponding table node (in HTML format) is extracted from the XML as the ground truth annotation.
It is observed that the algorithm generates erroneous bounding boxes for some tables, hence we use a heuristic to automatically verify the bounding boxes. For each annotation, the text within the bounding box is extracted from the PDF and compared with that in the annotation. The bounding box is considered to be correct if the cosine similarity of the term frequency-inverse document frequency (Tf-idf) features of the two texts is greater than 90% and the length of the two texts differs less than 10%. In addition, to improve the learnability of the data, we remove rare tables which contains any cell that spans over 10 rows or 10 columns, or any character that occurs less than 50 times in all the tables. Tables of which the annotation contains math and inline-formula nodes are also removed, as we found they do not have a consistent XML representation.
After filtering the table samples, we curate the HTML code of the tables to remove unnecessary variations. First, we remove the nodes and attributes that are not reconstructable from the table image, such as hyperlinks and definition of acronyms. Second, table header cells are defined as th nodes in some tables, but as td nodes in others. We unify the definition of header cells as td nodes, which preserves the header identify of the cells as they are still descendants of the thead node. Third, all the attributes except ‘rowspan’ and ‘colspan’ in td nodes are stripped, since they control the appearance of the tables in web browsers, which do not match with the table image. These curations lead to consistent and clean HTML code and make the data more learnable.
Finally, the samples are randomly partitioned into 60%/20%/20%
development/test sets. The training set contains 548,592 samples. As only a small proportion of tables contain spanning (multi-column or multi-row) cells, the evaluation on the raw development and test sets would be strongly biased towards tables without spanning cells. To better evaluate how a model performs on complex table structures, we create more balanced development and test sets by randomly drawing 5,000 tables with spanning cells and 5,000 tables without spanning cells from the corresponding raw set.
Iv Encoder-dual-decoder (EDD) model
Fig. 1 shows the architecture of the EDD model, which consists of an encoder, an attention-based structure decoder, and an attention-based cell decoder. The use of two decoders is inspired by two intuitive considerations: i) table structure recognition and cell content recognition are two distinctively different tasks. It is not effective to solve both tasks at the same time using a single attention-based decoder. ii) information in the structure recognition task can be helpful for locating the cells that need to be recognized. The encoder is a convolutional neural network (CNN) that captures the visual features of input table images. The structure decoder and cell decoder are recurrent neural networks (RNN) with the attention mechanism proposed in . The structure decoder only generates the HTML tags that define the structure of the table. When the structure decoder recognizes a new cell, the cell decoder is triggered and uses the hidden state of the structure decoder to compute the attention for recognizing the content of the new cell. This ensures a one-to-one match between the cells generated by the structure decoder and the sequences generated by the cell decoder. The outputs of the two decoders can be easily merged to get the final HTML representation of the table.
As the structure and the content of an input table image are recognized separately by two decoders, during training, the ground truth HTML representation of the table is tokenized into structural tokens, and cell tokens as shown in Fig. 2. Structural tokens include the HTML tags that control the structure of the table. For spanning cells, the opening tag is broken down into multiple tokens as ‘<td’, ‘rowspan’ or ‘colspan’ attributes, and ‘>’. The content of cells is tokenized at the character level, where HTML tags are treated as single tokens.
Two loss functions can be computed from the EDD network: i) cross-entropy loss of generating the structural tokens (); and ii) cross-entropy loss of generating the cell tokens (). The overall loss () of the EDD network is calculated as,
where is a hyper-parameter.
V Tree-edit-distance-based similarity (TEDS)
Tables are presented as a tree structure in the HTML format. The root has two children thead and tbody, which group table headers and table body cells, respectively. The children of thead and tbody nodes are table rows (tr). The leaves of the tree are table cells (td). Each cell node has three attributes, i.e. ‘colspan’, ‘rowspan’, and ‘content’. We measure the similarity between two tables using the tree-edit distance proposed by Pawlik and Augsten . The cost of insertion and deletion operations is 1. When the edit is substituting a node with , the cost is 1 if either or is not td. When both and are td, the substitution cost is 1 if the column span or the row span of and is different. Otherwise, the substitution cost is the normalized Levenshtein similarity () between the content of and . Finally, TEDS between two trees is computed as
where denotes tree-edit distance, and is the number of nodes in . The table recognition performance of a method on a set of test samples is defined as the mean of the TEDS score between the recognition result and ground truth of each sample.
In order to justify that TEDS solves the two problems of the adjacency relation metric  described previously in Section II, we add two types of perturbations to the validation set of PubTabNet and examine how TEDS and the adjacency relation metric respond to the perturbations.
To demonstrate the empty-cell and multi-hop misalignment issue, we shift some cells in the first row downwards
6, and pad the leftover space with empty cells. The shift distance of a cell is proportional to its column index. We tested 5 perturbation levels, i.e., 10%, 30%, 50%, 70%, or 90% of the cells in the first row are shifted. Fig. 5 shows a perturbed example, where 90% of the cells in the first row are shifted.
To demonstrate the fine-grained cell content recognition issue, we randomly modify some characters into a different one. We tested 5 perturbation levels, i.e., the chance that a character gets modified is set to be 10%, 30%, 50%, 70%, or 90%. Fig. 8 shows an example at the 10% perturbation level.
Fig. 11 illustrates how TEDS and the adjacency relation F1-score respond to the two types of perturbations at different levels. The adjacency relation metric is under-reacting to the cell shift perturbation. At the 90% perturbation level, the table is substantially different from the original (see example in Fig. 5). However, the adjacency relation F1-score is still nearly 80%. On the other hand, the perturbation causes a 60% drop on TEDS, demonstrating that TEDS is able to capture errors that the adjacency relation metric cannot.
When it comes to cell content perturbations, the adjacency relation metric is over-reacting. Even the 10% perturbation level (see example in Fig. 8) leads to over 70% decrease in adjacency relation F1-score, which drops close to zero from the 50% perturbation level. In contrast, TEDS linearly decreases from 90% to 40% as the perturbation level increases from 10% to 90%, demonstrating the capability of capturing fine-grained cell content recognition errors.
The test performance of the proposed EDD model is compared with five
off-the-shelf tools (Tabula
Vi-a Implementation details
To avoid exceeding GPU RAM, the EDD model is trained on a subset (399k samples) of PubTabNet training set, which satisfies
|width and height|
Note that samples in the validation and test sets are not constrained by these criteria. The vocabulary size of the structural tokens and the cell tokens of the training data is 32 and 281, respectively. Training images are rescaled to pixels to facilitate batching and each channel is normalized by z-score.
We use the ResNet-18  network as the encoder. The default ResNet-18 model downsamples the image resolution by 32. We modify the last CNN layer of ResNet-18 to study if a higher-resolution feature map improves table recognition performance. A total of five different settings are tested in this paper:
EDD-S2: the default ResNet-18
EDD-S1: stride of the last CNN layer set to 1
EDD-S2S2: two independent last CNN layers for structure (stride=2) and cell (stride=2) decoder
EDD-S2S1: two independent last CNN layers for structure (stride=2) and cell (stride=1) decoder
EDD-S1S1: two independent last CNN layers for structure (stride=1) and cell (stride=1) decoder
We evaluate the performances of these five settings on the validation set and find that a higher-resolution feature map and independent CNN layers improve performance. As a result, the EDD-S1S1 setting provides the best validation performance, and is therefore chosen to compare with baselines on the test set.
The structure decoder and the cell decoder are single-layer long short-term memory (LSTM) networks, of which the hidden state size is 256 and 512, respectively. Both of the decoders weight the feature map from the encoder with soft-attention, which has a hidden layer of size 256. The embedding dimension of structural tokens and cell tokens is 16 and 80, respectively. At inference time, the output of both of the decoders are sampled with beam search (beam=3).
The EDD model is trained with the Adam  optimizer with two stages. First, we pre-train the encoder and the structure decoder to generate the structural tokens only (), where the batch size is 10, and the learning rate is 0.001 in the first 10 epochs and reduced by 10 for another 3 epochs. Then we train the whole EDD network to generate both structural and cell tokens (), with a batch size 8 and a learning rate 0.001 for 10 epochs and 0.0001 for another 2 epochs. Total training time is about 16 days on two V100 GPUs.
Vi-B Quantitative analysis
Table II compares the test performance of the proposed EDD model and
the baselines, where the average TEDS of simple
|Input||Method||Average TEDS (%)|
|(a) Input table||(b) Ground truth|
|(c) EDD (TEDS )||(d) WYGIWYS (TEDS )|
|(e) Acrobat® on PDF (TEDS )||(f) Acrobat® on Image (TEDS )|
|(g) Tabula (TEDS )||(h) Traprange (TEDS )|
|(i) Camelot (TEDS )||(j) PDFPlumber (TEDS )|
Vi-C Qualitative analysis
To illustrate the differences in the behavior of the compared methods, Fig. 12 shows the rendering of the predicted HTML given an example input table. The table has 7 columns, 3 header rows, and 4 body rows. The table header has a complex structure, which consists of 4 multi-row (span=3) cells, 2 multi-column (span=3) cells, and three normal cells. Our EDD model is able to generate an extremely close match to the ground truth, making no error in structure recognition and a single optical character recognition (OCR) error (‘PF’ recognized as ‘PC’). The second header row is missing in the results of WYGIWYS, which also makes a few errors in the cell content. On the other hand, the off-the-shelf tools make substantially more errors in recognizing the complex structure of the table headers. This demonstrates the limited capability of these tools on recognizing complex tables.
Figs. 13 (a) - (c) illustrate the attention of the structure decoder when processing an example input table. When a new row is recognized (‘<tr>’ and ‘</tr>’), the structure decoder focuses its attention around the cells in the row. When the opening tag (‘<td>’) of a new cell is generated, the structure decoder pays more attention around the cell. For the closing tag ‘</td>’ tag, the attention of the structure decoder spreads across the image. Since ‘</td>’ always follows the ‘<td>’ or ‘>’ token, the structure decoder relies on the language model rather than the encoded feature map to predict it. Fig. 13 (d) shows the aggregated attention of the cell decoder when generating the content of each cell. Compared to the structure decoder, the cell decoder has more focused attention, which falls on the cell content that is being generated.
Vi-D Error analysis
We categorize the test set of PubTabNet into 15 equal-interval groups along four key properties of table size: width, height, number of structural tokens, and number of tokens in the longest cell. Fig. 18 illustrates the number of tables in each group and the performance of the EDD model and the WYGIWYS model on each group. The EDD model outperforms the WYGIWYS model on all groups. The performance of both models decreases as table size increases. We train the models with tables that satisfy Equation 3, where the thresholds are indicated with vertical dashed lines in Fig. 18. Except for width, we do not observe a steep decrease in performance near the thresholds. We think the lower performance on larger tables is mainly due to rescaling images for batching, where larger tables are more strongly downsampled. The EDD model may better handle large tables by grouping table images into similar sizes as in  and using different rescaling sizes for each group.
To demonstrate that the EDD model is not only suitable for PubTabNet, but also
generalizable to other table recognition datasets, we train and test EDD on the
synthetic dataset proposed in . We did not choose the
ICDAR2013 or ICDAR2019 table recognition competition datasets. Because, as shown
in Table I, ICDAR2013 does not provide enough training data; and
ICDAR2019 does not provide ground truth of cell content (cell position only). We
synthesize 500K table images with the corresponding HTML
We compare the test performance of EDD to the graph neural network model TIES
proposed in  on each table category. We compute the
TEDS score only for EDD, as TIES predicts if two tokens (recognized by an OCR
engine from the table image) share the same cell, row, and column, but not a
HTML representation of the table
Table III shows the test performance of EDD and TIES, where EDD achieves an extremely high TEDS score (99.7+%) on all the categories of the synthetic dataset. This means EDD is able to nearly perfectly reconstructed both the structure and cell content from the table images. EDD outperforms TIES in terms of exact match on all table categories. In addition, unlike TIES, EDD does not show any significant downgrade in performance on category 3 or 4, in which the samples have a more complex structure. This demonstrates that EDD is much more robust and generalizable than TIES on more difficult examples.
|Model||Average TEDS (%)||Exact match (%)|
This paper makes a comprehensive study of the image-based table recognition problem. A large-scale dataset PubTabNet is developed to train and evaluate deep learning models. By separating table structure recognition and cell content recognition tasks, we propose an attention-based EDD model. The structure decoder not only recognizes the structure of input tables, but also helps the cell decoder to place its attention on the right cell content. We also propose a new evaluation metric TEDS, which captures both the performance of table structure recognition and cell content recognition. Compare to the traditional adjacency relation metric, TEDS can more appropriately capture multi-hop cell misalignment and OCR errors. The proposed EDD model, when trained on PubTabNet, is effective on recognizing complex table structures and extracting cell content from image. PubTabNet has been made available and we believe that PubTabNet will accelerate future development in table recognition and provide support for pre-training table recognition models.
Our future works will focus on the following two directions. First, current PubTabNet dataset does not provide coordinates of table cells, which we plan to supplement in the next version. This will enable adding an additional branch to the EDD network to also predict cell location. We think this additional task will assist cell content recognition. In addition, when tables are available in text-based PDF format, the cell location can be used to extract cell content directly from PDF without using OCR, which might improve the overall recognition quality. Second, the EDD model takes table images as input, which implicitly assumes that the accurate location of tables in documents is given by users. We will investigate how the EDD model can be integrated with table detection neural networks to achieve end-to-end table detection and recognition.
- Both cells are identical and the direction matches
- If the number of rows is greater than the number of columns, we shift the cells in the first column rightwards instead.
- v1.0.4 (https://github.com/tabulapdf/tabula-java)
- v1.0 (https://github.com/thoqbk/traprange)
- v0.7.3 (https://github.com/camelot-dev/camelot)
- v0.6.0-alpha (https://github.com/jsvine/pdfplumber)
- WYGIWYS is trained on the same samples as EDD by truncated back-propagation through time (200 steps). WYGIWYS and EDD use the same CNN in the encoder to rule out the possibility that the performance gain of EDD is due to difference in CNN.
- Tables without multi-column or multi-row cells.
- Tables with multi-column or multi-row cells.
-  does not describe how the adjacency relations can be converted to a unique HTML representation.
- A. Jimeno Yepes and K. Verspoor, “Literature mining of genetic variants for curation: quantifying the importance of supplementary material,” Database, vol. 2014, 2014.
- M. Göbel, T. Hassan, E. Oro, and G. Orsi, “ICDAR 2013 table competition,” in 2013 12th International Conference on Document Analysis and Recognition. IEEE, 2013, pp. 1449–1453.
- M. Hurst, “A constraint-based approach to table structure derivation,” 2003.
- Y. Deng, D. Rosenberg, and G. Mann, “Challenges in end-to-end neural scientific table recognition,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, Sep. 2019, pp. 894–901.
- S. R. Qasim, H. Mahmood, and F. Shafait, “Rethinking table recognition using graph neural networks,” pp. 142–147, Sep. 2019.
- J. Fang, X. Tao, Z. Tang, R. Qiu, and Y. Liu, “Dataset, ground-truth and performance metrics for table detection evaluation,” in 2012 10th IAPR International Workshop on Document Analysis Systems. IEEE, 2012, pp. 445–449.
- X. Zhong, J. Tang, and A. J. Yepes, “Publaynet: largest dataset ever for document layout analysis,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, Sep. 2019, pp. 1015–1022.
- N. Siegel, N. Lourie, R. Power, and W. Ammar, “Extracting scientific figures with distantly supervised neural networks,” in Proceedings of the 18th ACM/IEEE on joint conference on digital libraries. ACM, 2018, pp. 223–232.
- L. Gao, Y. Huang, Y. Li, Q. Yan, Y. Fang, H. Dejean, F. Kleber, and E.-M. Lang, “ICDAR 2019 competition on table detection and recognition,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, Sep. 2019, pp. 1510–1515.
- A. Shahab, F. Shafait, T. Kieninger, and A. Dengel, “An open approach towards the benchmarking of table structure recognition systems,” in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. ACM, 2010, pp. 113–120.
- Y. Hirayama, “A method for table structure analysis using dp matching,” in Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 2. IEEE, 1995, pp. 583–586.
- S. Tupaj, Z. Shi, C. H. Chang, and H. Alam, “Extracting tabular information from text files,” EECS Department, Tufts University, Medford, USA, 1996.
- J. Hu, R. S. Kashi, D. P. Lopresti, and G. Wilfong, “Medium-independent table detection,” in Document Recognition and Retrieval VII, vol. 3967. International Society for Optics and Photonics, 1999, pp. 291–302.
- B. Gatos, D. Danatsas, I. Pratikakis, and S. J. Perantonis, “Automatic table detection in document images,” in International Conference on Pattern Recognition and Image Analysis. Springer, 2005, pp. 609–618.
- F. Shafait and R. Smith, “Table detection in heterogeneous documents,” in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. ACM, 2010, pp. 65–72.
- S. S. Paliwal, D. Vishwanath, R. Rahul, M. Sharma, and L. Vig, “Tablenet: Deep learning model for end-to-end table detection and tabular data extraction from scanned document images,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 128–133.
- T. Kieninger and A. Dengel, “The t-recs table recognition and analysis system,” in International Workshop on Document Analysis Systems. Springer, 1998, pp. 255–270.
- F. Cesarini, S. Marinai, L. Sarti, and G. Soda, “Trainable table location in document images,” in Object recognition supported by user interaction for service robots, vol. 3. IEEE, 2002, pp. 236–240.
- A. C. e Silva, “Learning rich hidden markov models in document analysis: Table location,” in 2009 10th International Conference on Document Analysis and Recognition. IEEE, 2009, pp. 843–847.
- T. Kasar, P. Barlas, S. Adam, C. Chatelain, and T. Paquet, “Learning to detect tables in scanned document images using line information,” in 2013 12th International Conference on Document Analysis and Recognition. IEEE, 2013, pp. 1185–1189.
- M. Fan and D. S. Kim, “Table region detection on large-scale pdf files without labeled data,” CoRR, abs/1506.08891, 2015.
- L. Hao, L. Gao, X. Yi, and Z. Tang, “A table detection method for pdf documents based on convolutional neural networks,” in 2016 12th IAPR Workshop on Document Analysis Systems (DAS). IEEE, 2016, pp. 287–292.
- D. He, S. Cohen, B. Price, D. Kifer, and C. L. Giles, “Multi-scale multi-task fcn for semantic page segmentation and table detection,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 254–261.
- I. Kavasidis, C. Pino, S. Palazzo, F. Rundo, D. Giordano, P. Messina, and C. Spampinato, “A saliency-based convolutional neural network for table and chart detection in digitized documents,” in International Conference on Image Analysis and Processing. Springer, 2019, pp. 292–302.
- C. Tensmeyer, V. I. Morariu, B. Price, S. Cohen, and T. Martinez, “Deep splitting and merging for table structure decomposition,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019, pp. 114–121.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
- J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
- S. Schreiber, S. Agne, I. Wolf, A. Dengel, and S. Ahmed, “Deepdesrt: Deep learning for detection and structure recognition of tables in document images,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 1162–1167.
- A. Gilani, S. R. Qasim, I. Malik, and F. Shafait, “Table detection using deep learning,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1. IEEE, 2017, pp. 771–776.
- P. W. Staar, M. Dolfi, C. Auer, and C. Bekas, “Corpus conversion service: A machine learning platform to ingest documents at scale,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2018, pp. 774–782.
- P. Riba, A. Dutta, L. Goldmann, A. Fornés, O. Ramos, and J. Lladós, “Table detection in invoice documents by graph neural networks,” in 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, Sep. 2019, pp. 122–127.
- K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, 2015, pp. 2048–2057.
- Y. Deng, A. Kanervisto, J. Ling, and A. M. Rush, “Image-to-markup generation with coarse-to-fine attention,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 980–989.
- Y.-F. Zhou, R.-H. Jiang, X. Wu, J.-Y. He, S. Weng, and Q. Peng, “Branchgan: Unsupervised mutual image-to-image transfer with a single encoder and dual decoders,” IEEE Transactions on Multimedia, 2019.
- R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh, “Learning regularity in skeleton trajectories for anomaly detection in videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 996–12 004.
- M. Pawlik and N. Augsten, “Tree edit distance: Robust and memory-efficient,” Information Systems, vol. 56, pp. 157–173, 2016.
- V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8, 1966, pp. 707–710.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of International Conference on Learning Representations (ICLR), 2015.