General-Purpose OCR Paragraph Identification by Graph Convolution Networks

General-Purpose OCR Paragraph Identification by Graph Convolution Networks

Abstract

Paragraphs are an important class of document entities. We propose a new approach for paragraph identification by spatial graph convolution networks (GCN) applied on OCR text boxes. Two steps, namely line splitting and line clustering, are performed to extract paragraphs from the lines in OCR results. Each step uses a -skeleton graph constructed from bounding boxes, where the graph edges provide efficient support for graph convolution operations. With only pure layout input features, the GCN model size is 34 orders of magnitude smaller compared to R-CNN based models, while achieving comparable or better accuracies on PubLayNet and other datasets. Furthermore, the GCN models show good generalization from synthetic training data to real-world images, and good adaptivity for variable document styles.

Optical character recognition, document layout, graph convolution network.

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

1 Introduction

Document image understanding is a task to recognize, structure, and understand the contents of document images, and is a key technology to digitally process and consume such document images. If we regard any images containing structured text as document images, they are ubiquitous and can be found in numerous applications. Document image understanding enables the conversion of such documents into a digital format with rich structure and semantic information and makes them available for subsequent information tasks.

A document can be represented by its semantic structure and physical structure [13]. The task to recover the semantic structure is called logical layout analysis [4] or semantic structure extraction [34] while the task to recover the physical structure is called geometric (physical, or structural) layout analysis [4]. These tasks are critical subproblems of document image understanding.

A paragraph is a semantic unit of writing consisting of one or more sentences that usually develops one main idea. Paragraphs are basic constituents of semantic structure and thus paragraph boundary estimation (or paragraph estimation, for short) is an important building block of logical layout analysis. Moreover, paragraphs are often appropriate as processing units for various downstream tasks such as translation and information extraction because they are self-contained and have rich semantic information. Therefore, developing a generic paragraph estimation algorithm is of great interest by itself.

Fig. 1: Examples of paragraphs in printed text. Paragraphs may have complex shapes when wrapped around figures or other types of document entities.

Paragraphs are usually rendered in a geometric layout structure according to broadly accepted typographical rules. For example, a paragraph can be rendered as a series of text lines that

  • are placed with uniform vertical spacing between adjacent lines;

  • start with a line where one of the following is true:

    1. The line is indented. (An indented paragraph.)

    2. The line starts with a bullet symbol or number, with all the subsequent lines indented to be left-justified, flush with the first. (A list item.)

    3. The vertical spacing above the first line is significantly larger than the uniform spacing between subsequent lines. (A block paragraph.)

As such, there are usually clear visual cues to identify paragraphs1. Nevertheless, the task of estimating paragraphs often remains non-trivial as shown in Fig. 1.

Previous studies have attempted to develop a paragraph estimation method by defining handcrafted rules based on careful observations [23, 29, 3, 28] or by learning an object detection model to identify the regions of paragraphs from an image [34, 36]. For the former approaches, it is usually challenging to define a robust set of heuristics even for a limited domain, and hence machine-learning-based solutions are generally preferable. The latter approaches tend to have difficulty dealing with diverse aspect ratios and text shapes, and the wide range of degradations observed in real-world applications such as image skews and perspective distortions.

In this paper, we propose to apply graph convolution networks (GCNs) in a post-processing step of an optical character recognition (OCR) system for paragraph estimation. Modern OCR engines can detect and recognize texts with a very high recall for documents in a variety of conditions. Indeed, as will be shown in the experiments, our generic OCR system can detect and recognize texts with a higher recall than a specialized image-based paragraph detector, indicating little risk of missing correct paragraph boundaries by restricting to the possibilities generated by the OCR engine. That motivates us to employ the post-processing strategy rather than a pre-processing or an entangled approach. Recent advancements in graph neural (convolutional) networks [26, 33] have enabled deep learning on non-Euclidian data. GCNs can learn spatial relationships among entities combining information from multiple sources and provide a natural way to learn the non-linear mapping from OCR results to paragraphs.

More specifically, we design two classifiers based on GCNs — one for line splitting and one for line clustering. A word graph is constructed for the first stage and a line graph is constructed for the second stage. Both graphs are constructed based on the -skeleton algorithm [16] that produces a graph with good connectivity and sparsity.

To fully utilize the models’ capability, it is desirable to have a diverse set of document styles in the training data. We create synthetic data sets from web pages where the page styles are randomly modified in the web scraping engine. By leveraging open web sites like Wikipedia [32] for source material to render in randomized styles, we have access to practically unlimited document data.

We evaluate the 2-step models on both the PubLayNet [36] and our own datasets. We show that GCN based models can be small and efficient by taking OCR produced bounding boxes as input, and are also capable of generating highly accurate results. Moreover, with synthesized training data from a browser-based rendering engine, these models can be a step towards a reverse rendering engine that recovers comprehensive layout structure from document images.

This paper is organized as follows: Section 2 reviews related work. Section 3 presents our proposed method, where the graph construction method and the details of each step of the algorithms are described. Section 4 explains training data generation methods with web scraping. Experimental setups and results are given in Section 5. Section 6 concludes the paper with suggestions for future work.

2 Related Work

OCR layout analysis (”layout” for short) comprises a large variety of problems that have been studied from different aspects. There is pre-recognition layout like [27] to find text lines as the input of recognition, and post-recognition layout like [25] to find higher level entities based on OCR recognition results. We list selected studies that are most relevant to our problem in the following subsections.

2.1 Geometric and Rule-based Approaches

Multi-column text, often with small column gaps, needs to be first identified before paragraphs. Early studies have proposed geometric methods [3, 2] and rule-based methods [23, 29, 28]. Both categories have algorithms to find column gaps by searching whitespace [2] or text alignment [28].

Limitations of these approaches include susceptibility to input noise and false positive column boundaries, especially with monospace font families. Our production2 layout analyzer has been using a simpler rule-based heuristic algorithm which splits lines at white spaces that are significantly larger than others. This simplification gives it a performance (computational and memory cost) advantage, but also hampers its capability of handling dense text columns.

Fig. 2: Example of multiple short paragraphs densely packed and rotated into a non axis-aligned direction. The right side shows the region proposal boxes for object detection models.

2.2 Image Based Detection

The PubLayNet paper [36] provides a large dataset for multiple types of document entities, as well as two object detection models F-RCNN [24] and M-RCNN  [12] trained to detect these entities. Both show good metrics in evaluations, but also with some disadvantages on detecting paragraphs:

  • Cost: Object detection models are typically large in size and expensive in computation. When used together with an OCR engine to retrieve text paragraphs, it seems wasteful to bypass the OCR results and attempt to detect paragraphs independently.

  • Quality: Paragraph bounding boxes may have high aspect ratios and are sometimes tightly packed, making it difficult for Faster R-CNN detection. In Fig. 2, several short paragraphs are printed with dense text and rotated by 45 degrees. The region proposals required to detect all the paragraphs are highly overlapped, so some detections will be dropped by non-maximum suppression (NMS). Rotational R-CNN models [14] can mitigate this issue by inclined NMS, but further increase the computational cost while still facing a more difficult task with rotated or warped inputs.

2.3 Page Segmentation

Page segmentation models  [34, 18, 22] classify every part of the image to certain types of objects such as text, table, image and background. Sometimes the shapes of paragraphs can be revealed by the “text” part of the segmentation. However, when text is dense and paragraphs are indentation based without variation in line spacings, individual paragraphs cannot be easily extracted from large connected text regions. On the other hand, when text is sparse and appears as a lot of separate small components, paragraphs are not obvious in the segmentation result either.

2.4 Graph Neural Network for Table Detection

A graph neural network approach is proposed in [25] to detect tables in invoice documents. It shows that tabular structures can be detected based purely on structural information by graph neural networks.

Limitations of this approach include graph construction and graph representation. First, the visibility graph is built by only connecting pairs of pre-defined entities that are vertically or horizontally visible, which requires the input image free of skews and distortions. Second, the adjacency matrix learned by the GNN is in size and hence inefficient for large inputs. A general-purpose post-OCR model will need to overcome these limitations to accommodate all types of input images and achieve high computational efficiency.

3 Paragraph Estimation with Graph Convolution Networks

A paragraph consists of a set of text lines, which are usually produced in the output of OCR systems [8, 27]. If text lines are given by OCR systems, we can consider a bottom-up approach to cluster text lines into paragraphs for the paragraph estimation task.

The detected lines provide rudimentary layout information but may not match the true text lines. For example, in Fig. 3, the lower section of the page consists of two text columns placed closely to each other. The line detector might be confused by the tiny spacing and find wrong lines spanning both columns. These lines need to be split in the middle before being clustered into paragraphs.

Line splitting and clustering are non-trivial tasks for general-purpose paragraph estimation – the input images can be skewed or warped, and the layout styles can vary among different types of documents, e.g. newspapers, books, signs, web pages, handwritten letters, etc. Even though the concept of paragraph is mostly consistent across all document categories, the appearance of a paragraph can differ by many factors such as word spacing, line spacing, indentation, text flowing around figures, etc. Such variations make it difficult, if not impossible, to have a straightforward algorithm that identifies all the paragraphs.

In order to address erroneous line detection and solve the non-trivial split and clustering problem, we design a paragraph identification method as a 2-step process after the main OCR engine produces line and word bounding boxes. Both steps use a graph convolution network (GCN) that takes input features from bounding boxes in the OCR result, together with a -skeleton graph [16] constructed from these boxes. Neither the original image nor text transcriptions are included in the input, so the models are small, fast, and entirely focused on the layout structure.

  • Step 1: Line splitting. Raw text lines from OCR line detectors may cross multiple columns, and thus need to be split into shorter lines. A GCN node classifier predicts splitting points in lines.

  • Step 2: Line clustering. After step 1 produces refined lines, they are clustered into paragraphs. A GCN edge classifier predicts clustering operations on pairs of neighboring lines.

The following subsections describe these steps in details. In addition, we discuss the possibility of an alternative one-step process.

Fig. 3: Example of a double-column document image and its paragraphs. The left side shows all the lines found by an OCR text line detector, and the right side shows the paragraphs formed by clustered text lines within each column.
Fig. 4: Comparison among different types of graphs constructed on an example set of boxes.

3.1 -skeleton on Boxes

A graph is a key part of the GCN model input. We want a graph with high connectivity for effective message passing in graph convolutions, while also being sparse for computational efficiency.

Visibility graphs have been used in previous studies [6, 25], where edges are made by “lines-of-sight”. They are not considered suitable for our models because the lines may create excessive edges. Fig. 4(a) shows the visibility graph built on two rows of boxes, where any pairs of boxes on different rows are connected. This means word connections between text lines may get number of edges. If we limit the lines-of-sight to be axis aligned like Fig. 4(b), then the graph becomes too sparse, even producing disconnected components in some cases.

By changing “lines-of-sight” into “balls-of-sight”, we get a -skeleton graph [16] with . In such a graph, two boxes are connected if they can both touch a circle that does not intersect with any other boxes. It provides a good balance between connectivity and sparsity. As shown in Fig. 4(c), a -skeleton graph does not have too many connections between rows of boxes. With , it is a subgraph of a Delaunay triangulation [1] with number of edges bounded by . Yet, it provides good connectivity within any local cluster of boxes, and the whole graph is guaranteed to be one connected component.

Fig. 5: Building a -skeleton on boxes from -skeleton on points. Left side: intersecting boxes are first connected with edges of length 0. Right side: Non-internal peripheral points are connected with -skeleton edges which are then collapsed into box edges. Edge lengths are approximate.

The original -skeleton graph is constructed on a point set. To apply it to bounding boxes, we use an algorithm illustrated in Fig. 5 and described in the following steps, where used for complexity analysis is the number of boxes. We assume the length and width of all the input boxes are bounded by a constant.

  1. For each box, pick a set of peripheral points at a pre-set density, and pick a set of internal points along the longitudinal middle line3.

  2. Build a Delaunay triangulation graph on all the points. (Time complexity [1].)

  3. Find all the “internal” points that are inside at least one of the boxes. (Time complexity by traversing along ’s edges inside each box starting from any peripheral point. Internal points are marked grey in Fig. 5.)

  4. Add an edge of length 0 for each pair of intersecting boxes (containing each other’s peripheral points).

  5. Pick -skeleton edges from where for each edge , both its vertices , are non-internal points and the circle with as diameter does not cover any other point.

    If there is such a point set covered by the circle, then the point closest to must be the neighbor of either or (in Delaunay triangulation graphs). Finding such takes time for each edge, since produced in step 2 have edges sorted at each point.

  6. Keep only the shortest edge for each pair of boxes as the -skeleton edge.

The overall time complexity of this box based -skeleton graph construction is , dominated by Delaunay triangulation. There are pathological cases where step 4 will need time, e.g. all the boxes contain a common overlapping point. But these cases are easily excluded from OCR results. The total number of edges is bounded by as in , so the graph convolution layers have linear time operations.

Fig. 6: Overview of the line splitting model. In the output, line start nodes are marked green and line end nodes are marked orange.
Fig. 7: Overview of the line clustering model. In the output, positive edges are marked pink.

3.2 Message Passing on Graphs

We use spatial-based graph convolution networks (GCNs) [33, 7] for both tasks of line splitting and line clustering, since both can leverage the local spatial feature aggregation and combinations across graph edges (more details in subsections 3.3 and 3.4 below).

Our graph convolution network resembles the message passing neural network (MPNN) [9] and GraphSage [10]. We use the term “message passing phase” from [9] to describe the graph level operations in our models. In this phase, repeated steps of “message passing” are performed based on a message function and node update function . At step , a message is passed along every edge in the graph where and are the hidden states of node and . Let denote the neighbors of node in the graph, the aggregated message by average pooling received by is

(1)

and the updated hidden state is

(2)

Alternatively, we can use attention weighted pooling [31] to enhance message aggregation. Consequently, the model is also called a graph attention network (GAT), where calculation of is replaced by

(3)

and is computed from a shared attention mechanism

(4)

For , we use the self-attention mechanism introduced in [30].

In our GCN models, the message passing steps are applied on the -skeleton graph constructed on OCR bounding boxes, so that structural information can be passed around local vicinities along graph edges, and potentially be combined and extracted into useful signals. Both the line splitting step and the line clustering step rely on this mechanism to make predictions on graph nodes or edges.

3.3 Splitting Lines

As in [3, 28], if multi-column text blocks are present in a document page, splitting lines across columns is a necessary first step for finding paragraphs. Here we have the same objective but a different input with available OCR bounding boxes for each word and symbol. Image processing can be skipped to accelerate computations.

Note that the horizontal spacings between words is not a reliable signal for this task, as when the typography alignment of the text is “justified,” i.e. the text falls flush with both sides, these word spacings may be stretched to fill the full column width. In Fig. 3, the bottom left line has word spacings larger than the column gap. This is common in documents with tightly packed text such as newspapers.

We use the GCN model shown in Fig. 6 to predict the splitting points, or tab-stops. Each graph node is a word bounding box. Graph edges are the -skeleton edges built as described in section 3.1. The model output contains two sets of node classification results – whether each word is a “line start” and whether it is a “line end”. This model is expected to work well for difficult cases like dense text columns with “justified” alignment by aggregating signals from words in multiple lines surrounding the potential splitting point.

Fig. 8 shows a zoomed-in area of Fig. 3 with a -skeleton graph constructed from the word bounding boxes. Since words are aligned on either side of the two text columns, a set of words with their left edges all aligned are likely on the left side of a column, i.e. these words are line starts. Similarly, a set of words with right edges all aligned are likely on the right side and are line ends. The -skeleton edges are guaranteed to connect aligned words in neighboring lines, since aligned words have the shortest distance between the two lines and there is nothing in between to block the connection. Thus, the alignment signal can be passed around in the message passing steps and be effectively learned by the GCN model. Moreover, the two sets of words beside the column gap are also connected with -skeleton edges crossing the two columns, so the signals can be mutually strengthened.

Fig. 8: Line splitting signal from word box alignment propagating through -skeleton edges. The resulting predictions are equivalent to tab-stop detection.
Fig. 9: Example of paragraph line clustering by indentations. Light blue edges indicate the -skeleton constructed on line bounding boxes, and pink edges indicate that the connected lines are clustered into paragraphs.

3.4 Clustering Lines

After splitting all the lines into “true” lines, the remaining task is to cluster them into paragraphs. Again we use a graph convolution network, but now each graph node is a line bounding box, and the output is edge classification similar to link prediction in [21, 35]. We define a positive edge to be one that connects two consecutive lines in the same paragraph. Note that it is possible to have non-consecutive lines in the same paragraph being connected by a -skeleton edge. Such edges are defined as negative to make the task easier to learn.

Fig. 7 is an overview of the line clustering model. It looks similar to the line splitting model in Fig. 6, except the input consists of line bounding boxes, and the output predictions are on graph edges instead of nodes. An additional “node-to-edge” step is necessary to enable edge classification with node-level output from the graph convolution steps. It works in a similar way as the first half of a graph convolution step, with node aggregation replaced by edge aggregation:

(5)

The model predicts whether two lines belong to the same paragraph on each pair of lines connected with a -skeleton edge. The predictions are made from multiple types of context.

  • Indentation: For example, in Fig. 9 which is zoomed-in from Fig. 3, each new paragraph starts with an indented line, so the edge connecting the fourth and fifth lines in the left column is predicted as non-clustering.

  • Vertical spacing: “Block paragraphs” are separated by extra vertical spacing, which are common in web pages. Line spacing signals are passed around in graph convolutions to detect vertical space variations.

  • List items: The first line of a list item is usually outdented with a bullet point or a number, and the first word after is flush with the following lines. So list items can be detected in a similar way as indentation based paragraphs.

Besides the three common types listed above, we may have other forms of paragraphs such as mailing addresses, computer source code or other customized structures. The model can be trained on different types of layout data.

3.5 Possibility of Clustering Words

If a 1-step model can cluster the words directly into paragraphs, it will be preferable to the 2-step GCN models described above. A single model is not only faster to train and run, but can also avoid cascading errors where the first step’s mistake propagates to the second.

However, there is significant difficulty for a 1-step GCN model to work on the paragraph word clustering problem. First, word based GCN models may not have good signal aggregation for line level features because of the limited number of graph convolution layers. The oversmoothing effect [20, 19] limits the depth of the network, i.e. the number of “message passes”. With -skeleton graphs mostly consisting of local connections, the “receptive field” on each graph node is small and often cannot cover a whole line. For instance, a word at the end of a line has no information on whether this line is indented. In a general purpose paragraph model where the input can be noisy and deformed, this limitation can severely affect model performance.

While it is possible to extend the receptive fields by adding non-local edges in the graph, or employing residual connections and dilated convolutions [19] in the model, it is non-trivial to build a scalable and effective solution. This is an interesting topic for further research, but not the focus of this paper.

4 Synthetic Training Data from Web

A large set of diverse and high quality annotated data is a necessity for training deep neural networks. Such datasets are not readily available for paragraphs and layout-related tasks. The PubLayNet dataset [36] is a very large annotated set, but lacks in style diversity as all the pages are from publications.

Therefore, we largely rely on automated training data generation [15]. By taking advantage of high quality and publicly available web documents, as well as a powerful rendering engine used in modern browsers, we can generate synthetic training data with a web scraper.

Fig. 10: Training data examples from web scraping with randomized style changes and data augmentation. Green boxes indicate line ground truth labels and yellow boxes indicate multi-line paragraph ground truth labels. (Yellow paragraph boxes are for visualization purpose only. Paragraph ground truth labels are represented by sets of line numbers in order to prevent ambiguity when there are overlaps between paragraph bounding boxes.)

4.1 Scraping Web Pages with Modified Styles

Web pages are a good source of document examples. Wikipedia [32] is well known to host a great number of high quality articles with free access.

We use a browser-based web scraper to retrieve a list of Wikipedia pages, where each result includes the image rendered in the browser as well as the HTML DOM (document object model) tree of that page. The DOM tree contains the complete document structure and detailed locations of all the rendered elements, from which we can reconstruct the ground truth line bounding boxes. Each line bounding box is an axis-aligned rectangle covering a line of text. For paragraph ground truth, the HTML tag p conveniently indicates a paragraph node, and all the text lines under this node belong to the same paragraph.

Style Change Script Sample
Single-column to div.style.columnCount = 2;
double-column
Vertical spacing to div.style.textIndent = 30px;
indentation div.style.marginTop = 0;
div.style.marginBottom = 0;
Typography alignment div.style.textAlign = “right”;
Text column width div.style.width = 50%;
Horizontal text block div.style.marginLeft = 20%;
position
Line height/spacing div.style.lineHeight = 150%;
Font div.style.fontFamily = “times”;
TABLE I: Sample web script code for changing paragraph styles. Random combinations of these changes are used in training data synthesis.

One issue of using web page data directly for document layout is the lack of diversity in document styles. Almost all web pages use vertical spacing to separate paragraphs, and multi-column text is rare. Fortunately, modern web browsers support extensions that can run script code on web pages to change their CSS styles. For example, to generate double-column text for a certain division of a page, we can use “div.style.columnCount = 2.”

Table I lists a few examples of web script code for changing paragraph styles. Such pieces are randomly picked and combined in our training data pipeline. Parameters such as column count and alignment type are also randomized. Thus, the total combinations give a great diversity of styles to simulate various types of documents to be encountered in the real world.

4.2 Data Augmentation

A general-purpose OCR engine must accommodate all types of input images, including photos of text taken at different camera angles. Our model should be able to handle the same variations in input, so data augmentation is needed to transform the rectilinear data scraped from web pages into photo-like data.

To emulate the effect of camera angles on a page, we need two types of geometric transformation: rotation and perspective projection. Again, we use randomized parameters in each transformation to diversify our data. Each data point gets a random projection followed by a random rotation, applied to both the image and ground truth boxes. Fig. 10 shows two training examples with data augmentation.

  • The left one has Arial font, dense text lines, paragraphs separated by indentation and the camera placed near the upper-left corner.

  • The right one has Monospace font, sparse text lines, paragraphs separated by vertical spacing and the camera placed near the lower-right corner.

Note that we do not need pixel-level augmentation (imaging noise, illumination variation, compression artifacts, etc.) for the training of our GCN models, because these models only take bounding box features from the OCR engine output, and are decoupled from the input image. Even when real input images look very different from the training data images, the bounding boxes from a robust OCR engine can still be consistent. It is assumed that the OCR engine has been trained to be robust to pixel-level degradation, as is the case in the present work.

4.3 Sequential 2-Step Training by Web Synthetic Data

We train the two GCN models in sequence, where the line clustering input depends on the line splitting model. For each model, the classification ground truth labels are computed from matching OCR output to the shapes of ground truth (GT) lines. The GT lines are denoted by . Each GT line is a rectangle from the web rendering engine, and is transformed into a quadrilateral in data augmentation.

In the line splitting model, the graph nodes are the OCR word boxes, and the output labels are node classifications on whether each node is a “line start” and whether it is a “line end”. These labels can be computed by the following two steps.

  • For each OCR word , the word is allocated to the ground truth line with maximum intersection area with the word bounding box.

  • For each GT line , sort the words allocated to this line along its longitudinal axis. The word at the left end is a line start, the word at the right end is a line end, and the remaining words are negative on both.

In the line clustering model, the graph nodes are the line boxes after the line splitting process, and the output labels are edge classifications on whether an edge connects two lines that are adjacent lines in the same paragraph. For each -skeleton edge that connects a pair of OCR line boxes (, ), we find the corresponding pair of GT line boxes (, ) by the same maximum intersection area as the step above. The edge label is positive if and belong to the same paragraph and 4

The line clustering model input is generated from line splitting on the OCR results. While there remains some risk of cascading of errors, line clustering is able to correct some mistakes in the previous line splitting step. Specifically,

  • For “under-splitting”, i.e. when an OCR line covers multiple GT lines, there is no way to correct it by clustering, and the training example is discarded.

  • For “over-splitting”, i.e. when multiple OCR lines match the same GT line, the line clustering model can cluster the over-split short lines into the same paragraph and recover the original lines. See the second picture in Fig. 15 as an example. The sequential training steps enable this error correction.

It is worth noting that the ground truth labels associated with table elements are treated as “don’t-care” and assigned with weight 0 in training. The reason is that tables have very different structures from paragraphed text, and the two types of entities often produce contradicting labels within the current GCN framework. Using GCN for table detection like [25] is another interesting topic but out of the scope of this paper.

5 Experiments

We experiment with the 2-step GCN models and evaluate the end-to-end performance on both the open PubLayNet dataset and our own annotated sets.

In the end-to-end flow, the line splitting model and the line clustering model work in a sequential order. It takes an OCR result page as input, and produces a set of paragraphs each containing a set of lines, and every line in the page belongs to exactly one paragraph.

5.1 Setups

We use the OCR engine behind Google Cloud Vision API DOCUMENT_TEXT_DETECTION 5 version 2020 for all the pre-layout detection and recognition tasks. Setup details are elaborated as follows.

Data

We use 3 datasets in our evaluations: PubLayNet from [36], the web synthetic set as described in section 4, and a human annoatated set with real-world images.

  • PubLayNet contains a large amount of document images with ground truth annotations: 340K in the training set and 12K in the development/validation set. The testing set ground truth has not been released at the time of this writing, so here we use the development set for testing.

  • For web synthetic, we scrape 100K Wikipedia [32] pages in English for image based model training and testing at a 90/10 split. For GCN models, 10K pages are enough. An additional 10K pages in Chinese, Japanese and Korean are scraped to train the omni-script GCN models.

  • We also use a human annotated set with real-world images – 25K in English for training and a few hundred for testing in each available language. The images are collected from books, documents or objects with printed text, and then sent to a team of raters who draw the ground truth polygons for all the paragraphs. Example images are shown in Fig. 14, 15 and 16.

Models and Hyperparameters

The GCN models are built as in Fig. 6 and Fig. 7, each carrying 8 steps of graph convolutions with 4-head self-attention weighted pooling [31, 30].

At the models’ input, each graph node’s feature is a vector containing its bounding box information of the word or the line. The first five values are width , height , rotation angle , and . Then for each of its 4 corners , we add 6 values . For line clustering, an additional indicating the first word’s width is added to each line for better context of line breaks and list items. These values provide the starting point for feature crossings and combinations in graph convolutions. This low dimension of model input enables lightweight and efficient computation.

Model inference latency is very low, under 20 milliseconds for an input graph of around 1500 nodes (on a 12-core 3.7GHz Xeon CPU), since each GCN model is less than 130KB in size with 32-bit floating point parameters. In fact, the computation bottleneck is on the -skeleton construction which can take 50 milliseconds for the same graph. Compared to the main OCR process, the overall GCN latency is small, and the complexity ensures scalability.

We cannot claim that GCN models have better latency than image based R-CNN models, because image models can run in parallel with the OCR engine when resources allow. Instead, the small size of GCN models makes them easy to be deployed as a lightweight, low cost and energy efficient step of post-OCR layout analysis.

Evaluation Metrics

While classification tasks are evaluated by precision and recall, the end-to-end performance is measured by IoU based metrics such as the COCO mAP@IoU[.50:.95] used in [36] so the results are comparable.

The average precision (AP) for mAP is usually calculated on a precision-recall curve. Since our models produce binary predictions instead of detection boxes, we have only one output set of paragraph bounding boxes, i.e. only one point on the precision-recall curve. So .

We introduce another metric F1 using variable IoU thresholds, which is more suitable for paragraph evaluations. In Fig. 11, a single-line paragraph has a lower IoU even though it is correctly detected, while a 4-line detection (in red) has a higher IoU with a missed line. This is caused by boundary errors at character scale rather than at paragraph scale. This error is larger for post-OCR methods since the OCR engine is not trained to fit paragraph boxes. If we have line-level ground truth in each paragraph, and adjust IoU thresholds by

(5)

the single-line paragraph will have IoU threshold 0.5, the 5-line one will have IoU threshold 0.833, and both cases in Fig. 11 can be more reasonably scored.

Fig. 11: Paragraph detection example from PubLayNet [36]. Red boxes are different from ground truth in terms of enclosed words. A single-line correct detection has lower IoU than a multi-line detection missing a line, necessitating variable IoU thresholds in evaluations.

Both PubLayNet [36] and our web synthetic set have line level ground truth to support this metric. For the human annotated set without line annotations, we fall back to a fixed IoU threshold of 0.5.

Baselines

The 2-step GCN models are compared against image based models and the heuristic algorithm in our production system.

The image models include Faster R-CNN and Mask R-CNN used in [36], which work on the PubLayNet data with non-augmented images. For broader testing on augmented datasets, we train a Faster R-CNN model with an additional quadrilateral output to indicate rotated boxes, denoted by “F-RCNN-Q” in following subsections. This model uses a ResNet-101 [11] backbone and is 200MB in size, smaller than the two models in [36] but still 3 orders of magnitude larger than the GCN models.

For reference, the baseline heuristic algorithm takes the OCR recognized text lines as input and generates paragraphs by the following steps.

  1. Within each line, find white spaces between words that are significantly wider than average, and split the line by these spaces into shorter lines.

  2. For each line, repeatedly cluster nearby lines into its block by a distance threshold, until no more proximate lines can be found.

  3. Within each block, merge lines that are roughly placed in the same straight line.

  4. Within each block, for each indented line, create a new paragraph.

These rule-based steps were intended to handle multi-column text pages, but the fixed hand-tuned parameters make it inflexible at style variations. Replacing them with machine learned GCN models as proposed here can greatly enhance the algorithm’s performance and adaptivity.

5.2 GCN Classification Accuracies

We first check the metrics of the GCN classification tasks on various training sets. Precision and recall scores of the binary classification tasks are shown in Table II. The PubLayNet data is not applied with data augmentation because of the low resolution of its images, while the web synthetic data is tried with and without data augmentation.

The human annotated training set is added to train the line clustering GCN model, but not the line splitting model because it lacks dense, multi-column text pages. Therefore, only the line clustering scores are shown for the combined set in Table II. The scores on the annotated set are significantly lower because of the diverse and noisy nature of the data source.

We also compare the -skeleton graph with the two types of “line-of-sight” graphs in Fig. 4. Since the edges are very different among these graphs, Table III only compares node classification scores trained on the augmented web synthetic set. When average pooling is used in graph convolutions, the free “line-of-sight” graph in Fig. 4(a) achieves the best scores. However, the size of the graph scales poorly and causes out-of-memory errors when training with attention weighted pooling within our environment. In practical use, -skeleton graph appears to yield the best results for our purpose.

5.3 PubLayNet Evaluations

The PubLayNet dataset has five types of layout elements: text, title, list, figure and table. For our task, we take text and title bounding boxes as paragraph ground truth, and set all other types as “don’t-care” for both training and testing.

Dataset Line start Line end Edge clustering
PubLayNet 0.998/0.992 0.992/0.990 0.994/0.997
Web synthetic 0.995/0.996 0.994/0.997 0.978/0.980
Augmented web 0.988/0.986 0.990/0.987 0.958/0.966
synthetic
Combined set Augmented web synthetic 0.949/0.953
Human annotated 0.901/0.912
TABLE II: Precision/recall pairs of the two GCN models’ classification tasks during training with different datasets. The -skeleton graph is used for all tasks.
Graph type Pooling method Line start Line end
-skeleton Average 0.982/0.978 0.981/0.978
Attention 0.988/0.986 0.990/0.987
Line-of-sight Average 0.983/0.985 0.984/0.988
Attention -/- -/-
Axis-aligned Average 0.972/0.974 0.970/0.971
line-of-sight Attention 0.973/0.973 0.964/0.978
TABLE III: Precision/recall pairs of the line splitting model using different types of graphs on the augmented web synthetic set. Both average pooling and attention weighted pooling are tested for message aggregation in graph convolutions.

(a)                                                        (b)                                                       (c)                                                        (d)

Fig. 12: Representative PubLayNet examples of paragraphs by OCR followed by GCN line splitting and line clustering.

(a)                                                        (b)                                                       (c)                                                        (d)

Fig. 13: Paragraph errors in PubLayNet examples caused by various types of failures including OCR detection, line splitting and line clustering. (a) Under splitting. (b) Over splitting. (c) Clustering errors for normal text and math equations. (d) Clustering error across table boundary line.

Table IV shows that F-RCNN-Q matches the mAP scores in [36]. The GCN models are worse in this metric because there is only one point in the precision-recall curve, and the OCR engine is not trained to produce bounding boxes that match the ground truth. In the bottom row of Table IV, “OCR + Ground Truth” is computed by clustering OCR words into paragraphs based on ground truth boxes, which is the upper bound for all post-OCR methods. For mAP scores, even the upper bound is lower than the scores of image based models. However, if we measure by F1 scores defined in subsection 5.1.3, OCR + GCNs can match image based models with a slight advantage.

Model Training Set mAP F1
F-RCNN [36] PubLayNet training 0.910 -
M-RCNN [36] PubLayNet training 0.916 -
F-RCNN-Q PubLayNet training 0.914 0.945
Tesseract [27] - 0.571 0.707
OCR + Heuristic - 0.302 0.364
OCR + GCNs Augmented web synthetic 0.748 0.867
OCR + GCNs PubLayNet training 0.842 0.959
OCR + Ground Truth              - 0.892 0.997
TABLE IV: Paragraph mAP@IoU[.50:.95] score and F1 comparison on the PubLayNet development set. Numbers in the first 2 rows are from [36].

The high F1 score on “OCR + Ground Truth” also shows that the OCR engine we use has a very high recall on text detection. The reason it is lower than 1 is mostly from ground truth variations – a small fraction of single-line paragraphs have IoU lower than 0.5.

Fig. 12 shows some GCN produced examples where all the paragraphs are correctly identified. Errors made by the GCN models (or the OCR engine) are shown in Fig. 13 with four examples:

  1. Under splitting – the top line (marked red) should have been split into two. This usually causes large IoU drop and cannot be recovered by line clustering.

  2. Over splitting.

  3. Clustering errors among text lines, and also on a math equation together with detection errors.

  4. A table annotation is clustered with a table cell across a boundary line, because our models do not take image features and ignore non-text lines.

5.4 Synthetic Dataset Evaluations

The synthetic dataset from web scraping can give a more difficult test for these models by its aggressive style variations.

(a)                                                                         (b)                                                                          (c)

Fig. 14: Representative examples of real-world images with OCR followed by GCN line splitting and line clustering. Blue boxes: words; green boxes: lines; yellow boxes: paragraphs; pink line segments: positive line clustering predictions.

(a)                                                                         (b)                                                                            (c)

Fig. 15: Paragraph errors in real-word images. (a) Under splitting. (b) Over splitting. (c) Over clustering table elements.

In Table V, we can see the F1 score of the image based F-RCNN-Q model decreases sharply as the task difficulty increases. At the synthetic dataset where the images are augmented with rotations and projections as in Fig. 10, detection is essentially broken, not only from non-max suppression drops shown in Fig. 2, but also from much worse box predictions.

Model Training/Testing Set F1
F-RCNN-Q PubLayNet training/dev 0.945
Web synthetic 0.722
Augmented web synthetic 0.547
OCR + GCNs PubLayNet training/dev 0.959
Web synthetic 0.830
Augmented web synthetic 0.827
TABLE V: Paragraph F1 score comparison across different types of models and datasets.

In contrast, the GCN models are much less affected by data augmentations and layout style variations. Especially between augmented and non-augmented datasets, the F1 score change is minimal. So GCN models will have greater advantage when input images are non axis-aligned.

5.5 Real-World Dataset Evaluations

The human annotated dataset can potentially show the models’ performance in real-world applications. Since the annotated set is relatively small, the F-RCNN-Q model needs to be pre-trained on other paragraph sets, while the GCN models are small enough that the line clustering model can be trained entirely on the paragraph annotations. Evaluation metric for this set is F1-score with a fixed IoU threshold of 0.5.

Model Training Data F1@IoU0.5
F-RCNN-Q Augmented web synthetic 0.030
Annotated data (pre-trained 0.607
on PubLayNet)
OCR + Heuristic - 0.602
OCR + GCNs Augmented web synthetic 0.614
Annotated data 0.671
Combined set 0.671
OCR + Ground Truth - 0.960
TABLE VI: Paragraph F1-scores tested on the real-world test set with paragraph annotations. Fixed IoU threshold 0.5 is used since there is no line-level ground truth to support variable thresholds.

Table VI shows comparisons across different models and different training sets. All the models should handle image rotations and perspective transformations, so we only compare models trained on the augmented web synthetic set or the human annotated set. First, we can see that Faster R-CNN trained from synthetic web rendered pages does not work at all for real-world images, whereas the GCN models can generalize well from synthetic training data.

Also note that most of the annotated images are nearly axis-aligned, so the GCN models will yield even greater advantage if the images are rotated or taken with varied camera angles.

Fig. 14 and Fig. 15 show six examples of OCR + GCNs produced paragraphs. The successful examples in Fig. 14 are all difficult cases for heuristic and detection based approaches but are handled well by the GCN models. The image on the right shows the effectiveness of training with augmented web synthetic data, as there are no similar images in the annotated set. Error examples produced by GCN are shown in Fig. 15:

  1. Under splitting: the caption under the top-right picture is not split from the paragraph on the left, causing downstream errors.

  2. Over splitting: two lines in the middle are mistakenly split, but the short line segments are then clustered back into the same paragraph, resulting in a correct final output.

  3. Over clustering table elements: since tables are “don’t-care” regions in the training data, the GCN models trained with paragraph data may take table elements as sparse text lines and incorrectly cluster them together. A table detector may help to filter out these lines for paragraphs.

To verify the robustness of the GCN models for language and script diversity, we test them on a multi-language evaluation set. The models are trained with both synthetic and human annotated data in English, and additional synthetic data from Wikipedia pages in Chinese, Japanese and Korean. No other Latin language data is needed as the English data is sufficient to represent the layout styles.

Table VII shows the F1-scores across multiple languages. F-RCNN-Q is not evaluated for the three Asian languages, because we don’t have suitable training data, and Table VI indicates that synthetic training data is not useful for this model. The GCN models produce best results in almost all the languages tried, once again showing good generalizability.

OCR + F-RCNN OCR + OCR +
Language Heuristic -Q GCNs Ground Truth
English 0.429 0.513 0.544 0.890
French 0.438 0.557 0.553 0.885
German 0.427 0.538 0.566 0.873
Italian 0.455 0.545 0.556 0.862
Spanish 0.449 0.597 0.616 0.885
Chinese 0.370 - 0.485 0.790
Japanese 0.398 - 0.487 0.772
Korean 0.400 - 0.547 0.807
TABLE VII: F1@IoU0.5 scores tested on the multi-language evaluation set.

The GCN models are also flexible in handling text lines written in vertical directions, which are common in Japanese and Chinese, and also appear in Korean. Although we don’t have much training data with vertical lines, the bounding box structures of lines and symbols in these languages remain the same when the lines are written vertically, as if they were written horizontally while the image is rotated clockwise by 90 degrees. Fig. 16 shows such an example. Since our models are trained to handle all rotation angles, such paragraphs can be correctly identified.

6 Conclusions and Future Work

We demonstrate that GCN models can be powerful and efficient for the task of paragraph estimation. Provided with a good OCR engine, they can match image based models with much lower requirement on training data and computation resources, and significantly beat them on non-axis-aligned inputs with complex layout styles. The graph convolutions in these models give them unique advantages in dealing with different levels of page elements and their relations.

Future work includes model performance improvement through both training data and model architectures. Training data can be made more realistic by tuning the web scraping pipeline and adding more complex degradation transformations such as wrinkling effects on document pages. Also, alternative model architectures and graph structures mentioned in subsection 3.5 may improve quality and performance.

Another aspect of the future work is to extend the GCN models’ capability to identify more types of entities and extract document structural information such as reading order. Some entities like titles and list items are similar to paragraphs, while some others like tables and document sections are not straightforward to handle with our proposed models. Image based CNNs may be needed with their outputs used as node or edge features in the GCN model, so that non-text components in the document (e.g. checkboxes, table grid lines) can be captured. In addition, reading order among entities is a necessary step if we want to identify semantic paragraphs that span across multiple columns/pages.

Fig. 16: Example of paragraphs from text lines with vertical writing direction.

Acknowledgments

The authors would like to thank Chen-Yu Lee, Chun-Liang Li, Michalis Raptis, Sandeep Tata and Siyang Qin for their helpful reviews and feedback, and to thank Alessandro Bissacco, Hartwig Adam and Jake Walker for their general leadership support in the overall project effort.

Footnotes

  1. A semantic paragraph can span over multiple text columns or pages. In this paper, we only look for physical paragraphs where lines of contiguous indices are always physically proximate. Moreover, we regard stand-alone text spans such as titles and headings as single-line paragraphs.
  2. In current use in products and services at the time of this writing.
  3. The middle line points are added so that no edges can go through the boxes.
  4. means no skip-line positive edges, which makes the task easier to learn. For datasets without line level ground truth labels, this condition is replaced by “there is no path in the -skeleton graph shorter than edge (, )”.
  5. https:cloud.google.com/vision/docs/fulltext-annotations

References

  1. M. d. Berg, O. Cheong, M. v. Kreveld and M. Overmars (2008) Computational geometry: algorithms and applications. 3rd ed. edition, Springer-Verlag TELOS, Santa Clara, CA, USA. External Links: ISBN 3540779736 Cited by: item 2, §3.1.
  2. T. M. Breuel (2003) An algorithm for finding maximal whitespace rectangles at arbitrary orientations for document layout analysis. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., Vol. , pp. 66–70 vol.1. External Links: Document Cited by: §2.1.
  3. T. M. Breuel (2003) High performance document layout analysis. In Proceedings of the Symposium on Document Image Understanding Technology, Greenbelt, MD, pp. 209–218. Cited by: §1, §2.1, §3.3.
  4. R. Cattoni, T. Coianiz, S. Messelodi and C. M. Modena (1998) Geometric layout analysis techniques for document image understanding: a review. Technical report Technical Report 9703-09, IRST, Trento, Italy. Cited by: §1.
  5. C. Clausner, A. Antonacopoulos and S. Pletschacher (2019) ICDAR2019 competition on recognition of documents with complex layouts - RDCL2019. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019, pp. 1521–1526. External Links: Link, Document Cited by: §6.
  6. B. L. Davis, B. S. Morse, S. Cohen, B. L. Price and C. Tensmeyer (2019) Deep visual template-free form parsing. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019, pp. 134–141. External Links: Link, Document Cited by: §3.1.
  7. D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama and R. Garnett (Eds.), Vol. 28, pp. 2224–2232. Cited by: §3.2.
  8. Y. Fujii, K. Driesen, J. Baccash, A. Hurst and A. C. Popat (2017) Sequence-to-label script identification for multilingual OCR. In 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017, Kyoto, Japan, November 9-15, 2017, pp. 161–168. External Links: Link, Document Cited by: §3.
  9. J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals and G. E. Dahl (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 1263–1272. Cited by: §3.2.
  10. W. L. Hamilton, Z. Ying and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan and R. Garnett (Eds.), pp. 1024–1034. External Links: Link Cited by: §3.2.
  11. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Document Cited by: §5.1.4.
  12. K. He, G. Gkioxari, P. Dollár and R. B. Girshick (2020) Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 42 (2), pp. 386–397. External Links: Link, Document Cited by: §2.2.
  13. W. Horak (1985-10) Office document architecture and office document interchange formats: current status of international standardization. Computer 18, pp. 50–60. Cited by: §1.
  14. Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu and Z. Luo (2018) R2 cnn: rotational region cnn for arbitrarily-oriented scene text detection. In 2018 24th International Conference on Pattern Recognition (ICPR), Vol. , pp. 3610–3615. External Links: Document Cited by: 2nd item.
  15. T. Kanungo and R. M. Haralick (1999) An automatic closed-loop methodology for generating character groundtruth for scanned documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (2), pp. 179–183. External Links: Document Cited by: §4.
  16. D. G. Kirkpatrick and J. D. Radke (1985) A framework for computational morphology. Machine Intelligence and Pattern Recognition 2, pp. 217–248. External Links: Link, Document Cited by: §1, §3.1, §3.
  17. Y. Lecun, L. Bottou, Y. Bengio and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document Cited by: §6.
  18. J. Lee, H. Hayashi, W. Ohyama and S. Uchida (2019) Page segmentation using a convolutional neural network with trainable co-occurrence features. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019, pp. 1023–1028. External Links: Link, Document Cited by: §2.3.
  19. G. Li, M. Müller, A. K. Thabet and B. Ghanem (2019) DeepGCNs: can gcns go as deep as cnns?. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 9266–9275. External Links: Link, Document Cited by: §3.5, §3.5.
  20. Q. Li, Z. Han and X. Wu (2018) Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, S. A. McIlraith and K. Q. Weinberger (Eds.), pp. 3538–3545. Cited by: §3.5.
  21. D. Liben-Nowell and J. Kleinberg (2007-05) The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology 58 (7), pp. 1019–1031. External Links: ISSN 1532-2882 Cited by: §3.4.
  22. A. L. L. M. Maia, F. D. Julca-Aguilar and N. S. T. Hirata (2018) A machine learning approach for graph-based page segmentation. In 31st SIBGRAPI Conference on Graphics, Patterns and Images, SIBGRAPI 2018, Paraná, Brazil, October 29 - Nov. 1, 2018, pp. 424–431. External Links: Link, Document Cited by: §2.3.
  23. D. Niyogi and S. N. Srihari (1986) A rule-based system for document understanding. In Proceedings of the Fifth AAAI National Conference on Artificial Intelligence, AAAI’86, pp. 789–793. Cited by: §1, §2.1.
  24. S. Ren, K. He, R. B. Girshick and J. Sun (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39 (6), pp. 1137–1149. External Links: Link, Document Cited by: §2.2.
  25. P. Riba, A. Dutta, L. Goldmann, A. Fornés, O. R. Terrades and J. Lladós (2019) Table detection in invoice documents by graph neural networks. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019, pp. 122–127. External Links: Link, Document Cited by: §2.4, §2, §3.1, §4.3.
  26. F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner and G. Monfardini (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. External Links: Document Cited by: §1.
  27. R. Smith (2007) An overview of the Tesseract OCR engine. In Proc. 9th IEEE Intl. Conf. on Document Analysis and Recognition (ICDAR), pp. 629–633. Cited by: §2, §3, TABLE IV.
  28. R. W. Smith (2009) Hybrid page layout analysis via tab-stop detection. In 10th International Conference on Document Analysis and Recognition, ICDAR 2009, Barcelona, Spain, 26-29 July 2009, pp. 241–245. External Links: Link, Document Cited by: §1, §2.1, §3.3.
  29. S. N. Srihari and G. W. Zack (1986) Document image analysis. In Proceedings of the 8th International Conference on Pattern Recognition, pp. 434–436. Cited by: §1, §2.1.
  30. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §3.2, §5.1.2.
  31. P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò and Y. Bengio (2018) Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, External Links: Link Cited by: §3.2, §5.1.2.
  32. Wikipedia, the free encyclopedia. External Links: Link Cited by: §1, §4.1, 2nd item.
  33. Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang and P. S. Yu (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems (), pp. 1–21. Cited by: §1, §3.2.
  34. X. Yang, E. Yumer, P. Asente, M. Kraley, D. Kifer and C. L. Giles (2017) Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.3.
  35. M. Zhang and Y. Chen (2018) Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pp. 5171–5181. External Links: Link Cited by: §3.4.
  36. X. Zhong, J. Tang and A. Jimeno-Yepes (2019) PubLayNet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019, pp. 1015–1022. External Links: Link, Document Cited by: §1, §1, §2.2, §4, Fig. 11, §5.1.1, §5.1.3, §5.1.3, §5.1.4, §5.3, TABLE IV.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
427364
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description