General-Purpose OCR Paragraph Identification by Graph Convolution Networks
Paragraphs are an important class of document entities. We propose a new approach for paragraph identification by spatial graph convolution networks (GCN) applied on OCR text boxes. Two steps, namely line splitting and line clustering, are performed to extract paragraphs from the lines in OCR results. Each step uses a -skeleton graph constructed from bounding boxes, where the graph edges provide efficient support for graph convolution operations. With only pure layout input features, the GCN model size is 34 orders of magnitude smaller compared to R-CNN based models, while achieving comparable or better accuracies on PubLayNet and other datasets. Furthermore, the GCN models show good generalization from synthetic training data to real-world images, and good adaptivity for variable document styles.
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
Document image understanding is a task to recognize, structure, and understand the contents of document images, and is a key technology to digitally process and consume such document images. If we regard any images containing structured text as document images, they are ubiquitous and can be found in numerous applications. Document image understanding enables the conversion of such documents into a digital format with rich structure and semantic information and makes them available for subsequent information tasks.
A document can be represented by its semantic structure and physical structure . The task to recover the semantic structure is called logical layout analysis  or semantic structure extraction  while the task to recover the physical structure is called geometric (physical, or structural) layout analysis . These tasks are critical subproblems of document image understanding.
A paragraph is a semantic unit of writing consisting of one or more sentences that usually develops one main idea. Paragraphs are basic constituents of semantic structure and thus paragraph boundary estimation (or paragraph estimation, for short) is an important building block of logical layout analysis. Moreover, paragraphs are often appropriate as processing units for various downstream tasks such as translation and information extraction because they are self-contained and have rich semantic information. Therefore, developing a generic paragraph estimation algorithm is of great interest by itself.
Paragraphs are usually rendered in a geometric layout structure according to broadly accepted typographical rules. For example, a paragraph can be rendered as a series of text lines that
are placed with uniform vertical spacing between adjacent lines;
start with a line where one of the following is true:
The line is indented. (An indented paragraph.)
The line starts with a bullet symbol or number, with all the subsequent lines indented to be left-justified, flush with the first. (A list item.)
The vertical spacing above the first line is significantly larger than the uniform spacing between subsequent lines. (A block paragraph.)
As such, there are usually clear visual cues to identify paragraphs
Previous studies have attempted to develop a paragraph estimation method by defining handcrafted rules based on careful observations [23, 29, 3, 28] or by learning an object detection model to identify the regions of paragraphs from an image [34, 36]. For the former approaches, it is usually challenging to define a robust set of heuristics even for a limited domain, and hence machine-learning-based solutions are generally preferable. The latter approaches tend to have difficulty dealing with diverse aspect ratios and text shapes, and the wide range of degradations observed in real-world applications such as image skews and perspective distortions.
In this paper, we propose to apply graph convolution networks (GCNs) in a post-processing step of an optical character recognition (OCR) system for paragraph estimation. Modern OCR engines can detect and recognize texts with a very high recall for documents in a variety of conditions. Indeed, as will be shown in the experiments, our generic OCR system can detect and recognize texts with a higher recall than a specialized image-based paragraph detector, indicating little risk of missing correct paragraph boundaries by restricting to the possibilities generated by the OCR engine. That motivates us to employ the post-processing strategy rather than a pre-processing or an entangled approach. Recent advancements in graph neural (convolutional) networks [26, 33] have enabled deep learning on non-Euclidian data. GCNs can learn spatial relationships among entities combining information from multiple sources and provide a natural way to learn the non-linear mapping from OCR results to paragraphs.
More specifically, we design two classifiers based on GCNs — one for line splitting and one for line clustering. A word graph is constructed for the first stage and a line graph is constructed for the second stage. Both graphs are constructed based on the -skeleton algorithm  that produces a graph with good connectivity and sparsity.
To fully utilize the models’ capability, it is desirable to have a diverse set of document styles in the training data. We create synthetic data sets from web pages where the page styles are randomly modified in the web scraping engine. By leveraging open web sites like Wikipedia  for source material to render in randomized styles, we have access to practically unlimited document data.
We evaluate the 2-step models on both the PubLayNet  and our own datasets. We show that GCN based models can be small and efficient by taking OCR produced bounding boxes as input, and are also capable of generating highly accurate results. Moreover, with synthesized training data from a browser-based rendering engine, these models can be a step towards a reverse rendering engine that recovers comprehensive layout structure from document images.
This paper is organized as follows: Section 2 reviews related work. Section 3 presents our proposed method, where the graph construction method and the details of each step of the algorithms are described. Section 4 explains training data generation methods with web scraping. Experimental setups and results are given in Section 5. Section 6 concludes the paper with suggestions for future work.
2 Related Work
OCR layout analysis (”layout” for short) comprises a large variety of problems that have been studied from different aspects. There is pre-recognition layout like  to find text lines as the input of recognition, and post-recognition layout like  to find higher level entities based on OCR recognition results. We list selected studies that are most relevant to our problem in the following subsections.
2.1 Geometric and Rule-based Approaches
Multi-column text, often with small column gaps, needs to be first identified before paragraphs. Early studies have proposed geometric methods [3, 2] and rule-based methods [23, 29, 28]. Both categories have algorithms to find column gaps by searching whitespace  or text alignment .
Limitations of these approaches include susceptibility to input noise and false positive column boundaries, especially with monospace font families.
2.2 Image Based Detection
The PubLayNet paper  provides a large dataset for multiple types of document entities, as well as two object detection models F-RCNN  and M-RCNN  trained to detect these entities. Both show good metrics in evaluations, but also with some disadvantages on detecting paragraphs:
Cost: Object detection models are typically large in size and expensive in computation. When used together with an OCR engine to retrieve text paragraphs, it seems wasteful to bypass the OCR results and attempt to detect paragraphs independently.
Quality: Paragraph bounding boxes may have high aspect ratios and are sometimes tightly packed, making it difficult for Faster R-CNN detection. In Fig. 2, several short paragraphs are printed with dense text and rotated by 45 degrees. The region proposals required to detect all the paragraphs are highly overlapped, so some detections will be dropped by non-maximum suppression (NMS). Rotational R-CNN models  can mitigate this issue by inclined NMS, but further increase the computational cost while still facing a more difficult task with rotated or warped inputs.
2.3 Page Segmentation
Page segmentation models [34, 18, 22] classify every part of the image to certain types of objects such as text, table, image and background. Sometimes the shapes of paragraphs can be revealed by the “text” part of the segmentation. However, when text is dense and paragraphs are indentation based without variation in line spacings, individual paragraphs cannot be easily extracted from large connected text regions. On the other hand, when text is sparse and appears as a lot of separate small components, paragraphs are not obvious in the segmentation result either.
2.4 Graph Neural Network for Table Detection
A graph neural network approach is proposed in  to detect tables in invoice documents. It shows that tabular structures can be detected based purely on structural information by graph neural networks.
Limitations of this approach include graph construction and graph representation. First, the visibility graph is built by only connecting pairs of pre-defined entities that are vertically or horizontally visible, which requires the input image free of skews and distortions. Second, the adjacency matrix learned by the GNN is in size and hence inefficient for large inputs. A general-purpose post-OCR model will need to overcome these limitations to accommodate all types of input images and achieve high computational efficiency.
3 Paragraph Estimation with Graph Convolution Networks
A paragraph consists of a set of text lines, which are usually produced in the output of OCR systems [8, 27]. If text lines are given by OCR systems, we can consider a bottom-up approach to cluster text lines into paragraphs for the paragraph estimation task.
The detected lines provide rudimentary layout information but may not match the true text lines. For example, in Fig. 3, the lower section of the page consists of two text columns placed closely to each other. The line detector might be confused by the tiny spacing and find wrong lines spanning both columns. These lines need to be split in the middle before being clustered into paragraphs.
Line splitting and clustering are non-trivial tasks for general-purpose paragraph estimation – the input images can be skewed or warped, and the layout styles can vary among different types of documents, e.g. newspapers, books, signs, web pages, handwritten letters, etc. Even though the concept of paragraph is mostly consistent across all document categories, the appearance of a paragraph can differ by many factors such as word spacing, line spacing, indentation, text flowing around figures, etc. Such variations make it difficult, if not impossible, to have a straightforward algorithm that identifies all the paragraphs.
In order to address erroneous line detection and solve the non-trivial split and clustering problem, we design a paragraph identification method as a 2-step process after the main OCR engine produces line and word bounding boxes. Both steps use a graph convolution network (GCN) that takes input features from bounding boxes in the OCR result, together with a -skeleton graph  constructed from these boxes. Neither the original image nor text transcriptions are included in the input, so the models are small, fast, and entirely focused on the layout structure.
Step 1: Line splitting. Raw text lines from OCR line detectors may cross multiple columns, and thus need to be split into shorter lines. A GCN node classifier predicts splitting points in lines.
Step 2: Line clustering. After step 1 produces refined lines, they are clustered into paragraphs. A GCN edge classifier predicts clustering operations on pairs of neighboring lines.
The following subsections describe these steps in details. In addition, we discuss the possibility of an alternative one-step process.
3.1 -skeleton on Boxes
A graph is a key part of the GCN model input. We want a graph with high connectivity for effective message passing in graph convolutions, while also being sparse for computational efficiency.
Visibility graphs have been used in previous studies [6, 25], where edges are made by “lines-of-sight”. They are not considered suitable for our models because the lines may create excessive edges. Fig. 4(a) shows the visibility graph built on two rows of boxes, where any pairs of boxes on different rows are connected. This means word connections between text lines may get number of edges. If we limit the lines-of-sight to be axis aligned like Fig. 4(b), then the graph becomes too sparse, even producing disconnected components in some cases.
By changing “lines-of-sight” into “balls-of-sight”, we get a -skeleton graph  with . In such a graph, two boxes are connected if they can both touch a circle that does not intersect with any other boxes. It provides a good balance between connectivity and sparsity. As shown in Fig. 4(c), a -skeleton graph does not have too many connections between rows of boxes. With , it is a subgraph of a Delaunay triangulation  with number of edges bounded by . Yet, it provides good connectivity within any local cluster of boxes, and the whole graph is guaranteed to be one connected component.
The original -skeleton graph is constructed on a point set. To apply it to bounding boxes, we use an algorithm illustrated in Fig. 5 and described in the following steps, where used for complexity analysis is the number of boxes. We assume the length and width of all the input boxes are bounded by a constant.
For each box, pick a set of peripheral points at a pre-set density, and pick a set of internal points along the longitudinal middle line
Build a Delaunay triangulation graph on all the points. (Time complexity .)
Find all the “internal” points that are inside at least one of the boxes. (Time complexity by traversing along ’s edges inside each box starting from any peripheral point. Internal points are marked grey in Fig. 5.)
Add an edge of length 0 for each pair of intersecting boxes (containing each other’s peripheral points).
Pick -skeleton edges from where for each edge , both its vertices , are non-internal points and the circle with as diameter does not cover any other point.
If there is such a point set covered by the circle, then the point closest to must be the neighbor of either or (in Delaunay triangulation graphs). Finding such takes time for each edge, since produced in step 2 have edges sorted at each point.
Keep only the shortest edge for each pair of boxes as the -skeleton edge.
The overall time complexity of this box based -skeleton graph construction is , dominated by Delaunay triangulation. There are pathological cases where step 4 will need time, e.g. all the boxes contain a common overlapping point. But these cases are easily excluded from OCR results. The total number of edges is bounded by as in , so the graph convolution layers have linear time operations.
3.2 Message Passing on Graphs
We use spatial-based graph convolution networks (GCNs) [33, 7] for both tasks of line splitting and line clustering, since both can leverage the local spatial feature aggregation and combinations across graph edges (more details in subsections 3.3 and 3.4 below).
Our graph convolution network resembles the message passing neural network (MPNN)  and GraphSage . We use the term “message passing phase” from  to describe the graph level operations in our models. In this phase, repeated steps of “message passing” are performed based on a message function and node update function . At step , a message is passed along every edge in the graph where and are the hidden states of node and . Let denote the neighbors of node in the graph, the aggregated message by average pooling received by is
and the updated hidden state is
Alternatively, we can use attention weighted pooling  to enhance message aggregation. Consequently, the model is also called a graph attention network (GAT), where calculation of is replaced by
and is computed from a shared attention mechanism
For , we use the self-attention mechanism introduced in .
In our GCN models, the message passing steps are applied on the -skeleton graph constructed on OCR bounding boxes, so that structural information can be passed around local vicinities along graph edges, and potentially be combined and extracted into useful signals. Both the line splitting step and the line clustering step rely on this mechanism to make predictions on graph nodes or edges.
3.3 Splitting Lines
As in [3, 28], if multi-column text blocks are present in a document page, splitting lines across columns is a necessary first step for finding paragraphs. Here we have the same objective but a different input with available OCR bounding boxes for each word and symbol. Image processing can be skipped to accelerate computations.
Note that the horizontal spacings between words is not a reliable signal for this task, as when the typography alignment of the text is “justified,” i.e. the text falls flush with both sides, these word spacings may be stretched to fill the full column width. In Fig. 3, the bottom left line has word spacings larger than the column gap. This is common in documents with tightly packed text such as newspapers.
We use the GCN model shown in Fig. 6 to predict the splitting points, or tab-stops. Each graph node is a word bounding box. Graph edges are the -skeleton edges built as described in section 3.1. The model output contains two sets of node classification results – whether each word is a “line start” and whether it is a “line end”. This model is expected to work well for difficult cases like dense text columns with “justified” alignment by aggregating signals from words in multiple lines surrounding the potential splitting point.
Fig. 8 shows a zoomed-in area of Fig. 3 with a -skeleton graph constructed from the word bounding boxes. Since words are aligned on either side of the two text columns, a set of words with their left edges all aligned are likely on the left side of a column, i.e. these words are line starts. Similarly, a set of words with right edges all aligned are likely on the right side and are line ends. The -skeleton edges are guaranteed to connect aligned words in neighboring lines, since aligned words have the shortest distance between the two lines and there is nothing in between to block the connection. Thus, the alignment signal can be passed around in the message passing steps and be effectively learned by the GCN model. Moreover, the two sets of words beside the column gap are also connected with -skeleton edges crossing the two columns, so the signals can be mutually strengthened.
3.4 Clustering Lines
After splitting all the lines into “true” lines, the remaining task is to cluster them into paragraphs. Again we use a graph convolution network, but now each graph node is a line bounding box, and the output is edge classification similar to link prediction in [21, 35]. We define a positive edge to be one that connects two consecutive lines in the same paragraph. Note that it is possible to have non-consecutive lines in the same paragraph being connected by a -skeleton edge. Such edges are defined as negative to make the task easier to learn.
Fig. 7 is an overview of the line clustering model. It looks similar to the line splitting model in Fig. 6, except the input consists of line bounding boxes, and the output predictions are on graph edges instead of nodes. An additional “node-to-edge” step is necessary to enable edge classification with node-level output from the graph convolution steps. It works in a similar way as the first half of a graph convolution step, with node aggregation replaced by edge aggregation:
The model predicts whether two lines belong to the same paragraph on each pair of lines connected with a -skeleton edge. The predictions are made from multiple types of context.
Vertical spacing: “Block paragraphs” are separated by extra vertical spacing, which are common in web pages. Line spacing signals are passed around in graph convolutions to detect vertical space variations.
List items: The first line of a list item is usually outdented with a bullet point or a number, and the first word after is flush with the following lines. So list items can be detected in a similar way as indentation based paragraphs.
Besides the three common types listed above, we may have other forms of paragraphs such as mailing addresses, computer source code or other customized structures. The model can be trained on different types of layout data.
3.5 Possibility of Clustering Words
If a 1-step model can cluster the words directly into paragraphs, it will be preferable to the 2-step GCN models described above. A single model is not only faster to train and run, but can also avoid cascading errors where the first step’s mistake propagates to the second.
However, there is significant difficulty for a 1-step GCN model to work on the paragraph word clustering problem. First, word based GCN models may not have good signal aggregation for line level features because of the limited number of graph convolution layers. The oversmoothing effect [20, 19] limits the depth of the network, i.e. the number of “message passes”. With -skeleton graphs mostly consisting of local connections, the “receptive field” on each graph node is small and often cannot cover a whole line. For instance, a word at the end of a line has no information on whether this line is indented. In a general purpose paragraph model where the input can be noisy and deformed, this limitation can severely affect model performance.
While it is possible to extend the receptive fields by adding non-local edges in the graph, or employing residual connections and dilated convolutions  in the model, it is non-trivial to build a scalable and effective solution. This is an interesting topic for further research, but not the focus of this paper.
4 Synthetic Training Data from Web
A large set of diverse and high quality annotated data is a necessity for training deep neural networks. Such datasets are not readily available for paragraphs and layout-related tasks. The PubLayNet dataset  is a very large annotated set, but lacks in style diversity as all the pages are from publications.
Therefore, we largely rely on automated training data generation . By taking advantage of high quality and publicly available web documents, as well as a powerful rendering engine used in modern browsers, we can generate synthetic training data with a web scraper.
4.1 Scraping Web Pages with Modified Styles
Web pages are a good source of document examples. Wikipedia  is well known to host a great number of high quality articles with free access.
We use a browser-based web scraper to retrieve a list of Wikipedia pages, where each result includes the image rendered in the browser as well as the HTML DOM (document object model) tree of that page. The DOM tree contains the complete document structure and detailed locations of all the rendered elements, from which we can reconstruct the ground truth line bounding boxes. Each line bounding box is an axis-aligned rectangle covering a line of text. For paragraph ground truth, the HTML tag p conveniently indicates a paragraph node, and all the text lines under this node belong to the same paragraph.
|Style Change||Script Sample|
|Single-column to||div.style.columnCount = 2;|
|Vertical spacing to||div.style.textIndent = 30px;|
|indentation||div.style.marginTop = 0;|
|div.style.marginBottom = 0;|
|Typography alignment||div.style.textAlign = “right”;|
|Text column width||div.style.width = 50%;|
|Horizontal text block||div.style.marginLeft = 20%;|
|Line height/spacing||div.style.lineHeight = 150%;|
|Font||div.style.fontFamily = “times”;|
One issue of using web page data directly for document layout is the lack of diversity in document styles. Almost all web pages use vertical spacing to separate paragraphs, and multi-column text is rare. Fortunately, modern web browsers support extensions that can run script code on web pages to change their CSS styles. For example, to generate double-column text for a certain division of a page, we can use “div.style.columnCount = 2.”
Table I lists a few examples of web script code for changing paragraph styles. Such pieces are randomly picked and combined in our training data pipeline. Parameters such as column count and alignment type are also randomized. Thus, the total combinations give a great diversity of styles to simulate various types of documents to be encountered in the real world.
4.2 Data Augmentation
A general-purpose OCR engine must accommodate all types of input images, including photos of text taken at different camera angles. Our model should be able to handle the same variations in input, so data augmentation is needed to transform the rectilinear data scraped from web pages into photo-like data.
To emulate the effect of camera angles on a page, we need two types of geometric transformation: rotation and perspective projection. Again, we use randomized parameters in each transformation to diversify our data. Each data point gets a random projection followed by a random rotation, applied to both the image and ground truth boxes. Fig. 10 shows two training examples with data augmentation.
The left one has Arial font, dense text lines, paragraphs separated by indentation and the camera placed near the upper-left corner.
The right one has Monospace font, sparse text lines, paragraphs separated by vertical spacing and the camera placed near the lower-right corner.
Note that we do not need pixel-level augmentation (imaging noise, illumination variation, compression artifacts, etc.) for the training of our GCN models, because these models only take bounding box features from the OCR engine output, and are decoupled from the input image. Even when real input images look very different from the training data images, the bounding boxes from a robust OCR engine can still be consistent. It is assumed that the OCR engine has been trained to be robust to pixel-level degradation, as is the case in the present work.
4.3 Sequential 2-Step Training by Web Synthetic Data
We train the two GCN models in sequence, where the line clustering input depends on the line splitting model. For each model, the classification ground truth labels are computed from matching OCR output to the shapes of ground truth (GT) lines. The GT lines are denoted by . Each GT line is a rectangle from the web rendering engine, and is transformed into a quadrilateral in data augmentation.
In the line splitting model, the graph nodes are the OCR word boxes, and the output labels are node classifications on whether each node is a “line start” and whether it is a “line end”. These labels can be computed by the following two steps.
For each OCR word , the word is allocated to the ground truth line with maximum intersection area with the word bounding box.
For each GT line , sort the words allocated to this line along its longitudinal axis. The word at the left end is a line start, the word at the right end is a line end, and the remaining words are negative on both.
In the line clustering model, the graph nodes are the line boxes after the line splitting process, and the output labels are edge classifications on whether an edge connects two lines that are adjacent lines in the same paragraph. For each -skeleton edge that connects a pair of OCR line boxes (, ), we find the corresponding pair of GT line boxes (, ) by the same maximum intersection area as the step above. The edge label is positive if and belong to the same paragraph and
The line clustering model input is generated from line splitting on the OCR results. While there remains some risk of cascading of errors, line clustering is able to correct some mistakes in the previous line splitting step. Specifically,
For “under-splitting”, i.e. when an OCR line covers multiple GT lines, there is no way to correct it by clustering, and the training example is discarded.
For “over-splitting”, i.e. when multiple OCR lines match the same GT line, the line clustering model can cluster the over-split short lines into the same paragraph and recover the original lines. See the second picture in Fig. 15 as an example. The sequential training steps enable this error correction.
It is worth noting that the ground truth labels associated with table elements are treated as “don’t-care” and assigned with weight 0 in training. The reason is that tables have very different structures from paragraphed text, and the two types of entities often produce contradicting labels within the current GCN framework. Using GCN for table detection like  is another interesting topic but out of the scope of this paper.
We experiment with the 2-step GCN models and evaluate the end-to-end performance on both the open PubLayNet dataset and our own annotated sets.
In the end-to-end flow, the line splitting model and the line clustering model work in a sequential order. It takes an OCR result page as input, and produces a set of paragraphs each containing a set of lines, and every line in the page belongs to exactly one paragraph.
We use the OCR engine behind Google Cloud Vision API DOCUMENT_TEXT_DETECTION
PubLayNet contains a large amount of document images with ground truth annotations: 340K in the training set and 12K in the development/validation set. The testing set ground truth has not been released at the time of this writing, so here we use the development set for testing.
For web synthetic, we scrape 100K Wikipedia  pages in English for image based model training and testing at a 90/10 split. For GCN models, 10K pages are enough. An additional 10K pages in Chinese, Japanese and Korean are scraped to train the omni-script GCN models.
We also use a human annotated set with real-world images – 25K in English for training and a few hundred for testing in each available language. The images are collected from books, documents or objects with printed text, and then sent to a team of raters who draw the ground truth polygons for all the paragraphs. Example images are shown in Fig. 14, 15 and 16.
Models and Hyperparameters
At the models’ input, each graph node’s feature is a vector containing its bounding box information of the word or the line. The first five values are width , height , rotation angle , and . Then for each of its 4 corners , we add 6 values . For line clustering, an additional indicating the first word’s width is added to each line for better context of line breaks and list items. These values provide the starting point for feature crossings and combinations in graph convolutions. This low dimension of model input enables lightweight and efficient computation.
Model inference latency is very low, under 20 milliseconds for an input graph of around 1500 nodes (on a 12-core 3.7GHz Xeon CPU), since each GCN model is less than 130KB in size with 32-bit floating point parameters. In fact, the computation bottleneck is on the -skeleton construction which can take 50 milliseconds for the same graph. Compared to the main OCR process, the overall GCN latency is small, and the complexity ensures scalability.
We cannot claim that GCN models have better latency than image based R-CNN models, because image models can run in parallel with the OCR engine when resources allow. Instead, the small size of GCN models makes them easy to be deployed as a lightweight, low cost and energy efficient step of post-OCR layout analysis.
While classification tasks are evaluated by precision and recall, the end-to-end performance is measured by IoU based metrics such as the COCO mAP@IoU[.50:.95] used in  so the results are comparable.
The average precision (AP) for mAP is usually calculated on a precision-recall curve. Since our models produce binary predictions instead of detection boxes, we have only one output set of paragraph bounding boxes, i.e. only one point on the precision-recall curve. So .
We introduce another metric F1 using variable IoU thresholds, which is more suitable for paragraph evaluations. In Fig. 11, a single-line paragraph has a lower IoU even though it is correctly detected, while a 4-line detection (in red) has a higher IoU with a missed line. This is caused by boundary errors at character scale rather than at paragraph scale. This error is larger for post-OCR methods since the OCR engine is not trained to fit paragraph boxes. If we have line-level ground truth in each paragraph, and adjust IoU thresholds by
the single-line paragraph will have IoU threshold 0.5, the 5-line one will have IoU threshold 0.833, and both cases in Fig. 11 can be more reasonably scored.
Both PubLayNet  and our web synthetic set have line level ground truth to support this metric. For the human annotated set without line annotations, we fall back to a fixed IoU threshold of 0.5.
The 2-step GCN models are compared against image based models and the heuristic algorithm in our production system.
The image models include Faster R-CNN and Mask R-CNN used in , which work on the PubLayNet data with non-augmented images. For broader testing on augmented datasets, we train a Faster R-CNN model with an additional quadrilateral output to indicate rotated boxes, denoted by “F-RCNN-Q” in following subsections. This model uses a ResNet-101  backbone and is 200MB in size, smaller than the two models in  but still 3 orders of magnitude larger than the GCN models.
For reference, the baseline heuristic algorithm takes the OCR recognized text lines as input and generates paragraphs by the following steps.
Within each line, find white spaces between words that are significantly wider than average, and split the line by these spaces into shorter lines.
For each line, repeatedly cluster nearby lines into its block by a distance threshold, until no more proximate lines can be found.
Within each block, merge lines that are roughly placed in the same straight line.
Within each block, for each indented line, create a new paragraph.
These rule-based steps were intended to handle multi-column text pages, but the fixed hand-tuned parameters make it inflexible at style variations. Replacing them with machine learned GCN models as proposed here can greatly enhance the algorithm’s performance and adaptivity.
5.2 GCN Classification Accuracies
We first check the metrics of the GCN classification tasks on various training sets. Precision and recall scores of the binary classification tasks are shown in Table II. The PubLayNet data is not applied with data augmentation because of the low resolution of its images, while the web synthetic data is tried with and without data augmentation.
The human annotated training set is added to train the line clustering GCN model, but not the line splitting model because it lacks dense, multi-column text pages. Therefore, only the line clustering scores are shown for the combined set in Table II. The scores on the annotated set are significantly lower because of the diverse and noisy nature of the data source.
We also compare the -skeleton graph with the two types of “line-of-sight” graphs in Fig. 4. Since the edges are very different among these graphs, Table III only compares node classification scores trained on the augmented web synthetic set. When average pooling is used in graph convolutions, the free “line-of-sight” graph in Fig. 4(a) achieves the best scores. However, the size of the graph scales poorly and causes out-of-memory errors when training with attention weighted pooling within our environment. In practical use, -skeleton graph appears to yield the best results for our purpose.
5.3 PubLayNet Evaluations
The PubLayNet dataset has five types of layout elements: text, title, list, figure and table. For our task, we take text and title bounding boxes as paragraph ground truth, and set all other types as “don’t-care” for both training and testing.
|Dataset||Line start||Line end||Edge clustering|
|Combined set||Augmented web synthetic||0.949/0.953|
|Graph type||Pooling method||Line start||Line end|
Table IV shows that F-RCNN-Q matches the mAP scores in . The GCN models are worse in this metric because there is only one point in the precision-recall curve, and the OCR engine is not trained to produce bounding boxes that match the ground truth. In the bottom row of Table IV, “OCR + Ground Truth” is computed by clustering OCR words into paragraphs based on ground truth boxes, which is the upper bound for all post-OCR methods. For mAP scores, even the upper bound is lower than the scores of image based models. However, if we measure by F1 scores defined in subsection 5.1.3, OCR + GCNs can match image based models with a slight advantage.
|F-RCNN ||PubLayNet training||0.910||-|
|M-RCNN ||PubLayNet training||0.916||-|
|OCR + Heuristic||-||0.302||0.364|
|OCR + GCNs||Augmented web synthetic||0.748||0.867|
|OCR + GCNs||PubLayNet training||0.842||0.959|
|OCR + Ground Truth -||0.892||0.997|
The high F1 score on “OCR + Ground Truth” also shows that the OCR engine we use has a very high recall on text detection. The reason it is lower than 1 is mostly from ground truth variations – a small fraction of single-line paragraphs have IoU lower than 0.5.
Under splitting – the top line (marked red) should have been split into two. This usually causes large IoU drop and cannot be recovered by line clustering.
Clustering errors among text lines, and also on a math equation together with detection errors.
A table annotation is clustered with a table cell across a boundary line, because our models do not take image features and ignore non-text lines.
5.4 Synthetic Dataset Evaluations
The synthetic dataset from web scraping can give a more difficult test for these models by its aggressive style variations.
In Table V, we can see the F1 score of the image based F-RCNN-Q model decreases sharply as the task difficulty increases. At the synthetic dataset where the images are augmented with rotations and projections as in Fig. 10, detection is essentially broken, not only from non-max suppression drops shown in Fig. 2, but also from much worse box predictions.
|Augmented web synthetic||0.547|
|OCR + GCNs||PubLayNet training/dev||0.959|
|Augmented web synthetic||0.827|
In contrast, the GCN models are much less affected by data augmentations and layout style variations. Especially between augmented and non-augmented datasets, the F1 score change is minimal. So GCN models will have greater advantage when input images are non axis-aligned.
5.5 Real-World Dataset Evaluations
The human annotated dataset can potentially show the models’ performance in real-world applications. Since the annotated set is relatively small, the F-RCNN-Q model needs to be pre-trained on other paragraph sets, while the GCN models are small enough that the line clustering model can be trained entirely on the paragraph annotations. Evaluation metric for this set is F1-score with a fixed IoU threshold of 0.5.
|F-RCNN-Q||Augmented web synthetic||0.030|
|Annotated data (pre-trained||0.607|
|OCR + Heuristic||-||0.602|
|OCR + GCNs||Augmented web synthetic||0.614|
|OCR + Ground Truth||-||0.960|
Table VI shows comparisons across different models and different training sets. All the models should handle image rotations and perspective transformations, so we only compare models trained on the augmented web synthetic set or the human annotated set. First, we can see that Faster R-CNN trained from synthetic web rendered pages does not work at all for real-world images, whereas the GCN models can generalize well from synthetic training data.
Also note that most of the annotated images are nearly axis-aligned, so the GCN models will yield even greater advantage if the images are rotated or taken with varied camera angles.
Fig. 14 and Fig. 15 show six examples of OCR + GCNs produced paragraphs. The successful examples in Fig. 14 are all difficult cases for heuristic and detection based approaches but are handled well by the GCN models. The image on the right shows the effectiveness of training with augmented web synthetic data, as there are no similar images in the annotated set. Error examples produced by GCN are shown in Fig. 15:
Under splitting: the caption under the top-right picture is not split from the paragraph on the left, causing downstream errors.
Over splitting: two lines in the middle are mistakenly split, but the short line segments are then clustered back into the same paragraph, resulting in a correct final output.
Over clustering table elements: since tables are “don’t-care” regions in the training data, the GCN models trained with paragraph data may take table elements as sparse text lines and incorrectly cluster them together. A table detector may help to filter out these lines for paragraphs.
To verify the robustness of the GCN models for language and script diversity, we test them on a multi-language evaluation set. The models are trained with both synthetic and human annotated data in English, and additional synthetic data from Wikipedia pages in Chinese, Japanese and Korean. No other Latin language data is needed as the English data is sufficient to represent the layout styles.
Table VII shows the F1-scores across multiple languages. F-RCNN-Q is not evaluated for the three Asian languages, because we don’t have suitable training data, and Table VI indicates that synthetic training data is not useful for this model. The GCN models produce best results in almost all the languages tried, once again showing good generalizability.
|OCR +||F-RCNN||OCR +||OCR +|
The GCN models are also flexible in handling text lines written in vertical directions, which are common in Japanese and Chinese, and also appear in Korean. Although we don’t have much training data with vertical lines, the bounding box structures of lines and symbols in these languages remain the same when the lines are written vertically, as if they were written horizontally while the image is rotated clockwise by 90 degrees. Fig. 16 shows such an example. Since our models are trained to handle all rotation angles, such paragraphs can be correctly identified.
6 Conclusions and Future Work
We demonstrate that GCN models can be powerful and efficient for the task of paragraph estimation. Provided with a good OCR engine, they can match image based models with much lower requirement on training data and computation resources, and significantly beat them on non-axis-aligned inputs with complex layout styles. The graph convolutions in these models give them unique advantages in dealing with different levels of page elements and their relations.
Future work includes model performance improvement through both training data and model architectures. Training data can be made more realistic by tuning the web scraping pipeline and adding more complex degradation transformations such as wrinkling effects on document pages. Also, alternative model architectures and graph structures mentioned in subsection 3.5 may improve quality and performance.
Another aspect of the future work is to extend the GCN models’ capability to identify more types of entities and extract document structural information such as reading order. Some entities like titles and list items are similar to paragraphs, while some others like tables and document sections are not straightforward to handle with our proposed models. Image based CNNs may be needed with their outputs used as node or edge features in the GCN model, so that non-text components in the document (e.g. checkboxes, table grid lines) can be captured. In addition, reading order among entities is a necessary step if we want to identify semantic paragraphs that span across multiple columns/pages.
The authors would like to thank Chen-Yu Lee, Chun-Liang Li, Michalis Raptis, Sandeep Tata and Siyang Qin for their helpful reviews and feedback, and to thank Alessandro Bissacco, Hartwig Adam and Jake Walker for their general leadership support in the overall project effort.
- A semantic paragraph can span over multiple text columns or pages. In this paper, we only look for physical paragraphs where lines of contiguous indices are always physically proximate. Moreover, we regard stand-alone text spans such as titles and headings as single-line paragraphs.
- In current use in products and services at the time of this writing.
- The middle line points are added so that no edges can go through the boxes.
- means no skip-line positive edges, which makes the task easier to learn. For datasets without line level ground truth labels, this condition is replaced by “there is no path in the -skeleton graph shorter than edge (, )”.
- (2008) Computational geometry: algorithms and applications. 3rd ed. edition, Springer-Verlag TELOS, Santa Clara, CA, USA. External Links: Cited by: item 2, §3.1.
- (2003) An algorithm for finding maximal whitespace rectangles at arbitrary orientations for document layout analysis. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., Vol. , pp. 66–70 vol.1. External Links: Cited by: §2.1.
- (2003) High performance document layout analysis. In Proceedings of the Symposium on Document Image Understanding Technology, Greenbelt, MD, pp. 209–218. Cited by: §1, §2.1, §3.3.
- (1998) Geometric layout analysis techniques for document image understanding: a review. Technical report Technical Report 9703-09, IRST, Trento, Italy. Cited by: §1.
- (2019) ICDAR2019 competition on recognition of documents with complex layouts - RDCL2019. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019, pp. 1521–1526. External Links: Cited by: §6.
- (2019) Deep visual template-free form parsing. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019, pp. 134–141. External Links: Cited by: §3.1.
- (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama and R. Garnett (Eds.), Vol. 28, pp. 2224–2232. Cited by: §3.2.
- (2017) Sequence-to-label script identification for multilingual OCR. In 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017, Kyoto, Japan, November 9-15, 2017, pp. 161–168. External Links: Cited by: §3.
- (2017) Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICMLâ17, pp. 1263â1272. Cited by: §3.2.
- (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan and R. Garnett (Eds.), pp. 1024–1034. External Links: Cited by: §3.2.
- (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Cited by: §5.1.4.
- (2020) Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 42 (2), pp. 386–397. External Links: Cited by: §2.2.
- (1985-10) Office document architecture and office document interchange formats: current status of international standardization. Computer 18, pp. 50–60. Cited by: §1.
- (2018) R2 cnn: rotational region cnn for arbitrarily-oriented scene text detection. In 2018 24th International Conference on Pattern Recognition (ICPR), Vol. , pp. 3610–3615. External Links: Cited by: 2nd item.
- (1999) An automatic closed-loop methodology for generating character groundtruth for scanned documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 21 (2), pp. 179–183. External Links: Cited by: §4.
- (1985) A framework for computational morphology. Machine Intelligence and Pattern Recognition 2, pp. 217–248. External Links: Cited by: §1, §3.1, §3.
- (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Cited by: §6.
- (2019) Page segmentation using a convolutional neural network with trainable co-occurrence features. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019, pp. 1023–1028. External Links: Cited by: §2.3.
- (2019) DeepGCNs: can gcns go as deep as cnns?. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 9266–9275. External Links: Cited by: §3.5, §3.5.
- (2018) Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, S. A. McIlraith and K. Q. Weinberger (Eds.), pp. 3538–3545. Cited by: §3.5.
- (2007-05) The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology 58 (7), pp. 1019â1031. External Links: Cited by: §3.4.
- (2018) A machine learning approach for graph-based page segmentation. In 31st SIBGRAPI Conference on Graphics, Patterns and Images, SIBGRAPI 2018, Paraná, Brazil, October 29 - Nov. 1, 2018, pp. 424–431. External Links: Cited by: §2.3.
- (1986) A rule-based system for document understanding. In Proceedings of the Fifth AAAI National Conference on Artificial Intelligence, AAAI’86, pp. 789â793. Cited by: §1, §2.1.
- (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39 (6), pp. 1137–1149. External Links: Cited by: §2.2.
- (2019) Table detection in invoice documents by graph neural networks. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019, pp. 122–127. External Links: Cited by: §2.4, §2, §3.1, §4.3.
- (2009) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. External Links: Cited by: §1.
- (2007) An overview of the Tesseract OCR engine. In Proc. 9th IEEE Intl. Conf. on Document Analysis and Recognition (ICDAR), pp. 629–633. Cited by: §2, §3, TABLE IV.
- (2009) Hybrid page layout analysis via tab-stop detection. In 10th International Conference on Document Analysis and Recognition, ICDAR 2009, Barcelona, Spain, 26-29 July 2009, pp. 241–245. External Links: Cited by: §1, §2.1, §3.3.
- (1986) Document image analysis. In Proceedings of the 8th International Conference on Pattern Recognition, pp. 434â436. Cited by: §1, §2.1.
- (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Cited by: §3.2, §5.1.2.
- (2018) Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, External Links: Cited by: §3.2, §5.1.2.
- Wikipedia, the free encyclopedia. External Links: Cited by: §1, §4.1, 2nd item.
- (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems (), pp. 1–21. Cited by: §1, §3.2.
- (2017) Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2.3.
- (2018) Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pp. 5171–5181. External Links: Cited by: §3.4.
- (2019) PubLayNet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20-25, 2019, pp. 1015–1022. External Links: Cited by: §1, §1, §2.2, §4, Fig. 11, §5.1.1, §5.1.3, §5.1.3, §5.1.4, §5.3, TABLE IV.