# Shape Robust Text Detection with Progressive Scale Expansion Network

###### Abstract

Scene text detection has witnessed rapid progress especially with the recent development of convolutional neural networks. However, there still exists two challenges which prevent the algorithm into industry applications. On the one hand, most of the state-of-art algorithms require quadrangle bounding box which is in-accurate to locate the texts with arbitrary shape. On the other hand, two text instances which are close to each other may lead to a false detection which covers both instances. Traditionally, the segmentation-based approach can relieve the first problem but usually fail to solve the second challenge. To address these two challenges, in this paper, we propose a novel Progressive Scale Expansion Network (PSENet), which can precisely detect text instances with arbitrary shapes. More specifically, PSENet generates the different scale of kernels for each text instance, and gradually expands the minimal scale kernel to the text instance with the complete shape. Due to the fact that there are large geometrical margins among the minimal scale kernels, our method is effective to split the close text instances, making it easier to use segmentation-based methods to detect arbitrary-shaped text instances. Extensive experiments on CTW1500, Total-Text, ICDAR 2015 and ICDAR 2017 MLT validate the effectiveness of PSENet. Notably, on CTW1500, a dataset full of long curve texts, PSENet achieves a F-measure of 74.3% at 27 FPS, and our best F-measure (82.2%) outperforms state-of-art algorithms by 6.6%. The code will be released in the future.

## 1 Introduction

Scene text detection in the wild is a fundamental problem with numerous applications such as scene understanding, product identification, and autonomous driving. Many progress has been made in recent years with the rapid development of Convolutional Neural Networks (CNNs) [he2016deep, huang2017densely, ren2015faster]. We can roughly divide the existing CNN based algorithm into two categories: regression-based approaches and segmentation-based approaches.

For the regression-based approaches [tian2016detecting, zhou2017east, shi2017detecting, jiang2017r2cnn, zhong2016deeptext, liu2018fots, he2017single, hu2017wordsup, lyu2018multi], the text targets are usually represented in the forms of rectangles or quadrangles with certain orientations. However, the regression-based approaches fail to deal with the text instance with arbitrary shapes, e.g., the curve texts as shown in Fig. 1 (b). Segmentation-based approaches, on the other hand, locate the text instance based on pixel-level classification. However, it is difficult to separate the text instances which are close with each other. Usually, a false detection which covers all the text instances close to each other may be predicted based on the segmentation-based approach. One example is shown in Fig. 1 (c).

To address these problems, in this paper, we propose a novel kernel-based framework, namely, Progressive Scale Expansion Network (PSENet). Our PSENet has the following two benefits. First, as a segmentation-based method, PSENet performs pixel-level segmentation, which is able to precisely locate the text instance with arbitrary shape. Second, we propose a progressive scale expansion algorithm, with which the adjacent text instances can be successfully identified as shown in Fig. 1 (d). More specifically, we assign each text instance with multiple predicted segmentation areas, which are denoted as “kernels” for simplicity. Each kernel has the similar shape to the original text instance but different scales. To obtain the final detections, we adopt a progressive scale expansion algorithm based on Breadth-First-Search (BFS). Generally, there are 3 steps: 1) starting from the kernels with minimal scales (instances can be distinguished in this step); 2) expanding their areas by involving more pixels in larger kernels gradually; 3) finishing until the complete text instances (the largest kernels) are explored.

There are three potential reasons for the design of the progressive scale expansion algorithm. First, the kernels with minimal scales are quite easy to be separated as their boundaries are far away from each other. Second, the minimal scale kernels can not cover the complete areas of text instances (see Fig. 2 (b)). Therefore, it is necessary to recover the complete text instances from the minimal scale kernels. Third, the progressive scale expansion algorithm is a simple and efficient method to expand the small kernels to complete text instances, which ensures the accurate locations of text instances.

To show the effectiveness of our proposed PSENet, we conduct extensive experiments on four competitive benchmark datasets including ICDAR 2015 [karatzas2015icdar], ICDAR 2017 MLT [icdar2017mlt] ,CTW1500 [Liu2017Detecting] and Total-Text [totaltext]. Among these datasets, CTW1500 and Total-Text are explicitly designed for curve text detection. Specifically, on CTW1500, a dataset with long curve texts, we outperform state-of-the-art results by absolute 6.6%, and our real-time model achieves a comparable performance (74.3%) at 27 FPS. Furthermore, the proposed PSENet also achieves promising performance on multi-oriented and multi-lingual text datasets: ICDAR 2015 and ICDAR 2017 MLT.

## 2 Related Work

Scene text detection based on deep learning methods have achieved remarkable results over the past few years. A major of modern text detectors are based on CNN framework, in which scene text detection is roughly formulated as two categories: regression-based methods and segmentation-based methods.

Regression-based methods often based on general object detection frameworks, such Faster R-CNN [ren2015faster] and SSD [liu2016ssd]. TextBoxes [liao2017textboxes] modified the anchor scales and shape of convolution kernels to adjust to the various aspect ratios of the text. EAST [zhou2017east] use FCN [FCN] to directly predict score map, rotation angle and text boxes for each pixel. RRPN [rrpn] adopted Faster R-CNN and developed rotation proposals of RPN part to detect arbitrary oriented text. RRD [rrd] extracted feature maps for text classification and regression from two separately branches to better long text detection.

However, most of the regression-based methods often require complex anchor design and cumbersome multiple stages, which might require exhaustive tuning and lead to sub-optimal performance. Moreover, the above works were specially designed for multiple oriented text detection and may fall short when handling curve texts, which are actually widely distributed in real-world scenarios.

Segmentation-based methods are mainly inspired by fully convolutional networks(FCN) [FCN]. Zhang \it{et~{}al.} [zhang2016multi] first adopted FCN to extract text blocks and detect character candidates from those text blocks via MSER. Yao \it{et~{}al.} [yao2016scene] formulated one text region as various properties, such as text region and orientation, then utilized FCN to predict the corresponding heatmaps. Lyu \it{et~{}al.}[lyu2018multi] utilized corner localization to find suitable irregular quadrangles for text instances. PixelLink [PixelLink] separated texts lying close to each other by predicting pixel connections between different text instances. Recently, TextSnake [textsnake] used ordered disks to represent curve text for curve text detection. SPCNet [xie2018scene] used instance segmentation framework and utilize context information to detect text of arbitrary shape while suppressing false positives.

The above works have achieved excellent performances over several horizontal and multi-oriented text benchmarks. Similarly, most of the above approaches have not paid special attention to curve text, except for TextSnake [textsnake]. However, TextSnake still needs time-consuming and complicated post-processing steps (Centralizing, Striding and Sliding) during inference, while our proposed Progressive Scale Expansion needs only one clean and efficient step.

## 3 Proposed Method

In this section, we first introduce the overall pipeline of the proposed Progressive Scale Expansion Network (PSENet). Next, we present the details of progressive scale expansion algorithm, and show how it can effectively distinguish the text instances lying closely. At last, we introduce the way of generating label and the design of loss function.

### 3.1 Overall Pipeline

A high-level overview of our proposed PSENet is illustrated in Fig. 3. We use ResNet [he2016identity] as the backbone of PSENet. We concatenate low-level texture feature with high-level semantic feature. These maps are further fused in F to encode information with various receptive views. Intuitively, such fusion is very likely to facilitate the generations of the kernels with various scales. Then the feature map F is projected into n branches to produce multiple segmentation results S_{1},S_{2},...,S_{n}. Each S_{i} would be one segmentation mask for all the text instances at a certain scale. The scales of different segmentation mask are decided by the hyper-parameters which will be discussed in Sec. 3.4. Among these masks, S_{1} gives the segmentation result for the text instances with smallest scales (i.e., the minimal kernels) and S_{n} denotes for the original segmentation mask (i.e., the maximal kernels). After obtaining these segmentation masks, we use progressive scale expansion algorithm to gradually expand all the instances’ kernels in S_{1}, to their complete shapes in S_{n}, and obtain the final detection results as R.

### 3.2 Network Design

The basic framework of PSENet is implemented from FPN [lin2017feature]. We firstly get four 256 channels feature maps (i.e. P_{2},P_{3},P_{4},P_{5}) from the backbone. To further combine the semantic features from low to high levels, we fuse the four feature maps to get feature map F with 1024 channels via the function \mathbb{C}(\cdot) as:

\begin{split}\displaystyle F&\displaystyle=\mathbb{C}(P_{2},P_{3},P_{4},P_{5})% \\ &\displaystyle=P_{2}\parallel\mbox{Up}_{\times 2}(P_{3})\parallel\mbox{Up}_{% \times 4}(P_{4})\parallel\mbox{Up}_{\times 8}(P_{5}),\end{split} | (1) |

where “\parallel” refers to the concatenation and \mbox{Up}_{\times 2}(\cdot), \mbox{Up}_{\times 4}(\cdot), \mbox{Up}_{\times 8}(\cdot) refer to 2, 4, 8 times upsampling, respectively. Subsequently, F is fed into Conv(3,3)-BN-ReLU layers and is reduced to 256 channels. Next, it passes through n Conv(1,1)-Up-Sigmoid layers and produces n segmentation results S_{1},S_{2},...,S_{n}. Here, Conv, BN, ReLU and Up refer to convolution [lecun1998gradient], batch normalization [ioffe2015batch], rectified linear units [glorot2011deep] and upsampling.

### 3.3 Progressive Scale Expansion Algorithm

As shown in Fig. 1 (c), it is hard for the segmentation-based method to separate the text instances that are close to each other. To solve this problem, we propose a progressive scale expansion algorithm.

Here is a vivid example (see Fig. 4) to explain the procedure of progressive scale expansion algorithm, whose central idea is brought from the Breadth-First-Search (BFS) algorithm. In the example, we have 3 segmentation results S=\{S_{1},S_{2},S_{3}\} (see Fig. 4 (a), (e), (f)). At first, based on the minimal kernels’ map S_{1} (see Fig. 4 (a)), 4 distinct connected components C=\{c_{1},c_{2},c_{3},c_{4}\} can be found as initializations. The regions with different colors in Fig. 4 (b) represent these different connected components, respectively. By now we have all the text instances’ central parts (i.e., the minimal kernels) detected. Then, we progressively expand the detected kernels by merging the pixels in S_{2}, and then in S_{3}. The results of the two scale expansions are shown in Fig. 4 (c) and Fig. 4 (d), respectively. Finally, we extract the connected components which are marked with different colors in Fig. 4 (d) as the final predictions for text instances.

The procedure of scale expansion is illustrated in Fig. 4 (g). The expansion is based on Breadth-First-Search algorithm which starts from the pixels of multiple kernels and iteratively merges the adjacent text pixels. Note that there may be conflicted pixels during expansion, as shown in the red box in Fig. 4 (g). The principle to deal with the conflict in our practice is that the confusing pixel can only be merged by one single kernel on a first-come-first-served basis. Thanks to the “progressive” expansion procedure, these boundary conflicts will not affect the final detections and the performances. The detail of scale expansion algorithm is summarized in Algorithm 1. In the pseudocode, T,P are the intermediate results. Q is a queue. \mbox{Neighbor}(\cdot) represents the neighbor pixels (4-ways) of p. \mbox{GroupByLabel}(\cdot) is the function of grouping the intermediate result by label. “S_{i}[q]=\mbox{True}” means that the predicted value of pixel q in S_{i} belongs to the text part. C and E are used to keep the kernels before and after expansion respectively;

### 3.4 Label Generation

As illustrated in Fig. 3, PSENet produces segmentation results (e.g. S_{1},S_{2},...,S_{n}) with different kernel scales. Therefore, it requires the corresponding ground truths with different kernel scales during training. In our practice, these ground truth labels can be conducted simply and effectively by shrinking the original text instance. The polygon with blue border in Fig. 5 (b) denotes the original text instance and it corresponds to the largest segmentation label mask (see the rightmost map in Fig. 5 (c)). To obtain the shrunk masks sequentially in Fig. 5 (c), we utilize the Vatti clipping algorithm [vatti1992generic] to shrink the original polygon p_{n} by d_{i} pixels and get shrunk polygon p_{i} (see Fig. 5 (a)). Subsequently, each shrunk polygon p_{i} is transferred into a 0/1 binary mask for segmentation label ground truth. We denote these ground truth maps as G_{1},G_{2},...,G_{n} respectively. Mathematically, if we consider the scale ratio as r_{i}, the margin d_{i} between p_{n} and p_{i} can be calculated as:

d_{i}=\frac{\mbox{Area}(p_{n})\times(1-r_{i}^{2})}{\mbox{Perimeter}(p_{n})}, | (2) |

where \mbox{Area}(\cdot) is the function of computing the polygon area, \mbox{Perimeter}(\cdot) is the function of computing the polygon perimeter. Further, we define the scale ratio r_{i} for ground truth map G_{i} as:

r_{i}=1-\frac{(1-m)\times(n-i)}{n-1}, | (3) |

where m is the minimal scale ratio, which is a value in (0,1]. Based on the definition in Eqn. (3), the values of scale ratios (i.e., r_{1},r_{2},...,r_{n}) are decided by two hyper-parameters n and m, and they increase linearly from m to 1.

### 3.5 Loss Function

For learning PSENet, the loss function can be formulated as:

L=\lambda L_{c}+(1-\lambda)L_{s}, | (4) |

where L_{c} and L_{s} represent the losses for the complete text instances and the shrunk ones respectively, and \lambda balances the importance between L_{c} and L_{s}.

It is common that the text instances usually occupy only an extremely small region in natural images, which makes the predictions of network bias to the non-text region, when binary cross entropy [de2005tutorial] is used. Inspired by [milletari2016v], we adopt dice coefficient in our experiment. The dice coefficient D(S_{i},G_{i}) is formulated as in Eqn. (5):

D(S_{i},G_{i})=\frac{2\sum\nolimits_{x,y}(S_{i,x,y}\times G_{i,x,y})}{\sum% \nolimits_{x,y}S_{i,x,y}^{2}+\sum\nolimits_{x,y}G_{i,x,y}^{2}}, | (5) |

where S_{i,x,y} and G_{i,x,y} refer to the value of pixel (x,y) in segmentation result S_{i} and ground truth G_{i}, respectively.

Furthermore, there are many patterns similar to text strokes, such as fences, lattices, etc. Therefore, we adopt Online Hard Example Mining (OHEM) [shrivastava2016training] to L_{c} during training to better distinguish these patterns.

L_{c} focuses on segmenting the text and non-text region. Let us consider the training mask given by OHEM as M, and thus L_{c} can be formulated as Eqn. (6).

L_{c}=1-D(S_{n}\cdot M,G_{n}\cdot M), | (6) |

L_{s} is the loss for shrunk text instances. Since they are encircled by the original areas of the complete text instances, we ignore the pixels of the non-text region in the segmentation result S_{n} to avoid a certain redundancy. Therefore, L_{s} can be formulated as follows:

\begin{split}\displaystyle L_{s}=1-\frac{\sum\nolimits_{i=1}^{n-1}D(S_{i}\cdot W% ,G_{i}\cdot W)}{n-1},\\ \displaystyle W_{x,y}=\left\{\begin{array}[]{ll}1,&if\ S_{n,x,y}\geq 0.5;\\ 0,&otherwise.\cr\omit\span\@@LTX@noalign{ }\omit\\ \end{array}\end{split} |