Shape Robust Text Detection with Progressive Scale Expansion Network

Shape Robust Text Detection with Progressive Scale Expansion Network

Wenhai Wang{}^{1,4*}, Enze Xie{}^{2,5*}, Xiang Li{}^{3,4*}, Wenbo Hou{}^{1}, Tong Lu{}^{1}, Gang Yu{}^{5}, Shuai Shao{}^{5}
{{}^{1}}National Key Lab for Novel Software Technology, Nanjing University
{{}^{2}}Department of Comuter Science and Technology, Tongji University
{{}^{3}}School of Computer and Engineering, Nanjing University of Science and Technology
{{}^{5}}Megvii (Face++) Technology Inc.
{wangwenhai362, Johnny_ez, lysucuo},, {yugang, shaoshuai}
Authors contributed equally.Xiang Li is with PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, and Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China. Xiang Li is also a visiting scholar in Momenta.Corresponding author.

Scene text detection has witnessed rapid progress especially with the recent development of convolutional neural networks. However, there still exists two challenges which prevent the algorithm into industry applications. On the one hand, most of the state-of-art algorithms require quadrangle bounding box which is in-accurate to locate the texts with arbitrary shape. On the other hand, two text instances which are close to each other may lead to a false detection which covers both instances. Traditionally, the segmentation-based approach can relieve the first problem but usually fail to solve the second challenge. To address these two challenges, in this paper, we propose a novel Progressive Scale Expansion Network (PSENet), which can precisely detect text instances with arbitrary shapes. More specifically, PSENet generates the different scale of kernels for each text instance, and gradually expands the minimal scale kernel to the text instance with the complete shape. Due to the fact that there are large geometrical margins among the minimal scale kernels, our method is effective to split the close text instances, making it easier to use segmentation-based methods to detect arbitrary-shaped text instances. Extensive experiments on CTW1500, Total-Text, ICDAR 2015 and ICDAR 2017 MLT validate the effectiveness of PSENet. Notably, on CTW1500, a dataset full of long curve texts, PSENet achieves a F-measure of 74.3% at 27 FPS, and our best F-measure (82.2%) outperforms state-of-art algorithms by 6.6%. The code will be released in the future.

Figure 1: The results of different methods, best viewed in color. (a) is the original image. (b) refers to the result of regression-based method, which displays disappointing detections as the red box covers nearly more than half of the context in the green box. (c) is the result of naive semantic segmentation, which mistakes 3 text instances as 1 instance since their boundary pixels are partially connected. (d) is the result of our proposed PSENet, which successfully distinguishs and detects the 4 unique text instances.

1 Introduction

Scene text detection in the wild is a fundamental problem with numerous applications such as scene understanding, product identification, and autonomous driving. Many progress has been made in recent years with the rapid development of Convolutional Neural Networks (CNNs) [he2016deep, huang2017densely, ren2015faster]. We can roughly divide the existing CNN based algorithm into two categories: regression-based approaches and segmentation-based approaches.

For the regression-based approaches [tian2016detecting, zhou2017east, shi2017detecting, jiang2017r2cnn, zhong2016deeptext, liu2018fots, he2017single, hu2017wordsup, lyu2018multi], the text targets are usually represented in the forms of rectangles or quadrangles with certain orientations. However, the regression-based approaches fail to deal with the text instance with arbitrary shapes, e.g., the curve texts as shown in Fig. 1 (b). Segmentation-based approaches, on the other hand, locate the text instance based on pixel-level classification. However, it is difficult to separate the text instances which are close with each other. Usually, a false detection which covers all the text instances close to each other may be predicted based on the segmentation-based approach. One example is shown in Fig. 1 (c).

To address these problems, in this paper, we propose a novel kernel-based framework, namely, Progressive Scale Expansion Network (PSENet). Our PSENet has the following two benefits. First, as a segmentation-based method, PSENet performs pixel-level segmentation, which is able to precisely locate the text instance with arbitrary shape. Second, we propose a progressive scale expansion algorithm, with which the adjacent text instances can be successfully identified as shown in Fig. 1 (d). More specifically, we assign each text instance with multiple predicted segmentation areas, which are denoted as “kernels” for simplicity. Each kernel has the similar shape to the original text instance but different scales. To obtain the final detections, we adopt a progressive scale expansion algorithm based on Breadth-First-Search (BFS). Generally, there are 3 steps: 1) starting from the kernels with minimal scales (instances can be distinguished in this step); 2) expanding their areas by involving more pixels in larger kernels gradually; 3) finishing until the complete text instances (the largest kernels) are explored.

There are three potential reasons for the design of the progressive scale expansion algorithm. First, the kernels with minimal scales are quite easy to be separated as their boundaries are far away from each other. Second, the minimal scale kernels can not cover the complete areas of text instances (see Fig. 2 (b)). Therefore, it is necessary to recover the complete text instances from the minimal scale kernels. Third, the progressive scale expansion algorithm is a simple and efficient method to expand the small kernels to complete text instances, which ensures the accurate locations of text instances.

Figure 2: Visualization of complete text instance and kernel of text instance. It can be seen that CRNN [crnn] recognizes complete text instance correctly but fail to recognize the kernel, because the kernel can not cover the complete areas of text instances.

To show the effectiveness of our proposed PSENet, we conduct extensive experiments on four competitive benchmark datasets including ICDAR 2015 [karatzas2015icdar], ICDAR 2017 MLT [icdar2017mlt] ,CTW1500 [Liu2017Detecting] and Total-Text [totaltext]. Among these datasets, CTW1500 and Total-Text are explicitly designed for curve text detection. Specifically, on CTW1500, a dataset with long curve texts, we outperform state-of-the-art results by absolute 6.6%, and our real-time model achieves a comparable performance (74.3%) at 27 FPS. Furthermore, the proposed PSENet also achieves promising performance on multi-oriented and multi-lingual text datasets: ICDAR 2015 and ICDAR 2017 MLT.

Figure 3: Illustration of our overall pipeline. The left part of pipeline is implemented from FPN [lin2017feature]. The right part denotes the feature fusion and the progressive scale expansion algorithm.

2 Related Work

Scene text detection based on deep learning methods have achieved remarkable results over the past few years. A major of modern text detectors are based on CNN framework, in which scene text detection is roughly formulated as two categories: regression-based methods and segmentation-based methods.

Regression-based methods often based on general object detection frameworks, such Faster R-CNN [ren2015faster] and SSD [liu2016ssd]. TextBoxes [liao2017textboxes] modified the anchor scales and shape of convolution kernels to adjust to the various aspect ratios of the text. EAST [zhou2017east] use FCN [FCN] to directly predict score map, rotation angle and text boxes for each pixel. RRPN [rrpn] adopted Faster R-CNN and developed rotation proposals of RPN part to detect arbitrary oriented text. RRD [rrd] extracted feature maps for text classification and regression from two separately branches to better long text detection.

However, most of the regression-based methods often require complex anchor design and cumbersome multiple stages, which might require exhaustive tuning and lead to sub-optimal performance. Moreover, the above works were specially designed for multiple oriented text detection and may fall short when handling curve texts, which are actually widely distributed in real-world scenarios.

Segmentation-based methods are mainly inspired by fully convolutional networks(FCN) [FCN]. Zhang \it{et~{}al.} [zhang2016multi] first adopted FCN to extract text blocks and detect character candidates from those text blocks via MSER. Yao \it{et~{}al.} [yao2016scene] formulated one text region as various properties, such as text region and orientation, then utilized FCN to predict the corresponding heatmaps. Lyu \it{et~{}al.}[lyu2018multi] utilized corner localization to find suitable irregular quadrangles for text instances. PixelLink [PixelLink] separated texts lying close to each other by predicting pixel connections between different text instances. Recently, TextSnake [textsnake] used ordered disks to represent curve text for curve text detection. SPCNet [xie2018scene] used instance segmentation framework and utilize context information to detect text of arbitrary shape while suppressing false positives.

The above works have achieved excellent performances over several horizontal and multi-oriented text benchmarks. Similarly, most of the above approaches have not paid special attention to curve text, except for TextSnake [textsnake]. However, TextSnake still needs time-consuming and complicated post-processing steps (Centralizing, Striding and Sliding) during inference, while our proposed Progressive Scale Expansion needs only one clean and efficient step.

3 Proposed Method

In this section, we first introduce the overall pipeline of the proposed Progressive Scale Expansion Network (PSENet). Next, we present the details of progressive scale expansion algorithm, and show how it can effectively distinguish the text instances lying closely. At last, we introduce the way of generating label and the design of loss function.

3.1 Overall Pipeline

A high-level overview of our proposed PSENet is illustrated in Fig. 3. We use ResNet [he2016identity] as the backbone of PSENet. We concatenate low-level texture feature with high-level semantic feature. These maps are further fused in F to encode information with various receptive views. Intuitively, such fusion is very likely to facilitate the generations of the kernels with various scales. Then the feature map F is projected into n branches to produce multiple segmentation results S_{1},S_{2},...,S_{n}. Each S_{i} would be one segmentation mask for all the text instances at a certain scale. The scales of different segmentation mask are decided by the hyper-parameters which will be discussed in Sec. 3.4. Among these masks, S_{1} gives the segmentation result for the text instances with smallest scales (i.e., the minimal kernels) and S_{n} denotes for the original segmentation mask (i.e., the maximal kernels). After obtaining these segmentation masks, we use progressive scale expansion algorithm to gradually expand all the instances’ kernels in S_{1}, to their complete shapes in S_{n}, and obtain the final detection results as R.

3.2 Network Design

The basic framework of PSENet is implemented from FPN [lin2017feature]. We firstly get four 256 channels feature maps (i.e. P_{2},P_{3},P_{4},P_{5}) from the backbone. To further combine the semantic features from low to high levels, we fuse the four feature maps to get feature map F with 1024 channels via the function \mathbb{C}(\cdot) as:

\begin{split}\displaystyle F&\displaystyle=\mathbb{C}(P_{2},P_{3},P_{4},P_{5})% \\ &\displaystyle=P_{2}\parallel\mbox{Up}_{\times 2}(P_{3})\parallel\mbox{Up}_{% \times 4}(P_{4})\parallel\mbox{Up}_{\times 8}(P_{5}),\end{split} (1)

where “\parallel” refers to the concatenation and \mbox{Up}_{\times 2}(\cdot), \mbox{Up}_{\times 4}(\cdot), \mbox{Up}_{\times 8}(\cdot) refer to 2, 4, 8 times upsampling, respectively. Subsequently, F is fed into Conv(3,3)-BN-ReLU layers and is reduced to 256 channels. Next, it passes through n Conv(1,1)-Up-Sigmoid layers and produces n segmentation results S_{1},S_{2},...,S_{n}. Here, Conv, BN, ReLU and Up refer to convolution [lecun1998gradient], batch normalization [ioffe2015batch], rectified linear units [glorot2011deep] and upsampling.

Figure 4: The procedure of progressive scale expansion algorithm. CC refers to the function of finding connected components. EX represents the scale expansion algorithm. (a), (e) and (f) refer to S_{1}, S_{2} and S_{3}, respectively. (b) is the initial connected components. (c) and (d) is the results of expansion. (g) is the illustration of expansion. The blue and orange areas represent the kernels of different text instances. The gray girds represent the pixels need to be involved. The red box in (g) refers to the conflicted pixel.

3.3 Progressive Scale Expansion Algorithm

As shown in Fig. 1 (c), it is hard for the segmentation-based method to separate the text instances that are close to each other. To solve this problem, we propose a progressive scale expansion algorithm.

Here is a vivid example (see Fig. 4) to explain the procedure of progressive scale expansion algorithm, whose central idea is brought from the Breadth-First-Search (BFS) algorithm. In the example, we have 3 segmentation results S=\{S_{1},S_{2},S_{3}\} (see Fig. 4 (a), (e), (f)). At first, based on the minimal kernels’ map S_{1} (see Fig. 4 (a)), 4 distinct connected components C=\{c_{1},c_{2},c_{3},c_{4}\} can be found as initializations. The regions with different colors in Fig. 4 (b) represent these different connected components, respectively. By now we have all the text instances’ central parts (i.e., the minimal kernels) detected. Then, we progressively expand the detected kernels by merging the pixels in S_{2}, and then in S_{3}. The results of the two scale expansions are shown in Fig. 4 (c) and Fig. 4 (d), respectively. Finally, we extract the connected components which are marked with different colors in Fig. 4 (d) as the final predictions for text instances.

The procedure of scale expansion is illustrated in Fig. 4 (g). The expansion is based on Breadth-First-Search algorithm which starts from the pixels of multiple kernels and iteratively merges the adjacent text pixels. Note that there may be conflicted pixels during expansion, as shown in the red box in Fig. 4 (g). The principle to deal with the conflict in our practice is that the confusing pixel can only be merged by one single kernel on a first-come-first-served basis. Thanks to the “progressive” expansion procedure, these boundary conflicts will not affect the final detections and the performances. The detail of scale expansion algorithm is summarized in Algorithm 1. In the pseudocode, T,P are the intermediate results. Q is a queue. \mbox{Neighbor}(\cdot) represents the neighbor pixels (4-ways) of p. \mbox{GroupByLabel}(\cdot) is the function of grouping the intermediate result by label. “S_{i}[q]=\mbox{True}” means that the predicted value of pixel q in S_{i} belongs to the text part. C and E are used to keep the kernels before and after expansion respectively;

Algorithm 1 Scale Expansion Algorithm
1:Kernels: C, Segmentation Result: S_{i}
2:Scale Expanded Kernels: E
3:function Expansion(C, S_{i})
4:       T\leftarrow\emptyset;P\leftarrow\emptyset;Q\leftarrow\emptyset
5:       for each c_{i}\in C do
6:             T\leftarrow T\cup\{(p,label)\mid(p,label)\in c_{i}\}
7:             P\leftarrow P\cup\{p\mid(p,label)\in c_{i}\}
8:             \mbox{\bf{Enqueue}}(Q,c_{i})                             // push all the elements in c_{i} into Q
9:       end for
10:       while Q\neq\emptyset do
11:             (p,label)\leftarrow\mbox{\bf{Dequeue}}(Q)           // pop the first element of Q
12:             if \exists q\in\mbox{\bf{Neighbor}}(p) and q\notin P and S_{i}[q]=\mbox{True} then
13:                    T\leftarrow T\cup\{(q,label)\};P\leftarrow P\cup\{q\}
14:                    \mbox{\bf{Enqueue}}(Q,(q,label))        // push the element (q,label) into Q
15:             end if
16:       end while
17:       E=\mbox{\bf{GroupByLabel}}(T)
18:       return E
19:end function

3.4 Label Generation

As illustrated in Fig. 3, PSENet produces segmentation results (e.g. S_{1},S_{2},...,S_{n}) with different kernel scales. Therefore, it requires the corresponding ground truths with different kernel scales during training. In our practice, these ground truth labels can be conducted simply and effectively by shrinking the original text instance. The polygon with blue border in Fig. 5 (b) denotes the original text instance and it corresponds to the largest segmentation label mask (see the rightmost map in Fig. 5 (c)). To obtain the shrunk masks sequentially in Fig. 5 (c), we utilize the Vatti clipping algorithm [vatti1992generic] to shrink the original polygon p_{n} by d_{i} pixels and get shrunk polygon p_{i} (see Fig. 5 (a)). Subsequently, each shrunk polygon p_{i} is transferred into a 0/1 binary mask for segmentation label ground truth. We denote these ground truth maps as G_{1},G_{2},...,G_{n} respectively. Mathematically, if we consider the scale ratio as r_{i}, the margin d_{i} between p_{n} and p_{i} can be calculated as:

d_{i}=\frac{\mbox{Area}(p_{n})\times(1-r_{i}^{2})}{\mbox{Perimeter}(p_{n})}, (2)

where \mbox{Area}(\cdot) is the function of computing the polygon area, \mbox{Perimeter}(\cdot) is the function of computing the polygon perimeter. Further, we define the scale ratio r_{i} for ground truth map G_{i} as:

r_{i}=1-\frac{(1-m)\times(n-i)}{n-1}, (3)

where m is the minimal scale ratio, which is a value in (0,1]. Based on the definition in Eqn. (3), the values of scale ratios (i.e., r_{1},r_{2},...,r_{n}) are decided by two hyper-parameters n and m, and they increase linearly from m to 1.

Figure 5: The illustration of label generation. (a) contains the annotations for d, p_{i} and p_{n}. (b) shows the original text instances. (c) shows the segmentation masks with different kernel scales.

3.5 Loss Function

For learning PSENet, the loss function can be formulated as:

L=\lambda L_{c}+(1-\lambda)L_{s}, (4)

where L_{c} and L_{s} represent the losses for the complete text instances and the shrunk ones respectively, and \lambda balances the importance between L_{c} and L_{s}.

It is common that the text instances usually occupy only an extremely small region in natural images, which makes the predictions of network bias to the non-text region, when binary cross entropy [de2005tutorial] is used. Inspired by [milletari2016v], we adopt dice coefficient in our experiment. The dice coefficient D(S_{i},G_{i}) is formulated as in Eqn. (5):

D(S_{i},G_{i})=\frac{2\sum\nolimits_{x,y}(S_{i,x,y}\times G_{i,x,y})}{\sum% \nolimits_{x,y}S_{i,x,y}^{2}+\sum\nolimits_{x,y}G_{i,x,y}^{2}}, (5)

where S_{i,x,y} and G_{i,x,y} refer to the value of pixel (x,y) in segmentation result S_{i} and ground truth G_{i}, respectively.

Furthermore, there are many patterns similar to text strokes, such as fences, lattices, etc. Therefore, we adopt Online Hard Example Mining (OHEM) [shrivastava2016training] to L_{c} during training to better distinguish these patterns.

L_{c} focuses on segmenting the text and non-text region. Let us consider the training mask given by OHEM as M, and thus L_{c} can be formulated as Eqn. (6).

L_{c}=1-D(S_{n}\cdot M,G_{n}\cdot M), (6)

L_{s} is the loss for shrunk text instances. Since they are encircled by the original areas of the complete text instances, we ignore the pixels of the non-text region in the segmentation result S_{n} to avoid a certain redundancy. Therefore, L_{s} can be formulated as follows:

\begin{split}\displaystyle L_{s}=1-\frac{\sum\nolimits_{i=1}^{n-1}D(S_{i}\cdot W% ,G_{i}\cdot W)}{n-1},\\ \displaystyle W_{x,y}=\left\{\begin{array}[]{ll}1,&if\ S_{n,x,y}\geq 0.5;\\ 0,&otherwise.\cr\omit\span\@@LTX@noalign{ }\omit\\ \end{array}\end{split}
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description