MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

Ning Lu1, Wenwen Yu1,2, Xianbiao Qi1, Yihao Chen1, Ping Gong2, Rong Xiao1
1Visual Computing Group, Ping An Property & Casualty Insurance Company
2School of Medical Imaging, Xuzhou Medical University
Co-first author.N. Lu, X. Qi, Y. Chen, and R. Xiao are with the Visual Computing Group, Ping An Property & Casualty Insurance Company, Shenzhen 518000, P.R. China. E-mail: jiangxiluning@gmail.com; qixianbiao@gmail.com, o0o@o0oo0o.cc, xiaorong283@pingan.com.cn.W. Yu and P. Gong are with the School of Medical Imaging, Xuzhou Medical University, Xuzhou 221000, P.R. China, and W. Yu is also with the Visual Computing Group, Ping An Property & Casualty Insurance Company, Shenzhen 518000, P.R. China. E-mail: yuwenwen62@gmail.com, gongping@xzhmu.edu.cn.
Abstract

Attention based scene text recognizers have gained huge success, which leverage a more compact intermediate representations to learn 1d- or 2d- attention by a RNN-based encoder-decoder architecture. However, such methods suffer from attention-drift problem because high similarity among encoded features lead to attention confusion under the RNN-based local attention mechanism. Moreover RNN-based methods have low efficiency due to poor parallelization. To overcome these problems, we propose the MASTER, a self-attention based scene text recognizer that (1) not only encodes the input-output attention, but also learns self-attention which encodes feature-feature and target-target relationships inside the encoder and decoder and (2) learns a more powerful and robust intermediate representation to spatial distortion and (3) owns a better training and evaluation efficiency. Extensive experiments on various benchmarks demonstrate the superior performance of our MASTER on both regular and irregular scene text.

Scene Text Recognition, OCR, Transformer, Non-local Network.

I Introduction

Scene text recognition in the wild is a hot area in both industry and academia in the last two decades. There are various applications scenarios such as text identification on the signboard for autonomous driving, ID card scan for bank, and key information extraction in Robotic Process Automation (RPA). However, constructing a high-quality scene text recognition system is a non-trivial task, due to unexpected blur, strong exposure, spatial and perspective distortion, and complex background. There are two types of scene text in nature, regular and irregular, as exemplified in Figure 1.

Fig. 1: Examples of regular and irregular images. (a). regular text. (b). irregular text.

Regular scene text recognition is to recognize a sequence of characters from a almost straight text image. It is usually considered as an image-based sequence recognition problem. Some traditional text recognition methods [1, 2] use human-designed features to segment patches into small glyphs, then categorize them into corresponding characters. However, these methods are known to be vulnerable to the complicated background, diverse font types and irregular arrangement of the characters. Connectionist temporal classification (CTC) based methods [3, 4] and attention-based methods [2, 5, 6] are the mainstream methods for scene text recognition because of they do not require character-level annotations and also show superior performance on real applications.

Irregular scene text recognition is more challenging due to various curved shapes and perspective distortions. Existing irregular scene text recognizers can be divided into three categories: rectification based, multi-direction encoding based, and attention-based approaches. Shi et al. [5] propose ASTER to combine a Thin-Plate Spline (TPS) [7] transformation as rectification module and an attentional BiLSTMs as recognition module. ASTER achieves well-performing performance on many public benchmarks. Cheng et al. [8] encode text patch into four-direction features and use filter-gate to attend appropriate feature, then a character sequence is generated by decoding RNN’s output. Inspired by the Show-Attend-Tell [9], Li et al. [6] propose a Show-Attend-Read (SAR) method which employs a 2D attention in encoder-decoder architecture. Nonetheless, attention drifting is a serious problem in these methods, especially when the text lines contain repetitive digits or characters.

Incorporating global context is an effective way to ease the problem of attention drifting. Self-attention [10] can provide an effective approach to encode global context information. Recently, self-attention attracts a lot of eyeballs and gains unprecedented success in natural language processing [10, 11, 12, 13] and computer vision [14, 15, 16]. It can depict a long-range dependency between words and image regions. Wang et al. [17] propose a Transformer-like non-local block which can be plugged in any backbone to model spatial global dependencies between objects. Its successors, GCNet, proposed in [15], found that the attention maps are almost the same for different query positions. GCNet simplifies non-local block with SE block [18] to reduce the computational complexity and enhances the representative ability of the proposed block based on a query-independent formulation.

Inspired by the effectiveness of global context in GCNet and the huge success of transformer achieved in NLP and CV, we propose a Multi-Aspect non-local network for irregular Scene TExt Recognition (MASTER) to target an efficient and accurate scene text recognition for both regular and irregular text. Our main contributions are highlighted as follows:

  • We propose a novel multi-aspect non-local block and fuse it into the conventional CNN backbone, which enables the feature extracter to model global context. The proposed multi-aspect non-local block can learn different aspects of spatial 2D attention, which can be viewed as a multi-head self-attention module. Different types of attention focus on different aspects of spatial feature dependencies, which is another form of different syntactic dependency types.

  • In the decoder part, Transformer modules are used to predict output sequence, which utilizes the 2D-attention that fuses local and global context and the latent language model to better predict visual words.

  • Besides of its high efficiency, our method achieves the state of the art performance on both regular and irregular scene text benchmarks. Especially, our method achieves the best case-sensitive performance on COCO-Text dataset.

Ii Related Works

In academia, scene text recognition can be divided into two categories: regular and irregular texts. In this section, we will give a brief review about related works in both areas. A more detailed review for scene text detection and recognition can be found in [19, 20, 21].

Regular Text Recognition attracts most of early research attention. Mishra et al. [22] use a traditional sliding window-based method to describe bottom-up cues and use vocabulary prior to model top-down cues. These two cues are combined to minimize the character combinations energy. Shi et al. [3] propose an end-to-end trainable character annotation-free network, called CRNN. CRNN extracts a 1D feature sequence using CNN, and then encodes the sequence encoding using RNN. Finally a CTC loss is calculated. This is the first work that only needs word-level annotation instead of character-level annotation. Gao et al. [23] integrates attention module into the residual block to amplify the response of foreground and suppress the response of background. However, the attention module cannot encode global dependencies between pixels. Cheng et al. [24] have an observation that attention may drift due to the complex scenes or low-quality images, which is a weakness of vanilla 2D-attention network. To address the misalignment between the input sequence and the target, Bai et al. [4] employ a attention-based encoder-decoder architecture, and estimate the edit probability of a text conditioned on the output sequence. Edit probability is to target the issue of character missing and superfluous. Zhang et al. [25] adopt an unsupervised fixed-length domain adaptation methodology to a variable-length scene text recognition area and the model is also based on attentional encoder-decoder architecture.

Irregular Text Recognition is more challenging than regular text recognition, nevertheless, it appeals to most of researchers’ endeavour. Yao et al. [26] is one of the earliest works that proposes a multi-orientation text detection model in an explicit way. Shi et al. [27, 5] attempt to address the multi-type irregular text recognition problem in one framework via Spatial Transformer Network (STN) [28]. In [29], Zhan et al. propose to iteratively rectify text images to be fronto-parallel in order to further improve the recognition performance. The proposed line-fitting transformation estimates the pose of text line by learning a middle line of the text line and line segments that are needed by Thin-Plate Spline. However, the rectification-based methods are often constrained by characters’ geometric feature and the background noise could be exaggerated unexpectedly. To overcome this, Luo et al. [30] propose a multi-object rectified attention network which is more flexible than direct affine transformation estimation. Unlike the rectification-based approaches, Show-Attend-Read (SAR) proposed by Li [6] uses a 2D-attention mechanism to guide the encoder-decoder recognition module to focus on the corresponding character region. This method is free to complex spatial transformation.

While 2D attention can represent the relationship between target output and input image feature, the global context between pixels and the latent dependency between characters is ignored. In [31], Hu et al. propose an object relation module to simultaneously model a set of object relations through their visual features. After the success of Transformer [10], Wang et al. [17] incorporate a self-attention block into non-local network. Cao et al. [15] further simplify and improve the non-local network, and propose a novel global context network (GCNet). Recently, Sheng et al. [32] propose a purely Transformer-based scene text recognizer which can learn the self-attention of encoder and decoder. It extracts 1D sequence feature using a simple CNN module and inputs it into a Transformer to decode target outputs. Nevertheless, the self-attention module of Transformer consists of multiple fully connected layers, which largely increases the number of parameters. Wang et al. [33] abandon the encoder of the original Transformer and only retain the CNN feature extractor and decoder to conduct an irregular scene text recognizer. However, it cannot encode the global context of pixels in the feature map. The network proposed in this paper learns not only the 2D attention between input feature and output target but also the self-attention inside the feature extractor and decoder. The multi-aspect non-local block can encode different types of spatial feature dependencies with lower computational cost and a compact model.

Iii Methodology

MASTER model, as shown in Figure 2c, consists of two key modules, a Multi-Aspect Global Context Attention (GCAttention) based encoder and a Transformer based decoder. In MASTER, an image with fixed size is input into the network, and the output is a sequence of predicted characters.

Iii-a Encoder

Encoder, in our MASTER model, encodes an input image into a feature tensor. For instance, we can obtain a tensor when inputting a image into the encoder of MASTER. One of our key contribution in this paper is that we introduce a Multi-Aspect Global Context Attention (GCAttention) in the encoder part. In this subsection, we will firstly review the definition of the Global Context Block [15], then introduce the proposed Multi-Aspect Global Context Attention (GCAttention), and describe the architecture of the encoder in detail.

Iii-A1 Global Context Block

A standard global context block was firstly introduced in [15]. The module structure is shown as Figure 2a. From Figure 2a, the input feature map of global context block is (), where , , and indicate the number of channel, width and height of the feature map individually. indicates the dimension of the output of the encoder. In global context block, three operations are performed on the feature map , including a) global attention pooling for context modeling, b) bottleneck transform to capture channel-wise dependencies, and c) broadcasting element-wise addition for feature fusion. The global context block can be expressed as

(1)

where and denote the input and the output of the global context block, respectively. They have the same dimensions. is the index of query positions, and enumerates positions of all pixels. , and denote linear transformations to be learned via a convolution. denotes layer normalization as [34]. For simplification, we denote as the weight for context modeling, and as the bottleneck transform. “” operation denotes broadcast element-wise addition.

Fig. 2: (a): representing the architecture of a standard Global Context(GC) block. (b): representing the proposed Multi-Aspect GCAttention. (c): representing the whole architecture of MASTER model, consisting of two main parts: a Multi-Aspect Global Context Attention(GCAttention) based encoder for feature representation and a transformer based decoder model. C denotes a feature map with channel number C, height H and width W. , , and denotes the number of Multi-Aspect Context, bottleneck ratio and hidden representation dimension of the bottleneck, respectively. denotes matrix multiplication, denotes broadcast element-wise addition. in_ch/out_ch donates input/output dimensions.

Iii-A2 Multi-Aspect GCAttention

Instead of performing a single attention function in original global context block, we found it beneficial to multiple attention function. Here, we call it as Multi-Apsect GCAttention (MAGC). The structure of the MAGC is illustrated in Figure 2b, and we can formulate MAGC as

(2)

where is the number of Multi-Aspect Context, denotes the -th global context, is the number of positions of all pixels in the feature map (), is a concatenation function. denotes multi-aspect global context attention operation. is a scale factor to counteracting the effect of different variance in MAGC. It can be calculated as .

As shown in Figure 3, we present the implementation of our approach with a short code based on PyTorch111https://pytorch.org/, which is also very easy to be implemented on other platforms such as TensorFlow222https://tensorflow.google.cn/.

import torch
from torch import nn
from torch.nn import functional as F
def multi_aspect_global_context_attention(x, h, r):
     # x: input features with shape [N,C,H,W]
     # h: number of multi-aspect context
     # r: bottleneck ratio
     out = x
     N, C, H, W = x.size()
     C_per = int(C/h)
     # [N*h, C_per, H, W]
     x = x.view(N * h, C_per, H, W)
     input_x = x
     # [N*h, C, H * W]
     input_x = input_x.view(N * h, C_per, H * W)
     # [N*h, 1, C_per, H * W]
     input_x = input_x.unsqueeze(1)
     # [N*h, 1, H, W]
     filters = torch.randn(C_per, 1, 1, 1)
     MAContext = F.conv2d(x, filters, stride=1, padding=0)
     # [N*h, 1, H * W]
     MAContext = MAContext.view(N, 1, H * W)
     # scale variance
     if h > 1:
        MAContext = MAContext / torch.sqrt(C_per)
     # [N*h, 1, H * W]
     MAContext = F.softmax(MAContext, dim=2)
     # [N*h, 1, H * W, 1]
     MAContext = MAContext.unsqueeze(-1)
     # [N*h, 1, C_per, 1]
     context = torch.matmul(input_x, MAContext)
     # [N, C, 1, 1]
     context = context.view(batch, channel, 1, 1)
     # transform
     ratio_planes = int(C/r)
     transform = nn.Sequential(
                nn.Conv2d(C, ratio_planes, kernel_size=1),
                nn.LayerNorm([ratio_planes, 1, 1]),
                nn.ReLU(inplace=True),
                nn.Conv2d(ratio_planes, C, kernel_size=1))
     out = out + transform(context)
     return out
Fig. 3: Python code implementation of Multi-Aspect GCAttention in PyTorch.

Iii-A3 Encoder Structure

The detailed architecture of Multi-Aspect GCAttention based Encoder is shown in the left half of Figure 2c. The backbone of the encoder, following the design of ResNet31 [35] and the setting protocol in [6], is presented in Table I. The encoder have four fundamental blocks shown in blue color in Figure 2c, each fundamental block consists of a residual block, a MAGC, and a convolution block, and max pooling that are not included in the last two fundamental blocks. In the residual block, if the input and output dimensions are different we use the projection shortcut, otherwise, we use the identity shortcut. After the residual block, a Multi-Aspect GCAttention is plugged into network architectures to learn new feature representation from multi-aspect. All the convolutional kernel size is . Besides of two max-pooling layers, we also use a max-pooling layer , which reserves more information along the horizontal axis and benefits the recognition of narrow shaped characters.

Iii-B Decoder

As shown in the right halves of Figure 2c, the decoder contains a stack of fundamental blocks as shown in purple color. Each fundamental block contains three core modules, a masked Masked Multi-Head Attention, a Multi-Head Attention, and a Feed-Forward Network (FFN). In the following, we introduce these three key modules in detail and discuss about the loss function used in this paper.

Iii-B1 Scaled Multi-Head Dot-Product Attention

A scaled multi-head dot product attention is firstly introduced in [10]. The inputs of the scaled dot-product attention consist of a query , (where is the dimension of embedding output and is the number of queries), and a set of key-value pairs of -dimensional vectors , , (where is the number of key-value pairs). The formulation of scaled dot-product attention can be expressed as follows

(3)

where is the attention weights, and , . is a set of queries.

The above scaled dot-product attention can be repeated multiple times (multi-head) with different linear transformations on , and , followed by a concatenation and linear transformation operation:

(4)

where , denotes multi-head attention operation. The parameters are and . denotes the number of multi-head attention.

Iii-B2 Masked Multi-Head Attention

Masked multi-head attention is an effective mechanism to promise that, in decoder, the prediction of one time step can only access the output information of its previous time steps. In the training stage, by creating a lower triangle mask matrix, the decoder can output predictions for all time steps simultaneously instead of one by one sequentially. This makes the training process highly parallel.

Iii-B3 Position-wise Feed-Forward Network

Point-wise Feed-Forward Network (FFN) consists of two linear transformations. One ReLU activation function is after the first transformation. FFN is defined as

(5)

where the weights are , and , and the bias are and , is the inner-dimension of the two linear transformations.

Iii-B4 Loss Function

A linear transformation followed by a softmax function is used to compute the prediction probability over all classes. Then, we use the standard cross-entropy to calculate the loss between the predicted probabilities w.r.t. the ground truth, at each decoding position. In this paper, we use 66 symbol classes except for COCO-Text which uses 104 symbol classes. These 66 symbols are 10 digits, 52 case-sensitive letters and 4 special punctuation characters. These 4 special punctuation character are “SOS”, “EOS”, “PAD”, and “UNK” which indicate the start of sequence, the end of sequence, padding symbol and unknown characters (that are neither digit nor character), respectively. The parameters of the classification layer are shared over all decoding positions.

Iii-C Inference Stage

The inference stage is different from the training stage. In the training stage, by constructing a lower triangular mask matrix, the decoder can predict out all time steps simultaneously. This process is highly parallel and efficient. However, the decoder in the inference stage can only predict each character one by one sequentially until decoder predicts out the “End” token or the length of the decoder sequence reaches to a maximum length. In decoding, the output of the later step is dependent on the outputs of its previous time steps because the outputs of its previous time steps will be used as part of the input to decode itself. In MASTER, we employ a fixed-length input for the decoder, and the length is set to be 100.

An inference flowchart is shown in Figure 4. In the time step 0, “SOS” will be put in the first position of the input, and the rest positions are padded with “PAD” token. With this input, the decoder will predict out the first character “S” as shown in Figure 4. Put the predicted character “S” in the second position after “SOS” and pad “PAD” in the rest positions, we can predict the second character “T” as shown in Figure 4. Similarly, we can predict the remaining characters until the decoder predicts out “EOS” or the number of predictions reaches to the preset max length. Under the fixed-length prediction strategy, our inference is highly parallel.

Fig. 4: Illustration of decoder input in the inference phase. “SOS”, “EOS”, and “PAD” means start of sequence, end of sequence, and padding for fixed-length input of decoder respectively.

Iv Experiments

We conduct extensive experiments on several benchmarks to verify the effectiveness of our method, and compare it with the state-of-the-art methods. In Section IV-A, we give an introduction to the used training and testing datasets. Then in Section IV-B, we present our implementation details. In Section IV-C, we make an detailed comparison between our method and the state-of-the-art methods. Finally, we conduct an ablation study in Section IV-D.

Layer Configuration Output
conv1_x , , , 64
, , , 128
max_pool: , ,
conv2_x
multi-aspect gcattention
, , , 256
max_pool: , ,
conv3_x
multi-aspect gcattention
, , , 512
max_pool: , ,
conv4_x
multi-aspect gcattention
, , , 512
conv5_x
multi-aspect gcattention
, , , 512
TABLE I: A ResNet-based CNN architecture for robust text feature representation. Residual blocks are shown in brackets, and Multi-Aspect GCAttention are highlighted with gray background. Output shape is .
Method IIIT5K SVT IC03 IC13 IC15 SVTP CUTE
None None None None None None None
[28] - 80.7 93.1 90.8 - - -
[2] 78.4 80.8 88.7 90.0 - - -
[27] 81.9 81.9 90.1 88.6 - 71.8 59.2
STAR-Net [37] 83.3 83.6 - 89.1 - 73.5 -
[38] 80.8 81.5 - - - - -
CRNN [3] 81.2 82.7 91.9 89.6 - - -
Focusing Attention [24]* 87.4 85.9 94.2 93.3 70.6 - -
SqueezedText [39]* 87.0 - - 92.9 - - -
Char-Net [40]* 92.0 85.5 - 91.1 74.2 78.9 -
Edit Probability [4]* 88.3 87.5 94.6 94.4 73.9 - -
AON [8] 87.0 82.8 91.5 - 68.2 73.0 76.8
ASTER [5] 93.4 89.5 94.5 91.8 76.1 78.5 79.5
NRTR [32] 86.5 88.3 95.4 94.7 - - -
SAR [6] 91.5 84.5 - 91.0 69.2 76.4 83.3
ESIR [29] 93.3 90.2 - 91.3 76.9 79.6 83.3
MORAN [30] 91.2 88.3 95.0 92.4 68.8 76.1 77.4
[33] 93.3 88.1 - 91.3 74.0 80.2 85.1
Mask TextSpotter [36]** 95.3 91.8 95.2 95.3 78.2 83.6 88.5
MASTER (Ours) 95.0 90.6 96.4 95.3 79.4 84.5 87.5
TABLE II: Performance of our model and other state-of-the-art methods on public datasets. All values are reported as percentage (%). “None” means no lexicon. * indicates using both word-level and character-level annotations to train model. ** indicates model trained with end to end detection and recognition. In each column, the best performance result is shown in bold font, and the second best result is shown with underline. Our model achieves competitive performance on the most of public datasets, and the distance between us and the first place [36] is very small on IIIT5k and SVT datasets.

Iv-a Datasets

In this paper, we train our MASTER model only on three synthetic datasets without any finetuning on other real datasets. We evaluate our model on eight standard benchmarks that contain four regular scene text datasets and four irregular scene text datasets.

The training datasets consist of the following datasets.

Synth90k (MJSynth) is the synthetic text dataset proposed in [41]. The dataset has 9 million images generated from a set of 90k common English words. Every image in Synth90k is annotated with a word-level ground-truth. All of the images in this dataset are used for training.

SynthText [42] is a synthetic text dataset originally introduced for text detection. The generating procedure is similar to [41], but different from [41], words are rendered onto a full image with large resolution instead of a text line. 800 thousand full images are used as background images, and usually each rendered image contains around 10 text lines. Recently, It is also widely used for scene text recognition. We obtain 7 millions of text lines from this dataset for training.

SynthAdd is the synthetic text dataset proposed in [6]. The dataset contains 1.6 million word images using the synthetic engine proposed by [41] to compensate the lack of special characters like punctuations. All of the images in this dataset are used for training.

The test datasets consist of the following datasets.

IIIT5K-Words (IIIT5K) [43] has 3,000 test images collected from the web. Each image contains a short, 50-word lexicon and a long, 1,000-word lexicon. A lexicon include the groundtruth word and other stochastic words.

Street View Text (SVT) [44] is collected from the Google Street View. The test set includes 647 images of cropped words. Many images in SVT are severely corrupted by noise and blur or have low resolution. Each image contains a 50-word lexicon.

ICDAR 2003 (IC03) [45] contains 866 images of cropped word because we discard images that contain non-alphanumeric characters or have less than three characters for fair comparison. Each image contains a 50-word lexicon defined.

ICDAR 2013 (IC13) [46] contains 1,095 images for evaluation and 848 cropped image patches for training. We filter words that contain non-alphanumeric characters for fair comparison, which results in 1,015 test words. No lexicon is provided.

ICDAR 2015 (IC15) has 4,468 cropped words for training and 2,077 cropped words for evaluation, which are capture by Google Glasses without careful positioning and focusing. The dataset contains many of irregular text.

SVT-Perspective (SVTP) consists of 645 cropped images for testing [47]. Images are generated from side-view angle snapshots in Google Street View. Therefore, most images are perspective distorted. Each image contains a 50-word lexicon and a full lexicon.

CUTE80 (CUTE) contains 288 images [48]. It is a challenging dataset since there are plenty of images with curved text. No lexicon is provided.

COCO-Text (COCO-T) was firstly introduced in the Robust Reading Challenge of ICDAR 2017. It contains 62,351 image patches cropped from the famous Microsoft COCO dataset. The COCO-T dataset is extremely challenging because the text lines are mixed up with printed, scanned, and handwritten texts, and the shapes of text lines vary a lot. For this dataset, 42,618, 9,896 and 9,837 images are used for training, validation and testing individually.

Iv-B Network Structure and Implementation Details

Iv-B1 Networks

The network structure of the Encoder part is listed in Table I. The input size of our model is . When the ratio between width and height is larger than , we directly resize the input image into , otherwise we resize the height to 48 while keeping the aspect ratio and then pad the resized image into to . In MASTER, the embedded dimension is 512, the dimension of the output of the encoder is 512 too, and the number of the multi-head attention is 8. in the feed forward module is set to be 2048, and the identical layers is 3. We use 0.2 dropout on the embedding module, feed forward module, and the output layer of linear transformation in decoder part. The number of Multi-Aspect Context is 8 and the bottleneck ratio is 16.

Iv-B2 Implementation Details

Our model is only trained on three synthetic datasets without any finetune on any real data except for COCO-T dataset. These three synthetic datasets are SynText [42] with 7 millions of text images, Synth90K [41] with 9 millions of text images and SynthAdd [6] with 1.6 millions of text images.

Our model is implemented in PyTorch. The model is trained on four NVIDIA Tesla V100 GPUs with GB memory. We train the model from scratch using Adam [49] optimizer and cross-entropy loss with a batch size of . The learning rate is set to be over the whole training phase. We observe that the learning rate should be associated with the number of GPUs. For one GPU, is a good choice. Our model is trained for 12 epochs, each epoch takes about 3 hours. For COCO-T, we firstly finetune the above model with around 9K real images collected from IC13, IC15 and IIIT5K, then finetune the model with the training and validation images of COCO-T.

At test stage, for the image with its height larger than width, we rotate the images 90 degrees clockwise and anti-clockwise. We feed the original image and two rotated images into the model, and choose the output result with the maximum output probability. No lexicon is used in this paper. Note that, different from SAR [6], ASTER [5], and NRTR [32], we do not use beam search.

Method Case Sensitive Case Insensitive
Total Edit Distance Correctly Recognised Words (%) Total Edit Distance Correctly Recognised Words (%)
SogouMM 3,496.3121 44.64 1,037.2197 77.97
SenseTime-CKD 4,054.8236 41.52 824.6449 77.22
HIK_OCR 3,661.5785 41.72 899.1009 76.11
Tencent-DPPR Team 4,022.1224 36.91 1,233.4609 70.83
CLOVA-AI [50] 3,594.4842 47.35 1,583.7724 69.27
SAR [6] 4,002.3563 41.27 1,528.7396 66.85
HKU-VisionLab [40] 3,921.9388 40.17 1,903.3725 59.29
MASTER (Ours) 3,272.0810 49.09 1,203.4201 71.33
TABLE III: Leaderboard of various methods on the online COCO-Text test server. In each column, Bold represent the best performance.

Iv-C Comparisons with State-of-the-arts

In this section, we measure the proposed method on several regular and irregular text benchmarks, and analyze the performance with other state-of-the-art methods. We also report results on the online COCO-Text datasets test server333https://rrc.cvc.uab.es/?ch=5&com=evaluation&task=2 to show the performance of our model.

As shown in Table II, our method achieves superior performance on both regular and irregular datasets compared to the state-of-the-art methods. On the regular datasets including IIIT-5K, IC03, annd IC13, our approach largely improves SAR [6] which is based on LSTM with 2D attention and ASTER [5] which is based on Seq2Seq model with attention after a text rectification module. Specifically our approach improves SAR by 3.5% and 6.1% on IIIT-5K and SVT individually. On the irregular datasets, our method achieves the best performance on SVTP and IC15 datasets. This fully demonstrates the multi-aspect mechanism used in MASTER are highly effective in irregular scene text. Note that all these results are not with lexicon and beam search. The method in [36] uses extra character-level data and it unifies the detection and the recognition in one framework.

Furthermore, seen from Table III, we also use online evaluation tools on COCO-Text datasets to verify our competitive performance. As we can see, our model outperforms the compared method by a large margin in case sensitive metrics, demonstrating the powerful network. Specifically, our model get correctly recognised words accuracy increases of 1.74% (from 47.35% to 49.09%) under case sensitive condition. In case of case-insensitive metrics, our model also get the fourth place on the leaderboard and the performance is much better than SAR. Note that, the first place method of case-insensitive uses a tailored 2D-attention module, and the second and third place method of case-insensitive leaderboard use model ensemble. Our results are based on ensemble of four models obtained in different time steps of the same round of training process. The prediction with the maximum probability in four models is selected as the final prediction.

Seen from Figure 5, Our method obviously possesses more robust performance on scene text recognition than SAR [6], although the input image quality is blurry and the shape is curved or the text is badly distorted. The reason is that our model not only learns the input-output attention, but also learns self-attention which encodes feature-feature and target-target relationships inside the encoder and decoder. This makes the intermediate representations more robust to spatial distortion. Besides, in our approach, the problem of attention drifting is significantly better than SAR [6].

Input Images Ours By SAR [6] GT
ANDA AMDA ANDA
GOOD GCOD GOOD
wacom waccom wacom
BONNIE BONIE BONNIE
SERV LEAD SERV
actaea actara actaea
Fig. 5: Samples of recognition results of our MASTER and SAR. Green characters mean correct predictions and red characters mean wrong predictions.
Methods IIIT5k SVT CUTE IC03 IC13 IC15 SVTP
Standard Setting:
,
95.0 90.6 87.5 96.4 95.3 79.4 84.5
94.6 90.1 86.2 95.9 95.0 78.4 82.3
94.9 91.5 87.6 96.9 95.7 79.4 83.8
94.93 90.7 88.54 96.6 95.4 79.5 84.0
94.7 90.9 86.8 96.1 95.1 79.6 83.7
95.1 91.3 85.4 96.0 95.3 79.4 84.1
94.3 90.4 85.4 95.3 94.1 78.9 83.1
91.3 87.4 76.7 94.3 91.6 72.9 75.7
TABLE IV: Under different parameter settings our model recognition accuracy: , denotes the numbers of Multi-Aspect Context in encoder and identical layers in decoder, respectively. Standard Setting uses and . When or changes, all other parameters keep the same as the Standard Setting. All values are reported as percentage(%).

Iv-D Ablation Studies

Iv-D1 Influence of Key Hyperparameters

we perform a series of ablation studies to analyze the impact of different hyperparameters on the recognition performance. All models are trained from scratch on three synthetic datasets (Synth90K, SynthText, and SynthAdd). Results are reported on seven standard benchmarks without using lexicon. Here, we study two key hyperparameters, the number of Multi-Aspect Context in the encoder part, and the number of fundamental blocks in the decoder part. The results are shown in Table IV.

There are two groups of experimental comparisons in Table IV. Fix , we vary ranging in , where means no MAGC is used in the model. We observe that using MAGC module consistently improves the performance compared to that without using MAGC (). Compared to , obtains performance improvement on all datasets, especially significant improvement on CUTE, IC15 and SVTP. These three datasets are difficult and irregular. We believe this phenomenon is due to the introduced MAGC module can well capture different aspects of spatial 2D attention which is very important for irregular and hard text images. We also evaluate different settings of the number of fundamental blocks in the decoder part. gets the best performance, and the performance of N=6 decreases a lot compared to . We reckon that too deep decoder layers may bring in convergence problem. Therefore, in our experiment, we use , in default.

Iv-D2 Comparison of Evaluation Speed

Method Input Accuracy Time (ms)
SAR [6] 91.5 16.08
MASTER (Ours) 95.0 9.22
TABLE V: Speed comparison between MASTER (Ours) and SAR. MASTER is faster and more accurate than SAR method. All timing information is on a NVIDIA Tesla V100 GPU.

We conduct comparison of test speed on a server using a NVIDIA Tesla V100 GPU with Intel Xeon Gold 6130@ 2.10GHz CPU. The results are averaged on 3,000 test images from IIIT-5K, the input image size is . The results of SAR is based on our own implementation in PyTorch with the same setting as [6].

We observe from Table V that, MASTER not only achieves better performance, but also runs faster than SAR. The test time speed of our MASTER is 9.22ms per image compared to 16.08 ms of SAR. Note that our MASTER is highly parallel because of using a fixed-length strategy. By stacking multiple test images together and inputting the stacked batch in one time, we can obtain a further speedup.

Iv-D3 Model stability

We show the evaluation accuracies of MASTER and SAR along with training steps in Figure 6. We find that from Figure 6, MASTER model achieves more stable recognition performance than SAR though SAR converges faster. We reckon the reason is the MASTER requires calculating global attention which is slower but SAR only needs to compute local attention. We can see that the performance of MASTER model is very stable when it hits the best performance, it will not decrease a lot. However, the performance of SAR often decreases a little more when it reaches the best performance.

Fig. 6: The model stability comparison between MASTER (Ours) and SAR [6].

V Conclusion

In this work, we propose a novel approach MASTER: multi-aspect non-local network for scene text recognition. The MASTER consists of a Multi-Aspect Global Context Attention (GCAttention) based encoder module and a transformer based decoder module. The proposed MASTER owns three advantages: (1) The model can both learn input-output attention and self-attention which encodes feature-feature and target-target relationships inside the encoder and decoder. (2) Experiments demonstrate that the proposed method is more robust to spatial distortion. (3) The training process of the proposed method is highly parallel and efficient. Experiments on standard benchmarks demonstrate it can achieve the state-of-the-art performances regarding both efficiency and recognition accuracy.

References

  • [1] P. Shivakumara, S. Bhowmick, B. Su, C. L. Tan, and U. Pal, “A New Gradient Based Character Segmentation Method for Video Text Recognition,” in IEEE International Conference on Document Analysis and Recognition, 2011, pp. 126–130.
  • [2] C.-Y. Lee and S. Osindero, “Recursive recurrent nets with attention modeling for ocr in the wild,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2016, pp. 2231–2239.
  • [3] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 2298–2304, 2015.
  • [4] F. Bai, Z. Cheng, Y. Niu, S. Pu, and S. Zhou, “Edit probability for scene text recognition,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2018, pp. 1508–1516.
  • [5] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, pp. 2035–2048, 2018.
  • [6] H. Li, P. Wang, C. Shen, and G. Zhang, “Show, attend and read: A simple and strong baseline for irregular text recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 8610–8617.
  • [7] F. L. Bookstein, “Principal warps: thin-plate splines and the decomposition of deformations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, no. 6, pp. 567–585, June 1989.
  • [8] Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou, “Aon: Towards arbitrarily-oriented text recognition,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2017, pp. 5571–5579.
  • [9] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning, 2015, pp. 2048–2057.
  • [10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  • [11] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [12] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” arXiv preprint arXiv:1906.08237, 2019.
  • [13] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017, pp. 1243–1252.
  • [14] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2017, pp. 3156–3164.
  • [15] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks meet squeeze-excitation networks and beyond,” ArXiv, vol. abs/1904.11492, 2019.
  • [16] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation networks for object detection,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2018, pp. 3588–3597.
  • [17] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.
  • [18] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” IEEE International Conference on Computer Vision and Pattern Recognition, pp. 7132–7141, 2017.
  • [19] Q. Ye and D. Doermann, “Text detection and recognition in imagery: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 7, pp. 1480–1500, July 2015.
  • [20] Y. Zhu, C. Yao, and X. Bai, “Scene text detection and recognition: recent advances and future trends,” Frontiers of Computer Science, vol. 10, no. 1, pp. 19–36, Feb 2016.
  • [21] S. Long, X. He, and C. Yao, “Scene text detection and recognition: The deep learning era,” ArXiv, vol. abs/1811.04256, 2018.
  • [22] A. Mishra, K. Alahari, and C. Jawahar, “Enhancing energy minimization framework for scene text recognition with top-down cues,” Computer Vision and Image Understanding, vol. 145, p. 30–42, Apr 2016.
  • [23] Y. Gao, Y. Chen, J. Wang, M. Tang, and H. Lu, “Reading scene text with fully convolutional sequence modeling,” Neurocomputing, vol. 339, pp. 161–170, 2019.
  • [24] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou, “Focusing attention: Towards accurate text recognition in natural images,” in IEEE International Conference on Computer Vision, 2017, pp. 5086–5094.
  • [25] Y. Zhang, S. Nie, W. Liu, X. Xu, D. Zhang, and H. T. Shen, “Sequence-to-sequence domain adaptation network for robust text image recognition,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2019, pp. 2740–2749.
  • [26] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of arbitrary orientations in natural images,” in IEEE International Conference on Computer Vision and Pattern Recognition.    IEEE, jun 2012, pp. 1083–1090.
  • [27] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust scene text recognition with automatic rectification,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2016, pp. 4168–4176.
  • [28] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” in Advances in neural information processing systems, 2015, pp. 2017–2025.
  • [29] F. Zhan and S. Lu, “Esir: End-to-end scene text recognition via iterative image rectification,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2019, pp. 2059–2068.
  • [30] C. Luo, L. Jin, and Z. Sun, “Moran: A multi-object rectified attention network for scene text recognition,” Pattern Recognition, vol. 90, pp. 109–118, 2019.
  • [31] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation networks for object detection,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2017, pp. 3588–3597.
  • [32] F. Sheng, Z. Chen, and B. Xu, “Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition,” ArXiv, vol. abs/1806.00926, 2018.
  • [33] P. Wang, L. Yang, H. Li, Y. Deng, C. Shen, and Y. Zhang, “A simple and robust convolutional-attention network for irregular text recognition,” ArXiv, vol. abs/1904.01375, 2019.
  • [34] J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” ArXiv, vol. abs/1607.06450, 2016.
  • [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2015, pp. 770–778.
  • [36] M. Liao, P. Lyu, M. He, C. Yao, W. Wu, and X. Bai, “Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • [37] W. Liu, C. Chen, K.-Y. K. Wong, Z. Su, and J. Han, “Star-net: A spatial attention residue network for scene text recognition,” in The British Machine Vision Conference, 2016, p. 7.
  • [38] J. Wang and X. Hu, “Gated recurrent convolution neural network for ocr,” in Advances in Neural Information Processing Systems, 2017, pp. 335–344.
  • [39] Z. Liu, Y. Li, F. Ren, W. L. Goh, and H. Yu, “Squeezedtext: A real-time scene text recognition by binary convolutional encoder-decoder network,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  • [40] W. Liu, C. Chen, and K.-Y. K. Wong, “Char-net: A character-aware neural network for distorted scene text recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  • [41] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Synthetic data and artificial neural networks for natural scene text recognition,” NIPS Deep Learning Workshop, 2014.
  • [42] A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2016.
  • [43] A. Mishra, K. Alahari, and C. V. Jawahar, “Top-down and bottom-up cues for scene text recognition,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2012, pp. 2687–2694.
  • [44] K. Wang, B. Babenko, and S. J. Belongie, “End-to-end scene text recognition,” in International Conference on Computer Vision, 2011, pp. 1457–1464.
  • [45] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, K. Ashida, H. Nagai, M. Okamoto, H. Yamamoto, H. Miyao, J. Zhu, W. Ou, C. Wolf, J.-M. Jolion, L. Todoran, M. Worring, and X. Lin, “Icdar 2003 robust reading competitions: entries, results, and future directions,” in International Journal of Document Analysis and Recognition, vol. 7, 2004, pp. 105–122.
  • [46] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. M. Romeu, D. F. Mota, J. Almazán, and L.-P. de las Heras, “Icdar 2013 robust reading competition,” in IEEE International Conference on Document Analysis and Recognition, 2013, pp. 1484–1493.
  • [47] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan, “Recognizing text with perspective distortion in natural scenes,” in IEEE International Conference on Computer Vision, 2013, pp. 569–576.
  • [48] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan, “A robust arbitrary text detection system for natural scene images,” Expert Systems with Applications, vol. 41, pp. 8027–8048, 2014.
  • [49] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” The Computing Research Repository, vol. abs/1412.6980, 2014.
  • [50] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee, “What is wrong with scene text recognition model comparisons? dataset and model analysis,” ArXiv, vol. abs/1904.01906, 2019.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393187
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description