What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis
Many new proposals for scene text recognition (STR) models have been introduced in recent years.
While each claim to have pushed the boundary of the technology, a holistic and fair comparison has been largely missing in the field due to the inconsistent choices of training and evaluation datasets.
This paper addresses this difficulty with three major contributions.
First, we examine the inconsistencies of training and evaluation datasets, and the performance gap results from inconsistencies.
Second, we introduce a unified four-stage STR framework that most existing STR models fit into. Using this framework allows for the extensive evaluation of previously proposed STR modules and the discovery of previously unexplored module combinations.
Third, we analyze the module-wise contributions to performance in terms of accuracy, speed, and memory demand, under one consistent set of training and evaluation datasets.
Such analyses clean up the hindrance on the current comparisons to understand the performance gain of the existing modules.
Our code is publicly available
Reading text in natural scenes, referred to as scene text recognition (STR), has been an important task in a wide range of industrial applications. The maturity of Optical Character Recognition (OCR) systems has led to its successful application on cleaned documents, but most traditional OCR methods have failed to be as effective on STR tasks due to the diverse text appearances that occur in the real world and the imperfect conditions in which these scenes are captured.
To address these challenges, prior works [24, 25, 16, 17, 28, 30, 4, 18, 5, 2, 3, 19] have proposed multi-stage pipelines, where each stage is a deep neural network addressing a specific challenge. For example, Shi \etal  have suggested using a recurrent neural network to address the varying number of characters in a given input, and a connectionist temporal classification loss  to identify the number of characters. Shi \etal  have proposed a transformation module that normalizes the input into a straight text image to reduce the representational burden for downstream modules to handle curved texts.
However, it is hard to assess whether and how a newly proposed module improves upon the current art, as some papers have come up with different evaluation and testing environments, making it difficult to compare reported numbers at face value (Table 1). We observed that 1) the training datasets and 2) the evaluation datasets deviate amongst various methods, as well. For example, different works use a different subset of the IC13 dataset as part of their evaluation set, which may cause a performance disparity of more than 15%. This kind of discrepancy hinders the fair comparison of performance between different models.
Our paper addresses these types of issues with the following main contributions. First, we analyze all training and evaluation datasets commonly used in STR papers. Our analysis reveals the inconsistency of using the STR datasets and its causes. For instance, we found 7 missing examples in IC03 dataset and 158 missing examples in IC13 dataset as well. We investigate several previous works on the STR datasets and show that the inconsistency causes incomparable results as shown in Table 1. Second, we introduce a unifying framework for STR that provides a common perspective for existing methods. Specifically, we divide the STR model into four different consecutive stages of operations: transformation (Trans.), feature extraction (Feat.), sequence modeling (Seq.), and prediction (Pred.). The framework provides not only existing methods but their possible variants toward an extensive analysis of module-wise contribution. Finally, we study the module-wise contributions in terms of accuracy, speed, and memory demand, under a unified experimental setting. With this study, we assess the contribution of individual modules more rigorously and propose previously overlooked module combinations that improves over the state of the art. Furthermore, we analyzed failure cases on the benchmark dataset to identify remaining challenges in STR.
2 Dataset Matters in STR
In this section, we examine the different training and evaluation datasets used by prior works, and then their discrepancies are addressed. Through this analysis, we highlight how each of the works differs in constructing and using their datasets, and investigate the bias caused by the inconsistency when comparing performance between different works (Table 1). The performance gaps due to dataset inconsistencies are measured through experiments and discussed in §4.
2.1 Synthetic datasets for training
When training a STR model, labeling scene text images is costly, and thus it is difficult to obtain enough labeled data for. Alternatively using real data, most STR models have used synthetic datasets for training. We first introduce two most popular synthetic datasets used in recent STR papers:
MJSynth (MJ)  is a synthetic dataset designed for STR, containing 8.9 M word box images. The word box generation process is as follows: 1) font rendering, 2) border and shadow rendering, 3) background coloring, 4) composition of font, border, and background, 5) applying projective distortions, 6) blending with real-world images, and 7) adding noise. Figure 0(a) shows some examples of MJSynth,
SynthText (ST)  is another synthetically generated dataset and was originally designed for scene text detection. An example of how the words are rendered onto scene images is shown in Figure 0(b). Even though SynthText was designed for scene text detection task, it has been also used for STR by cropping word boxes. SynthText has 5.5 M training data once the word boxes are cropped and filtered for non-alphanumeric characters.
Note that prior works have used diverse combinations of MJ, ST, and or other sources (Table 1). These inconsistencies call into question whether the improvements are due to the contribution of the proposed module or to that of a better or larger training data. Our experiment in §4.2 describes the influence of the training datasets to the final performance on the benchmarks. We further suggest that future STR researches clearly indicate the training datasets used and compare models using the same training set.
2.2 Real-world datasets for evaluation
Seven real-world STR datasets have been widely used for evaluating a trained STR model. For some benchmark dataset, different subsets of the dataset may have been used in each prior work for evaluation (Table 1). These difference in subsets result in inconsistent comparison.
We introduce the datasets by categorizing them into regular and irreguglar datsets. The benchmark datasets are given the distinction of being “regular” or “irregular” datasets [25, 30, 5], according to the difficulty and geometric layout of the texts. First, regular datasets contain text images with horizontally laid out characters that have even spacings between them. These represent relatively easy cases for STR:
IIIT5K-Words (IIIT)  is the dataset crawled from Google image searches, with query words that are likely to return text images, such as “billboards”, “signboard”, “house numbers”, “house name plates”, and “movie posters”. IIIT consists of 2,000 images for training and 3,000 images for evaluation,
Street View Text (SVT)  contains outdoor street images collected from Google Street View. Some of these images are noisy, blurry, or of low-resolution. SVT consists of 257 images for training and 647 images for evaluation,
ICDAR2003 (IC03)  was created for the ICDAR 2003 Robust Reading competition for reading camera-captured scene texts. It contains 1,156 images for training and 1,110 images for evaluation. Ignoring all words that are either too short (less than 3 characters) or ones that contain non-alphanumeric characters reduces 1,110 images to 867. However, researchers have used two different versions of the dataset for evaluation: versions with 860 and 867 images. The 860-image dataset is missing 7 word boxes compared to the 867 dataset. The omitted word boxes can be found in the supplementary materials,
ICDAR2013 (IC13)  inherits most of IC03’s images and was also created for the ICDAR 2013 Robust Reading competition. It contains 848 images for training and 1,095 images for evaluation, where pruning words with non-alphanumeric characters results in 1,015 images. Again, researchers have used two different versions for evaluation: 857 and 1,015 images. The 857-image set is a subset of the 1,015 set where words shorter than 3 characters are pruned.
ICDAR2015 (IC15)  was created for the ICDAR 2015 Robust Reading competitions and contains 4,468 images for training and 2,077 images for evaluation. The images are captured by Google Glasses while under the natural movements of the wearer. Thus, many are noisy, blurry, and rotated, and some are also of low resolution. Again, researchers have used two different versions for evaluation: 1,811 and 2,077 images. Previous papers [4, 2] have only used 1,811 images, discarding non-alphanumeric character images and some extremely rotated, perspective-shifted, and curved images for evaluation. Some of the discarded word boxes can be found in the supplementary materials,
SVT Perspective (SP)  is collected from Google Street View and contains 645 images for evaluation. Many of the images contain perspective projections due to the prevalence of non-frontal viewpoints,
CUTE80 (CT)  is collected from natural scenes and contains 288 cropped images for evaluation. Many of these are curved text images.
Notice that, Table 1 provides us a critical issue that prior works evaluated their models on different benchmark datasets. Specifically, the evaluation has been conducted on different versions of benchmarks in IC03, IC13 and IC15. In IC03, 7 examples can cause a performance gap by 0.8% that is a huge gap when comparing those of prior performances. In the case of IC13 and IC15, the gap of the example numbers is even bigger than those of IC03.
3 STR Framework Analysis
The goal of the section is introducing the scene text recognition (STR) framework consisting of four stages, derived from commonalities among independently proposed STR models. After that, we describe the module options in each stage.
Due to the resemblance of STR to computer vision tasks (\egobject detection) and sequence prediction tasks, STR has benefited from high-performance convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The first combined application of CNN and RNN for STR, Convolutional-Recurrent Neural Network (CRNN) , extracts CNN features from the input text image, and re-configures them with an RNN for robust sequence prediction. After CRNN, multiple variants [25, 16, 17, 18, 28, 4, 3] have been proposed to improve performance. For rectifying arbitrary text geometries, as an example, transformation modules have been proposed to normalize text images [25, 17, 18]. For treating complex text images with high intrinsic dimensionality and latent factors (\egfont style and cluttered background), improved CNN feature extractors have been incorporated [16, 28, 4]. Also, as people have become more concerned with inference time, some methods have even omitted the RNN stage . For improving character sequence prediction, attention based decoders have been proposed [16, 25].
The four stages derived from existing STR models are as follows:
Transformation (Trans.) normalizes the input text image using the Spatial Transformer Network (STN ) to ease downstream stages.
Feature extraction (Feat.) maps the input image to a representation that focuses on the attributes relevant for character recognition, while suppressing irrelevant features such as font, color, size, and background.
Sequence modeling (Seq.) captures the contextual information within a sequence of characters for the next stage to predict each character more robustly, rather than doing it independently.
Prediction (Pred.) estimates the output character sequence from the identified features of an image.
We provide Figure 3 for an overview and all the architectures we used in this paper are found in the supplementary materials.
3.1 Transformation stage
The module of this stage transforms the input image into the normalized image . Text images in natural scenes come in diverse shapes, as shown by curved and tilted texts. If such input images are fed unaltered, the subsequent feature extraction stage needs to learn an invariant representation with respect to such geometry. To reduce this burden, thin-plate spline (TPS) transformation, a variant of the spatial transformation network (STN) , has been applied with its flexibility to diverse aspect ratios of text lines [25, 17]. TPS employs a smooth spline interpolation between a set of fiducial points. More precisely, TPS finds multiple fiducial points (green ’+’ marks in Figure 3) at the upper and bottom enveloping points, and normalizes the character region to a predefined rectangle. Our framework allows for the selection or de-selection of TPS.
3.2 Feature extraction stage
In this stage, a CNN abstract an input image (i.e., or ) and outputs a visual feature map ( is the number of columns in the feature map). Each column in the resulting feature map by a feature extractor has a corresponding distinguishable receptive field along the horizontal line of the input image. These features are used to estimate the character on each receptive field.
We study three architectures of VGG , RCNN , and ResNet , previously used as feature extractors for STR. VGG in its original form consists of multiple convolutional layers followed by a few fully connected layers . RCNN is a variant of CNN that can be applied recursively to adjust its receptive fields depending on the character shapes [16, 28]. ResNet is a CNN with residual connections that eases the training of relatively deeper CNNs.
3.3 Sequence modeling stage
The extracted features from Feat. stage are reshaped to be a sequence of features . That is, each column in a feature map is used as a frame of the sequence. However, this sequence may suffer the lack of contextual information. Therefore, some previous works use Bidirectional LSTM (BiLSTM) to make a better sequence after the feature extraction stage [24, 25, 4]. On the other hand, Rosetta  removed the BiLSTM to reduce computational complexity and memory consumption. Our framework allows for the selection or de-selection of BiLSTM.
3.4 Prediction stage
In this stage, from the input , a module predict a sequence of characters, (i.e., ). By summing up previous works, we have two options for prediction: (1) Connectionist temporal classification (CTC)  and (2) attention-based sequence prediction (Attn) [25, 4]. CTC allows for the prediction of a non-fixed number of a sequence even though a fixed number of the features are given. The key methods for CTC are to predict a character at each column () and to modify the full character sequence into a non-fixed stream of characters by deleting repeated characters and blanks [6, 24]. On the other hand, Attn automatically captures the information flow within the input sequence to predict the output sequence . It enables an STR model to learn a character-level language model representing output class dependencies.
4 Experiment and Analysis
This section contains the evaluation and analysis of all possible STR module combinations (2322 24 in total) from the four-stage framework in §3, all evaluated under the common training and evaluation dataset constructed from the datasets listed in §2.
4.1 Implementation detail
As we described in §2, training and evaluation datasets influences the measured performances of STR models significantly. To conduct a fair comparison, we have fixed the choice of training, validation, and evaluation datasets.
STR training and model selection We use an union of MJSynth 8.9 M and SynthText 5.5 M (14.4 M in total) as our training data. We adopt the AdaDelta  optimizer, whose decay rate is set to . The training batch size is , and the number of iterations is 300 K. Gradient clipping is used at magnitude 5. All parameters are initialized with He’s method . We use the union of the training sets IC13, IC15, IIIT, and SVT as the validation data, and validated the model after every 2000 training steps to select the model with the highest accuracy on this set. Notice that, the validation set does not contain the IC03 train data because some of them were duplicated in the evaluation dataset of IC13. The total number of duplicated scene images is 34, and they contain 215 word boxes. Duplicated examples can be found in the supplementary materials.
Evaluation metrics In this paper, we provide a thorough analysis on STR combinations in terms of accuracy, time, and memory aspects altogether. For accuracy, we measure the success rate of word predictions per image on the 9 real-world evaluation datasets involving all subsets of the benchmarks, as well as a unified evaluation dataset (8,539 images in total); 3,000 from IIIT, 647 from SVT, 867 from IC03, 1015 from IC13, 2,077 from IC15, 645 from SP, and 288 from CT. We only evaluate on alphabets and digits. For each STR combination, we have run five trials with different initialization random seeds and have averaged their accuracies. For speed assessment, we measure the per-image average clock time (in millisecond) for recognizing the given texts under the same compute environment, detailed below. For memory assessment, we count the number of trainable floating point parameters in the entire STR pipeline.
Environment: For a fair speed comparison, all of our evaluations are performed on the same environment: an Intel Xeon(R) E5-2630 v4 2.20GHz CPU, an NVIDIA TESLA P40 GPU, and 252GB of RAM. All experiments are performed with NAVER Smart Machine Learning (NSML) platform .
4.2 Analysis on training datasets
We investigate the influence of using different groups of the training datasets to the performance on the benchmarks. As we mentioned in §2.1, prior works used different sets of the training datasets and left uncertainties as to the contributions of their models to improvements. To unpack this issue, we examined the accuracy of our best model from §4.3 with different settings of the training dataset. We obtained 80.0% total accuracy by using only MJSynth, 75.6% by using only SynthText, and 84.1% by using both. The combination of MJSynth and SynthText improved accuracy by more than 4.1%, over the individual usages of MJSynth and SynthText. A lesson from this study is that the performance results using different training datasets are incomparable, and such comparisons fail to prove the contribution of the model, which is why we trained all models with the same training dataset, unless mentioned otherwise.
Interestingly, training on 20% of MJSynth (1.8M) and 20% of SynthText (1.1M) together (total 2.9M the half of SynthText) provides 81.3% accuracy – better performance than the individual usages of MJSynth or SynthText. MJSynth and SynthText have different properties because they were generated with different options, such as distortion and blur. This result showed that the diversity of training data can be more important than the number of training examples, and that the effects of using different training datasets is more complex than simply concluding more is better.
4.3 Analysis of trade-offs for module combinations
Here, we focus on the accuracy-speed and accuracy-memory trade-offs shown in different combinations of modules. We provide the full table of results in the supplementary materials. See Figure 4 for the trade-off plots of all 24 combinations, including the six previously proposed STR models (Stars in Figure 4). In terms of the accuracy-time trade-off, Rosetta and STAR-net are on the frontier and the other four prior models are inside of the frontier. In terms of the accuracy-memory trade-off, R2AM is on the frontier and the other five of previously proposed models are inside of the frontier. Module combinations along the trade-off frontiers are labeled in ascending ascending order of accuracy (T1 to T5 for accuracy-time and P1 to P5 for accuracy-memory).
Analysis of combinations along the trade-off frontiers.
As shown in Table 3(a), T1 takes the minimum time by not including any transformation or sequential module. Moving from T1 to T5, the following modules are introduced in order (indicated as bold): ResNet, BiLSTM, TPS, and Attn. Note that from T1 to T5, a single module changes at a time. Our framework provides a smooth shift of methods that gives the least performance trade-off depending on the application scenario. They sequentially increase the complexity of the overall STR model, resulting in increased performance at the cost of computational efficiency. ResNet, BiLSTM, and TPS introduce relatively moderate overall slow down (1.3ms10.9ms), while greatly boosting accuracy (69.5%82.9%). The final change, Attn, on the other hand, only improves the accuracy by 1.1% at a huge cost in efficiency (27.6 ms).
As for the accuracy-memory trade-offs shown in Table 3(b), P1 is the model with the least amount of memory consumption, and from P1 to P5 the trade-off between memory and accuracy takes place. As in the accuracy-speed trade-off, we observe a single module shift at each step up to P5, where the changed modules are: Attn, TPS, BiLSTM, and ResNet. They sequentially increase the accuracy at the cost of memory. Compared to VGG used in T1, we observe that RCNN in P1-P4 is lighter and gives a good accuracy-memory trade-off. RCNN requires a small number of unique CNN layers that are repeatedly applied. We observe that transformation, sequential, and prediction modules are not significantly contributing to the memory consumption (1.9M7.2M parameters). While being lightweight overall, these modules provide accuracy improvements (75.4%82.3%). The final change, ResNet, on the other hand, increases the accuracy by 1.7% at the cost of increased memory consumption from 7.2M to 49.6M floating point parameters. Thus, a practitioner concerned about memory consumption can be assured to choose specialized transformation, sequential, and prediction modules relatively freely, but should refrain from the use of heavy feature extractors like ResNets.
The most important modules for speed and memory.
We have identified the module-wise impact on speed and memory by color-coding the scatter plots in Figure 4 according to module choices. The full set of color-coded plots is in the supplementary materials. Here, we show the scatter plots with the most speed- and memory-critical modules, namely the prediction and feature extraction modules, respectively, in Figure 5.
There are clear clusters of combinations according to the prediction and feature modules. In the accuracy-speed trade-off, we identify CTC and Attn clusters (the addition of Attn significantly slows the overall STR model). On the other hand, for accuracy-memory trade-off, we observe that the feature extractor contributes towards memory most significantly. It is important to recognize that the most significant modules for each criteria differ, therefore, practitioners under different applications scenarios and constraints should look into different module combinations for the best trade-offs depending on their needs.
4.4 Module analysis
Here, we investigate the module-wise performances in terms of accuracy, speed, and memory demand. For this analysis, the marginalized accuracy of each module is calculated by averaging out the combination including the module in Table 2. Upgrading a module at each stage requires additional resources, time or memory, but provides performance improvements. The table shows that the performance improvement in irregular datasets is about two times that of regular benchmarks over all stages. when comparing accuracy improvement versus time usage, a sequence of ResNet, BiLSTM, TPS, and Attn is the most efficient upgrade order of the modules from a base combination of None-VGG-None-CTC. This order is the same order of combinations for the accuracy-time frontiers (T1T5). On the other hand, an accuracy-memory perspective finds RCNN, Attn, TPS, BiLSTM and ResNet as the most efficient upgrading order for the modules, like the order of the accuracy-memory frontiers (P1P5). Interestingly, the efficient order of modules for time is reverse from those for memory. The different properties of modules provide different choices in practical applications. In addition, the module ranks in the two perspectives are the same as the order of the frontier module changes, and this shows that each module contributes to the performances similarly under all combinations.
Qualitative analysis Each module contributes to identify text by solving targeted difficulties of STR tasks, as described in §3. Figure 7 shows samples that are only correctly recognized when certain modules are upgraded (e.g. from VGG to ResNet backbone). Each row shows a module upgrade at each stage of our framework. Presented samples are failed before the upgrade, but becomes recognizable afterward. TPS transformation normalizes curved and perspective texts into a standardized view. Predicted results show dramatic improvements especially for “POLICE” in a circled brand logo and “AIRWAYS” in a perspective view of a storefront sign. Advanced feature extractor, ResNet, results in better representation power, improving on cases with heavy background clutter “YMCA”, “CITYARTS”) and unseen fonts (“NEUMOS”). BiLSTM leads to better context modeling by adjusting the receptive field; it can ignore unrelatedly cropped characters (“I” at the end of “EXIT”, “C” at the end of “G20”). Attention including implicit character-level language modeling finds missing or occluded character, such as “a” in “Hard”, “t” in “to”, and “S” in “HOUSE”. These examples provide glimpses to the contribution points of the modules in real-world applications.
4.5 Failure case analysis
We investigate failure cases of all 24 combinations. As our framework derived from commonalities among proposed STR models, and our best model showed competitive performance with previously proposed STR models, the presented failure cases constitute a common challenge for the field as a whole. We hope our analysis inspires future works in STR to consider addressing those challenges.
Among 8,539 examples in the benchmark datasets (§2), 644 images (7.5%) are not correctly recognized by any of the 24 models considered. We found six common failure cases as shown in Figure 6. The followings are discussion about the challenges of the cases and suggestion future research directions.
Calligraphic fonts: font styles for brands, such as “Coca Cola”, or shop names on streets, such as “Cafe”, are still in remaining challenges. Such diverse expression of characters requires a novel feature extractor providing generalized visual features. Another possible approach is regularization because the model might be over-fitting to the font styles in a training dataset.
Vertical texts: most of current STR models assumes horizontal text images, and thus structurally could not deal with vertical texts. Some STR models [30, 5] exploit vertical information also, however, vertical texts are not clearly covered yet. Further research would be needed to cover vertical texts.
Special characters: since current benchmarks do not evaluate special characters, existing works exclude them during training. This results in failure prediction, misleading the model to treat them as alphanumeric characters. We suggest training with special characters. This has resulted in a boost from 87.9% to 90.3% accuracy on IIIT.
Heavy occlusions: current methods do not extensively exploit contextual information to overcome occlusion. Future researches may consider superior language models to maximally utilize context.
Low resolution: existing models do not explicitly handle low resolution cases; image pyramids or super-resolution modules may improve performance.
Label noise: We found some noisy (incorrect) labels in the failure examples. We examined all examples in the benchmark to identify the ratio of the noisy labels. All benchmark datasets contain noisy labels and the ratio of mislabeling without considering special character was 1.3%, mislabeling with considering special character was 6.1%, and mislabeling with considering case-sensitivity was 24.1%.
We make all failure cases available in our Github repository, hoping that they will inspire further researches on corner cases of the STR problem.
While there has been great advances on novel scene text recognition (STR) models, they have been compared on inconsistent benchmarks, leading to difficulties in determining whether and how a proposed module improves the STR baseline model. This work analyzes the contribution of the existing STR models that was hindered under inconsistent experiment settings before. To achieve this goal, we have introduced a common framework among key STR methods, as well as consistent datasets: seven benchmark evaluation datasets and two training datasets (MJ and ST). We have provided a fair comparison among the key STR methods compared, and have analyzed which modules brings about the greatest accuracy, speed, and size gains. We have also provided extensive analyses on module-wise contributions to typical challenges in STR as well as the remaining failure cases.
The authors would like to thank Jaeheung Surh for helpful discussions.
Appendix A Contents
Appendix B : Dataset Matters in STR - examples
Appendix C : STR Framework - verification
Appendix D : STR Framework - architectural details
We describe the architectural details of all modules in our framework described in §3.
Appendix E : STR Framework - full experimental results
We provide the comprehensive results of our experiments described in §4, and discuss them in detail.
Appendix B Dataset Matters in STR - examples
IC03 - 7 missing word boxes in 860 evaluation dataset. The original IC03 evaluation dataset has 1,110 images, but prior works have conducted additional filtering, as described in §2. All papers have ignored all words that are either too short (less than 3 characters) or ones that contain non-alphanumeric characters. Although all papers have supposedly applied the same data filtering method and should have reduced the evaluation set from 1,110 images to 867 images, the reported example numbers are different: either the expected 867 images or a further reduced 860 images. We identified the missing examples as shown in Figure 8.
IC15 - Filtered examples in evaluation dataset. The IC15 dataset originally contains 2,077 examples for its evaluation set, however prior works [4, 2] have filtered it down to 1,811 examples and have not given unambiguous specifications between them for deciding on which example to discard. To resolve this ambiguity, we have contacted one of the authors, who shared the specific dataset used for the evaluation. This information is made available with the source code on Github. A few sample images that have been filtered out of the IC15 evaluation dataset is shown in Figure 9.
IC03 and IC13 - Duplicated images between IC03 training dataset and IC13 evaluation dataset. Figure 10 shows two images from the subset given by the intersection between the IC03 training dataset and the IC13 evaluation dataset. In our investigation, a total of 34 duplicated scene images have been found, amounting to 215 duplicate word boxes, in total. Therefore, when one assesses the performance of a model on the IC13 evaluation data, he/she should be mindful of these overlapping data.
Appendix C STR Framework - verification
To show the correctness of our implemented module for our framework, we reproduce the performances of existing models that can be re-built by our framework. Specifically, we compare the results of our implementation of CRNN, RARE, GRCNN, and FAN (w/o Focus Net) [24, 25, 28, 4] from those of publicly reported by the authors. We implemented each module as described in their original papers, and also we followed the training and evaluation pipelines of their original papers to train the individual models. Table 3 shows the results. Our implementation has overall similar performance with reported result in their paper, which verify the sanity of our implementations and experiments.
Appendix D STR Framework - architectural details
In this appendix, we describe each module of our framework in terms of its concept and architectural specifications. We first introduce common notations used in this appendix and then explain the modules of each stage; Trans., Feat., Seq., and Pred.
Notations For a simple expression for a neural network architecture, we denote ‘c’, ‘k’, ‘s’ and ‘p’ for the number of the output channel, the size of kernel, the stride, and the padding size respectively. BN, Pool, and FC denote the batch normalization layer, the max pooling layer, and the fully connected layer, respectively. In the case of convolution operation with the stride of 1 and the padding size of 1, ‘s’ and ‘p’ are omitted for convenience.
d.1 Transformation stage
The module of this stage transforms the input image into the normalized image . We explained the concept of TPS [25, 17] in §3.1, but here we deliver its mathematical background and the implementation details.
TPS transformation: TPS generates a normalized image that shows a focused region of an input image. To build this pipeline, TPS consists of a sequence of processes; finding a text boundary, linking the location of the pixels in the boundary to those of the normalized image, and generating a normalized image by using the values of pixels and the linking information. Such processes are called as localization network, grid generator, and image sampler, respectively. Conceptually, TPS employs a smooth spline interpolation between a set of fiducial points that represented a focused boundary of text in an image. Here, indicates the constant number of fiducial points.
The localization network explicitly calculates -coordinates of fiducial points on an input image, . The coordinates are denoted by , whose -th column contains the coordinates of the -th fiducial point. represents pre-defined top and bottom locations on the normalized image, .
The grid generator provides a mapping function from the identified regions by the localization network to the normalized image. The mapping function can be parameterized by a matrix , which is computed by
where is a matrix determined only by , thus also a constant:
where the element of -th row and -th column of is , is the euclidean distance between and . The pixels of grid on the normalized image is denoted by , where is the x,y-coordinates of the -th pixel, is the number of pixels. For every pixel on , we find the corresponding point on , by applying the transformation:
where is the euclidean distance between pixel and the -th base fiducial point . By iterating Eq. 3 over all points in , we generate a grid on the input image .
Finally, the image sampler produces the normalized image by interpolating the pixels in the input images which are determined by the grid generator.
TPS-Implementation: TPS requires the localization network calculating fiducial points of an input image. We designed the localization network by following most of the components of prior work , and added batch normalization layers and adaptive average pooling to stabilize the training of the network. Table 4 shows the details of our architecture. In our implementation, the localization network has 4 convolution layers, each followed by a batch normalization layer and 2 x 2 max-pooling layer. The filter size, padding size, and stride are 3, 1, 1 respectively, for all convolutional layers. Following the last convolutional layer is an adaptive average pooling layer (APool in Table 4). After that, two fully connected layers are following: 512 to 256 and 256 to 2F. Final output is 2F dimensional vector which corresponds to the value of -coordinates of fiducial points on input image. Activation functions for all layers are the ReLU.
d.2 Feature extraction stage
In this stage, a CNN abstract an input image (i.e., or ) and outputs a feature map ( is the number of columns in the feature map).
Recurrently applied CNN (RCNN): As a RCNN module, we implemented a Gated RCNN (GRCNN)  which is a variant of RCNN that can be applied recursively with a gating mechanism. The architectural details of the module are shown in Table 6. The output of RCNN is 512 channels 26 columns.
d.3 Sequence modeling stage
Some previous works used Bidirectional LSTM (BiLSTM) to make a contextual sequence after the Feat. stage .
BiLSTM: We implemented 2-layers BiLSTM  which is used in CRNN . In the followings, we explain a BiLSTM layer used in our framework: A BiLSTM layer identifies two hidden states, and , calculated through time sequence and its reverse. Following , we additionally applied a FC layer between BiLSTM layers to determine one hidden state, , by using the two identified hidden states, and . The dimensions of all hidden states including the FC layer was set as .
None indicates not to use any Seq. modules upon the output of the Feat. modules, that is, .
d.4 Prediction stage
A prediction module produces the final prediction output from the input , (i.e., ), which is a sequence of characters. We implemented two modules: Connectionist Temporal Classification (CTC)  based and Attention mechanism (Attn) based Pred. module. In our experiments, we make the character label set which include 36 alphanumeric characters. For the CTC, additional blank token is added to the label set due to the characteristics of the CTC. For the Attn, additional end of sentence (EOS) token is added to the label set due to the characteristics of the Attn. That is, the number of character set is 37.
Connectionist Temporal Classification (CTC): CTC takes a sequence , where is the sequence length, and outputs the probability of , which is defined as
where is the probability of generating character at each time step . After that, the mapping function which maps to by removing repeated characters and blanks. For instance, maps “aaa--b-b-c-ccc-c--” onto “abbccc”, where ’-’ is blank token. The conditional probability is defined as the sum of probabilities of all that are mapped by onto , which is
At testing phase, the predicted label sequence is calculated by taking the highest probability character at each time step , and map the onto :
where and are trainable parameters. is the decoder LSTM hidden state at time as
and is a context vector, which is computed as the weighted sum of from the former stage as
where is called attention weight and computed by
and , , and are trainable parameters. The dimension of LSTM hidden state was set as .
d.5 Objective function
Denote the training dataset by , where is the training image and is the word label. The training conducted by minimizing the objective function that negative log-likelihood of the conditional probability of word label.
This function calculates a cost from an image and its word label, and the modules in the framework are trained end-to-end manner.
Appendix E STR Framework - full experimental results
We report the full results of our experiments in Table 8.
FLOPS in Table 8 is approximately calculated, the detail is in our GitHub issue
Appendix F Additional Experiments
f.1 Fine-tuning on real datasets
We have fine-tuned our best model on the union of training sets IIIT, SVT, IC13, and IC15 (in-distribution), the held-out subsets of evaluation datasets of real scene text images. Other evaluation datasets, IC03, SP, and CT (out-distribution), do not have held-out subset for training; SP and CT have not training sets and some training images of IC03 have been found in IC13 evaluation dataset, as mentioned in §4.1, thus it is not appropriate to fine-tuning on IC03 training set.
Our model has been fine-tuned for 10 epochs. The table 9 shows the results. By fine-tuning on the real data, the accuracy on in-distribution subset (the union of evaluation datasets IIIT, SVT, IC13, and IC15) and on all benchmark data have improved by 2.2 pp and 1.5 pp, respectively. Meanwhile, the fine-tuned performance on the out-distribution subset (the union of evaluation datasets IC03, SP, and CT) has decreased by 1.3 pp. We conclude that fine-tuning over real data is effective when the real-data is close to the test-time distribution. Otherwise, fine-tuning over real data may do more harm than good.
f.2 Accuracy with varying training dataset size
We have evaluated the accuracy of all 24 STR models against varying training dataset size. Training dataset consists of MJSynth 8.9 M and SynthText 5.5 M (14.4 M in total), same setting as in §4.1. We report the full results of varying training dataset size in Table 10. In addition, Figure 15–18 show averaged accuracy plots. Each plot is color-coded in terms of each module, which helps to grasp the tendency of each module.
In Figure 15, we observe that the curves of without TPS do not get saturated at 100% training data size; more training data are certainly likely to improve them. The curves of TPS show saturated performances at 80% training data. We conjecture this is because TPS usually normalizes the input images and the last 20% of training dataset would be normalized by TPS, rather than improve accuracy. Thus other kinds of datasets, which will not simply be normalized by TPS trained with 80% training dataset, would be needed to better accuracy.
In Figure 16, we observe that the curves of ResNet do not get saturated at 100% training data size. The averages of VGG and RCNN, on the other hand, show saturated performances at 60% and 80% training data, respectively. We conjecture this is because VGG and RCNN have lower capacity than ResNet and they have already reached their performance limits at the current amount of training data.
In Figure 17, we observe that the curves of BiLSTM do not get saturated at 100% training data size. The curves of without BiLSTM show saturated performances at 80% training data. We conjecture this is because using BiLSTM has higher capacity than without BiLSTM and thus using BiLSTM still has room for improving accuracy with more training data.
In Figure 18, we observe that the curves of Attn do not get saturated at 100% training data size. The curves of CTC show saturated performances at 80% training data. Again, we conjecture this is because using Attn has higher capacity than CTC and thus using Attn still has room for improving accuracy with more training data.
f.3 Evaluation on COCO-Text dataset
We have evaluated the models on COCO-Text dataset , another good benchmark derived from MS COCO containing complex and low-resolution scene images. COCO-Text contains many special characters, heavy noises, and occlusions; it is generally considered more challenging than the seven benchmarks considered so far. Figure 19 shows the accuracy-time and accuracy-space trade-off plots for 24 STR methods on COCO-Text. Except that the overall accuracy is lower, the relative orders amongst methods are largely preserved compared to Figure 4. Fine-tuning models with COCO-Text training set has improved the averaged accuracy (24 models) from 42.4% to 58.2%, a relatively big jump that is attributable to the unusual data distribution for COCO-Text. Evaluation and analysis over COCO-Text are beneficial, especially to address remaining corner cases for STR.
- (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §D.4, §3.4.
- (2018) Edit probability for scene text recognition. In CVPR, Cited by: Appendix B, §D.4, Table 1, §1, 1st item.
- (2018) Rosetta: large scale system for text detection and recognition in images. In KDD, pp. 71–79. Cited by: Table 1, §1, §3.3, §3.
- (2017) Focusing attention: towards accurate text recognition in natural images. In ICCV, pp. 5086–5094. Cited by: 1st item, Table 3, Appendix B, Appendix C, §D.2, §D.4, Table 1, §1, 1st item, §3.3, §3.4, §3.
- (2018) AON: towards arbitrarily-oriented text recognition. In CVPR, pp. 5571–5579. Cited by: §D.4, Table 1, §1, §2.2, §2.2, §4.5.
- (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In ICML, pp. 369–376. Cited by: §D.4, §1, §3.4.
- (2009) A novel connectionist system for unconstrained handwriting recognition. In TPAMI, Vol. 31, pp. 855–868. Cited by: §D.3.
- (2016) Synthetic data for text localisation in natural images. In CVPR, Cited by: Table 1, 2nd item.
- (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In ICCV, pp. 1026–1034. Cited by: §4.1.
- (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §D.2, §3.2.
- (2014) Synthetic data and artificial neural networks for natural scene text recognition. In Workshop on Deep Learning, NIPS, Cited by: Table 1, 1st item.
- (2015) Spatial transformer networks. In NIPS, pp. 2017–2025. Cited by: item 1, §3.1.
- (2015) ICDAR 2015 competition on robust reading. In ICDAR, pp. 1156–1160. Cited by: 1st item.
- (2013) ICDAR 2013 robust reading competition. In ICDAR, pp. 1484–1493. Cited by: 4th item.
- (2018) Nsml: meet the mlaas platform with a real-world case study. arXiv:1810.09957. Cited by: §4.1.
- (2016) Recursive recurrent nets with attention modeling for ocr in the wild. In CVPR, pp. 2231–2239. Cited by: Table 1, §1, §3.2, §3.
- (2016) STAR-net: a spatial attention residue network for scene text recognition.. In BMVC, Vol. 2. Cited by: §D.1, Table 1, §1, §3.1, §3.
- (2018) Char-net: a character-aware neural network for distorted scene text recognition.. In AAAI, Cited by: Table 1, §1, §3.
- (2018) Synthetically supervised feature learning for scene text recognition. In ECCV, Cited by: Table 1, §1.
- (2003) ICDAR 2003 robust reading competitions. In ICDAR, pp. 682–687. Cited by: 3rd item.
- (2012) Scene text recognition using higher order language priors. In BMVC, Cited by: 1st item.
- (2013) Recognizing text with perspective distortion in natural scenes. In ICCV, pp. 569–576. Cited by: 2nd item.
- (2014) A robust arbitrary text detection system for natural scene images. In ESWA, Vol. 41, pp. 8027–8048. Cited by: 3rd item.
- (2017) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. In TPAMI, Vol. 39, pp. 2298–2304. Cited by: 1st item, Table 3, Appendix C, §D.2, §D.3, §D.3, Table 1, §1, §3.3, §3.4, §3.
- (2016) Robust scene text recognition with automatic rectification. In CVPR, pp. 4168–4176. Cited by: 1st item, Table 3, Appendix C, §D.1, §D.1, §D.2, Table 1, §1, §2.2, §2.2, §3.1, §3.3, §3.4, §3.
- (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §D.2, §3.2.
- (2016) COCO-text: dataset and benchmark for text detection and recognition in natural images. In arXiv:1601.07140, Cited by: 1st item, §F.3.
- (2017) Gated recurrent convolution neural network for ocr. In NIPS, pp. 334–343. Cited by: 1st item, Table 3, Appendix C, §D.2, Table 1, §1, §3.2, §3.
- (2011) End-to-end scene text recognition. In ICCV, pp. 1457–1464. Cited by: 2nd item.
- (2017) Learning to read irregular text with attention mechanisms. In IJCAI, Cited by: Table 1, §1, §2.2, §2.2, §4.5.
- (2012) ADADELTA: an adaptive learning rate method. In arXiv:1212.5701, Cited by: §4.1.