ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT
This paper reports the ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT that consists of three major challenges: i) scene text detection, ii) scene text recognition, and iii) scene text spotting. A total of 78 submissions from 46 unique teams/individuals were received for this competition. The top performing score of each challenge is as follows: i) T1 - 82.65%, ii) T2.1 - 74.3%, iii) T2.2 - 85.32%, iv) T3.1 - 53.86%, and v) T3.2 - 54.91%. Apart from the results, this paper also details the ArT dataset, tasks description, evaluation metrics and participants’ methods. The dataset, the evaluation kit as well as the results are publicly available at the challenge website111https://rrc.cvc.uab.es/?ch=14.
Text in the wild comes in a variety of shapes. However, linear text arrangements, be it horizontal or rotated (as defined by multi-oriented text datasets like ICDAR2015 and MSRA-TD500) dominate existing popular datasets such as ICDAR2013 , ICDAR2015 , COCO-Text . Text instances arranged in curved or other irregular arrangements, as pointed out in Total-Text  and SCUT-CTW1500 , despite their commonness in our real world scenes, are rarely seen in the mentioned datasets. As a result, text detection models properly considering arbitrary-shaped text are relatively uncommon. In addition, recent studies [5, 6, 7, 8, 9] point out that existing state-of-the-art scene text detection models perform poorly against such data. Their studies suggest that a major design change is needed to handle the wild nature of arbitrary-shaped text instances.††*Authors contributed equally
Motivated by  and , numerous research works [7, 8] have demonstrated their interest in tackling the curved text reading problem. These studies suggest that some principle design changes are necessary in order to produce a tight polygon detection result, which is capable of binding arbitrary-shaped texts tightly. One example is the increment of the regression variables in order to cater for the higher count of vertices of a curved text region . Meanwhile,  took advantage of the segmentation-based approach to address this problem. However, since the testing sets of  and  consist of only 300 and 500 images, respectively, it is hard to draw conclusive claims based on them due to its relatively small sample size. Hence, we combined all the released images and ground truth in both of the mentioned datasets as the training set for this competition, and at the same time collected new images with similar attributes (i.e. high existence of arbitrary-shaped text alongside horizontal and multi-oriented text) to increase the size of both the training and testing set.
This competition is a natural extension to all the previous RRC competitions, and consists of three main tasks: i) scene text detection, ii) scene text recognition, and iii) scene text spotting. It stands out by demanding higher robustness out of the scene text understanding models against text of arbitrary shapes. Details about this competition and ArT dataset can be found on the RRC competition website222http://rrc.cvc.uab.es/?ch=14.
The structure of this paper is as follows. Related work is presented in Sec. II and details of the ArT dataset are described in Sec. III. Tasks that are involved in this competition can be found in Sec. V, VI, VII respectively with the task’s description, evaluation metric and a brief discussion of participants’ results in the subsections. This paper will then end with our conclusions in Sec. VIII.
Ii Related Work
Scene text reading methods have achieved significant progress alongside the evolution of the scene text benchmarks. The continuously emerging datasets follow several noticeable patterns: i) the size getting bigger, ii) the data becomes harder, and iii) the annotation becomes more flexible. In 2013, ICDAR2013  comprised 462 images with only well-focused rectangular-shaped text. On ICDAR2015  dataset, the number is increased to 1,500 and all the images were incidentally captured. Besides, the dataset introduces quadrilateral annotation to meet the variety of text shapes. In 2017, IC17-MLT  was introduced to challenge the community with the multi-script scene text reading problem in 9 different languages. Similarly, the size of the dataset increases to 18,000, and quadrangles were used as the ground truth format.
Recently, [5, 6] pointed out although curved text instances are commonly found in the real world, they are rarely seen in the existing benchmarks. Besides, in the limited appearance of the curved text instances, their annotations are wildly loose with both the axis-aligned and quadrilateral bounding regions. Therefore, Total-text and SCUT-CTW1500 were collected with a great emphasis on curved text instance. Additionally, both of the datasets employed polygonal shape as the ground truth format for their annotations. These two benchmarks have quickly attracted the interests of the research community, motivating many promising text reading methods. Following the principles of both of the said datasets, the ArT dataset aims to provide the community with a much larger data size to work with and a more comprehensive benchmark for future evaluations.
Iii The ‘ArT’ Dataset
The dataset intended for this competition, ArT, is a combination of Total-Text 333https://github.com/cs-chan/Total-Text-Dataset, SCUT-CTW1500444https://github.com/Yuliang-Liu/Curve-Text-Detector, Baidu Curved Text Dataset555A subset of LSVT plus a large sample of newly collected images. The new images were collected following the same principles as [5, 6]: i) At least one arbitrary-shaped text per image; ii) high diversity in terms of text orientations (i.e. large amounts of horizontal, multi-oriented, and curved text instances); iii) text instances are annotated with tight polygon ground truth format.
Iii-1 Type/source of images
Images in the ArT dataset were collected via digital camera, mobile phone camera, Internet, Flickr, image libraries, and Google Open-Image . Also, part of the new images that contain Chinese text are collected from Baidu Street View. Similar to most of the publicly scene text datasets, the images in ArT contain scenes from both indoor and outdoor settings, with digitally born images included. Apart from the usual vision-related challenges (illumination, background complexity, perspective distortion, etc.), ArT stands out in challenging scene text understanding models with the combination of different text orientations within one image.
Iii-2 Homogeneity of the dataset
The images from Total-Text , SCUT-CTW1500  and Baidu Curved Text Dataset are similar in nature, they are i) from real world scenes, and ii) the images are mostly well focused. Hence, the combination is smooth in this aspect. However, since SCUT-CTW1500 considers Chinese script in their annotation while Total-Text does not; a refinement to the ground truth of Total-Text is done to annotate all the Chinese characters in it. In addition, the line-level annotation of the Latin scripts in SCUT-CTW1500 is also re-annotated to word-level.
Iii-3 Number of images
On top of the existing images (3055) from Total-Text  and SCUT-CTW1500 , 7111 images are added to make the ArT dataset, one of the largest scene text datasets for arbitrary-shaped text. There is a total of 10,166 images in the ArT dataset that is split into a training set with 5,603 images, and a testing set of 4,563 newly collected images. We acknowledge the Baidu team for annotating all the newly collected images via the Baidu crowd-sourcing platform.
Iii-4 Ground truth
It is worth pointing out that the polygon ground truth format employed in ArT is different from all the previous RRC, which adopted the axis-aligned bounding box [3, 1], or quadrilateral  as the ground truth format. Both of these annotation styles have two and four vertices respectively, which are intuitively inappropriate for the arbitrary-oriented Chinese and Latin text instances in ArT, especially the curved text instances. Following the practice of the MLT dataset , we annotated Chinese and Latin scripts at line-level and word-level granularities respectively. The transcription and the language type of annotated text instances are provided. Also, note that the coordinates of the polygon bounding boxes are labelled to have either 4, 8, 10, or 12 polygon vertices depending on their shape. All illegible text instances and symbols were labelled as “Do Not Care”, which will not contribute to the evaluation result.
This competition is jointly organized by the University of Malaya, Malaysia; South China University of Technology, China; Baidu Inc, China; and the Computer Vision Centre (Autonomous University of Barcelona), Spain. There are monetary rewards to the winner of this challenge, which is sponsored by Baidu Inc.
V Task 1: Scene Text Detection
The main objective of this task is to detect the location of every text instance in the input image. Given an input image, participants are expected to provide the spatial location and confidence score of each prediction.
V-B Evaluation metrics
IoU-based evaluation protocol is adopted for this task by following . IoU (Intersection over Union) is a threshold-based evaluation protocol, with a default threshold of . Results are reported both at and thresholds but only the H-Mean of the former threshold is used to determine the official ranking. To ensure fairness, the participants are required to submit confidence score for each detection, and thus all confidence thresholds are iterated to find the best H-Mean score. It is also worth mentioning that, ArT will be the first RRC to handle unfixed detection output coordinates in Task 1 (Sec. V) and Task 3 (Sec. VII).
V-C Results and Discussion
For Task 1, we received 48 submissions with 35 of them submitted from unique participants. The average H-mean score for Task 1 is 67.46%. The first place of this task is Pil-Mask-RCNN by Wang et al. from Institute of Computational Technology, CAS, China, with the winning H-mean score of 82.65%. The proposed method is built based on the Mask R-CNN pipeline with two different backbone networks: Senet-152 and Shuffle-net v2. Figure 3 illustrates some of the successful examples. The visualization of its results show that the detection regions are of high quality: smooth and tight. Besides, it appears to be robust against the language variant of the text instances as well (i.e. Chinese and Latin scripts). We also investigated the failure examples of the winning method (as seen in Figure 3), the common problems are: i) under segmenting (combining multiple text instances into one), ii) mistaken group of crowded text instances in a small area (especially Chinese characters), and iii) small text instances. We notice that most of the top performing methods (both runners-up included) are based on the Mask-RCNN pipeline. Also, most of the participants (except 4 submissions) design their models to produce polygon bounding region as the detection output, which align with the emphasis of this competition - tightness of detection outputs.
The ranking of Task 1 is tabulated in Table I. Note that the top 3 teams between 0.5 IoU and 0.7 IoU are different - the original runner up - NJU-ImagineLab is overtook by ArtDet-v2 and drops to the fourth place. Meanwhile, Figure 1 is the histogram of the average H-mean scores of each image in the testing set. As we can see, most images have the average H-mean scores between 0.8 to 0.9, followed by 0.7-0.8 so forth. The challenging images with 0 - 0.1 H-mean score can be seen in Figure 2.
Vi Task 2: Scene Text Recognition
The main objective of this task is to recognize every character in a cropped image patch. The input patterns of this task are the cropped image patches with corresponding text instances, and the relative polygon spatial coordinates. Participants are asked to provide the recognized string of characters as output. Nevertheless, it is up to participants to choose if they want to utilise the polygon coordinates as they are provided as optional information. Furthermore, we decided to further break down Task 2 into two subcategories: i) Task 2.1 - Latin script only recognition, and ii) Task 2.2 - Latin and Chinese scripts recognition. We hope that such a split could make this task friendlier for non-Chinese participants, as the aim of this competition is to detect and recognize arbitrary-shaped text. Participants are required to make a single submission only regardless of the scripts. We evaluated all submissions under two categories, Latin and mixed (Latin and Chinese) scripts. When evaluating the recognition performance for Latin script, all non-Latin transcriptions will be treated as “Do Not Care” regions.
Vi-B Evaluation metrics
For Task 2.1, case-insensitive word accuracy is used as the primary challenge metric. Apart from this, all the standard practices for text recognition evaluation are followed. For example, symbols in the middle of ground truth text instances are considered but symbols such as ( !?.:,*”()·/’_ ) at the beginning and at the end of both the ground truth and the submissions are removed. For Task 2.2, the Normalized Edit Distance metric (1-N.E.D specifically, which is also used in the ICDAR 2017 competition, RCTW-17 ) are treated as the ranking metric. The reason of utilizing 1-N.E.D as the official ranking metric for Task 2.2 is motivated by the fact that Chinese scripts usually contain more characters than the Latin scripts, which makes word accuracy metric too harsh to evaluate Task 2.2 fairly. In the 1-N.E.D evaluation protocol, all characters (Latin and Chinese) will be treated in a consistent manner. To avoid ambiguities in the annotations, we performed several pre-processing steps before the evaluation process: 1) English letters are treated as case insensitive; 2) Chinese traditional and simplified characters are treated as the same label; 3) Blank spaces and symbols will be removed; 4) All illegible images will not contribute to the evaluation result.
Vi-C Results and Discussion
For Task 2, there are 22 unique submissions from 17 unique teams. Starting with Task 2.1, the average accuracy score of this task is 62.47%. The winner of this task is PKU_Team_Zero by Shangbang et al. from MEGVII (Face++) and Peking University, China, with the winning score of 74.30%. It comprises of three major modules: 1) A detection module that can provide the spatial coordinates of the text (in polygon vertices) within the cropped image; 2) a spatial transformer that can straighten the image based on the coordinates; and 3) an attention RNN model for recognizing words. We notice that all three winning models have similar pipelines - all of them rectify the cropped image patches (i.e. straighten the text region, in turn removing background) before recognizing the word in it. This shows that the polygon ground truth format instead of the normal bounding box is indeed crucial in the problem of recognizing curved or any arbitrary text instances. Besides, another similarity is that all three of them employ attention mechanism in their RNN word recognition module. Qualitative results of the PKU_Team_Zero method can be seen in Figure 4. The method has demonstrated its outstanding ability in recognizing curved text instances of challenging attributes in real world scene. On the other hand, Figure 4 illustrates some of the failure examples. The failure cases are mainly caused by unusual font types and severely blurred patch.
The top three methods for Task 2.2 are quite different from Task 2.1. The average 1-N.E.D score of this sub-task is 68.43%, and the winner of this task is CRAFT (Preprocessing) + TPS-ResNet by Baek et al. from Naver Corporation which scores 85.32%. This method also has three major modules: detection, rectification, and recognition. Specifically, it adopts CRAFT  as its text detector, Thin-Plate-Spline (TPS) based Spatial Transformer Network as its image normalizer, and a BiLSTM with attention as its text recognizer. Figure 5 shows some successful examples of the said method, it appears that the method is robust against curved text instances on both the Chinese and Latin scripts. Failure cases can be seen in Figure 5, where it fails in 1) Chinese character with similar appearance, 2) vertical oriented text, 3) blurred patch, and 4) interestingly Chinese character that looks like ‘K’ under perspective distortion and illumination.
The global performance of Task 2 is summarized in Figure 1. From this figure, we notice two obvious spikes in the 0-0.1 and 0.9-1.0 bars for Task 2.1 (blue). This phenomenon is because of the attribute of accuracy scoring mechanism (i.e. 1 for getting every character recognized and 0 otherwise). Meanwhile, in Task 2.2 (red), we see a smoother distribution between 0 to 1. As we can see, most of the patches have a high average 1-N.E.D score (between 0.9 and 1).
Vii Task 3: Scene Text Spotting
The main objective of this task is to detect and recognize every text instance in the provided image in an end-to-end manner. Given an input image, the output must be the spatial location of every text instance at word-level for Latin script and line-level for Chinese script together with the predicted word for each detection. Similar to RRC 2017 , a generic vocabulary list (90K common English words) will be provided as a reference for this task. Identical to Task 2, we break Task 3 down into two subcategories: i) Task 3.1 Latin script only text spotting, and ii) Task 3.2 Latin and Chinese scripts text spotting.
Vii-B Evaluation metrics
For Task 3, we first evaluate the detection result by calculating its IoU with the corresponding ground truth. Detection regions with an IoU value higher than 0.5 are then matched with the recognition ground truth (i.e. the transcript ground truth of that particular text region). Meanwhile, in the case of multiple matches, we only consider the detection region with the highest IOU, the rest of matches will be counted as False Positive. The pre-processing steps for the recognition part are the same as Task 2 and all Chinese text regions are ignored in Task 3.1. Also, it is worth mentioning that although the results of case-insensitive word accuracy H-mean and 1-N.E.D will be reported but the official ranking metric for both sub-tasks are 1-N.E.D.
Vii-C Results and Discussion
Task 3 received 8 submissions from 8 individual teams. It is also the hardest task among all tasks - the average accuracy H-mean score for Task 3.1 is only 44.37%. The method that ranks the first place is baseline_0.5_class_5435 by Jinjin Zhang from Beihang University, China with the accuracy H-mean score of 52.45%. The winning method has a segmentation-based detector and an attention-based recognizer. Zhang mentioned that the method is modelled to have 5,435 classes for the recognition task. Besides, extra training data from LSVT, ICDAR2017, COCO-Text, ReCTS, and augmented data were used to train their recognition network. The top three winners of Task 3.2 is the same as Task 3.1, with a slightly higher average 1-N.E.D score - 44.91%. Figure 6 depicts several successful and failure examples. As observed, in a high contrast setting (left figure), every text instance is well detected and recognized by the model; while the challenging example on the right confused the method with multiple possible combinations of the text instances. To be specific, the four vertical red regions are evaluated as false positives; the actual ground truths are supposed to be two text regions arranged from left to right (top and bottom), making up two Chinese words. Such an example could potentially be solved by instilling semantic information (e.g. the specific language knowledge) into the text spotting model.
In contrast to Task 1 and 2, the histogram of Task 3 (in Figure 1) demonstrates the most distributed pattern across the score range. Note that for Task 3.1 (blue) the spike at the 0.9-1.0 (in contrast to the low count of Task 3.2 (red)) is due to the fact that Chinese scripts are counted as “Do Not Care” regions, which makes it easier to score a full mark on Chinese text dominant images. In general, most of the images have 0.4 to 0.6 average score which reflect the challenging aspect of this task. Again, Figure 2 and 2 shows some of the most challenging images in the test set with 0 to 0.1 average score.
The ICDAR2019 Robust Reading Challenge on ArT received an overwhelming number of submissions, which is a delightful outcome considering that scene text understanding works with curved text in consideration were rarely seen before the introduction of the Total-Text and SCUT-CTW1500 datasets recently. Although the scene text understanding community has seen tremendous improvements in very recent years, the gap between the research-end and the application-end still exists. The main motivation behind ArT dataset and this challenge is to encourage both the academic and industrial fields to look into the arbitrary orientation or shape aspect of text instances in the wild.
The score of the top three winners in all tasks are close to each other, which is a good indication of where the state-of-the-arts resides at the moment. By taking a deeper look into the submission models, segmentation based methods seem to dominate the arbitrary-shaped text detection. Besides, we also find that the current IoU metric has many drawbacks; for example, some of the detections that miss several characters are still being rewarded with 100% recall. Therefore, a better and more reasonable metric such as the recent TIoU metric  may be worth practising in the future. In the recognition tasks, popular and high performing models share similar pipelines, which includes rectifying the text patches before recognizing them with an attention RNN/LSTM module. To this end, text spotting seems to be the most challenging task with the lowest winning H-mean score.
-  D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu et al., “Icdar 2015 competition on robust reading,” in ICDAR, 2015.
-  C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of arbitrary orientations in natural images,” in CVPR, 2012.
-  D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. de las Heras, “Icdar 2013 robust reading competition,” in ICDAR, 2013.
-  V. Andreas, M. Tomas, N. Lukas, M. Jiri, and B. Serge, “Coco-text: Dataset and benchmark for text detection and recognition in natural images,” arXiv preprint arXiv:1601.07140, 2016.
-  C. K. Chng and C. S. Chan, “Total-text: A comprehensive dataset for scene text detection and recognition,” ICDAR, pp. 935–942, 2017.
-  L. Yuliang, J. Lianwen, S. Zhang, and Z. Sheng, “Detecting curve text in the wild: New dataset and new solution,” arXiv preprint arXiv:1712.02170, 2017.
-  S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “Textsnake: A flexible representation for detecting text of arbitrary shapes,” in ECCV, 2018, pp. 19–35.
-  P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai, “Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes,” in ECCV, 2018, pp. 71–88.
-  S. Long, X. He, and C. Ya, “Scene text detection and recognition: The deep learning era,” arXiv preprint arXiv:1811.04256, 2018.
-  N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon et al., “Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt,” in ICDAR, 2017, pp. 1454–1459.
-  I. Krasin, T. Duerig, N. Alldrin, A. Veit, S. Abu-El-Haija, S. r. Belongie, D. Cai, Z. Feng, V. Ferrari, V. Gomes et al., “Openimages: A public dataset for large-scale multi-label and multi-class image classification,” Dataset available from https://github. com/openimages, vol. 2, no. 6, p. 7, 2016.
-  B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. J. Belongie, S. Lu, and X. Bai, “ICDAR2017 competition on reading chinese text in the wild (RCTW-17),” CoRR, vol. abs/1708.09585, 2017. [Online]. Available: http://arxiv.org/abs/1708.09585
-  Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region awareness for text detection,” in CVPR, 2019.
-  Y. Liu, L. Jin, Z. Xie, C. Luo, S. Zhang, and L. Xie, “Tightness-aware evaluation protocol for scene text detection,” in CVPR, 2019.