A pooling based scene text proposal technique for scene text reading in the wild
Automatic reading texts in scenes has attracted increasing interest in recent years as texts often carry rich semantic information that is useful for scene understanding. In this paper, we propose a novel scene text proposal technique aiming for accurate reading texts in scenes. Inspired by the pooling layer in the deep neural network architecture, a pooling based scene text proposal technique is developed. A novel score function is designed which exploits the histogram of oriented gradients and is capable of ranking the proposals according to their probabilities of being text. An end-to-end scene text reading system has also been developed by incorporating the proposed scene text proposal technique where false alarms elimination and words recognition are performed simultaneously. Extensive experiments over several public datasets show that the proposed technique can handle multi-orientation and multi-language scene texts and obtains outstanding proposal performance. The developed end-to-end systems also achieve very competitive scene text spotting and reading performance.
keywords:Scene text proposal, max-pooling grouping, scene text detection, scene text reading, scene text spotting
Texts in scenes play an important role in communication between a machine and its surrounding environments. Automated machine understanding of texts in scenes has a vast range of applications such as navigation assistance for the visually impaired OrCam (), scene text translation for tourists GoogleTrans (), etc. It has been a grand challenge for years in the computer vision research community, and it received increasing interests in recent years as observed by a number of scene text reading competitions ICDAR2013 (); IncidentialText (); SVT (); ChineseICDAR ().
Automated reading texts in scenes is a very challenging task due to the large intra-class variations and the small inter-class variations with respect to many non-text objects in scenes. In particular, scene texts often have very different appearance because they may be printed in very different fonts and styles, captured from different distances and perspectives, and have very different cluttered background and lighting conditions. At the same time, texts often have high resemblance to many non-text objects in scenes, e.g. letter ’o’/’O’ has similar appearance as many circular objects such as vehicle wheels, letter ’l’ has similar appearance as many linear structures such as poles, etc.
In this research, we design a novel scene text proposal technique and integrate it into an end-to-end scene text reading system. Inspired by the pooling layer in deep neural network, a pooling based scene text proposal technique is designed which is capable of grouping image edges into word and text line proposals efficiently. One unique feature of the pooling based proposal technique is that it does not involve heuristic thresholds/parameters such as text sizes, inter-character distances, etc. that are often used in many existing scene text detection techniques TextFlow (); Shij (); GVF (); TDCNN7 (). A novel score function is designed which exploits the histogram of oriented gradients and is capable of ranking proposals according to their probabilities of being text. Preliminary study on max-pooling based proposals has been presented in our prior work MPT_ICDAR2017 (). This paper presents a more comprehensive study by investigating several key proposal parameters such as proposal quality, optimal proposal set-ups, etc. In addition, several new studies are performed from different aspects of edge label assignments, pooling methods and detection of arbitrarily oriented scene texts in different languages. Furthermore, a new score function is designed which is more efficient and robust in proposal ranking. We also integrate the proposed pooling based proposal technique into an end-to-end scene text reading system and study its effects on the scene text reading performance.
The rest of this paper is organized as follows. Section 2 reviews recent works in scene text proposals, scene text detections, and scene text reading. Sections 3 and 4 describe the proposed scene text proposal technique and its application in scene text reading. Section 5 presents experimental results and several concluding remarks are drawn in Section 6.
2 Related works
Traditional scene text reading systems consist of a scene text detection step and a scene text recognition step. In recent years, scene text proposal has been investigated as an alternative to the scene text detection, largely due to its higher recall rate which is capable of locating more text regions as compared with the traditional text detection step.
2.1 Scene text proposal
The scene text proposal idea is mainly inspired by the success of object proposal in many object detection systems. It has advantage in locating more possible text regions to offer higher detection recall. It’s often evaluated according to the recall rate as well as the number of needed proposals - typically the smaller the better at a similar recall level HowGoodPP (). False-positive scene text proposals are usually eliminated by either a text/nontext classifier STL (); TDCNN3DT () or a scene text recognition model GomezE2E (); TEB () in end-to-end scene text reading systems.
Different scene text proposal approaches have been explored. One widely adopted approach combines generic object proposal techniques with text-specific features for scene text proposal generation. For example, EdgeBoxes EB () is combined with two text-specific features for scene text proposal generation TEB (). In another work JarE2E (), EdgeBoxes is combined with the Aggregate Channel Feature (ACF) and AdaBoost classifiers to search for text regions. In GomezE2E (), Selective Search SS () is combined with Maximally Stable Extremal Regions (MSER) to extract texture features for dendrogram grouping. A text-specific symmetry feature is explored in STL () to search for text line proposals directly, where false text line proposals are removed by training a CNN classifier. Deep features have also been used for scene text proposal due to its superior performance in recent years. For example, inception layers are built on top of the last convolution layer of the VGG16 for generating text proposal candidates in TDCNN3DT (). The Region Proposal Network (RPN) and Faster R-CNN structure are adopted for scene text proposal generation in TDCNN7 (); TextRCNN ().
Most existing scene text proposal techniques have various limitations. For example, the EdgeBoxes based technique JarE2E () is efficient but often generate a large number of false-positive proposals. The hand-crafted text-specific features rely heavily on object boundaries which are sensitive to image noise and degradation Shij (). Techniques using heuristic rules and parameters TEB () do not adapt well across datasets. The deep learning based technique TDCNN3DT () produces a small number of proposals but the recall rate becomes unstable when the Intersection over Union (IoU) threshold increases. As a comparison, our proposed proposal technique does not leverage heuristic parameters and obtains a high recall rate with a small number of false-positive proposals.
2.2 Scene text detections
A large number of scene text detection techniques have been reported in the literature. Sliding window has been widely used to search for texts in scene images TextFlow (); Coarse2FineConv (); TaoWang (). However, it usually has a low efficiency because it adopts an exhaustive search process by using multiple windows of different sizes and aspect ratios. Region based techniques have been proposed to overcome the low efficiency constraint. For example, the Maximal Stable External Regions (MSRE) has been widely used Re1 (); Re21 (); Re5 (); Re13 () for scene text detection. In addition, various hand-craft text-specific features have also been extensively investigated such as Stroke Width Transform (SWT) Re2 (), Stroke Feature Transform (SFT) Re8 (), text edge specific features Shij (), Stroke End Keypoints (SEK), Stroke Bend Keypoints (SBK) Re20 (), and deep features based regions TDCNN1 (); TDCNN6 (); TDCNN8 (). Different post-processing schemes have also been designed to remove false positives, e.g heuristic rules based classifier GVF (); Re13 (); Re15 (); Re18 (), graph processing TextFlow (); Coarse2FineConv (), support vector regression Shij (), convolutional K-meanCoarse2FineConv (), distance metric learning Re1 (), AdaBoost Re5 (); Re14 (), random forest Re2 (); Re8 (), convolution neural network TaoWang (); Re21 (), etc.
With the advance of convolutional neural network (CNN), different CNN models have been exploited for the scene text detection tasks. For example, the DeepText makes use of convolutional layers for deep features extraction and inception layers for bounding boxes predictions TDCNN3DT () . The TextBoxes TextBoxes () adopts the Single Shot Multiboxex Detector (SSD) SSD () to deal with multi-scale texts in scenes. Quadrilateral anchor boxes have also been proposed for detecting tighter scene text boxes DeepMatchPriorNet (). In addition, direct regression solution has also been proposed DeepDirectRegress () to remove the hand-crafted anchor boxes. Different CNN based detection and learning schemes have also been explored. For example, some work adopts a bottom-up approach that first detection characters and then group them to words or text lines TDCNN7 (); TDCNN4 (); WordSup (). Some system instead defines a text boundary class for pixel-level scene text detection WordFence (); SelfOrg (). In addition, weakly supervised and semi-supervised learning approach WeakNet () has also been studied to address the image annotation constraint WeText ().
2.3 End-to-end scene text reading
End-to-end scene text reading integrates detection and recognition into the same system to read texts in scenes. One popular system is a Google-Translation E2E5 () which performs end-to-end scene text reading by integrating a list of techniques including three scene text detection methods, three scene text segmentation and grouping methods, two scene text recognition models, and language models for post-processing. In E2E4 (), sliding window is combined with Histogram of Oriented Gradient feature extraction and Random Ferns Classifier to compute text saliency maps where words are extracted using External Regions (ER) and further re-scored using Support Vector Machine (SVM). In E2E3 (), Adaboost and SVM text classifiers are applied on the extracted text regions using ER to localize scene texts which are further recognized under an Optical Character Recognition (OCR) framework. Similar approach was also adopted in E2E7 (), where Maximal Stable External Regions (MSER) instead of ER is implemented for scene text region localization. In E2E9 (), Stroke Width Transform Re2 () is adopted for scene text region detection and Random Forest is used for character recognition and words are further recognized by component linking, word partition, and dictionary based correction. In GomezE2E (); JarE2E (), potential text regions are first localized using EdgeBox (EB) EB () or adapted simple selective search for scene text GomezE2E () and scene texts are further recognized using Jarderberg’s scene text recognition model JarData ().
Quite a number of CNN based end-to-end scene text reading systems have been reported in recent years. In TaoWang (); E2E11 (), a CNN based character recognition model is developed where word information is extracted from text saliency map using sliding windows. The same framework has been implemented in JarTReg (), where a more robust end-to-end scene text reading system is developed by training a model handling three functions including text and non-text classification, case-insensitive characters recognition, and case-sensitive characters recognition. In TextBoxes (), an advanced end-to-end scene text reading system is designed where the Single Shot Multiboxes Detector (SSD) is employed for scene text detection and a transcription model proposed in E2ETrainable () is adopted for recognition. End-to-end trainable scene text reading system has also been proposed which can concurrently produce texts location and text transcription DTSpotter ()
Our developed end-to-end scene text reading system adopts a similar framework as presented in GomezE2E (); JarE2E () that exploits proposals and existing scene text recognition models. One unique feature is that it uses only around one-fifth of the number of proposals that prior proposal based end-to-end systems use thanks to our proposed pooling based proposal technique and gradient histogram based proposal ranking.
3 Pooling based scene text proposal
The proposed scene text proposal technique follows the general object proposal framework which consists of two major steps including proposal generation and proposal ranking. For the proposal generation, we design a pooling based technique that iteratively groups image edges into possible words or text lines. Here each edge component could be a part of a single character, several neighbouring characters touching each other, or other non-text objects. Each set of grouped image edges thus forms a proposal which can be represented by a bounding box that covers all grouped edges. For proposal ranking, a scoring function is designed which is capable of ranking the determined proposals according to their probability of being text. The ranking strategy employs the histogram of oriented gradient which first learns a number of text and non-text templates and then ranks proposals according to their distances to the learned templates. Fig. 1 illustrates the framework of our proposed scene text proposal technique.
3.1 Proposal generation
A novel pooling based technique is designed for the scene text proposal generation. The idea is inspired by the pooling layer in the convolution neural network (CNN) which is employed to eliminate insignificant features while shrinking a feature map. Given an image, an edge map is first determined by using the Canny edge detector Canny (), where each binary edge can be labelled through connected components (CC) analysis. Each binary edge can then be labelled by an unique number indicating the order when it is searched. For example, the first searched binary edge is assigned an unique label number 1, the second with an unique number 2, etc. An initial edge feature map can thus be determined by assigning all pixels of a binary edge with the same number as the component label and all non-edge pixels with a number of zero.
The image edge feature map is then processed iteratively through pooling using a pooling window. Take max-pooling as an example. During each pooling iteration, only the pixel with the largest label number within the pooling window is kept for generating the new edge feature map for the next iteration, and all other pixels with a smaller label number are discarded. The binary edges are therefore shifting to each others iteratively where those closer to each other are grouped first and those farther away are merged later. The iterative edge merging process terminates when there is no zero pixels existing in the edge feature map, meaning that there are no more gaps between the labelled binary edges as illustrated in the ‘Pooling process’ in Fig. 1. Multiple proposals are accordingly generated with different groups of edges throughout the pooling process.
Fig. 2 illustrates the max-pooling based proposal generation process by using a synthetic edge map that contains 3 binary edges as labelled by 1, 2, and 3. Taking the second row as an example, the first two edges are grouped to form a proposal after the second pooling iteration and the second and third edges are grouped to form another proposal after the third pooling iteration. Since the first and the third edges are both grouped with the second edge, all three edges are also grouped to form a new proposal. For the first and the third row, a single group of the first and second edges and a single group of three single edges can be derived under the similar idea, respectively. By removing duplicated proposals, six proposals are finally determined including the three single edges, the grouped first and second edges, the grouped second and third edges, and the grouped three edges. It should be noted that zero-padding is implemented at the right side when the studied row has an even number of pixels left.
Though a horizontal pooling window of size 1-by-3 and a horizontal stride of 2 are used, the proposed pooling based proposal technique is able to handle non-horizontal words or text lines as far as the constituting letters/digits having certain overlap in the vertical direction. This is illustrated in the two synthetic graphs in Fig. 3. As the first graph in Fig. 3 shows, the curved chain of digits 1-6 will be grouped together due to their overlap in the vertical direction. Note digits/letters could be grouped via other neighbouring digits/letters when they have no overlap in the vertical direction. For example, the digits 4 and 5 can be grouped via the digits 2 and 3 even though they have no vertical overlap. Digits/letters will not be grouped when they have no overlap in the vertical direction and also have no neighbouring digits/letters to leverage as illustrated in the second graph in Fig. 3.
3.2 Proposal ranking
Histogram of oriented gradient has been used successfully for the scene text detection and recognition tasks Wang (); Recog4 (). The success shows that scene texts actually have certain unique HoG features that can differentiate them from other non-text objects. We therefore adapt HoG for proposal ranking, aiming to exploit the unique text-specific HoG features to rank text proposals to the front of the whole proposal list. Different from the traditional HoG, we extract HoG features from the Canny edge pixels only which we will refer it by Histogram of Oriented Gradient on edges (HoGe) in the ensuing discussion.
In our proposal ranking strategy, a number of text and non-text HoGe templates are first learned from a set of training images to be discussed in 3.3.3. Scene text proposals are then scored and ranked according to the distances between their HoGe and the learned text and non-text HoGe templates. The scoring function is defined as follows:
where and refer to the distances between the feature vector () of a detected proposal and the pre-determined text and non-text templates as follows.
where denotes the number of text () and non-text () templates, and gives the Euclidean distance between and a text/non-text feature template. The score function in Eq. 1 is designed based on the observation that the feature vector of a text proposal is usually closer to text templates as compared with non-text templates. The feature vector of a text proposal will thus produce a small and a large which further lead to a high text probability score.
3.3.1 Pooling and edge labelling
As described in Section 3.1, we assign edge labels according to the searching order (from left to right column by column and from top to bottom in each column) and adopt the max-pooling to group text edges within the same line. On the other hand, the proposed technique can work with different edge label assignment and pooling methods. Two new tests are performed for verification. The first test studies two more edge labelling methods that assign edge labels by using the maximum and mean gradient of pixels within an CC, respectively (named by maxE and meanE in Table 1). Take the use of meanE as an example. It first calculates the mean gradient of each CC and then labels all edge pixels by using the calculated mean gradient directly. The second test studies the min-pooling method that keeps the smallest instead of the largest edge labels (as in max-pooling) falling within the same pooling window.
Table 1 shows the test results on the test images of the ICDAR2003 and ICDAR2013 datasets, where searL denotes a labelling method that assigns edge labels according to the edge searching order, maxP and minP denote max-pooling and min-pooling, respectively. So maxEmaxP means that image edges are labelled by using the maximum gradient and pooling is performed by max-pooling. The very close proposal recalls under different window sizes and strides in Table 1 verify that our proposed technique is tolerant to both edge labelling methods and edge label pooling methods.
3.3.2 Proposal generation
As the optimization of proposal generation targets the best proposal recall, we relax the number of proposals and include all generated proposals while studying the size of the pooling window and strides. We adopt the grid search to study the two key sets of parameters, including a pooling window size (width and height) and stride values (a horizontal stride and a vertical stride). In particular, we vary the size of the pooling window and strides from 1 to 5 which produces 600 (24*25) parameter settings. Note the pooling window size 1-by-1 is not included as it does not perform any grouping operations.
Fig. 4 shows proposal recalls under the 600 parameters setting which are presented by using a heat-map, where each recall is an average of three recalls when three IoU thresholds 0.5, 0.7, and 0.8 are applied. As shown in Fig. 4, the two numbers at the bottom of each column refer to the row number (the number at the top) and the column number (the number at the bottom) of the pooling window, respectively. The two numbers at the left of each row refer to strides in the vertical direction (the number on the left) and horizontal direction (the number on the right), respectively. We further sort the recalls under the 600 settings and the table on the right shows several best-performing settings. In our implemented system, we take a compromise between recall rate and processing time and select the combination of a 1-by-3 pooling window and strides 1-by-2 in vertical and horizontal directions.
Several factors need to be taken into consideration while setting the pooling and strides. The first is the absolute size of the pooling window which defines the minimum distance between neighbouring edges that the pooling based proposal technique could capture. For example, a large pooling window of size 2x4 will not be able to captures distances of 1, 2 and 3 pixels between neighbouring edges whereas a pooling window of size 2x2 is capable of capturing distance as small as 1 pixel only as illustrated in Fig. 5. The second is specific setting of rows and columns of the pooling window and strides in horizontal and vertical directions. In particular, the increase of coverage/jump in the vertical direction often deteriorate the proposal performance as illustrated in the heat-map in Fig. 4. One reason could be due to the fact that most text in scenes are positioned in a horizontal direction. In addition, a pooling window with a big span in vertical direction often groups texts with neighbouring non-text objects lying above or below texts. The third is overlap between two consecutive pooling windows which happens when the stride in the horizontal direction is smaller than the width of the pooling window. A smaller stride often produces better recall rate, meaning that overlaps between two consecutive pooling windows helps to produce better proposals. In fact, the proposal performance drops a lot when there are absolutely no overlaps between two consecutive pooling windows as illustrated in the heat-map in Fig. 4.
3.3.3 Proposal ranking
We adopt a grid search strategy to investigate the optimal HoGe dimension and the number of text and non-text templates. The dimension of the HoGe feature vector refers to the number of histogram bins within the HoGe which we change from 10 to 180 with a step of 10. The number of text and no-text templates is varied from 5 to 100 with a step of 5. Hence, the full combination of the two sets of parameters thus gives 360 (18x20) settings. Different from the proposal generation optimization, we limit the maximum proposal number at 2000 (a reasonable number by compromising recall and the ensuing computational cost ObjDT ()) for the evaluation of proposal recalls. Under each parameter setting, an average recall is computed for all images within the validation set when three IoU thresholds of 0.5, 0.7 and 0.8 are used. Additionally, 80% training images of the ICDAR2003 and the ICDAR2013 datasets are used for training and the rest 20% are used for validation in our study.
Table 2 shows the first ten best-performing settings of the two parameters which are sorted according to the average recall under the three IoU thresholds. As Table 2 shows, the recalls are quite close to each other around the best parameter settings. In our implemented system, we select the 25 text/non-text templates and template dimension of 120, i.e., the setting (25, 120), as a compromise of detection recall and detection efficiency.
4 Automatic scene text reading
We also develop an end-to-end scene text reading system by integrating the proposed pooling based proposal technique and a state-of-the-art scene text recognition model JarData () which is trained on generic 90k words list and recognizes words directly. Given an image, a number of scene text proposals are first determined by using the proposed pooling based technique. Each detected proposal is then fed to the word recognition model JarData () to derive a word recognition score, and it will be discarded if the recognition score is too low or the recognized word is not in the lexicon list. After that, non-maximum-suppression (nms) is applied to keep the proposal with the maximum score and remove those with lower scores. Additionally, a word based nms is also implemented to remove duplicate proposals of the same word. In particular, only a proposal that has the maximum recognition score is kept as the reading output when more than one proposals overlap with each other and produce the same recognized word. More details will be discussed in Section 5.3.
5 Experiments and results
5.1 Experiment setup and evaluation metrics
Given the very similar performance under different label assignment and pooling methods as described in Section 3.3.1, we label image edges by their searching order and use the max-pooling in the ensuing evaluations and benchmarking with the state-of-the-arts. In addition, the size of the pooling window is fixed at 1-by-3 and the strides are set at 2 and 1 pixels in the horizontal and vertical directions as described in Section 3.3.2. Further, 25 feature templates are used for both text and non-text classes and the dimension of the HoGe is fixed at 120 bins as discussed in Section 3.3.3.
Three datasets are used in evaluations and comparisons including the focused scene text dataset used in the Robust Reading Competition 2015 (ICDAR2015) ICDAR2013 (), the Street View Text (SVT) SVT () and the MSRA-TD500 Re3 (). The ICDAR2015 contains 229 training images and 233 testing images, and the SVT contains 101 training images and 249 testing images. The scene text images in both datasets suffer from a wide range of image degradation but most texts are horizontal and printed in English. The MSRA-TD500 contains 500 images including 300 training images and 200 test images, where scene texts are in arbitrary orientations and a mixture of English and Chinese. It is used to show that the proposed technique can work with scene texts in different orientations and languages.
The proposal quality is evaluated by the recall rate, the number of proposal selected and the computation time. The criterion is that a better proposal technique is capable of achieving a higher recall rate with a smaller number of proposals and a lower computation cost. While benchmarking different proposal techniques, the recall rate can be compared by fixing the number of proposals, says 2000 as a widely adopted number ObjDT (). In addition, the recall rate is also affected by the IoU threshold where a larger IoU usually leads to a lower recall rate. For the scene text reading system, two evaluation criteria are adopted as used in the robust reading competitions Wang (), namely, the end-to-end based and the spotting based. The end-to-end based evaluation focuses on alphanumeric words, while the spotting based evaluation targets words consisting of letters only. In particular, a correct word should have at least three characters (otherwise ignored), and only proposals that have over 50% overlap with corresponding ground truth boxes and contain correctly recognized words are counted as true positives.
5.2 Comparisons with state-of-the-arts
The proposed technique (MPT) is compared to several state-of-the-art scene text proposal techniques including Simple Selective Search for Text Proposal (TP) GomezE2E (), Symmetry Text Line (STL) STL (), and DeepText (DT) TDCNN3DT (). In addition, we also compare the MPT with several state-of-the-art generic object proposal methods including EdgeBox (EB) EB (), Geodesic (GOP) GOP (), Randomized Prime (RP) RP (), and Multiscale Combination Grouping (MCG) MCG (). All these techniques are implemented in Matlab except TP and STL which are implemented in C++. All evaluations are performed on a HP workstation with a Intel Xeon 3.5GHz x 12 CPU and 32GB Ram memory.
Fig. 6 shows experimental results on the ICDAR2015 dataset. The graph on the left shows proposal recalls when the IoU thresholds changes from 0.5 to 1 with a step of 0.05 and 2000 proposals selected from each image. The graph on the right shows recalls while the number of proposals varies from 1 to 2000 when the IoU threshold is fixed at 0.8. As the graph on the left shows, DT demonstrates competitive recalls when the IoU threshold lies between 0.5 and 0.6, but its recall drops dramatically when the IoU threshold increases. TP and STL are stabler than DT as they both use hand-craft text specific features, but their recalls are lower than the proposed MPT except when the IoU threshold is large than 0.9, which is seldom adopted in real systems. In the right graph, the proposed MPT outperforms most compared techniques when the number of proposals changes. In fact, it even outperforms DT which adopts a deep learning approach. Note that the recalls of DT are only evaluated in the range of 100-500 proposals because it set the maximum proposal number at 500.
|Method||IoU: 0.5||IoU: 0.7||IoU: 0.8||Nppb||times (s)|
TP GomezE2E ()
STL STL ()
DT TDCNN3DT ()
EB EB ()
GOP GOP ()
RP RP ()
MCG MCG ()
|Method||IoU: 0.5||IoU: 0.7||IoU: 0.8||Nppb||times (s)|
TP GomezE2E ()
STL STL ()
EB EB ()
GOP GOP ()
RP RP ()
MCG MCG ()
We also studied the number of needed proposals for good recalls and computational cost. Tables 3 and 4 show the experimental results on the test images of the dataset ICDAR2015 and SVT. It can be seen that the proposed MPT outperforms other proposal techniques in most cases for both datasets. TP is also competitive but it requires a larger number of proposals and also higher computational cost. EB is the most efficient and MCG requires a smaller number of proposals but both methods have low recalls under different IoU thresholds.
Fig. 7 illustrates the performance of the proposed MPT and compares it with two state-of-the-art techniques TP and STL (green boxes indicate proposals and red boxes indicate ground-truth). Several sample images are selected from the ICDAR2015 and SVT datasets which suffer from different types of degradations including text size variation (the first images from left), uneven illumination (the second image), ultra-low contrast (the third image), and perspective distortion (the fourth and fifth images). A series of numbers are shown under each image which correspond to the position of each proposal within the ranked proposal list (the smaller, the better). As Fig. 7 shows, the proposed MPT can deal with different types of image degradation and demonstrates superior proposal performance as compared with TP and STL. It should be noted that Fig. 7 only shows good proposals that have over 80% overlap with ground-truth boxes. In addition, each text ground truth has more than one good proposal and Fig. 7 only shows the proposal which is ranked at the front-most with the smallest index number within the ranked proposal list.
The proposed technique also can detect scene texts in different orientations and languages. We demonstrate this capability using the MSRA-TD500 dataset Re3 () that contains scene texts in arbitrary orientations and a mixture of English and Chinese. Experiments show that recalls of 88.14%, 83.33% and 75.77% are obtained under the IoU thresholds of 0.5, 0.7 and 0.8. These recalls are comparable to those achieved over the ICDAR2015 and SVT datasets (as shown in Tables 3 and 4), where most texts are almost horizontal and printed in English. Fig. 8 shows several sample images from the MSRA-TD500 that capture English and Chinese texts in arbitrary orientations, as well as text proposals by our proposed technique. As Fig. 8 shows, the proposed technique is capable of detecting English and Chinese texts when there exist certain overlaps between neighbouring characters in the vertical direction. It fails when scene texts are vertically oriented as shown in the two images in the last column. Note that a maximum of 2000 proposals are generated in each image and the proposals shown in Fig. 8 are those having the best overlapping with the ground truth.
The superior performance of the MPT is largely attributed to the proposed pooling based grouping strategy that captures the exact text layout and appearance in scenes, i.e. characters are usually closer to each other (as compared with neighbouring non-text objects) forming words and text lines. In fact, the proposed grouping strategy can also handle texts with broken edges as far as they have certain overlap in the vertical direction. As a comparison, the EdgeBox (EB) EB () makes use of image edges similarly with a much lower recall rate, largely due to different grouping strategies. Besides the proposed grouping strategy, the HoGe based proposal ranking helps to shift scene text proposals to the front of the sorted list which also contributes to the superior performance of the proposed MPT technique when a limited number of proposals are selected.
5.3 End-to-end and word spotting
|ConvLSTM E2E11 ()||84.93||98.91||91.39||84||97.29||90.16|
|TextBoxes TextBoxes ()||84||90.77||97.25||93.90||87.38||97.02||91.95|
|Method||End-to-end Scene Text Reading|
|Jar-E2E JarE2E ()||82.12||91.05||86.35||-||-||-|
|ConvLSTM E2E11 ()||79.39||96.68||87.19||79.28||94.91||86.39|
Tables 5 and 6 compare our developed end-to-end system with several state-of-the-art end-to-end scene text reading systems including several CNN-based: Jar-E2E model JarE2E (), ConvLSTM E2E11 (), DeepTextSpotterDTSpotter (), and TextBoxes TextBoxes () as well as several proposal based: EB_Sys, TP_Sys, STL_Sys which are constructed by combining EB, TP, and STL based scene text proposal techniques with Jarderberg’s scene text recognition model. The comparisons are based on precision, recall and f-measure on the ICDAR2015 dataset and the SVT dataset. As the two tables show, the performance of the proposed system is clearly better than other proposal based systems and also comparable to the CNN-based systems. Note that the TextBoxes TextBoxes () trains two dedicated networks for detection and recognition, and it was trained using a huge amount images including images in the SynthText SynText () (containing 800,000 images) as well as training images in the ICDAR2011 dataset and the ICDAR2013 dataset. As a comparison, our proposed system was trained using 479 training images in the ICDAR2003 dataset and the ICDAR2013 dataset only.
Fig. 9 shows a number of sample images that illustrate the performance of our developed end-to-end scene text reading system. As Fig. 9 shows, the proposed technique is capable of detecting and recognizing challenging texts with small text size (the first image in the first row), poor illumination and motion blur (the second images in the first and second rows), perspective distortion (the second image in the third row and the third image in second row). The superior scene text reading performance is largely due to the robustness of the proposed scene text proposal technique and the integrated scene text recognition model. Note that the proposed technique may fail when scene texts have ultra-low contrast or are printed in certain odd styles as illustrated in the sample images in the last row.
This paper presents a pooling based scene text proposal technique as well as its application to end-to-end scene text reading. The scene text proposal technique is inspired by the CNN pooling layer which is capable of grouping image edges into words and text lines accurately and efficiently. A novel score function is also designed which is capable of ranking generated proposals according to their probabilities of being text and accordingly helps to reduce the number of false-alarm proposals greatly. Further, the proposed proposal technique does not rely on those heuristic thresholds/parameters such as text sizes, inter-character distances, etc. that are widely used in many existing techniques. Extensive experiments show that the pooling based proposal technique achieves superior performance as compared with state-of-the-arts. In addition, the integration of the pooling based proposal technique into an end-to-end scene text reading system also demonstrates state-of-the-art scene text reading performance.
Dinh is a PhD student at Sorbonne University â University Pierre and Marie CURIE, France. His current research works are in Image & Pervasive Access Lab (IPAL, UMI2955, CNRS) and have collaboration with Institute for Infocomm Research (I2R, A-STAR), Singapore. His major research interests are visual understanding and machine learning.
Shijian is an Assistant Professor with School of Computer Science & Engineering, the Nanyang Technological University, Singapore. His major research interests include image and video analytic, visual intelligence, and machine learning. He published more than 80 international journals and conference papers and co-authored over 10 patents in these research areas.
Shangxuan-Tian is a Senior Researcher at Tencent, China. Previously, he worked as a Research Scientist in the Institute for Infocomm Research, Singapore. He received his Ph.D. degree in School of Computing, National University of Singapore. His research interests include object detection and recognition, text understanding in scene images.
Nizar-Ouarti is Associate Professor at Sorbonne UPMC in France. He is recently in CNRS delegation in the IPAL laboratory in Singapore. He received a PhD of College de France in 2007. He was postdoctoral researcher at INRIA. His topics of interest are ego-motion, computer vision and robotics.
Mounir-Mokhtari is a Professor at Institut MINES TELECOM , France, Director of IPAL-CNRS French-Singaporean joint lab, Singapore, and Research Associate at CNRS-LIRMM Montpellier, France. His background is in human-machine interaction in the domain of Ambient Assistive Living. He has over 100 publications in journals, books and international conferences.
OrCam, Orcam, see for yourself,
(Accessed on 05-Dec-2017).
translations (Accessed on 05-Dec-2017).
- (3) Karatzas, Dimosthenis, Shafait, Faisal, Uchida, Seiichi, Iwamura, Masakazu, Bigorda, L. Gomez, Mestre, S. Robles, Mas, Joan, Mota, D. Fernandez, Almazàn, J. Almazàn, de las Heras, L. Pere, Icdar 2013 robust reading competition, Proceedings of the 2013 12th International Conference on Document Analysis and Recognition (2013) 1484–1493.
- (4) D. Karatzas, L. Gomez, A. Nicolaou, S. K. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, E. Valveny, Icdar 2015 competition on robust reading, Proceedings of the 2015 13th International Conference on Document Analysis and Recognition.
- (5) K. Wang, Belongie, Serge, Word spotting in the wild, Proceedings of the 11th European conference on Computer vision (ECCV) (2010) 591–604.
- (6) B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, S. Lu, X. Bai, Icdar2017 competition on reading chinese text in the wild (rctw-17), arXiv preprint arXiv:1708.09585.
- (7) T. Shangxuan, Y. Pan, C. Huang, S. Lu, K. Yu, C. L. Tan, Text flow: A unified text detection system in natural scene images, IEEE International Conference on Computer Vision (2015) 4651–4659.
- (8) S. Lu, T. Chen, S. Tian, J. Lim, C. L. Tan, Scene text extraction based on edges and support vector regression, International Journal on Document Analysis and Recognition 18 (2015) 125–135.
- (9) P. Shivakumara, Q. Pham, S. Lu, C. L. Tan, Gradient vector flow and grouping-based method for arbitrarily oriented scene text detection in video images, IEEE Circuits and Systems Society 23 (2013) 1729–1739.
- (10) Z. Tian, W. Huang, T. He, P. He, Y. Qiao, Detecting text in natural image with connectionist text proposal network, European Conference on Computer Vision (ECCV) (2016) –.
- (11) D. Nguyen, S. Lu, X. Bai, N. Ouarti, M. Mokhtari, A max-pooling based grouping strategy for scene text localization, International Conference on Document Analysis and Recognition (ICDAR) (2017) –.
- (12) H. Jan, R. Benenson, B. Schiele, How good are detection proposals, really?, Proceedings of the British Machine Vision Conference (2014) –.
- (13) Z. Zhang, W. Shen, C. Yao, X. Bai, Symmetry-based text line detection in natural scenes, IEEE Conference on Computer Vision and Pattern Recognition (2015) 2558–2567.
- (14) Z. Zhong, L. Jin, S. Huang, Deeptext: A new approach for text proposal generation and text detection in natural images, IEEE International Conference on Acoustics, Speech and Signal Processing (2017) 1208–1212.
- (15) L. Gomez, D. Karatzas, Textproposals: a text-specific selective search algorithm for word spotting in the wild, Pattern Recognition 70 (2017) 60–74.
- (16) D. Nguyen, S. Lu, N. Ouarti, M. Mokhtari, Text-edge-box: An object proposal approach for scene text localization, IEEE Winter Conference on Application of Computer Vision (2017) 1296–1305.
- (17) L. Zitnick, P. Dollar, Edge boxes: Locating object proposals from edges, European Conference on Computer Vision (2014) 391–405.
- (18) M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Reading text in the wild with convolutional neural networks, International Journal of Computer Vision 16 (2016) 1–20.
- (19) J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, A. W. M. Smeulders, Selective search for object recognition, International Journal of Computer Vision 104 (2) (2013) 154–171.
- (20) Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu, Z. Luo, R2cnn: Rotational region CNN for orientation robust scene text detection, CoRR abs/1706.09579 (2017) –.
- (21) Z. Siyu, Z. Richard, A text detection system for natural scenes with convolutional feature learning and cascaded classification, Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 625–632.
- (22) T. Wang, D. J. Wu, A. Coates, A. Y. Ng, End-to-end text recognition with convolutional neural networks, Proceedings of the 2012 International Conference on Pattern Recognition (ICPR) (2012) 3304–3308.
- (23) X. Yin, X. Yin, K. Huang, H. Hao, Robust text detection in natural scene images, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (5) (2014) 970–983.
- (24) S. Qin, R. Manduchi, A fast and robust text spotter, IEEE International Conference on Applications of Computer Vision (WACV) (2016) 1–8.
- (25) H. Cho, M. Sung, B. Jun, Canny text detector: Fast and robust scene text localization algorithm, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 3566–3573.
- (26) M. C. Sung, B. Jun, H. Cho, D. Kim, Scene text detection with robust character candidate extraction method, International Conference on Document Analysis and Recognition (ICDAR) (2015) 426–430.
- (27) B. Epshtein, E. Ofek, Y. Wexler, Detecting text in natural scenes with stroke width transform, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010) 2963–2970.
- (28) W. Huang, Z. Lin, J. Yang, Y. Wang, Text localization in natural images using stroke feature transform and text covariance descriptors, IEEE International Conference on Computer Vision (ICCV) (2013) 1241–1248.
- (29) M. Buta, L. Neumann, J. Matas, Fastext: Efficient unconstrained scene text detector, IEEE International Conference on Computer Vision (ICCV) (2015) 1206–1214.
- (30) C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, Z. Cao, Scene text detection via holistic, multi-channel prediction, CoRR abs/1606.09002 (2016) –.
- (31) Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, X. Bai, Multi-oriented text detection with fully convolutional networks, IEEE Conference on Computer Vision and Pattern Recognition (2016) 4159–4167.
- (32) T. He, W. Huang, Y. Qiao, J. Yao, Text-attentional convolutional neural network for scene text detection, IEEE Transactions on Image Processing 25 (6) (2016) 2529–2541.
- (33) C. Yi, Y. Tian, Text string detection from natural scenes by structure-based partition and grouping, IEEE Transactions on Image Processing 20 (9) (2011) 2594–2605.
- (34) L. Sun, Q. Hub, W. Jia, K. Chen, A robust approach for text detection from natural scene images, International Journal on Pattern Recognition 48 (9) (2015) 2906–2920.
- (35) L. Gomez, D. Karatzas, Mser-based real-time text detection and tracking, International Conference on Pattern Recognition (ICPR) (2014) 3110–3115.
- (36) M. Liao, B. Shi, X. Bai, X. Wang, W. Liu, Textboxes: A fast text detector with a single deep neural network, Association for the Advancement of Artificial Intelligence (2017) 4161–4167.
- (37) W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. Berg, Single shot multibox detector, European Conference on Computer Vision (2016) 21–37.
- (38) Y. Liu, L. Jin, Deep matching prior network: Toward tighter multi-oriented text detection, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) –.
- (39) W. He, X.-Y. Zhang, F. Yin, C.-L. Liu, Deep direct regression for multi-oriented scene text detection, The IEEE International Conference on Computer Vision (ICCV) (2017) –.
- (40) B. Shi, X. Bai, S. J. Belongie, Detecting oriented text in natural images by linking segments, CoRR abs/1703.06520 (2017) –.
- (41) H. Hu, C. Zhang, Y. Luo, Y. Wang, J. Han, E. Ding, Wordsup: Exploiting word annotations for character based text detection, The IEEE International Conference on Computer Vision (ICCV) (2017) –.
- (42) A. Polzounov, A. Ablavatski, S. Escalera, S. Lu, J. Cai, Wordfence: Text detection in natural images with border awareness, CoRR abs/1705.05483 (2017) –.
- (43) Y. Wu, P. Natarajan, Self-organized text detection with minimal post-processing via border learning, The IEEE International Conference on Computer Vision (ICCV) (2017) –.
- (44) G. Papandreou, L.-C. Chen, K. P. Murphy, A. L. Yuille, Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation, The IEEE International Conference on Computer Vision (ICCV) (2015) –.
- (45) S. Tian, S. Lu, C. Li, Wetext: Scene text detection under weak supervision, The IEEE International Conference on Computer Vision (ICCV) (2017) –.
- (46) A. Bissaco, M. Cummins, Y. Netzer, H. Neven, Photoocr: Reading text in uncontrolled conditions, IEEE International Conference on Computer Vision (ICCV) (2013) 785–792.
- (47) K. Wang, B. Babenko, S. Belongie, End-to-end scene text recognition, International Conference on Computer Vision (2011) 1457–1464.
- (48) L. Neumann, J. Matas, Real-time scene text localization and recognition, IEEE Conference on Computer Vision and Pattern Recognition (2012) 3538–3545.
- (49) L. Neumann, J. Matas, Efficient scene text localization and recognition with local character refinement, International Conference on Document Analysis and Recognition (ICDAR) (2015) 746–750.
- (50) C. Yao, X. Bai, W. Liu, Un unified framework for multioriented text detection and recognition, IEEE Transactions on Image Processing 23 (11) (2014) 4737–4749.
- (51) M. Jaderberg, K. Simonyan, A. Velaldi, A. Zisserman, Synthetic data and artificial neural networks for natural scene text recognition, Workshop on Deep Learning, NIPS.
- (52) H. Li, C. Shen, Reading car license plates using deep convolutional neural networks and lstms, CoRR abs/1601.05610 (2016) –.
- (53) M. Jaderberg, A. Velaldi, A. Zisserman, Deep features for text spotting, European Conference on Computer Vision (2014) 512–528.
- (54) B. Shi, X. Bai, CongYao, An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, PP, (2016) –.
- (55) M. Busta, L. Neumann, J. Matas, Deep textspotter: An end-to-end trainable scene text localization and recognition framework, The IEEE International Conference on Computer Vision (ICCV) (2017) –.
- (56) J. Canny, A computational approach to edge detection, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-8 (6) (1986) 679–698.
- (57) K. Wang, B. Babenko, S. Belongie, Word spotting in the wild, Proceedings of the 2011 International Conference on Computer Vision (ICCV) (2011) 1457–1464.
- (58) A. Mishra, K. Alahari, C. Jawahar, Enhancing energy minimization framework for scene text recognition with top-down cues, Journal on Computer Vision and Image Understanding 145 (2016) 30–42.
- (59) R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, IEEE Conference on Computer Vision and Pattern Recognition (2014) 580–587.
- (60) C. Yao, xiang Bai, W. Liu, Y. Ma, Z. Tu, Detecting texts of arbitrary orientations in natural images, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012) 1083–1090.
- (61) P. Krahenbuhl, V. Koltun, Geodesic object proposals, European Conference on Computer Vision (2014) 725–739.
- (62) S. Manen, M. Guillaumin, L. V. Gool, Prime object proposals with randomized prim’s algorithm, IEEE International Conference on Computer Vision (2013) 2536–2543.
- (63) J. PontTuset, P. Arbelaez, J. Barron, F. Marques, J. Malik, Multiscale combinatorial grouping for image segmentation and object proposal generation, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (1) (2017) 128–140.
- (64) A. Gupta, A. Vedaldi, A. Zisserman, Synthetic data for text localisation in natural images, IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 2315–2324.