Robust Text Detection in Natural Scene Images
Abstract
Text detection in natural scene images is an important prerequisite for many contentbased image analysis tasks. In this paper, we propose an accurate and robust method for detecting texts in natural scene images. A fast and effective pruning algorithm is designed to extract Maximally Stable Extremal Regions (MSERs) as character candidates using the strategy of minimizing regularized variations. Character candidates are grouped into text candidates by the singlelink clustering algorithm, where distance weights and threshold of the clustering algorithm are learned automatically by a novel selftraining distance metric learning algorithm. The posterior probabilities of text candidates corresponding to nontext are estimated with an character classifier; text candidates with high probabilities are then eliminated and finally texts are identified with a text classifier. The proposed system is evaluated on the ICDAR 2011 Robust Reading Competition dataset; the measure is over 76% and is significantly better than the stateoftheart performance of 71%. Experimental results on a publicly available multilingual dataset also show that our proposed method can outperform the other competitive method with the measure increase of over percent. Finally, we have setup an online demo of our proposed scene text detection system at “http://kems.ustb.edu.cn/learning/yin/dtext”.
scene text detection, maximally stable extremal regions, singlelink clustering, distance metric learning
1 Introduction
Text in images contains valuable information and is exploited in many contentbased image and video applications, such as contentbased web image search, video information retrieval, mobile based text analysis and recognition [Zhong2000, Doermann2000, Weinman2009, Yin2011, Chew2011]. Due to complex background, variations of font, size, color and orientation, text in natural scene images has to be robustly detected before being recognized or retrieved.
Existing methods for scene text detection can roughly be categorized into three groups: sliding window based methods [derivative_feature, adaboost_text, Kim2003], connected component based methods [Epshtein, structurepartition, colorclustering], and hybrid methods [pan]. Sliding window based methods, also called as region based methods, engage a sliding window to search for possible texts in the image and then use machine learning techniques to identify texts. These methods tend to be slow as the image has to be processed in multiple scales. Connected component based methods extract character candidates from images by connected component analysis followed by grouping character candidates into text; additional checks may be performed to remove false positives. The hybrid method presented by Pan et al. [pan] exploits a region detector to detect text candidates and extracts connected components as character candidates by local binarization; noncharacters are eliminated with a Conditional Random Fields (CRFs) [crf] model, and characters can finally be grouped into text. More recently, Maximally Stable Extremal Region (MSER) based methods, which actually fall into the family of connected component based methods but use MSERs [mser] as character candidates, have become the focus of several recent projects [icdar2011, edge_mser, head_mounted, real_time, pruned_search, neumann_method, mser2013].
Although the MSER based method is the winning method of the benchmark data, i.e., ICDAR 2011 Robust Reading Competition [icdar2011] and has reported promising performance, there remains several problems to be addressed. First, as the MSER algorithm detects a large number of noncharacters, most of the character candidates need to be removed before further processing. The existing methods for MSERs pruning [head_mounted, real_time], on one hand, may still have room for further improvement in terms of the accuracies; on the other hand, they tend to be slow because of the computation of complex features. Second, current approaches [head_mounted, real_time, pan] for text candidates construction, which can be categorized as rule based and clustering based methods, work well but are still not sufficient; rule based methods generally require handtuned parameters, which is time consuming and error pruning; the clustering based method [pan] shows good performance but it is complicated by incorporating a second stage processing after minimum spanning tree clustering.
In this paper, we propose a robust and accurate MSER based scene text detection method. First, by exploring the hierarchical structure of MSERs and adopting simple features, we designed a fast and accurate MSERs pruning algorithm; the number of character candidates to be processed is significantly reduced with a high accuracy. Second, we propose a novel selftraining distance metric learning algorithm that can learn distance weights and clustering threshold simultaneously and automatically; character candidates are clustered into text candidates by the singlelink clustering algorithm using the learned parameters. Third, we propose to use a character classifier to estimate the posterior probabilities of text candidates corresponding to nontext and remove text candidates with high probabilities. Such elimination helps to train a more powerful text classifier for identifying text. By integrating the above ideas, we built an accurate and robust scene text detection system. The system is evaluated on the benchmark ICDAR 2011 Robust Reading Competition dataset and achieved an measure of 76%. To our best knowledge, this result ranks the first on all the reported performance and is much higher the current best performance of 71%. We also validate our method on the multilingual (include Chinese and English) dataset used in [pan]. With an measure of 74.58%, our system significantly outperforms the competitive method [pan] that achieves only 65.2%. An online demo of our proposed scene text detection system is available at http://kems.ustb.edu.cn/learning/yin/dtext.
The rest of this paper is organized as follows. Recent MSER based scene text detection methods are reviewed in Section 2. Section 3 describes the proposed scene text detection method. Section 4 presents the experimental results of the proposed system on ICDAR 2011 Robust Reading Competition dataset and a multilingual (include Chinese and English) dataset. Final remarks are presented in Section 5.
2 Related Work
As described above, MSER based methods have demonstrated very promising performance in many real projects. However, current MSER based methods still have some key limitations, i.e., they may suffer from large number of noncharacters candidates in detection and also insufficient text candidates construction algorithms. In this section, we review the MSER based methods with the focus on these two problems. Other scene text detection methods can be referred to in some survey papers [survey04, survey05, survey08]. A recently published MSER based method can be referred to in Shi et al. [mser2013].
The main advantage of MSER based methods over traditional connected component based methods may root in the usage of MSERs as character candidates. Although the MSER algorithm can detect most characters even when the image is in low quality (low resolution, strong noises, low contrast, etc.), most of the detected character candidates correspond to noncharacters. Carlos et al. [head_mounted] presented a MSERs pruning algorithm that contains two steps: (1) reduction of linear segments and (2) hierarchical filtering. The first stage reduces linear segments in the MSER tree into one node by maximizing the border energy function; the second stage walks through the tree in a depthfirst manner and eliminates nodes by checking them against a cascade of filters: size, aspect ratio, complexity, border energy and texture. Neumann and Matas [real_time] presented a two stage algorithm for Extremal Regions (ERs) pruning. In the first stage, a classifier trained from incrementally computable descriptors (area, bounding box, perimeter, Euler number and horizontal crossing) is used to estimate the classconditional probabilities of ERs; ERs corresponding to local maximum of probabilities in the ER inclusion relation are selected. In the second stage, ERs passed the first stage are classified as characters and noncharacters using more complex features. As most of the MSERs correspond to noncharacters, the purpose of using cascading filters and incrementally computable descriptors in these above two methods is to deal with the computational complexity caused by the high false positive rate.
Another challenge of MSER based methods, or more generally, CCbased methods and hybrid methods, is how to group character candidates into text candidates. The existing methods for text candidates construction fall into two general approaches: rulebased [edge_mser, head_mounted, real_time] and clusteringbased methods [pan]. Neumann and Matas [real_time] grouped character candidates using the text line constrains, whose basic assumption is that characters in a word can be fitted by one or more top and bottom lines. Carlos et al. [head_mounted] constructed a fully connected graph over character candidates; they filtered edges by running a set of tests (edge angle, relative position and size difference of adjacent character candidates) and used the remaining connected subgraphs as text candidates. Chen et al. [edge_mser] pairwised character candidates as clusters by putting constrains on stroke width and height difference; they then exploited a straight line to fit to the centroids of clusters and declared a line as text candidate if it connected three or more character candidates. The clusteringbased method presented by Pan et al. [pan] clusters character candidates into a tree using the minimum spanning tree algorithm with a learned distance metric [yinliu:metric2009]; text candidates are constructed by cutting off betweentext edges with an energy minimization model. The above rulebased methods generally require handtuned parameters, while the clusteringbased method is complicated by the incorporating of the postprocessing stage, where one has to specify the energy model.
3 Robust Scene Text Detection
In this paper, by incorporating several key improvements over traditional MSER based methods, we propose a novel MSER based scene text detection method, which finally leads to significant performance improvement over the other competitive methods. The structure of the proposed system, as well as the sample result of each stage is presented in Figure 1. The proposed scene text detection method includes the following stages:
1) Character candidates extraction. character candidates are extracted using the MSER algorithm; most of the noncharacters are reduced by the proposed MSERs pruning algorithm using the strategy of minimizing regularized variations. More details are presented in Section 3.1.
2) Text candidates construction. distance weights and threshold are learned simultaneously using the proposed metric learning algorithm; character candidates are clustered into text candidates by the singlelink clustering algorithm using the learned parameters. More details are presented in Section 3.2.
3) Text candidates elimination. the posterior probabilities of text candidates corresponding to nontext are measured using the character classifier and text candidates with high probabilities for nontext are removed. More details are presented in Section 3.3.
4) Text candidates classification. text candidates corresponding to true text are identified by the text classifier. An AdaBoost classifier is trained to decide whether an text candidate corresponding to true text or not [Yin12]. As characters in the same text tend to have similar features, the uniformity of character candidates’ features are used as text candidate’s features to train the classifier.
In order to measure the performance of the proposed system using the ICDAR 2011 competition dataset, text candidates identified as text are further partitioned into words by classifying inner character distances into character spacings and word spacings using an AdaBoost classifier [Yin12]. The following features are adopted: spacing aspect ratio, relative width difference between left and right neighbors, number of character candidates in the text candidate.
3.1 Letter Candidates Extraction
3.1.1 Pruning Algorithm Overview
The MSER algorithm is able to detect almost all characters even when the image is in low quality. However, as shown in Figure 2(a), most of the detected character candidates correspond to noncharacters and should be removed before further processing. Figure 2(a) also shows that the detected characters forms a tree, which is quite useful for designing the pruning algorithm. In real world situations, as characters cannot be “included” by or “include” other characters, it is safe to remove children once the parent is known to be a character, and vice versa. The parentchildren elimination is a safe operation because characters are preserved after the operation. By reduction, if the MSER tree is pruned by applying parentchildren elimination operation recursively in a depthfirst manner, we are still in safe place and characters are preserved. As shown in Figure 2(e), the above algorithm will end up with a set of disconnected nodes containing all the characters. The problem with the above algorithm is that it is expensive to identify character. Fortunately, rather than identifying the character, the choice between parent and children can be made by simply choosing the one that is more likely to be characters, which can be estimated by the proposed regularized variation scheme. Considering different situations in MSER trees, we design two versions of the parentchildren elimination method, namely the linear reduction and tree accumulation algorithm. Noncharacter regions are eliminated by the linear reduction and tree accumulation algorithm using the strategy of minimizing regularized variations. Our experiment on ICDAR 2011 competition training set shows that more than 80% of character candidates are eliminated using the proposed pruning algorithm.
In the following sections, we first introduce the concept of variation and explain why variations need to be regularized. Then we introduce the linear reduction and tree accumulation algorithm. Finally we present the complexity analysis for the proposed algorithms.
3.1.2 Variation and Its Regularization
According to Matas et al. [mser], an “extremal region” is a connected component of an image whose pixels have either higher or lower intensity than its outer boundary pixels [vlfeat, detector_compare]. Extremal regions are extracted by applying a set of increasing intensity levels to the gray scale image. When the intensity level increases, a new extremal region is extracted by accumulating pixels of current level and joining lower level extremal regions [head_mounted]; when the top level is reached, extremal regions of the whole image are extracted as a rooted tree. The variation of an extremal region is defined as follows. Let be an extremal region, ( is an parameter) be the branch of the tree rooted at , the variation (instability) of is defined as
(1) 
An extremal region is a maximally stable extremal region if its variation is lower than (more stable) its parent and child [vlfeat, mser]. Informally, a maximally stable extremal region is an extremal region whose size remains virtually unchanged over a range of intensity levels [real_time].
As MSERs with lower variations have sharper borders and are more likely to be characters, one possible strategy may be used by the parentchildren elimination operation is to select parent or children based on who have the lowest variation. However, this strategy alone will not work because MSERs corresponding to characters may not necessarily have lowest variations. Consider a very common situation depicted in Figure 2. The children of the MSER tree in Figure 1(a) correspond to characters while the parent of the MSRE tree in Figure 1(b) corresponds to character. The “minimize variation” strategy cannot deal with this situation because either parent or children may have the lowest variations. However, our experiment shows that this limitation can be easily fixed by variation regularization, whose basic idea is to penalize variations of MSERs with too large or too small aspect ratios. Note that we are not requiring characters to have the lowest variations globally, a lower variation in a parentchildren relationship suffices for our algorithm.
Let be the variation and be the aspect ratio of a MSER, the aspect ratios of characters are expected to fall in , the regularized variation is defined as
(2) 
where and are penalty parameters. Based on experiments on the training dataset, these parameters are set as .
Figure 2(b) shows a MSER tree colored according to variation. As variation increases, the color changes from green to yellow then to red. The same tree colored according to regularized variation is shown in Figure 2(c). The MSER tree in Figure 2(c) are used in our linear reduction (result presented in Figure 2(d)) and tree accumulation algorithm (result presented in Figure 2(e)). Notice that “variation” in the following sections refer to “regularized variation”.
3.1.3 Linear Reduction
The linear reduction algorithm is used in situations where MSERs has only one child. The algorithm chooses from parent and child the one with the minimum variation and discards the other.
This procedure is applied across the whole tree recursively. The detailed algorithm is presented in Figure 4. Given a MSER tree, the procedure returns the root of the processed tree whose linear segments are reduced. The procedure works as follows. Given a node , the procedure checks the number of children of ; if has no children, returns immediately; if has only one child, get the root of child tree by first applying the linear reduction procedure to the child tree; if has a lower variation compared with , link ’s children to and return ; otherwise we return ; if has more than one children, process these children using linear reduction and link the resulting trees to before returning . Figure 2(d) shows the resulting MSER tree after applying linear reduction to the tree shown in Figure 2(c). Note that in the resulting tree all linear segments are reduced and nonleaf nodes always have more than one children.
3.1.4 Tree Accumulation
The tree accumulation algorithm is used when MSERs has more than one child. Given a MSER tree, the procedure returns a set of disconnected nodes. The algorithm works as follows. For a given node , tree accumulation checks the number of ’s children; if has no children, return immediately; if has more than two children, create an empty set and append the result of applying tree accumulation to ’s children to ; if one of the nodes in has a lower variation than ’s variation, return , else discard ’s children and return . Figure 2(e) shows the result of applying tree accumulation to the tree shown in Figure 2(d). Note that the final result is a set of disconnected nodes containing all the characters in the original MSER tree.
3.1.5 Complexity Analysis
The linear reduction and tree accumulation algorithm effectively visit each nodes in the MSRE tree and do simple comparisons and pointer manipulations, thus the complexity is linear to the number of tree nodes. The computational complexity of the variation regularization is mostly due to the calculations of MSERs’ bounding rectangles, which is upbounded by the number of pixels in the image.
3.2 Text Candidates Construction
3.2.1 Text Candidates Construction Algorithm Overview
Text candidates are constructed by clustering character candidates using the singlelink clustering algorithm [clustering]. Intuitively, singlelink clustering produce clusters that are elongated [clustering_review] and thus is particularly suitable for the text candidates construction task. Singlelink clustering belongs to the family of hierarchical clustering; in hierarchical clustering, each data point is initially treated as a singleton cluster and clusters are successively merged until all points have been merged into a single remaining cluster. In the case of singlelink clustering, the two clusters whose two closest members have the smallest distance are merged in each step. A distance threshold can be specified such that the clustering progress is terminated when the distance between nearest clusters exceeds the threshold. The resulting clusters of singlelink algorithm form a hierarchical cluster tree or cluster forest if termination threshold is specified. In the above algorithm, each data point represent a character candidate and top level clusters in the final hierarchical cluster tree (forest) correspond to text candidates.
The problem is of course to determine the distance function and threshold for the singlelink algorithm. We use the weighted sum of features as the distance function. Given two data points , let be the feature vector characterizing the similarity between and , the distance between and is defined as
(3) 
where , the feature weight vector together with the distance threshold, can be learned using the proposed distance metric learning algorithm.
In the following subsections, we first introduce the feature space , then detail the proposed metric learning algorithm and finally present the empirical analysis on the proposed algorithm.
3.2.2 Feature Space
The feature vector is used to describe the similarities between data points and . Let be the coordinates of top left corner of ’s bounding rectangle, be the height and width of the bounding rectangle of , be the stroke width of , be the average three channel color value of , feature vector include the following features:

Spatial distance

Width and height differences

Top and bottom alignments

Color difference

Stroke width difference
3.2.3 Distance Metric Learning
There are a variety of distance metric learning methods [huang1, huang2, huang3]. More specifically, many clustering algorithms rely on a good distance metric over the input space. One task of semisupervised clustering is to learn a distance metric that satisfies the labels or constrains in the supervised data given the clustering algorithm [integrating, xing_metric, klein]. The strategy of metric learning is to the learn distance function by minimizing distance between point pairs in while maximizing distance between point pairs in , where specifies pairs of points in different clusters and specifies pairs of points in the same cluster. In singlelink clustering, because clusters are formed by merging smaller clusters, the final resulting clusters will form a binary hierarchical cluster tree, in which nonsingleton clusters have exactly two direct subclusters. It is not hard to see that the following property holds for top level clusters: given the termination threshold , it follows that distances between each top level cluster’ subclusters are less or equal to and distances between data pairs in different top level clusters are great than , in which the distance between clusters is that of the two closest members in each cluster. This property of singlelink clustering enables us to design a learning algorithm that can learn the distance function and threshold simultaneously.
Given the top level cluster set , we randomly initialize feature weights and set and as
(4)  
(5) 
where is the set of points excluding points in , and are direct subclusters of . Suppose is specified as the singlelink clustering termination threshold. By the definition of singlelink clustering, we must have
(6)  
(7) 
The above equations show that and can be corresponded as the positive and negative sample set of a classification problem, such that feature weights and threshold can be learned by minimizing the classification error. As we know, the logistic regression loss is the traditional loss used in classification with a high and stable performance. By adopting the objective function of logistic regression, we define the following objective function
(8)  
where
(9)  
The feature weights and threshold can be learned simultaneously by minimizing the objective function with respect to current assignment of and
(10) 
Minimization of the above objective function is a typical nonlinear optimization problem and can be solved by classic gradient optimization methods [elements_sl_book].
Note that in the above learning scheme, initial values for have to be specified in order to generate set and according to Equation (4) and (5). For this reason, we design an iterative optimization algorithm in which each iteration involves two successive steps corresponding to assignments of and optimization with respect to . We call our algorithm as “selftraining distance metric learning”. Pseudocode for this learning algorithm is presented in Figure 6. Given the top level cluster set , the learning algorithm find an optimized such that the objective function is minimized with respect to . In this algorithm, initial value for is set before the iteration begins; in the first stage of the iteration and are update according to Equation (4) and (5) with respect to current assignment of ; in the second stage, is updated by minimizing the objective function with respect to the current assignment of and . This twostage optimization is then repeated until convergence or the maximum number of iterations is exceeded.
Similar to most selftraining algorithms, convergence of the proposed algorithm is not guaranteed because the objective function is not assured to decrease in stage one. However, selftraining algorithms have demonstrated their success in many applications. In our case, we find that our algorithm can usually generate very good performance after a very small number of iterations, typically in 5 iterations. This phenomenon will be investigated in the next subsection.
3.2.4 Empirical Analysis
We perform an empirical analysis on the proposed distance metric learning algorithm. We labeled in the ICDAR 2011 competition dataset text candidates corresponding to true text in the training set, 70% of which used as training data, 30% as validation data. In each iteration of the algorithm, cannotlink set and mustlink set are updated in step one by generating cannotlink point pairs and mustlink point pairs from true text candidates in every image in the training dataset; the objective function are optimized using the LBFGS method [LBFGS] and the parameters are updated in stage two. Performance of the learned distance weights and threshold in step two is evaluated on the validation dataset in each iteration.
As discussed in the previous section, the algorithm may or may not converge due to different initial values of the parameters. Our experiments show that the learned parameters almost always have a very low error rate on the validation set after the first several iterations and no major improvement is observed in the continuing iterations. As a result, whether the algorithm converge or not has no great impact on the performance of the learned parameters.
We plot the value of the objective function after stage one and stage two in each iteration of two instance (correspond to a converged one and not converged one) of the algorithm in Figure 6(a). The corresponding error rates of the learned parameters on the validation set in each iteration are plotted in Figure 6(b). Notice that value of the objective function and validation set error rate dropped immediately after the first several iterations. Figure 6(b) shows that the learned parameters have different error rates due to different initial value, which suggests to run the algorithm several times to get the satisfactory parameters. The parameters for the singlelink clustering algorithm in our scene text detection system are chosen based on the performance on the validation set.
3.3 Text Candidates Elimination
Using the text candidates construction algorithm proposed in Section 3.2, our experiment in ICDAR 2011 competition training set shows that only 9% of the text candidates correspond to true text. As it is hard to train an effective text classifier using such unbalanced dataset, most of the nontext candidates need to be removed before training the classifier. We propose to use a character classifier to estimate the posterior probabilities of text candidates corresponding to nontext and remove text candidates with high probabilities for nontext.
The following features are used to train the character classifier. Smoothness, defined as the average difference of adjacent boundary pixels’ gradient directions, stroke width features, including the average stroke width and stroke width variation, height, width, and aspect ratio. Characters with small aspect ratios such as “i”, “j” and “l” are treated as negative samples, as it is very uncommon that some words comprise many small aspect ratio characters.
Given a text candidate , let be the observation that there are () character candidates in , of which () are classified as noncharacters by a character classifier of precision (). The probability of the observation conditioning on corresponding to text and nontext are and respectively. Let and be the prior probability of corresponding to text and nontext. By applying Bayes’ rule, the posterior probability of corresponding to nontext given the observation is
(11) 
where is the probability of the observation
(12) 
The candidate region is rejected if
(13) 
where is the threshold.
Our experiment shows that text candidates of different sizes tend to have different probability of being text. For example, on the ICDAR training set, 1.25% of text candiates of size two correspond to text, while 30.67% of text candidates of size seven correspond to text, which suggests to set different priors for text candidates of different size. Given a text candidates of size , let be the total number of text candidates of size , be the number of text candidates of size that correspond to text, we estimate the prior of being text as , and the prior of being nontext as . These priors are computed based on statistics on the ICDAR training dataset.
To find the appreciate , we used 70% of ICDAR training dataset to train the character classifier and text classifier, the remaining 30% as validation set to test the performance of different . Figure 7(a) shows the precision, recall and measure of text candidates classification task on the validation set. As increases, text candidates are more unlikely to be eliminated, which results in the increase of recall value. In the scene text detection task, recall is preferred over precision, until is reached, where a major decrease of measure occurred, which can be explained by the sudden decrease of ratio of text samples (see Figure 7(b)). Figure 7(b) shows that at , of text are preserved, while of nontext are eliminated.
4 Experimental Results
^{1}^{1}1 An online demo of the proposed scene text detection system is available at http://kems.ustb.edu.cn/learning/yin/dtext.In this section, we presented the experimental results of the proposed scene text detection method on two publicly available benchmark datasets, ICDAR 2011 Robust Reading Competition dataset ^{2}^{2}2The ICDAR 2011 Robust Reading Competition dataset is available at http://robustreading.opendfki.de/wiki/SceneText. and the multilingual dataset ^{3}^{3}3The multilingual dataset is available at http://liama.ia.ac.cn/wiki/projects:pal:member:yfpan. provided by Pan et al. [pan].
4.1 Experiments on ICDAR 2011 Competition Dataset
The ICDAR 2011 Robust Reading Competition (Challenge 2: Reading Text in Scene Images) dataset [icdar2011] is a widely used dataset for benchmarking scene text detection algorithms. The dataset contains training images and testing images. The proposed system is trained on the training set and evaluated on the testing set.
It is worth noting that the evaluation scheme of ICDAR 2011 competition is not the same as of ICDAR 2003 and ICDAR 2005. The new scheme, the object count/area scheme proposed by Wolf et al. [Wolf_Jolion_2006], is more complicated but offers several enhancements over the old scheme. Basically, these two scheme use the notation of precision, recall and measure that is defined as
(14)  
(15)  
(16) 
where is the set of groundtruth rectangles and is the set of detected rectangles. In the old evaluation scheme, the matching functions are defined as
(17)  
(18) 
The above matching functions only consider onetoone matches between groundtruth and detected rectangles, leaving room for ambiguity between detection quantity and quality [Wolf_Jolion_2006]. In the new evaluation scheme, the matching functions are redesigned considering detection quality and different matching situations (onetoone matchings, onetomany matchings and manytoone matchings) between groundtruth rectangles and detected rectangles, such that the detection quantity and quality can both be observed using the new evaluation scheme. The evaluation software DetEval ^{4}^{4}4DetEval is available at http://liris.cnrs.fr/christian.wolf/software/deteval/index.html. used by ICDAR 2011 competition is available online and free to use.
The performance of our system, together with Neumann and Matas’ method [real_time], a very recent MSER based method by Shi et al. [mser2013] and some of the top scoring methods (Kim’s method, Yi’s method, THTextLoc system and Neumann’s method) from ICDAR 2011 Competition are presented in Table I. As can be seen from Table I, our method produced much better recall, precision and measure over other methods on this dataset. It is worth noting that the first four methods in Table I are all MSER based methods and Kim’s method is the winning method of ICDAR 2011 Robust Reading Competition. Apart from the detection quality, the proposed system offers speed advantage over some of the listed methods. The average processing speed of the proposed system on a Linux laptop with Intel (R) Core (TM)2 Duo 2.00GHZ CPU is 0.43s per image. The processing speed of Shi et al.’s method [mser2013] on a PC with Intel (R) Core (TM)2 Duo 2.33GHZ CPU is 1.5s per image. The average processing speed of Neumann and Matas’ method [real_time] is 1.8s per image on a “standard PC”. Figure 9 shows some text detection examples by our system on ICDAR 2011 dataset.
Methods  Recall  Precision  

Our Method  68.26  86.29  76.22 
Shi et al.’s method [mser2013]  63.1  83.3  71.8 
Kim’s Method (not published)  62.47  82.98  71.28 
Neumann and Matas [real_time]  64.7  73.1  68.7 
Yi’s Method  58.09  67.22  62.32 
THTextLoc System  57.68  66.97  61.98 
Neumann’s Method  52.54  68.93  59.63 
To fully appreciate the benefits of text candidates elimination and the MSERs pruning algorithm, we further profiled the proposed system on this dataset using the following schemes (see Table II)
1) SchemeI, no text candidates elimination performed. As can be seen from Table II, the absence of text candidates elimination results in a major decrease in precision value. The degradation can be explained by the fact that large number of nontext are passed to the text candidates classification stage without being eliminated.
2) SchemeII, using default parameter setting [vlfeat] for the MSER extraction algorithm. The MSER extraction algorithm is controlled by several parameters [vlfeat]: controls how the variation is calculated; maximal variation excludes too unstable MSERs; minimal diversity removes duplicate MSERs by measuring the size difference between a MSER and its parent. As can be seen from Table II, compared with our parameter setting (), the default parameter setting () results in a major decrease in recall value. The degradation can be explained by two reasons: (1) the MSER algorithm is not able to detect some low contrast characters (due to ), and (2) the MSER algorithm tends to miss some regions that are more likely to be characters (due to and ). Note that the speed loss (from 0.36 seconds to 0.43 seconds) is mostly due to the MSER detection algorithm itself.
Component  Recall  Precision  Speed (s)  

Overall system  68.26  86.29  76.22  0.43 
SchemeI  65.57  77.49  71.03  0.41 
SchemeII  61.63  85.78  71.72  0.36 
4.2 Experiments on Multilingual Dataset
The multilingual (include Chinese and English, see Figure 10) dataset was initially published by Pan et al. [pan] to evaluate the performance of their scene text detection system. The training dataset contains images and the testing dateset contains images. As there are no apparent spacing between Chinese word, this multilingual dataset only provides groundtruths for text lines. We hence evaluate the text line detection performance of our system without further partitioning text into words. Figure 10 shows some scene text detection examples by our system on this dataset.
Methods  Recall  Precision  Speed (s)  

SchemeIII  63.23  79.38  70.39  0.22 
SchemeIV  68.45  82.63  74.58  0.22 
Pan et al.’s method [pan]  65.9  64.5  65.2  3.11 
The performance of our system (include SchemeIII and SchemeIV) and Pan et al.’s method [pan] is presented in Table III. The evaluation scheme in ICDAR 2003 competition (see Section 4.1) is used for fair comparison. The main difference between SchemeIII and SchemeIV is that the character classifier in the first scheme is trained on the ICDAR 2011 training set while the character classifier in the second scheme is trained on the multilingual training set (character features for training the classifier are the same). The result comparison between SchemeIII and SchemeIV in Table III shows that the performance of the proposed system is significantly improved because of the incorporating of the Chinesefriendly character classifier. The basic implication of this improvement is that the character classifier has a significant impact on the performance of the overall system, which offers another advantage of the proposed system: the character classifier can be trained on desired dataset until it is accurate enough and be plugged into the system and the overall performance will be improved. Table III also shows the advantages of the proposed method over Pan et al.’s method in detection quality and speed.
5 Conclusion
This paper presents a new MSER based scene text detection method. Several key improvement over traditional methods have been proposed. We propose a fast and accurate MSERs pruning algorithm that enables us to detect most the characters even when the image is in low quality. We propose a novel selftraining distance metric learning algorithm that can learn distance weights and threshold simultaneously; text candidates are constructed by clustering character candidates by the singlelink algorithm using the learned parameters. We propose to use a character classifier to estimate the posterior probability of text candidate corresponding to nontext and eliminate text candidates with high probability for nontext, which helps to build a more powerful text classifier. By integrating the above ideas, we built a robust scene text detection system that exhibited superior performance over stateoftheart methods on both the ICDAR 2011 Competition dataset and a multilingual dataset.
Acknowledgments
The research was partly supported by National Basic Research Program of China (2012CB316301) and National Natural Science Foundation of China (61105018, 61175020).