Robust Scene Text Recognition Using Sparse Coding based Features
In this paper, we propose an effective scene text recognition method using sparse coding based features, called Histograms of Sparse Codes (HSC) features. For character detection, we use the HSC features instead of using the Histograms of Oriented Gradients (HOG) features. The HSC features are extracted by computing sparse codes with dictionaries that are learned from data using K-SVD, and aggregating per-pixel sparse codes to form local histograms. For word recognition, we integrate multiple cues including character detection scores and geometric contexts in an objective function. The final recognition results are obtained by searching for the words which correspond to the maximum value of the objective function. The parameters in the objective function are learned using the Minimum Classification Error (MCE) training method. Experiments on several challenging datasets demonstrate that the proposed HSC-based scene text recognition method outperforms HOG-based methods significantly and outperforms most state-of-the-art methods.
Keywords:Scene text recognition feature representation sparse coding K-SVD HSC HOG.
Since texts in images and videos contain rich high-level semantic information, which is valuable for understanding scenes, scene text recognition has attracted increasing attentions in the computer vision community in recent years. However, scene text recognition is nontrivial due to challenges such as background clutters, illumination changes, rotated characters, curved text lines, low resolution of characters, blurred characters, etc. Figure 1 shows examples of some scene text images, showing the challenges of scene text recognition.
In scene text recognition, there are four main problems: (1) text detection, (2) text recognition, (3) full image word recognition, and (4) isolated scene character recognition. Text detection is to locate text regions in an image; while text recognition, given text regions, is usually referred to as cropped word recognition. Full image word recognition usually includes both text detection and text recognition in an end-to-end scene text recognition system. Isolated character recognition is usually a basic component of a scene text recognition system. In this paper, we mainly focus on the text recognition (or cropped word recognition) problem.
Since the work of WangK2011 (), object recognition/detection based scene text recognition methods have achieved significant progress for scene text recognition Mishra2012a ()WangT2012 ()Shi2013 ()Shi2014a ()Shi2014b (). In this type of methods, each character class is considered as a visual object, and a character detector is used to detect and recognize character candidates simultaneously. The final word recognition result is obtained by combining character detection result and other contexts (e.g. geometric constraints, language model, etc.). Since this kind of methods jointly optimizes character detection and recognition, they have shown superior performance to traditional Optical Character Recognition (OCR) based methods.
In object recognition/detection based scene text recognition methods, a key issue is to design effective character features for character detection and classification. Since the Histograms of Oriented Gradients (HOG) features have been proven effective and popularly used in object detection Dalal2005 ()Felzenszwalb2010 (), the HOG features have been introduced to character feature extraction for scene text recognition WangK2011 ()Mishra2012a ()Mishra2013 (). The core of HOG is to use gradient orientation at every pixel to represent the local appearance and shape of an object. While HOG is very effective in describing local features (such as edges) and robust to appearance and illumination changes, it can not effectively describe local structure information Zhang2011 ().
In this paper, we mainly focus on the cropped word recognition problem, and propose an effective scene text recognition method using sparse coding based features, i.e., the Histograms of Sparse Codes (HSC) features that are originally proposed for object detection Ren2013 (). In extracting the HSC features, per-pixel sparse codes firstly computed with a dictionary, which is learned from data using K-SVD in an unsupervised way. The per-pixel sparse codes are then aggregated to form local histograms (similar to the HOG features). By representing common structures using the dictionary, the HSC features for character feature description have advantages that they can represent richer structural information than the HOG features.
For the character detector, we use simple classifiers including random ferns (FERNS), support vector machine (SVM), and sparse coding (SC) based classifier Wright2009 (). The former two classifiers have been used for scene text recognition WangK2011 () Mishra2012a (), while the third one has not yet been used for scene text recognition. For word recognition, we design an objective function based on the Pictorial Structures (PS) model Felzenszwalb2005 ()Felzenszwalb2010 () to integrate character detection result and geometric constraints. The parameters in the objective function are learned using the Minimum Classification Error (MCE) training method Juang1997 (). Experiments on the popular ICDAR2003, ICDAR2011, SVT and III5K-Word datasets show that, for scene text recognition, the proposed method using the HSC features can significantly improve the performance of that using the HOG features, and it outperforms most of the state-of-the-art methods.
The main contributions of this paper are three-fold.
First, we propose an effective scene text recognition method using sparse coding based features (i.e., the HSC features) that are learned automatically from data for character feature representation. Since the HSC features can represent richer structural information of character features, they can achieve superior performance in character/word recognition compared to the commonly used HOG features, as illustrated in our experiments (see Section 5). Since the HSC feature extraction method is as simple as HOG (see Section 3), it can be considered as an alternative feature extraction method in applications (including character/word recognition) that need recognition, verification (detection), or classification.
Second, we show that the SC based classifier can achieve high performance in charater/word recognition, which reveals the potential of the SC based classifier in character/word recognition.
Third, we propose to use the MCE training method, which has been commonly used in speech recognition and handwriting recognition Juang1997 () Biem2006 (), to learn the parameters in the scene text recognition model. This provides a new way to learn parameters automatically in scene text recognition.
The rest of this paper is organized as follows. Section 2 reviews the related works. Section 3 describes character feature representation method using sparse coding based features. The word recognition method is provided in Section 4. Section 5 presents the experimental results, and Section 6 offers concluding remarks. This paper is an extension of our previous work Zhang2014 (). The extension includes that we provide more details and discussions about the proposed method, evaluate on more datasets and more aspects (e.g., the recognition speed, etc.), present more results on scene character recognition, and so on.
2 Related Work
In scene text recognition, there are two main issues: one is to design a character detector that mainly involves feature representation and classification (whereas, much more attentions are paid to feature representation which will be reviewed in this section); the other one is to develop a word recognition model for integrating character detection and other contexts (i.e., word recognition or word formation). For classification, classifiers such as FERNS WangK2011 (), SVM Mishra2012a (), random forest Yao2014 () and CNN WangT2012 ()Jaderberg2014 () have been adopted. In this section, we briefly review character feature representation methods and the main word recognition models, focusing more on the character feature representation methods.
2.1 Character Feature Representation
Feature representation is an important issue in pattern recognition and computer vision. For scene text recognition, feature representation attracts increasing attentions. Since scene character recognition is usually a basic component of a scene text recognition system, in this section we also review the main character feature representation methods proposed for scene character recognition. Existing character feature representation methods mainly fall in three types: HOG and its variants, mid-level character feature representation, and deep learning based methods.
The HOG features have been shown to be effective and have been used in object detection Dalal2005 (), and for scene character feature representation WangK2011 ()Mishra2012a ()Mishra2013 (). Although HOG is very simple and effective in describing local features (such as edges), HOG ignores the spatial and structural information. Hence some methods are proposed to improve HOG. For example, Yi et al. Yi2013b () improve the HOG features by global sampling (called GHOG) or local sampling (called LHOG) to better model character structures. Tian et al. Tian2013 () propose the Co-occurrence of Histogram of Oriented Gradients (called CoHOG) features, which capture the spatial distribution of neighboring orientation pairs instead of only a single gradient orientation, for scene text recognition. The CoHOG method improves HOG significantly in scene character recognition. Later, the authors of Tian2013 () propose the pyramid of HOG (called PHOG) Tan2014 () to encode the relative spatial layout of the character parts, and propose the convolutional CoHOG (called ConvCoHOG) Su2014 () to extract richer character features by exhaustively exploring every image patches within a character image. These methods effectively improve the performance of scene character recognition.
Recently, some works propose to extract mid-level features for character feature representation. Yao et al. Yao2014 () propose to use a set of mid-level detectable primitives (called strokelets), which capture substructures of characters, for character representation. The strokelets are used in conjunction with the HOG features for character description, as supplementary features to the HOG features. However, using the strokelets alone does not perform well. In Lee2014 (), a discriminative feature pooling method that automatically learns the most informative sub-regions of each scene character is proposed for character feature representation. Gao et al. Gao2014a () propose a stroke bank based character representation method. The basic idea is to design a stroke detector for scene character recognition using the stroke bank. In Gao2014b (), Gao et al. propose to learn co-occurrence of local strokes by using a spatiality embedded dictionary, which is used to introduce more precise spatial information for character recognition. The results demonstrate the effectiveness of the two methods. All these methods explore mid-level features (such as the strokelets proposed in Yao2014 (), the sub-regions proposed in Lee2014 (), the stroke bank proposed in Gao2014a (), etc.) to represent character features, and have shown their effectiveness in scene characeter/text recognition.
The deep learning methods have also been adopted for feature learning of scene characters. Coates et al. Coates2011 () propose a unsupervised feature learning method using convolutional neural networks (CNN) for scene character recognition. In WangT2012 (), character features are extracted using an unsupervised feature learning algorithm similar to Coates2011 (), and are integrated into a convolutional neural network (CNN) for character detection and classification. Recently, Jaderberg et al. Jaderberg2014 () develop a CNN classifier that can be used for both text detection and recognition. The CNN classifier has a novel architecture that enables efficient feature sharing using a number of layers in common for character recognition. The performance achieved by deep learning based methods is pretty high on scene character and text recognition, showing the potential advantages of the deep learning based methods in scene character and text recognition. However, there still lacks theoretical analysis about the effectiveness of the CNN based methods, which needs further efforts of the community.
Character structure information is important to character representation. Besides Yao et al. Yao2014 () and Lee et al. Lee2014 () that exploit character structure features, Shi et al. Shi2013 ()Shi2014a ()Shi2014b () propose to use part-based tree-structured features for representing character features. The part-based tree-structured features are designed directly according to the shape and structure of each character class. However, in Shi2013 ()Shi2014a ()Shi2014b (), one needs to artificially design a tree-structured model and manually annotate training samples for each class. This is nontrivial and it is difficult to apply part-based tree-structured features to tasks with more character classes (such as Chinese characters).
The proposed HSC feature representation for scene text recognition is based on sparse coding, and is learned automatically, which can be viewed as a feature learning method using sparse coding. Compared to the feature learning method based on deep learning (such as CNN), the HSC feature extraction method is much more simpler and easier to implement.
In the context of feature learning, sparse coding has been a popular technique for learning feature representation, in applications such as image classification Yang2009 (), object detection Kavukcuoglu2010 (), etc. In Ren2013 (), the HSC feature extraction method based on sparse coding is proposed for object detection showing its superior performance to the HOG features. However, it has not been applied to scene text recognition previously. To the best of our knowledge, this work is the first time to apply the HSC features to scene text recognition. We have surprisingly found that the use of the HSC features instead of the HOG features can significantly improve the performance of scene text recognition.
2.2 Word Recognition Model
Regarding the word recognition model for yielding word recognition results, Wang et al. WangK2011 () apply a lexicon-driven pictorial structures model to combine character detection scores and geometric constraints. Mishra et al. Mishra2012a ()Mishra2012b () build a conditional random field (CRF) model to integrate bottom-up cues (character detection scores) with top-down cues (lexicon prior). Similarly, Shi et al. Shi2013 () use a CRF model to get final word recognition results. In Novikova2012 (), the weighted finite-state transducers (WFSTs) based maximum a posteriori (MAP) inference is used for word recognition. In Goel2013 (), the text is recognized by matching the scene and synthetic image features with a weighted dynamic time warping (wDTW) approach. In WangT2012 ()Phan2013 ()Neumann2013 ()Jaderberg2014 (), heuristic integration models (i.e., summation of character detection scores) are used to integrate character detection result. In Shi2014b (), a probabilistic model is proposed to combine the character detection scores and a language model from the Bayesian decision view. In Weinman2014 (), a semi-Markov model is used to integrate multiple information for scene text recognition. In Su2014b (), a word image is first converted into a sequential column vectors based on the HOG features, and then Recurrent Neural Network (RNN) is adapted to classify the sequential feature vectors into the corresponding word. The characteristics of the RNN based method proposed in Su2014b () is that it is able to recognize the whole word images without character-level segmentation and recognition.
It is also noticed that Almazán et al. Almazan2014 () and Rodríguez-Serrano et al. Serrano2015 () propose to embed word attributes/labels and word images into a common subspace for word spotting and text recognition. Then the text recognition problem turns into a retrieval problem which essentially is a nearest neighbor search problem. This idea provides a new solution to text recognition, and has show its effectiveness and efficiency Serrano2015 ().
In this paper, we mainly focus on the character feature representation issue. For word recognition/formation, we simply apply the lexicon-driven pictorial structures model similar to that in WangK2011 (). However, we improve it by taking into account the influence of the word length (i.e., the number of characters in the word) on word recognition results. Moreover, we propose to automatically learn the parameters of the word recognition model using the MCE training method to optimize the word recognition performance.
3 Character Feature Representation based on Sparse Coding
In this paper, we propose to learn a feature representation that describes richer structures of characters based on sparse coding. In Ren2013 (), sparse coding is used to compute per-pixel sparse codes, which are aggregated into “histograms”, i.e., forming the Histograms of Sparse Codes (HSC) features. HSC shows its advantages over HOG in that it can represent richer structures. We propose to use the HSC features to represent structures of characters for scene text recognition.
3.1 Local Representation via Sparse Coding
For computing per-pixel sparse codes, we need to learn a dictionary from data. We employ K-SVD Aharon2006 () for dictionary learning, which learns common structures of objects in an unsupervised way. K-SVD is a generalization of the K-means algorithm and has been popularly used for dictionary learning in tasks such as image processing Elad2006 ()Romano2014 (), image classification Yang2009 (), face recognition Zhang2010 (), action recognition Zheng2013 (), handwritten digit recognition Zhang2013 (), object recognition Jiang2013 (), etc. In the following, we briefly introduce how to use K-SVD to learn the dictionary for character feature representation.
In sparse representation, an input pattern is represented by a linear combination of atoms of a dictionary, under the constraints that the linear coefficients are sparse, which means that only a small fraction of entries in the coefficients are nonzeros. Denote a dictionary by , where is the -th basis vector and is the size of the dictionary. When a dictionary is given, the coefficients for representing a given pattern are computed by sparse coding, which can be written as:
where is a sparse vector of the representation coefficients, and is a predetermined number of the non-zero entries in the coefficients for constraining the sparsity level. The obtained representation coefficients are referred to as the sparse codes of the pattern.
Besides sparse coding, another important issue in sparse representation is to learn the dictionary based on a set of training samples. Given a training data of samples, , the dictionary learning problem can be addressed as Aharon2006 ():
where is the Frobenius norm defined as the sum of squares of the elements of the matrix/vector, and is a matrix. The problem can be rewritten as a matrix factorization problem with a sparsity constraint:
This is to search the best possible dictionary for the sparse representation of the sample set , and it is a joint optimization problem with respect to the dictionary and the coefficients . It can be solved by alternating between sparse coding of the samples based on the current dictionary and an update process of the dictionary elements.
The connection between sparse representation and clustering (i.e., vector quantization (VQ)) has been mentioned in previous works Delgado2003 ()Aharon2006 (). In clustering, the representative examples (or the centers of clusters) can be viewed as the codewords of a dictionary (also called a codebook in vector quantization), and are used to represent samples using the nearest neighbor assignment. That is, when a dictionary is given, each sample is represented as its closest codeword. This can be considered as a special case of sparse coding, in the sense that only one atom participates in the reconstruction of the sample. Using the above notations, the sparse coding problem can be written as:
where is the standard basis and is a vector from the standard basis set with all zero entries except for a one in the -th position. Similary, the codebook in vector quantization can be learned by minimizing the mean square error (MSE) as follows:
The K-means algorithm is one of the most popular methods for learning codewords, which applies an iterative procedure with two steps in each iteration: 1) given , assign each sample to its nearest cluster, i.e., sparse coding; and 2) given that assignment, update to further minimize MSE. Since MSE monotonically decreases in each iteration, K-means can guarantee at least a local optimum solution.
In Aharon2006 (), the sparse representation problem is viewed as a generalization of the VQ problem, and the K-SVD algorithm is proposed to learn the dictionary by generalizing the K-means algorithm. If setting in Eqn. (2) to 1, the sparse representation problem becomes the VQ problem, as a special case. In the sparse coding step of K-SVD, any pursuit algorithm (such as the orthogonal matching pursuit (OMP) algorithm Pati1993 (), the basis pursuit (BP) algorithm Chen2001 (), etc.) can be adopted. In the second step, K-SVD updates one column of at each time while fixing the other columns unchanged. The updated column and the new representation coefficients are obtained by minimizing MSE using singular value decomposition (SVD). The characteristics of K-SVD are that, it updates the columns of sequentially and updates the relevant coefficients simultaneously. Though some other learning approaches apply similar procedures (i.e., the two steps), K-SVD is distinctive in that it updates each column separately by which it achieves better efficiency.
K-SVD is fast and efficient, and performs well in practice Robinstein2010 (). It has been widely used in applications such as image processing, face recognition, object recognition, etc. In this paper, we use K-SVD to learn a dictionary that is used to compute per-pixel sparse codes in extracting the HSC features. In the sparse coding step of learning the dictionary using K-SVD, the OMP algorithm is adopted. In the dictionary update step, SVD performs once for updating one atom, and the procedure repeats times for a dictionary with atoms.
For learning common structures, we use a set of image patches as the training set. The dictionary is learned by K-SVD. The Berkeley Segmentation Dataset and Benchmark (BSDS) Martin2001 () is used to learn the dictionary. For obtaining the training image patches, we sample about 1000 image patches randomly from each image in BSDS. Then all the sampled image patches are used as the training set. Once the dictionary is learned, sparse codes at each pixel in an image pyramid are computed using OMP. When learning the dictionaries, we can set the patch size and the dictionary size to different values (we will evaluate the influence of the dictionary size and the patch size in Section 5.2.2). In this paper, we set the dictionary size to and the patch size to . The learned dictionaries with four patch sizes are shown in Figure 2. We can see that, with a larger patch size, the dictionary can represent more rich and detailed structures.
3.2 Aggregation into Histograms of Sparse Codes
With learned sparse codes at each pixel, we aggregate them into histograms using a strategy similar to HOG. In extracting the HSC features, a character candidate is divided into small cells () and a feature vector of each cell is computed. Denote the sparse codes at a pixel by . For each non-zero element in , its absolute value is assigned to one of the four spatially-surrounding cells using bilinear interpolation. In each cell, a feature vector is obtained by averaging the codes in a neighborhood. These features are called the HSC features in Ren2013 (). is normalized with its norm. Finally, to increase the discriminative power of HSC, each value is transformed using the Box-Cox transformation Fukunaga1990 () as follows:
We experimentally set the value of to 0.25.
Figure 3 shows the HSC and HOG features for some character samples as well as some non-character samples extracted from the background regions. From the figure, we can see that the HSC features capture richer structural information, and can better localize local patterns (such as textures, edges, corners, etc.) in each cell. In contrast, the HOG features capture less structural information and the edges in HOG may be off center.
4 The Word Recognition Method
Figure 4 shows the main steps of the proposed method for word recognition. There are mainly two steps: character detection and classification, and word formation using the pictorial structures (PS) model. In the first step, for detecting and classifying character candidates, we train a character classifier based on the HSC features using a training set of character samples. In this step, we detect potential character candidates using a multi-scale sliding window strategy with the character classifier. Non-maximum suppress (NMS)Felzenszwalb2010 () is adopted to get the final character detection result (see Section 4.1 for more details).
After character detection and classification, character candidates are concatenated to form words. In this paper, we apply the lexicon-driven pictorial structures (PS) model Felzenszwalb2005 () similar to that used in WangK2011 () to form words. In object recognition using the PS model Felzenszwalb2005 (), an object is represented by a collection of parts with connections between certain pairs of parts. For example, for faces, the parts are features such as the eyes, nose and mouth, and the connections can be described as the relative locations of these features. For people, the parts are the limbs, torso and head, and the connections allow for articulation at the joints. The PS model is suitable for the tasks of finding objects, where the best match of a PS model (related to the object) to an image is found. Hence, the PS model is an appealing model for word recognition, where characters can be viewed as the parts of a word and geometric relationships can be considered as the connections between the parts. In Figure 4, the bounding boxes of characters of each candidate recognition result are provided. From the figure, we can see that each word can be considered as a collection of characters with each character being a part of the word.
In this paper, we use the word spotting strategy for word recognition. This means that for each word image, a lexicon consisting of a list of words is given, and word recognition is accomplished by finding the word that matches the image best. The matching score between a word and an image is given by an objective function that integrates character classification scores and geometric connections between characters based on the PS model. The match between a word and an image is searched using the Dynamic Programming (DP) algorithm based on the objective function. Section 4.2 gives more details. In Figure 4, the matching results of three words (“MADE”, “TRADE”, and “MAN”) and the image are shown. The value in each candidate recognition result denotes the value of the objective function that evaluates the matching score between the input image and the given word. The word that yields the highest score is chosen as the recognition result.
4.1 Character Detection and Classification
Character detection aims to detect character candidates using a character classifier. In this paper, we propose to apply the HSC features for character feature representation, and train the character classifier using the HSC features. We train a multi-class character classifier using a supervised training procedure. For each character class, the HSC features of the training samples are extracted using the learned dictionary. The HSC features of the whole training samples are fed into a classifier (such as the FERNS or SVM classifier) to train the character classifier.
The character detection algorithm is shown in Algorithm 1, which contains two steps: character candidate generation and classification, and non-maximum suppress (NMS) for obtaining the final character detection result. For each detected character candidate, the location of the character candidate, its corresponding character class and the classification score with respect to the character class are retained.
In the first step, the character candidates of each character class are detected separately. That is, for each character candidate, if its classification score with respect to a character class is larger than a threshold, it is considered as a character candidate of the character class. In our experiments, we set the threshold empirically to maintain moderate number of character candidates to guarantee that all true character regions are included while keeping the search space in the word formation step small (the more character candidates are maintained, the larger the search space is). In the second step, non-maximum suppress (NMS) is adopted for character detection. In this step, we use a simple greedy heuristic that is similar to Felzenszwalb2005 ()Felzenszwalb2010 () to operate on all the character candidates in order: We first sort character candidates in descending order according to their classification scores; Then the NMS operation iterates on all the character candidates. If the character candidate has not yet been suppressed, we suppress all of its neighbors that are highly overlapped with the character candidate.
4.2 The Word Recognition Model
In the character detection and recognition step, character candidates of each character class are maintained. In the word formation step, character candidates are used to form the word. As indicated previously, the match of each word in the lexicon to the image is found using the DP algorithm. The procedures of concatenated character candidates corresponding to a given word are as follows. Let be a word with characters (note that denotes a character class). Let be one of the character candidates of the character class , be one of the character candidates of the character class , and so forth. Then the character candidate sequence will be a detected word candidate of , which is called a configuration of Felzenszwalb2005 ()WangK2011 (). All the configurations will be evaluated by an objective function that integrates the character classification scores and geometric relationships between pairs of adjacent characters. The optimal configuration can be obtained using the DP algorithm based on the objective function. We also use simple rules to reduce the searching space, such as the horizontal/vertical distance between the character candidates of two successive characters (such as and ) should not be larger than three times the width of and the two successive characters , and should not highly overlapped, etc.
For word recognition, we design a new objective function based on the PS model Felzenszwalb2010 () to evaluate each word in the lexicon. The PS model has been used for scene text recognition in WangK2011 () to find an optimal configuration of a given word. However, in WangK2011 (), the objective function does not consider the word length, which leads to a drawback that words’ scores are influenced by their lengths, and thus words of different lengths are not comparable. We improve it by considering the word length in the objective function. Since when using the word spotting strategy for word recognition, the list of words is provided (which means that the prior probabilities of the words in the lexicon are identical), we do not integrate an English language model in the objective function.
Let be the classification score given by a character classifier as in Eqn. (7). The objective function is designed as follows:
where is a geometric model to evaluate the compatibility of and in geometric constraints, and and are the coefficients to balance the contributions of and the word length to the objective function. The last term in (8) can be viewed as a penalty term, and is used to overcome the bias caused by long words. The geometric model is modeled by a linear SVM classifier, and the extracted features for include scale similarity, overlapping of the two candidates, distance between the top positions and the bottom positions of the two candidates, etc. The parameters (i.e., and ) in the objective function in Eqn. (8) are learned using the Minimum Classification Error (MCE) training method that has been widely used in speech recognition Juang1997 () and handwriting recognition Biem2006 () Liu2004c () Wang2012a (), as introduced in the following section.
4.2.1 Parameter Learning
In word-level MCE training, the coefficients in Eqn. (8 ) are estimated using a training dataset which contains scene text images, denoted by , where is the word image, ( is the number of characters in ) is the ground-truth transcript of , and is the ground-truth character candidate sequence of the character class sequence ().
Following the MCE training procedure Juang1997 (), the misclassification measure on a cropped word image sample is estimated by:
where is the parameter set, is the discriminant function for the ground-truth word and the ground-truth configuration , and is the discriminant function of the closest rival word and its optimal configuration :
where is the provided lexicon for the image . The misclassification measure is then transformed by the sigmoidal function as:
where is a parameter to control the hardness of sigmoidal nonlinearity and it is usually set to 1. The parameters in MCE training are learned using the stochastic gradient descent algorithm (SGD) Robbins1951 () on each input sample.
For scene text recognition, the discriminant function is the designed objective function as Eqn. (8). The rival word is the one that is the most confusable word with the correct word, and is obtained using the beam search algorithm Wang2012a (). The parameters are updated iteratively by SGD as follows:
where is the learning rate.
We evaluate the proposed HSC-based scene text recognition method on several popular datasets including the ICDAR2003 Lucas2003 (), ICDAR2011 Shahab2011 (), Stree View Text (SVT) WangK2011 (), and III5K-Word datasets Mishra2012b (). The ICDAR2003 dataset contains 507 natural scene images (including 258 training images and 249 test images) in total. The images are annotated at character level. Characters and words can be cropped from the images. The ICDAR2011 dataset contains 229 images for training and 255 images for test. The SVT dataset is composed of 100 training images and 249 test images. For the ICDAR2011 and SVT datasets, only the words in the images can be cropped because the images are annotated at word level only. The IIIT5K-Word dataset is the largest and most challenging dataset for word recognition so far. This dataset includes 5000 word images, where 2000 images are used for training and 3000 images for test. For fair comparison with previous works, we ignore non-alphanumeric characters and words with 2 or fewer characters in word recognition when using the ICDAR2003, ICDAR2011, and SVT datasets. The details of these datasets are shown in Table 1.
Our experiments are implemented on a PC with an Intel(R) Core(TM) i7-2670QM CPU 2.20 GHz processor and 8 GB RAM, and are programmed using Matlab R2011a.
5.1 Character Classification
In this section, we first evaluate the performance of the proposed HSC-based character classifier in the task of character classification.
5.1.1 Classifier Training
For training the character classifier, we get the hybrid training set by combining the cropped characters from the training set of the ICDAR2003 dataset (6,185 samples), the training character samples of Chars74K-15 111http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/. (930 samples), and the training samples of the synthetic data (60,219 samples) produced by Wang. et. al WangK2011 (). For better detection performance, we consider the background images as one class (i.e., the background class or the non-character class), and a classifier with 63 classes is used for character detection. We use the background image regions that are extracted from the training images of the ICDAR2003 dataset as the training samples of the non-character class. To increase the size of the training samples of the non-character class, the images in the Microsoft Research Cambridge Object Recognition Image Database 222http://research.microsoft.com/en-us/downloads/b94de342-60dc-45d0-830b-9f6eff91b301/default.aspx. are added as the training samples of the non-character class as well. In evaluating the performance of the character classifiers, two datasets are used for test: ICDAR03-CH and Chars74K-15. The former dataset consists of the character samples cropped from the test set of the ICDAR2003 dataset, which totally contains 5,379 character samples. The latter one contains 930 test samples (15 samples per character class).
We propose to use the HSC features in the character classifier. To test the influence of different features on the performance of the proposed method, we also use the HOG features and the LBP features Ojala2002 () in the character classifier for comparison.
For the HOG features, we use the Vlfeat library Vedaldi2008 () for the HOG feature extraction. We use a variant of the HOG features proposed in Felzenszwalb2010 (). To generate feature vectors with the same length, we first resize each training sample into the size of , then divide the image into square cells of size , obtaining cells. In each cell, a 31-dimensional vector is extracted by aggregating per-pixel gradients (see Felzenszwalb2010 () for more details). Finally, we obtain a feature vector with a dimensionality of 1116 for each sample.
For the HSC features, we also resize each training sample into the size of , and the size of each cell is set to as well. For each cell, a feature vector with a dimensionality that is the same as the dictionary size (i.e., ) is extracted. Since the feature vector in each cell is obtained by averaging the codes in a neighborhood, the cells on the boundary are not used to form the feature vector of the whole image. Hence, the dimensionality of the extracted HSC features on each image sample is , which is 1600 by default.
For the LBP features, we adopt the implementation of Vlfeat as well. Each training sample is also resized into the size of , and dividied into square cells of size . In each cell, a 58-dimensional vector is extracted. The dimensionality of the extracted LBP features for each image sample is 2088.
To test the influence of classifiers on the performance of the proposed method, we also evaluate several classifiers. The classifiers we test in this paper include FERNS, linear SVM, and SC. The FERNS classifier is trained as in WangK2011 (). In sparse coding, for each character class, we learn a dictionary using the training samples of the class. In classification, an input character pattern is classified into the class that gives the minimum reconstruction error, similar to Wright2009 (). For implementation, we use the SPArse Modeling Software (SPAMS) developed by Mariral et al. Mairal2010 () for dictionary learning and sparse coding, and classifiers with various number of basis vectors (including 10, 20, 30, 50, and 100, denoted by SC-10, SC-20, …, SC-100, respectively) are evaluated.
We have conducted some experiments to evaluate the performance of the character classifier on classifying non-character regions. We first randomly sampled some non-character regions (with size 48*48) from the test dataset of the ICDAR2003 dataset, obtaining about 5,000 non-character samples (which is disjoint with the non-character samples for training). We test the performance using the HOG and HSC features and the FERNS, SVM, and SC classifiers. The results show that the performance is quite high for SVM and SC (higher than 96%) while the FERNS classifier performs much worse (about 85%). Comparing the HOG features and HSC features, we can find that the HSC features and the HOG features perform comparable. This shows that the classifier can reject most regions that are not like a character region. Due to the page limitation, we do not put these results in the paper.
5.1.2 Classification Results
The character classification results using different classifiers trained with the hybrid training set and different features are shown in Table 2. As we can see, the HSC features outperform the HOG features and the LBP features significantly for all the classifiers on the two datasets. For the FERNS classifier, the increase of classification accuracy is about 14% on ICDAR03-CH and is about 10% on Chars74K-15 when the HSC features are used instead of the HOG features, and the increase of recognition accuracy is about 37% on ICDAR03-CH and is about 27% on Chars74K-15 when the HSC features are used instead of the LBP features. For the other classifiers, the improvements of recognition accuracy using the HSC features are also obvious.
Comparing the performance of different classifiers, we can see that the SC classifiers with more than 30 basis vectors perform better than the FERNS classifier and the SVM classifier on the two datasets. When using the HSC features, the SC-50 classifier yields the best performance which is comparable with that obtained by the SC-100 classifier on the ICDAR03-CH dataset, while the SC-100 classifier performs the best on the Chars74K-15 dataset. When using the HOG features and the LBP features, the SC-100 classifier performs the best on both datasets. Though only using a linear kernel, the SVM classifier shows promising performance for all the three features.
The LBP features perform much worse than the HOG features and the HSC features. This is mainly due to the fact that the LBP features describe texture information, while for character classification, gradients and other structural information (such as edges, corners, etc.) play a more important role. Hence, the performance of the LBP features is poor, indicating that the LBP features are not suitable for character classification. Thus, in the following experiments, we do not consider the LBP features.
5.1.3 Experiments with Individual Training Datasets
For fair comparison with previous works, we also use individual training datasets to train the character classifier. That is, for the ICDAR2003 dataset, we use the training set of the ICDAR2003 dataset to train a classifier, and test the classification accuracy on the test set of the ICDAR2003 dataset. For the Chars74K-15 dataset, we use the training set (15 training samples per character class) of the Chars74K-15 dataset to train a classifier, and evaluate on the test set of the Chars74K-15 dataset. These training and test settings are consistent with those used in previous works Lee2014 ()Campos2009 ()Lucas2003 ().
The classification results using different classifiers trained with individual training datasets and different features are shown in Table 3. The classification results in Table 3 also show the superiority of the HSC features to the HOG features. The difference between the results in Table 2 and those in Table 3 lies in the performance of the SC classifier. In Table 2, the tendency of the performance of the SC classifier is that, the more the number of basis vectors is, the better its performance is. However, in Table 3, the tendency is not obvious, and the performance of SC-100 is even worse (on the Chars74K-15 dataset, the SC-100 classifier does not even work). This is because for the Chars74K-15 dataset, only 15 training samples for each character class are used to train the classifier, which will cause the overfitting problem in training the SC classifier with 100 basis vectors. Due to the limited number of the training samples of the ICDAR2003 and Chars74K-15 datasets, the SC classifier does not benefit much from using a larger number of basis vectors. However, in training the character classifier with the hybrid training set, the number of the training samples for each class is much larger and thus the SC classifier with a larger number of basis vectors performs better.
In this paper, since we do not concern the influence of the number of the training samples on the performance of character classification and word recognition, we simply use the classifiers trained with the hybrid training set for the experiments of word recognition (see Section 5.2) because training with more samples generally provides better performance. The experiments with individual training datasets are only for fair comparison with previous works in the task of character classification to illustrate the effectiveness of the HSC features.
We compare the proposed method with several state-of-the-art methods on the two datasets, as shown in Table 4. In the table, we report the classification accuracies of the HSC features and the SC/SVM classifier trained with the hybrid training set or individual training datasets. The compared methods include: the method (ConvCoHOG+Linear SVM) Su2014 () that uses the ConvCoHOG features and a linear SVM classifier for classification, the method (PHOG+Linear/Chi-Square SVM) Tan2014 () that uses the PHOG features and a SVM classifier with the linear kernel or the chi-square kernel for classification, the method (CoHOG+Linear SVM) Tian2013 () that uses the CoHOG features and a linear SVM classifier for classification, the method (HOG+AT+Linear SVM) Mishra2013 () that uses the HOG features (applying affine transform (AT) to enrich the training samples) and uses a linear SVM classifier for classification, the method (Feature Pooling+L2 SVM) Lee2014 () that uses a mid-level feature pooling method to learn informative sub-regions of characters and adopts a L2-regularized SVM classifier for classification, the methods (GHOG+Chi-Square SVM and LHOG+Chi-Square SVM) Yi2013b (), the methods (HOG+NN and HOG+FERNS) WangK2011 () that use the HOG features and the nearest neighbor (NN) classifier or the FERNS classifier, the method (MKL) Campos2009 () that uses the multiple kernel learning (MKL) approach to combine different features, the method (GB+RBF SVM) Campos2009 () that adopts the Geometric Blur (GB) feature Berg2005 () and the SVM classifier with a RBF kernel, and the method (ABBYY) Campos2009 () that uses the commercial OCR system ABBYY FineReader 333http://www.abbyy.com for classification. In the table, we mark the methods that use individual training datasets to train a classifier with a asterisk (*) for convenience, while we do not mark the methods that use extra training datasets.
|Proposed HSC+SC (using the hybrid training set)||68||81|
|Proposed HSC+Linear SVM (using the hybrid training set)||65||77|
|Proposed HSC+SC (*)||68||79|
|Proposed HSC+Linear SVM (*)||68||78|
|Deep CNN Jaderberg2014 ()||80.3||91.0|
|ConvCoHOG+Linear SVM Su2014 ()||-||81|
|CoHOG+Linear SVM Tian2013 ()||-||79.4|
|PHOG+Chi-Square SVM Tan2014 ()||-||79|
|PHOG+Linear SVM Tan2014 ()||-||76.5|
|HOG+AT+Linear SVM Mishra2013 ()||68||73|
|Feature Pooling+L2 SVM Lee2014 () (*)||64||79|
|GHOG+Chi-Square SVM Yi2013b () (*)||62||76|
|LHOG+Chi-Square SVM Yi2013b () (*)||58||75|
|HOG+NN WangK2011 () (*)||58||52|
|MKL Campos2009 () (*)||55||-|
|HOG+FERNS WangK2011 () (*)||54||64|
|GB+RBF SVM Campos2009 () (*)||53||-|
|ABBYY Campos2009 () (*)||31||21|
Comparing the competing methods using individual training datasets, we can see that the proposed method achieves promising performance on the two datasets. Compared to the method proposed in Lee2014 (), the proposed method using individual training datasets shows a superior performance on the Chars74K-15 dataset (68% versus 64%) and a comparable performance on the ICDAR03-CH dataset (both achieving a classification accuracy of 79%). The method presented in Mishra2013 () enriches the training samples by adding extra training samples through affine transforming the original training data. Compared to Mishra2013 (), when using individual training datasets and the linear SVM classifier, the proposed method achieves a much better performance on the ICDAR03-CH dataset (78% versus 73%) and a comparable performance on the Chars74K-15 dataset (both achieving a classification accuracy of 68%). Compared to the method proposed in Su2014 () that uses hybrid training dataset as well, the proposed method achieves comparable performance (both achieving a classification accuracy of 81%). These results demonstrate the good properties of the HSC features in scene character classification. Using the hybrid training set and the SC classifier, the proposed method achieves a better performance than most competing methods on the ICDAR03-CH dataset (achieving a classification accuracy of 81%), and achieves the same performance on the Chars74K-15 dataset. We also note that the method (Deep CNN) recently proposed in Jaderberg2014 () achieves pretty high performance and performs the best so far. The high performance of the method (Deep CNN) is mainly due to the advantages of deep learning with a large number of training samples. We emphasize here that the proposed HSC features for scene text recognition are much simpler and easy to implement.
5.2 Word Recognition
We also evaluate the performance of the proposed method in word recognition using different character classifiers (i.e., FERNS, SVM, and SC) trained with the hybrid training set (as introduced in Section 5.1). The word images of the SVT, ICDAR2003, ICDAR2011 and III5K-Word datasets are used for evaluation. For the ICDAR2003 and ICDAR2011 datasets, we use a lexicon created from all the words in the test set (denoted by I03-Full and I11-Full, respectively), and a lexicon consisting of the ground truth word plus 50 random words from the test set (denoted by I03-50 and I11-50, respectively). For the SVT dataset, we use the lexicons containing 50 words provided by WangK2011 (), denoted by SVT-50. Regarding the III5K-Word dataset, the performance of the proposed method on word recognition with 50 words and the Medium lexicon (containing 1000 words for each image) provided by the authors of Mishra2012a () is evaluated (denoted by III5K-50 and III5K-Med, respectively).
5.2.1 Word Recognition Performance of the Proposed Method Using Different Features
We first evaluate the word recognition performance of the proposed method using the HSC features and the HOG features. For extracting the HSC features, we set the patch size of the elements in a dictionary to and set the dictionary size to 100 to better represent character features and get higher performance (see Section 5.2.2 for the influence of the patch size and the dictionary size). The recognition results are shown in Table 5. From the results, we can see that:
The HSC features outperform the HOG features significantly in word recognition. For all the classifiers, the HSC features outperform the HOG features by a large margin on all the seven experimental settings (i.e., SVT-50, I03-50, I03-Full, I11-50, I11-Full, III5K-50, and III5K-Med). When using the FERNS classifier, the gained improvement of the HSC features over the HOG features is about 6%-8% on SVT-50, I03-50, and I11-50, and is about 11%-15% on the other settings. These results show that the HSC features can more effectively represent character features and structures than the HOG features.
When the SC classifier is used, as the number of basis vectors increases, the performance of the proposed method generally gets better. These results indicate that, the SC classifier with a larger number of basis vectors can better reconstruct a character sample and has more discriminative power.
Using the SC-100 classifier and the HSC features, the proposed method achieves the highest recognition accuracies on I03-50 (92.90%), I03-Full (87.31%), I11-50 (93.87%), I11-Full (89.03%), III5K-50 (86.70%), and III5K-Med (73.63%), which are much higher than those achieved by the state-of-the-art methods (see Section 5.2.3 for more details). On SVT-50, the recognition accuracy obtained by the proposed method using the SC-50 classifier can achieve the highest recognition accuracy (83.31%).
5.2.2 Influence of the Patch Size and the Dictionary Size
In this section, we evaluate the influence of both the patch size of the elements of the dictionary and the dictionary size on the performance of proposed method in word recognition. In Section 5.2.1, we have shown the performance of the proposed method with the patch size and the dictionary size 100 (denoted by HSC99(100)). In this section, we first test the influence of the patch size by evaluating the performance of the proposed method with the patch size being , , and while fixing the dictionary size to 100 (denoted by HSC(100), HSC(100), and HSC(100), respectively). We then test the influence of the dictionary size by evaluating the performance of the proposed method with the dictionary size being 25, 50, and 75 while fixing the patch size to (denoted by HSC(25), HSC(50), and HSC(75), respectively). For all these settings, the classifier used is the SC-100 classifier.
The word recognition results obtained by the proposed method with different patch sizes (while the dictionary size is fixed as 100) are shown in Table 6. From the table, we can see that increasing the patch size from to causes to increase the recognition accuracy of the proposed method. The increase of recognition accuracy from HSC(100) to HSC(100) is quite large, while the increase of that from HSC(100) to HSC(100) is relatively small and the increase of that from HSC(100) to HSC(100) gets smaller. On I11-50 and I11-Full, the proposed method with HSC(100) performs even slightly better than that with HSC(100). These results indicate that, though further increasing the patch size may gain slightly better performance, the HSC features extracted using the patch size can effectively represent most structural information in characters. Hence in our experiments, we use the patch size by default. To better illustrate the influence of the patch size, we show the recognition accuracies of the proposed method with various patch sizes on different datasets in Figure 5.
Fixing the patch size as , we evaluate the influence of the dictionary size on the performance of the proposed method in word recognition with different dictionary sizes. Table 7 shows the word recognition performance of the proposed method with HSC(25), HSC(50), HSC(75), and HSC(100). From the results, we can see that increasing the dictionary size does not certainly increase the recognition accuracy. However, we also observe that, the performance of the proposed method with the dictionary size being 100 is more stable than that of the proposed method with the dictionary size being the other three sizes (25, 50, 75). This can be observed from the results that the recognition accuracies of the proposed method using HSC(100) are only slightly lower than the best results on SVT-50 (HSC(75)), I03-50 (HSC(50)), I11-50 (HSC(75)), I11-Full (HSC99(25)), and III5K-Med (HSC(25)). However, compared with the recognition accuracies obtained by the proposed method with the dictionary size being 100, the recognition accuracies of the proposed method using the other three dictionary sizes (25, 50, and 75) are less stable. For example, although the proposed method using HSC(25) performs the best on I11-Full and III5K-Med and comparably with the best on III5K-50 (HSC(100)), the performance of the proposed method using HSC(25) is much lower than the best results achieved on SVT-50 (by HSC(75)), I03-50 (by HSC(50)), I03-Full (by HSC(100)), and I11-50 (by HSC(75)). Thus, we set the dictionary size to 100 by default.
5.2.3 Comparing the Proposed Method with Several State-of-the-art Methods
We also compare the proposed method with several state-of-the-art methods, and show the results in Table 8. We report the performance of the proposed method using the HSC features and the SC-100 classifier as the obtained results by the proposed method. We note that Bissacco et al. Bissacco2013 () and Jaderberg et al. Jaderberg2014 () achieve higher performance than our method and other competing methods. The method proposed in Jaderberg2014 () achieves the highest performance on SVT-50, I03-50, and I03-Full, which is mainly due to the high performance of the CNN based classifier (as indicated in Table 4). However, it is worth noting that, both the two papers use a large number of additional outside training data, while we use all publicly available training data in this paper. Almazán et al. Almazan2014 () achieved highest performance on SVT-50, III5K-50, and III5K-Med (achieving accuracy of 87.01%, 88.57% and 75.60% on SVT-50, III5K-50, and III5K-Med, respectively)using a different framework called word/label embedding (however, they do not report results on the ICDAR2003 and ICDAR2011 datasets). In the following, we mainly compare the proposed method with the other methods.
|K. Wang et al. WangK2011 ()||57||76||62||–||–||–||–|
|Mishra et al. Mishra2012a ()||73.26||81.78||–||–||–||68.25||55.50|
|Mishra et al. Mishra2012b ()||73.57||80.28||–||–||–||66||57.5|
|Novikova et al. Novikova2012 ()||72.9||82.8||–||–||–||–||–|
|T. Wang et al. WangT2012 ()||70||90||84||–||–||–||–|
|Shi et al. Shi2013 ()||73.51||87.44||79.30||87.04||82.87||–||–|
|Goel et al. Goel2013 ()||77.28||89.69||–||–||–||–||–|
|Weinmann et al. Weinman2014 ()||78.05||–||–||–||–||–||–|
|Shi et al. Shi2014a ()||74.65||84.52||79.98||–||–||–||–|
|Shi et al. Shi2014b ()||73.67||87.83||79.58||87.22||83.21||–||–|
|Yao et al. Yao2014 ()||75.89||88.48||80.33||–||–||80.2||69.3|
|Lee et al. Lee2014 ()||80||88||76||88||77||–||–|
|Su et al. Su2014b ()||83||92||82||91||83||–||–|
|Almazán et al. Almazan2014 ()||87.01||–||–||–||–||88.57||75.60|
|Jaderberg et al. Jaderberg2014 ()||86.1||96.2||91.5||–||–||–||–|
From Table 8, we can see that the proposed method outperforms most the competing state-of-the-art methods on all the datasets. On SVT-50, the recognition accuracy obtained by the proposed method is higher than that achieved by Lee2014 () (83.15% versus 80%). On I03-50 and I03-Full, the proposed method achieves 3% higher recognition accuracy when it is compared to WangT2012 () which uses CNN as the character classifier. On I11-50 and I11-Full, the proposed method obtains an improvement of 6%-7% higher recognition accuracy when compared to Shi2013 () and Shi2014b (). When compared to Lee2014 () on I11-50 and I11-Full, the proposed method obtains an improvement of 6% higher recognition accuracy on I11-50 (93.87% versus 88%) and an improvement of 12% higher recognition accuracy on I11-Full (89% versus 77%). On III5K-50 and III5K-Med, the recognition accuracies obtained by the proposed method are much higher than those obtained by Yao2014 () (86.70% versus 80.2% on III5K-50 and 73.63% versus 69.3% on III5K-Med). When compared to Su2014b (), the proposed method performs comparably on SVT-50 and I03-50, and achives higher performance on I03-Full, I11-50, and I11-Full (87.31% versus 82% on I03-Full, 93.87% versus 91% on I11-50, and 89.03% versus 83% on I11-Full). These results demonstrate the effectiveness of the proposed HSC-based scene text recognition method.
5.3 Recognition Speed
In this section, we will evaluate the recognition speed of the proposed method in character classification and word recognition, and report the average CPU time for per character/word image sample. For character classification, we test the process time of the proposed method in feature extraction and feature classification separately. For word recognition, we evaluate the process time in the character detection step and the DP search step separately.
5.3.1 The Computational Speed of the Proposed Method in Character Classification
We first compare the process time of the proposed method in extracting the HSC features and the HOG features. For the HSC features, we evaluate the HSC features with different dictionary sizes and a fixed patch size of (i.e., HSC(25), HSC(50), HSC(75), and HSC(100)), and different patch sizes and a fixed dictionary size of 100 (i.e., HSC(100), HSC(100), HSC(100), and
HSC(100)). The CPU time used in extracting the HOG features is 1.3 millisecond for each character sample. Table 9 shows the CPU time used in extracting the HSC features (with different dictionary sizes and patch sizes) for each character sample. From the results, we can see that the feature extraction step of the HSC features is more computationally expensive than that of the HOG features. For the HSC features, as the patch size or the dictionary size grows, the computation cost used in extracting the HSC features increases.
|HSC with different dictionary sizes||HSC(25)||HSC(50)||HSC(75)||HSC(100)|
|HSC with different patch sizes||HSC(100)||HSC(100)||HSC(100)||HSC(100)|
We then evaluate the CPU time of the proposed method used in classifying the extracted features. For the HSC features, we set the dictionary size to 100 and the HSC features fed to the trained classifier have a dimensionality of (see Section 5.1.1 for more details). The dimensionality of the HOG features is (see Section 5.1.1 for more details). The feature classification speed of the proposed method using different features and classifiers is shown in Table 10 (as well as in Figure 6 for better illustration). We can see that the FERNS classifier is much faster (about 100-1000 times faster) than the other classifiers (of course, the proposed method using the FERNS classifier achieves much lower accuracy than that with the other classifiers. See Table 2 and 3 for more details). For the SC classifier, increasing the number of the basis vectors of the classifier increases the computation cost. Table 9 and 10 also show that, the feature classification step is much more computationally expensive than the feature extraction step.
5.3.2 The Computational Speed of the Proposed Method in Word Recognition
In this section, we test the CPU time of the proposed method in word recognition. We report the average CPU time used in the two steps (i.e., character detection and the DP search) separately. The computation cost of the proposed method used in the character detection step depends on the used feature extraction method and character classifier. For the HSC features, we fix the patch size to and the dictionary size to . The CPU time used in the character detection step of the proposed method using different classifiers is shown in Table 11 (as well as in Figure 7 for better illustration). From the result, we can see that the proposed method using the FERNS classifier performs much faster than that using the other classifiers. When the SC classifier with the number of the basis vectors being more than 20 is used, the character detection step of the proposed method is slow, which is mainly due to the high computation cost in classifying a large number of character candidates.
The CPU time used in the DP search step of the proposed method mainly depends on the number of words contained in the lexicon. Since the numbers of words in the lexicons of I03-Full, I11-Full, and III5K-Med are nearly the same (about 1000 words), we report the CPU time of the proposed method using lexicons with 50 words and the Full lexicons (without discriminating between different datasets). In the experiments, the average CPU time used in the DP search step of the proposed method per word image is 1.05 second for 50 words and 8.68 second for the Full lexicons, respectively. Since the proposed mthod adopts the word spotting strategy for word recognition, using the Full lexicons is much slower than using only 50 words in word recognition, which shows that using the word spotting strategy for word recognition is only suitable for small/medium lexicons.
These results show that, for word recognition, the recognition speed of the proposed method using the multi-scale sliding window strategy is still slow when using more powerful feature extraction and classification methods. One solution to this problem is to apply a general object detector to obtain preliminary character candidates and then recognize these character candidates using more effective feature extraction and classification algorithms. The general object detector needs to find as more character candidates as possible using only a small number of candidate proposals. An alternative solution is to use dimensionality reduction techniques, such as principal component analysis (PCA), to reduce the feature dimensionality so as to reduce the computation cost in character detection. This will benefit a lot to the SC classifier with a larger size of basis vectors. In this paper, we mainly focus on using sparse coding based features for scene text recognition, thus we do not evaluate the influence of dimensionality reduction techniques on the performance of the proposed method in word recognition.
5.4 Text Recognition Examples
In this section, we show and analyze some recognition results obtained by the proposed method. We first provide some recognition results of the proposed method with the HOG features and the HSC features using the SC-100 classifier. Some recognition results on the SVT and III5K-Word datesets are shown in Figure 8 and Figure 9, respectively. In the two figures, the upper row shows the results obtained by the proposed method using the HOG features and the lower row shows those using the HSC features. In both figures, the word images are not correctly recognized by the proposed method using the HOG features, but they are successfully recognized by the proposed method using the HSC features.
As shown in Figure 8, the images from the SVT dataset are difficult to recognize due to cluttered background (such as the word “NEUMOS” and “HERTZ”), similar color in background (such as the word “SCHOOL”), low resolution, shadow (such as the word “HERTZ”), etc. For some images such as the word “SCHOOL”, it is difficult to discriminate the characters from the background even for human beings, but the proposed method using the HSC features can detect and recognize the characters correctly, showing the robustness of the proposed HSC-based scene text recognition method.
In Figure 9, it shows that the images from the III5K-Word dataset are difficult to recognize due to similar color to the background (such as the word “GLEN”), the ambiguity caused by irregular characters (e.g., the art characters in the words “LGDNEY” and “LOVE”), rotated characters in curved text lines (such as the word “WELCOME”), etc. For these words, the proposed method fails to detect the characters and recognize the words when using the HOG features, while surprisingly, it is able to detect the characters and recognize the words correctly when using the HSC features. For the word “GLEN”, we can see that although the recognition result using the HOG features is incorrect, it is meaningful in that, the recognition result “IL” is reasonable if one views the shadow of the characters as the characters “I” and “L”. In fact, in the III5K-Word dataset, there are many word images with art characters, rotated characters and curved text lines, which makes the dataset challenging. The HSC features outperform the HOG features significantly for word recognition as they can describe much richer information of characters and thus effectively handle the challenges mentioned above.
We also show more word recognition results that are correctly recognized by the proposed method on the ICDAR2003 and ICDAR2011 datasets in Figure 10, and show some recognition results that are not correctly recognized by the proposed method on the SVT and III5K-Word datasets in Figure 11. Figure 10 shows that the proposed HSC-based scene text recognition method can robustly and effectively handle the challenges. Figure 11 shows the limitations of the proposed method that need more efforts to solve, such as to deal with seriously rotated characters in curved text lines, heavily distorted art characters, very low discrimination between characters and the background, etc. Since in this paper we use the multi-scale sliding windows strategy to generate character candidates, the current method is suitable for recognizing near horizontal/vertical text and do not performs well on curved text lines that contain rotated characters. Hence exploiting specific methods for recognizing text lines with various directions will be our future work.
In this paper, we have proposed an effective scene text recognition method using sparse coding based features (i.e., the HSC features), which are extracted by computing per-pixel sparse codes of characters and aggregating the sparse codes to form local histograms. The HSC features can represent much richer structural information of characters, and thus the proposed method using the HSC features significantly outperform the competing methods using the other features (such as HOG, LHOG, GHOG, LBP, GB, etc.) in scene character/text recognition. For word recognition, we propose to integrate character detection results with geometric constraints and use the MCE training method to learn the parameters of the proposed method. Experimental results on several popular datasets (ICDAR2003, ICDAR2011, SVT, and III5K-Word) show the effectiveness and robustness of the proposed method, which outperforms most state-of-the-art methods.
In our future work, two research directions will be considered: One is to use a general object detector to reduce the number of character candidates that need to be recognized by the character classifier; The other one is to use dimensionality reduction techniques to speed up the character detection process. Besides, we will also extend the current work (word recognition) to full image word detection and recognition.
This work was supported by the National Natural Science Foundation of China under Grants 61305004, 61472334 and 61170179, by China Postdoctoral Science Foundation under Grant 2012M521277.
- (1) K. Wang, B. Babenko, and S. Belongie, End-to-End Scene Text Recognition, Proc. ICCV, pp. 1457-1464, 2011.
- (2) A. Mishra, K. Alahari, and C.V. Jawahar, Top-down and Bottom-up Cues for Scene Text Recognition, Proc. CVPR, pp. 2687-2694, 2012.
- (3) T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, End-to-End Text Recognition with Convolutional Neural Networks, Proc. Int’l Conf. Pattern Recognition, pp. 3304-3308, 2012.
- (4) C. Shi, C. Wang, B. Xiao, Y. Zhang, S. Gao, and Z. Zhang, Scene Text Recognition Using Part-Based Tree-Structured Character Detection, Proc. CVPR, pp. 2961-2968, 2013.
- (5) C. Shi, C. Wang, B. Xiao, S. Gao, and J. Hu, End-to-end Scene Text Recognition Using Tree-Structured Models, Pattern Recognition, 47(9): 2853-2866, 2014.
- (6) C. Shi, C. Wang, B. Xiao, S. Gao, and J. Hu, Scene Text Recognition Using Structure-Guided Character Detection and Linguistic Knowledge, IEEE Trans. on Circuits and Systems for Video Technology, 24(7): 1235-1250, 2014.
- (7) N. Dalal and B. Triggs, Histograms of Oriented Gradients for Human Detection, Proc. CVPR, pp. 886-893, 2005.
- (8) P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, Object Detection with Discriminatively Trained Part Based Models, IEEE Trans. Pattern Anal. Mach. Intell., 32(9): 1627-1645, 2010.
- (9) A. Mishra, K. Alahari, and C. V. Jawahar, Image Retrieval Using Textual Cues, Proc. ICCV, pp. 3040-3047, 2013.
- (10) J. Zhang, K. Huang, Y. Yu, and T. Tan, Boosted Local Structured HOG-LBP for Object Localization, Proc. CVPR, pp. 1393-1400, 2011.
- (11) X. Ren and D. Ramanan, Histograms of Sparse Codes for Object Detection, Proc. CVPR, pp. 3246-3253, 2013.
- (12) J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma, Robust Face Recognition via Sparse Representation, IEEE Trans. Pattern Anal. Mach. Intell., 31(2): 210-227, 2009.
- (13) P. Felzenszwalb and D. Huttenlocher, Pictorial Structures for Object Recognition, International Journal of Computer Vision, 61(1): 55-79, 2005.
- (14) B.-H. Juang, W. Chou, and C.-H. Lee, Minimum Classification Error Rate Methods for Speech Recognition, IEEE Trans. Speech and Audio Processing, 5(3): 257-265, 1997.
- (15) A. Biem, Minimum Classification Error Training for Online Handwriting Recognition, IEEE Trans. Pattern Anal. Mach. Intell., 28(7): 1041-1051, 2006.
- (16) D. Zhang, D.-H. Wang, and H. Wang, Scene Text Recognition Using Sparse Coding based Features, Proc. ICIP, pp. 1066-1070, 2014.
- (17) C. Yao, X. Bai, B. Shi, and W. Liu, Strokelets: A Learned Multi-Scale Representation for Scene Text Recognition, Proc. CVPR, pp. 4042-4049, 2014.
- (18) M. Jaderberg, A. Vedaldi, and A. Zisserman, Deep Features for Text Spotting, Proc. ECCV, pp. 512-528, 2014(4).
- (19) C. Yi, X. Yang, and Y. Tian, Feature Representations for Scene Text Character Recognition: A Comparative Study. Proc. Int’l Conf. Document Analysis and Recognition, pp. 907-911, 2013.
- (20) S. Tian, S. Lu, B. Su, and C. L. Tan, Scene Text Recognition Using Co-occurrence of Histogram of Oriented Gradients, Proc. ICDAR, pp. 912-916, 2013.
- (21) Z. R. Tan, S. Tian, and C. L. Tan, Using Pyramid of Histogram of Oriented Gradients on Natural Scene Text Recognition, Proc. ICIP, pp. 2629-2633, 2014.
- (22) B. Su, S. Lu, S. Tian, J.-H. Lim, and C. L. Tan, Character Recognition in Natural Scenes Using Convolutional Co-occurrence HOG, Proc. ICPR, pp. 2926-2931, 2014.
- (23) C.-Y. Lee, A. Bhardwaj, W. Di, V. Jagadeesh, and R. Piramuthu, Region-based Discriminative Feature Pooling for Scene Text Recognition, Proc. CVPR, pp. 4050-4057, 2014.
- (24) S. Gao, C. Wang, B. Xiao, and C. Shi, Z. Zhang, Stroke Bank: A High-Level Representation for Scene Character Recognition, Proc. ICPR, pp. 2909-2913, 2014.
- (25) S. Gao, C. Wang, B. Xiao, C. Shi, W. Zhou, and Z. Zhang, Learning Co-occurrence Strokes for Scene Character Recognition based on Spatiality Embedded Dictionary, Proc. ICIP, pp. 5956-5960, 2014.
- (26) A. Coates, B. Carpenter, C. Case, S. Satheesh, B. Suresh, T. Wang, D. Wu, and A. Y. Ng, Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning, Proc. Int’l Conf. Document Analysis and Recognition, pp. 440¨C445, 2011.
- (27) J. Yang, K. Yu, Y. Gong, and T. Huang, Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification, Proc. CVPR, pp. 1794-1801, 2009.
- (28) K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu, and Y. LeCun, Learning Convolutional Feature Hierarchies for Visual Recognition, In Advances in Neural Information Processing Systems 23, pp. 1090-1098, 2010.
- (29) A. Mishra, K. Alahari, and C. V. Jawahar, Scene Text Recognition using Higher Order Language Priors, Proc. BMVC, pp. 1-11, 2012.
- (30) T. Novikova, O. Barinova, P. Kohli, and V. Lempitsky, Large-Lexicon Attribute-Consistent Text Recognition in Natural Images, Proc. ECCV, pp. 752-765, 2012.
- (31) V. Goel, A. Mishra, K. Alahari, and C. V. Jawahar, Whole is Greater than Sum of Parts: Recognizing Scene Text Words, Proc. Int’l Conf. Document pAnalysis and Recognition, pp. 398-402, 2013.
- (32) T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan, Recognizing Text with Perspective Distortion in Natural Scenes, Proc. ICCV, pp. 569-576, 2013.
- (33) L. Neumann and J. Matas, Scene Text Localization and Recognition with Oriented Stroke Detection, Proc. ICCV, pp. 97-104, 2013.
- (34) J. J. Weinman, Z. Butler, D. Knoll, and J. L. Feild, Toward Integrated Scene Text Reading, IEEE Trans. Pattern Anal. Mach. Intell., 36(2): 375-387, 2014.
- (35) B. Su, S. Lu, Accurate Scene Text Recognition Based on Recurrent Neural Network, Proc. ACCV, pp. 35-48, 2014(1).
- (36) J. Almazán, A. Gordo, A. Fornés, and Ernest Valveny, Word Spotting and Recognition with Embedded Attributes, IEEE Trans. Pattern Anal. Mach. Intell., 36(12): 2552-2566, 2014.
- (37) J. A. Rodríguez-Serrano, A. Gordo, and F. Perronnin, Label Embedding: A Frugal Baseline for Text Recognition, International Journal of Computer Vision, 113(3): 193-207, 2015.
- (38) M. Aharon, M. Elad, and A. Bruckstein, K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation, IEEE Transactions on Signal Processing, 54(11): 4311-4322, 2006.
- (39) M. Elad and M. Aharon, Image Denoising via Sparse and Redundant Representations Over Learned Dictionaries, IEEE Trans. on Image Processing, 15(12): 3736-3745, 2006.
- (40) Y. Romano, M. Protter, and M. Elad, Single Image Interpolation via Adaptive Nonlocal Sparsity-Based Modeling, IEEE Trans. Pattern Anal. Mach. Intell., 23(7): 3085-3098, 2014.
- (41) Q. Zhang and B. Li, Discriminative K-SVD for Dictionary Learning in Face Recognition, Proc. CVPR, pp. 2691-2698, 2010.
- (42) J. Zheng and Z. Jiang, Learning View-Invariant Sparse Representations for Cross-View Action Recognition, Proc. ICCV, 3176-3183, 2013.
- (43) H. Zhang, Y. Zhang, and T. S. Huang, Simultaneous Discriminative Projection and Dictionary Learning for Sparse Representation based Classification, Pattern Recognition, 46(1): 346-354, 2013.
- (44) Z. Jiang, Z. Lin, and L. S. Davis, Label Consistent K-SVD: Learning a Discriminative Dictionary for Recognition, IEEE Trans. Pattern Anal. Mach. Intell., 35(11): 2651-2664, 2013.
- (45) K. Kreutz-Delgado, J. F. Murray, B. D. Rao, K. Engan, T. Lee, and T. J. Sejnowski, Dictionary Learning Algorithms for Sparse Representation, Neural Computation, 15(2): 349-396, 2003.
- (46) Y. Pati, R. Rezaiifar, and P. Krishnaprasad, Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet Decomposition, Proc. Asilomar Conf. on Signals, Systems and Computers, pp. 40-44, 1993.
- (47) S. S. Chen, D. L. Donoho, and M. A. Saunders, Atomic Decomposition by Basis Pursuit, SIAM Review, 43(1): 129-159, 2001.
- (48) R. Rubinstein, A. M. Bruckstein, and M. Elad, Dictionaries for Sparse Representation Modeling, Proceedings of the IEEE, 98(6): 1045-1057, 2010.
- (49) D. R. Martin, C. Fowlkes, D. Tal, and J. Malik, A Database of Human Segmented Natural Images and its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics, Proc. ICCV, pp. 416-425, 2001.
- (50) K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd edition, Academic Press, 1990.
- (51) C.-L. Liu and K. Marukawa, Handwritten Numeral String Recognition: Character-Level Training vs. String-Level Training, Proc. Int’l Conf. Pattern Recognition, pp. 405-408, 2004.
- (52) D.-H. Wang, C.-L. Liu, and X.-D. Zhou, An Approach for Real-Time Recognition of Online Chinese Handwritten Sentences, Pattern Recognition, 45(10): 3661-3675. 2012.
- (53) H. Robbins and S. Monro, A Stochastic Approximation Method, Ann. Math. Stat. 22: 400-407, 1951.
- (54) S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young, ICDAR 2003 Robust Reading Competitions, Proc. Int’l Conf. Document Analysis and Recognition, 2003.
- (55) A. Shahab, F. Shafait, and A. Dengel, ICDAR 2011 Robust Reading Competition Challenge 2: Reading Text in Scene Images, Proc. Int’l Conf. Document Analysis and Recognition, pp. 1491-1496, 2011.
- (56) T. Ojala, M. Pietikinen, and T. Menp, Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns, IEEE Trans. Pattern Anal. Mach. Intell., 24(7): 971-987, 2002.
- (57) A. Vedaldi and B. Fulkerson, VLFeat: An Open and Portable Library of Computer Vision Algorithms, 2008, http://www.vlfeat.org/.
- (58) J. Mairal, F. Bach, J. Ponce, and G. Sapiro, Online Learning for Matrix Factorization and Sparse Coding, J. Machine Learning Research, 11: 19-60, 2010.
- (59) T. E. de Campos, B. R. Babu, and M. Varma, Character Recognition in Natural Images, Proc. Int’l Conf. on Computer Vision Theory and Applications, Lisbon, Portugal, 2009.
- (60) A. C. Berg, T. L. Berg, J. Malik, Shape Matching and Object Recognition Using Low Distortion Correspondences, Proc. CVPR, pp. 26-33, 2005.
- (61) A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, PhotoOCR: Reading Text in Uncontrolled Conditions, Proc. ICCV, pp. 785-792, 2013.