A Single Shot Text Detector with Scaleadaptive Anchors
Abstract.
Currently, most topperforming text detection networks tend to employ fixedsize anchor boxes to guide the search for text instances. They usually rely on a large amount of anchors with different scales to discover texts in scene images, thus leading to high computational cost. In this paper, we propose an endtoend boxbased text detector with scaleadaptive anchors, which can dynamically adjust the scales of anchors according to the sizes of underlying texts by introducing an additional scale regression layer. The proposed scaleadaptive anchors allow us to use a few number of anchors to handle multiscale texts and therefore significantly improve the computational efficiency. Moreover, compared to discrete scales used in previous methods, the learned continuous scales are more reliable, especially for small texts detection. Additionally, we propose Anchor convolution to better exploit necessary feature information by dynamically adjusting the sizes of receptive fields according to the learned scales. Extensive experiments demonstrate that the proposed detector is fast, taking only second per image, while outperforming most stateoftheart methods in accuracy.
1. Introduction
In the past few years, scene text detection and recognition have received a lot of attention from both academia and industry, due to its numerous potential applications in image understanding and computer vision systems. Detecting text from natural scene is an open issue in computer vision field because texts may appear in various forms and the background may be very complex. From systematically perspective, a text detector which can detect individual words directly while being robust enough to complex background is more preferable, as it will greatly simply the processing of later recognizer (Yang et al., 2017).
Owning to this, recently, many stateoftheart text detectors (Jaderberg and Simonyan, 2016; Liao et al., 2016; Zhong et al., 2016) based on the advanced general object detection techniques (Ren and He, 2015; Liu and Anguelov, 2016), or boxbased text detectors are proposed, which take words as the detection targets and thus make individual words detection feasible. Generally, they directly output word bounding boxes by jointly predicting text presence and coordinate offsets to anchor boxes (Ren and He, 2015) at multiple scales. By this way, they have remarkably improved the detection performance, in terms of accuracy and robustness. However, we argue that the current boxbased frameworks are still inefficient and unsatisfactory, for two main reasons.
First, it is not efficient to handle multiscale texts by traversing all the possible scales (see Figure 1). The current boxbased text detectors employ fixedsize anchors to match the words, and the boxregression can only adjust the sizes of anchors to some extent, the effect is rather minor. Due to the diversity of text size, they have to preset massive anchor boxes of different scales to match the underlying text shapes, which results in high computational cost. For example, in (Liao et al., 2016) scales (implemented with layered feature maps) are used and each cell is associated with fixedsize anchors.
Secondly, it is unreasonable to match texts of all possible scales with limited discrete scaled anchor boxes. This fact has been observed in (Liao et al., 2016; Zhong et al., 2016). In these work, though scales are adopted to produce multiscale anchors, there are still some texts are missed when no appropriately designed scale is applicable. Therefore, fixedsize anchors have become the bottleneck for the boxbased text detection framework, though they are widely adopted currently.
To conquer the above limitations, in this work, we propose a novel boxbased text detector with scaleadaptive anchors, where the scales of anchors are dynamically adjusted according to the sizes of texts. Specifically, we introduce an additional scale regression layer to the basic boxbased framework and use it to learn the scales of anchors in an implicit way, such that extra training supervision of object size is avoided. With the proposed scaleadaptive anchors, we only need to preset a few initial anchors of different aspect ratios in scale, thus reducing the number of anchors largely. Meanwhile, the learned scale value is continuous which is more applicable to detect various texts than several discrete scales, especially for small texts.
Additionally, we argue that when making predictions on a single feature map (i.e., single scale), when the scale of anchor is updating, the size of the corresponding receptive fields should change synchronously. However, for a given anchor, regardless of size, the standard convolutions in CNN (Krizhevsky and Sutskever, 2012) can only assign a fixedsize respective field to it. To tackle this problem, we propose Anchor convolution to dynamically adjust the sizes of receptive fields according to the learned scales of anchors, to ensure the integrity and richness of feature information of each anchor.
To summarize, the contributions of this paper are as follows:

We propose scaleadaptive anchors which largely reduce the computational cost and improve the robustness against multiscale texts, especially small scales. The whole framework is endtoend, simple and easy to train.

We propose Anchor convolution to dynamically adjust the sizes of respective fields, to ensure the integrity and richness of feature information of each anchor.

We evaluate the proposed method on two realworld text detection datasets, i.e., ICDAR11 (Shahab et al., 2011) and ICDAR13 (Karatzas and Shafait, 2013) and demonstrate that while keeping competitive accuracy with stateoftheart, it is more efficient, taking only 0.28s per image, which is important to real systems, especially mobile applications.
2. Related Work
Scene text detection have been extensively studied for a few years in the computer vision community, and a large amount of methods have been proposed. Traditional methods (Epshtein et al., 2010; Huang and Lin, 2013; Ye and Doermann, 2015; Huang et al., 2014) deal with this issue usually by first detecting individual characters or coarse text regions, then following with a sequential processing of grouping or segmentation to form text lines or blocks. However, such postprocessing steps are difficult to design because they require exploring many lowlevel image cues and various heuristic rules, which also make the whole system highly complicated and unreliable.
Owning to the strong representationcapability of the deep Convolutional Neural Networks (CNN), more and more deep learning based methods develop rapidly. A number of recent approaches were built on Fully Convolutional Networks (Long et al., 2015), by treating text detection as semantic problem. For example, Yao et al. (Yao et al., 2016) propose to directly run the algorithm on the full images and produce global, pixelwise prediction maps, in which detections are subsequently formed. Zhang et al. (Zhang et al., 2016) propose TextBlock FCN for generating text salient maps and the CharacterCentroid FCN for predicting the centroid of characters. However, the current FCNbased methods fail to produce accurate wordlevel predictions with a single model, and they still require multiple bottomup steps to construct words.
Recently, inspired by the great progress of deep learning methods (Ren and He, 2015) (Liu and Anguelov, 2016) for general object detection, many boxbased text detectors are proposed and have advanced the performance of text detection considerably. For example, Jaderberg et al. (Jaderberg and Simonyan, 2016) propose an RCNNbased (Girshick, 2015a) framework, which first generates word candidates with a proposal generator and then adopts a convolution neural network (CNN) to refine the word bounding boxes. Liao et al. (Liao et al., 2016) propose an endtoend trainable network named TextBoxes, to directly output word bounding boxes by jointly predicting text presence and coordinate offsets to anchor boxes (Ren and He, 2015) at multiple scales. However, most of them rely on presetting a large amount of anchors with different scales to discover texts in scene images, thus leading to high computational cost.
3. Approach
In this section, we propose a novel endtoend text detector, which can automatically adjust the sizes of both anchors and receptive fields according to the scales of texts. The whole framework is illustrated in the top row of Figure 1. Initially, an input image is forwarded through the convolution layers of VGG16 (Simonyan and Zisserman, 2014) and the feature maps are produced. We add an additional scale regression layer behind the feature maps to generate a scale map, which is used to indicate the text size at each location. Next, the scale map is passed to layer to produce scaleadaptive anchors and flexiblesize receptive fields. Finally, these anchors are classified and refined via the detection module, which contains a classification layer and a bounding box regression layer, similar to the SSD detector (Liu and Anguelov, 2016). The details of each phase will be described in the following.
3.1. Scaleadaptive Anchors
The basic idea of boxbased detector is to associate a set of anchor boxes with every map location to coarsely match the ground truth texts, and then predict both classification scores and shape offsets for each anchor to obtain the final text locations. In natural scene images, texts are usually presented in various scales. To match multiscale texts, most current boxbased detectors tend to employ multiple feature maps from different levels to simultaneously make detection predictions.
For example, as shown in the bottom of Figure 1, TextBoxes (Liao et al., 2016) adds 6 convolutional feature layers(green layers) decreasing in size progressively to the end of the VGG16 model, and thus allows predictions of detections at scales. In order to better describe this, we present its working principle in the bottom of Figure 2, where the pictures correspond to the green layers(from to ) in Figure 1. We can see that different layers represent different scales, and the anchors of former layers are applied to match small scale texts. Although the way of searching all possible anchors can handle most multiscale texts, it is obviously inefficient.
Different from previous boxbased framework, we add an additional scale regression layer behind the layer (as shown in Figure 1) to generate a scale map with one channel. The scale map has the same size with layer and is used to encode the predicted text size for each location. Then, the scale map is used to obtain the scaleadaptive anchors.
The top row of figure 2 gives the working principle of the proposed scaleadaptive anchors. At each feature map cell of layer, we place several initial anchors. Then, with the generated scale map, the anchors of each cell can be enlarged or shrunk according to the assigned scale values.
Specially, for a given anchor, we denote its initial size as , where are the coordinates of its center, is the width and is the height. Suppose the learned scale corresponding to this anchor is , then the updated anchor’s size can be computed as follows:
(1) 
The detailed learning process will be introduced in Section 3.3.
Given an input image, previous boxbased methods tend to preset all possible anchors, and the number of anchors can be represented as:
where is the number of green layers(as illustrated in Figure 1, and the of TextBoxes is 6), represents the size of feature map(such as is ), and represents the number of anchors in each cell. is a constant. Besides, the anchors preset in each cell of feature map have different aspect ratios(e.g. 1,3,5,7,10), and so is also a constant. Therefore, the time computational complexity of this kind of algorithms is .
As shown in Figure 1, we only employ one single layer to handle multiscale texts, so the number of anchors of our detector can be represented as:
where represents the size of feature map, the setting of is same with above. Therefore, our algorithm reduce the time computational complexity from to , and thus improve the computation efficiency of algorithm. More importantly, we propose adaptive scale to replace the previous search way, which traversing all possible scales(just like exhaustive search), and allow other boxbased methods to handle multiscale texts in a more efficient way.
3.2. Anchor Convolution
In the previous section, we proposed a novel scale regression layer for dynamically adjusting scales of anchors. In this section, we propose Anchor convolution to dynamically adjust the sizes of respective fields and thus exploit scaled features for each anchor.
The standard convolution (Krizhevsky and Sutskever, 2012) assigns a fixedsize respective field to each anchor and computes a feature vector for later classifying and boxregression. Different from standard convolution, we propose Anchor convolution to dynamically adjust the sizes of respective fields according to the scales of anchors and to extract necessary feature information for improving the performance of subsequent classification and regression. Next, we will introduce its working principle (see Figure 3).
In the convolution layers, each pixel of the output feature maps corresponds to a fixed size of receptive field P (P is part of input layer, and also known as convolutional patch). Generally, a total of elements are selected from P to construct a feature vector and perform convolution operations with the kernel, where and is the height and width of kernel, respectively.
In standard convolution, for each pixel, the size of corresponding P is , here P is a rectangular respective field with the center coordinates of and , are the dilation parameters. For integer and integer , the coordinates of selected elements from P are:
(2) 
Let P denote the feature vector constructed by these elements. Then, I is used to perform elementwise multiplication with the kernel.
For Anchor convolution, suppose the output layer has the same size with the scale map and all channels share one scale map. Thus, each pixel of the output map corresponds to a scale coefficient. Let the respective field here be , which is also a rectangle with the same center as P. Differently, the size of can change to along with the scale coefficient . Then, a total of elements are selected from to construct I. Inspired by (Zhang et al., 2017), the coordinates of these elements are changed to:
(3) 
We adopt an irregular kernel (Szegedy et al., 2015) in this work, setting and . Since is integer and , we have and . In order to use scale information in height direction, we define feature vector I and construct it in a different way, which is formulated as:
(4) 
where is a weight parameter.
In the Anchor convolution, the respective field can be shrunk or expanded according to the different values of scale coefficients. As illustrated in Figure 3, when , the Anchor convolution is the same with the standard convolution. However, when , the respective field P will be expanded to . In this case, we select sampling elements to construct the feature vector according to Eq.4. In our work, Anchor convolution is applied in layer and layer.
3.3. Scale Learning
We introduce a scale regression layer to generate scale map, which is applied to scaleadaptive anchors and Anchor convolution learning. Thus, the derivations of loss function w.r.t. scale follows the chain rule.
The objective loss function is defined as follows:
here denotes positives and denotes negatives, is the number of matched anchors, is the confidence, is the predicted box, is the ground truth box, and is a 2class softmax loss. SmoothL1 defined in (Girshick, 2015b) is applied to the localization loss .
Now, we derive gradients w.r.t. scale coefficients. For brevity, we omit the standard derivations applied in network.
In scaleadaptive anchors. We define the predicted box as , which is computed as:
(5) 
where are offsets relative to the matched anchor , which are learned from layer. According to Eq.1, the gradient of (we use for brevity) w.r.t. is obtained as
(6) 
If we use to denote the numbers of anchors at position , then the gradient of w.r.t. is .
In Anchor convolution. We first formulate the forward propagation of our proposed Anchor convolution, and then give the formulations to update scale.
Let , and denote the height, width and channel number of feature map, respectively. We also define subscripts and to distinguish inputs and outputs. Suppose the convolution kernels in Anchor convolution is denoted as and the bias is . We further define the subscript , and . If taking K as a partitioned matrix, then each of its block is a vector, corresponding to one of the convolution kernels. For any element in convolution , we have:
(7) 
In Anchor convolution, we compute the coordinates of feature vector via Eq.3. We denote the scale efficient corresponding to as . Since is a float value, the coordinates and may not be integers. Here, inspired by the Spatial Transformer Networks (Jaderberg and Simonyan, 2015), we obtain through bilinear interpolation. Let be the feature vector corresponding to , the forward propagation of convolution is computed via Eq.7.
During backward propagation, after obtaining from layer, the gradients w.r.t. , and are derived as
(8)  
(9)  
(10) 
According to Eq.3, we can obtain the gradients: and . Since the coordinates and rely on the scale coefficient , to obtain the gradient of ,we first compute the partial derivatives of coordinates as follows.:
(11) 
Thus the final gradients of are obtained as
(12) 
According to Eq.6 and Eq.12, the gradients of scale coefficients can be automatically calculated from the gradients of the following layers. In other words, the scale map can be obtained in a datadriven manner and we do not need any extra supervision. All above the derived formulations can be computed efficiently and implemented in parallel on GPUs. In practice, we limit the scale coefficients greater than zero and smaller than the size of image.
4. Experiments
We implement the proposed algorithm with Caffe (Jia and Shelhamer, 2014) on Python. All the experiments are conducted on a regular server (3.3GHz 20core CPU, 64GB RAM, NVIDIA TITAN GPU and Linux 64bit OS) and the routine run on a single GPU in each time.
4.1. Datasets and Experimental Settings
VGG SynthTextPart. The VGG SynthText datasets (Gupta and Z.A., 2016) consist of approximately synthetic scenetext images. For efficiency, we randomly select images for training and refer it as VGG SynthTextPart.
ICDAR13. The ICDAR13 datasets are from the ICDAR 2013 Robust Reading Competition (Karatzas and Shafait, 2013), with natural images for training and images for testing.
ICDAR11. The ICDAR11 datasets are from the ICDAR 2011 Robust Reading Competition (Shahab et al., 2011), with natural images for training and images for testing.
Our model is trained with images using stochastic gradient descent (SGD). Momentum and weight decay are set to and respectively. Learning rate is initially set to , and decayed to after training iterations. We first train our model on VGG SynthTextPart for 40k iterations, and then finetune it on ICDAR13 for iterations.
Compared to previous boxbased methods, the number of anchor boxes in our algorithm are largely reduced, so we employ all anchors without negative mining for training. Accordingly, we add a balance parameter in our loss function to balance the ratio of positive and negative anchors. To further boost detection recall, we rescale input image to 6 resolutions, and the total running time is 0.28s.
4.2. Visualization and Analysis
To verify the ability of our network in learning scales of texts, we visualize the training results of scale maps after different iterations. As shown in Figure 4, the scale maps exhibit overall similar structures to the texts in images gradually when the iteration increases. Specifically, large texts have large scale values, whereas small texts have small scale values. Besides, for each text area, the scale values associated with the center points are slightly larger, while the scale values close to boundaries are slightly smaller.
4.3. Evaluation on Anchor Convolution
In this part, we investigate the effect of Anchor convolution, which is used to adjust the size of receptive field of each anchor and get more abundant feature information. Two models are trained using all images of SynthTextPart and refined on ICDAR13. One model is with Anchor convolution (denoted as ACmodel) while the other is not (denoted as WACmodel). We evaluate two models on ICDAR13 and the results are tabulated in Table 1. Here P, R and F are abbreviations for Precision, Recall, and Fmeasure respectively.
Model  P  R  F  

WACmodel  77%  76%  76%   
ACmodel  89%  83%  86%  10% 
From Table 1 we can see that the Fmeasure based on Anchor convolution (ACmodel) is , which has improvement over WACmodel. This verifies that Anchor convolution is effective in exploiting necessary feature information by adjusting the receptive fields dynamically, for detecting texts of various sizes.
4.4. Evaluation for Full Text Detection
Datasets  ICDAR11  ICDAR13  Runtime/s  

Methods  P  R  F  P  R  F  
TextBoxes (Liao et al., 2016)  88%  82%  85%  88%  83%  85%  0.73 
Yao et al.(Yao et al., 2016)        89%  80%  84%  0.62 
MCLAB_FCN (Zhang et al., 2016)        78%  88%  82%  2.1 
RRPN (Ma et al., 2017)        90%  72%  80%   
TextFlow (Tian et al., 2015)  86%  76%  81%  85%  76%  80%  1.4 
Lu et al. (Lu et al., 2015)        89%  70%  80%   
Neumann et al. (Neumann and Matas, 2015)        82%  72%  77%  0.8 
FASText (Buta and Neumann, 2015)      84%  69%  75%  0.55  
Ours  89%  82%  85 %  89%  83%  86%  0.28 
We evaluate our detector on two benchmarks: ICDAR11 and ICDAR13. The comparison results with some stateoftheart methods, including traditional methods and boxbased methods, are tabulated in Table 2. We can see that our method achieves slightly superior results of , Fmeasure to stateoftheart approaches, while costing much less time, which is important to real systems especially mobile applications.
Some detection examples are given in Figure 5. The results show that our model is extremely robust against multiple text variations, cluttered backgrounds and challenging conditions like high light, blurring and so on.
Advantages on Small Texts.
We argue that our method is superior in detecting small texts. To verify this, we compare it with the representative boxbased method, i.e., TextB
oxes (Liao
et al., 2016). To detect small texts, TextBoxes needs to resize the input image to pixels before sending into the network, which is very timeconsuming. In contrast, we can cover most of the small texts with only input images. With such settings, as shown in Figure 6, our model is more reliable and finds all the small texts, while TextBoxes has missed some of them. We attribute the advantages of our method on small texts to the scaleadaptive anchors and Anchor convolution.
First, different to the fixedsize anchors of several discrete scales used in TextBoxes, the proposed scaleadaptive anchors can change their sizes continuously and thus are more potential in matching the shapes of small texts. Moreover, in TextBoxes even the smallest anchor may be much bigger than the small texts for some test images with its current settings. Second, the proposed Anchor convolution is able to shrink the respective fields of small texts adaptively, therefore we can focus on texts while remove the sideeffect of background in feature extraction.
Performance Analysis. By producing scaleadaptive anchors to replace presetting all possible anchors of different scales, which are employed in most boxbased methods, we improve the computation efficiency(reduce the time computational complexity from to , the details can be seen in Sec 3.1), and reduce the running time from 0.73s to 0.28s while keeping competitive accuracy. Our running time includes generating scale map and matching anchors. Therefore, the savings in time will be more significant as the networks go deeper. Furthermore, the proposed adaptive scale allows other boxbased methods to handle multiscale texts in a more efficient way and further improve their performance.
5. Conclusions
In this paper, we have presented an endtoend text detector with scaleadaptive anchors. It can largely reduce the number of anchors and thus improve the computation efficiency. Meanwhile, it also eliminates the unreliability of detection caused by discrete scales, and is more effective to handle multiscale texts, especially small texts. Additionally, the Anchor convolution is proposed to further improve detection performance via exploiting necessary feature for each anchor. Furthermore, the proposed adaptive scale can also be applied to other methods, and allow them to handle multiscale texts in a more efficient way. Experimental results show that our approach is fast while maintaining high accuracy on ICDAR11 and ICADR13. In future, we are interested to apply our method to arbitraryoriented text detection task.
Acknowledgements.
The authors would like to thank the associate editor and the anonymous reviewers for their constructive suggestions.References
 (1)
 Buta and Neumann (2015) Michal Buta and Neumann. 2015. FASText: Efficient Unconstrained Scene Text Detector. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (ICCV ’15). IEEE Computer Society, Washington, DC, USA, 1206–1214. https://doi.org/10.1109/ICCV.2015.143
 Epshtein et al. (2010) Boris Epshtein, Eyal Ofek, and Yonatan Wexler. 2010. Detecting text in natural scenes with stroke width transform. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2963–2970.
 Girshick (2015a) R. Girshick. 2015a. Fast RCNN. In 2015 IEEE International Conference on Computer Vision (ICCV). 1440–1448. https://doi.org/10.1109/ICCV.2015.169
 Girshick (2015b) Ross Girshick. 2015b. Fast rcnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440–1448.
 Gupta and Z.A. (2016) V.A. Gupta, Ankush and Z.A. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2315–2324.
 Huang and Lin (2013) Weilin Huang and Lin. 2013. Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the IEEE International Conference on Computer Vision. 1241–1248.
 Huang et al. (2014) Weilin Huang, Yu Qiao, and Xiaoou Tang. 2014. Robust Scene Text Detection with Convolution Neural Network Induced MSER Trees. In Computer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 497–511.
 Jaderberg and Simonyan (2016) Max Jaderberg and Simonyan. 2016. Reading Text in the Wild with Convolutional Neural Networks. International Journal of Computer Vision 116, 1 (01 Jan 2016), 1–20. https://doi.org/10.1007/s112630150823z
 Jaderberg and Simonyan (2015) Max Jaderberg and Karen Simonyan. 2015. Spatial Transformer Networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 2017–2025.
 Jia and Shelhamer (2014) Yangqing Jia and Evan Shelhamer. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia (MM ’14). ACM, New York, NY, USA, 675–678. https://doi.org/10.1145/2647868.2654889
 Karatzas and Shafait (2013) Dimosthenis Karatzas and Faisal Shafait. 2013. ICDAR 2013 robust reading competition. In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 1484–1493.
 Krizhevsky and Sutskever (2012) Alex Krizhevsky and Ilya Sutskever. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097–1105.
 Liao et al. (2016) Minghui Liao, Baoguang Shi, and Xiang Bai. 2016. Textboxes: A fast text detector with a single deep neural network. arXiv (2016), 1611.06779.
 Liu and Anguelov (2016) Wei Liu and Dragomir Anguelov. 2016. SSD: Single shot multibox detector. In European Conference on Computer Vision. Springer, 21–37.
 Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully Convolutional Networks for Semantic Segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 Lu et al. (2015) Shijian Lu, Tao Chen, Shangxuan Tian, JooHwee Lim, and ChewLim Tan. 2015. Scene text extraction based on edges and support vector regression. International Journal on Document Analysis and Recognition (IJDAR) 18, 2 (01 Jun 2015), 125–135. https://doi.org/10.1007/s100320150237z
 Ma et al. (2017) Jianqi Ma, Weiyuan Shao, and Hao Ye. 2017. ArbitraryOriented Scene Text Detection via Rotation Proposals. CoRR abs/1703.01086 (2017). arXiv:1703.01086 http://arxiv.org/abs/1703.01086
 Neumann and Matas (2015) Lukáš Neumann and Jiří Matas. 2015. Efficient scene text localization and recognition with local character refinement. In Document Analysis and Recognition (ICDAR), 2015 13th International Conference on. IEEE, 746–750.
 Ren and He (2015) Shaoqing Ren and Kaiming He. 2015. Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 91–99.
 Shahab et al. (2011) Asif Shahab, Faisal Shafait, and Andreas Dengel. 2011. ICDAR 2011 robust reading competition challenge 2: Reading text in scene images. In Document Analysis and Recognition (ICDAR), 2011 International Conference on. IEEE, 1491–1496.
 Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2014).
 Szegedy et al. (2015) Christian Szegedy, Wei Liu, and Yangqing Jia. 2015. Going Deeper With Convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 Tian et al. (2015) Shangxuan Tian, Yifeng Pan, and Chang Huang. 2015. Text Flow: A Unified Text Detection System in Natural Scene Images. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (ICCV ’15). IEEE Computer Society, Washington, DC, USA, 4651–4659. https://doi.org/10.1109/ICCV.2015.528
 Yang et al. (2017) Chun Yang, XuCheng Yin, and Zejun Li. 2017. AdaDNNs: Adaptive Ensemble of Deep Neural Networks for Scene Text Recognition. CoRR abs/1710.03425 (2017). arXiv:1710.03425 http://arxiv.org/abs/1710.03425
 Yao et al. (2016) Cong Yao, Xiang Bai, and Nong Sang. 2016. Scene Text Detection via Holistic, MultiChannel Prediction. CoRR abs/1606.09002 (2016). arXiv:1606.09002 http://arxiv.org/abs/1606.09002
 Ye and Doermann (2015) Q. Ye and D. Doermann. 2015. Text Detection and Recognition in Imagery: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 7 (July 2015), 1480–1500. https://doi.org/10.1109/TPAMI.2014.2366765
 Zhang et al. (2017) Rui Zhang, Sheng Tang, and Yongdong Zhang. 2017. ScaleAdaptive Convolutions for Scene Parsing. In The IEEE International Conference on Computer Vision (ICCV).
 Zhang et al. (2016) Zheng Zhang, Chengquan Zhang, and Wei Shen. 2016. MultiOriented Text Detection With Fully Convolutional Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
 Zhong et al. (2016) Zhuoyao Zhong, Lianwen Jin, and Shuye Zhang. 2016. Deeptext: A unified framework for text proposal generation and text detection in natural images. arXiv preprint arXiv:1605.07314 (2016).