SEE: Towards Semi-Supervised End-to-End Scene Text Recognition
Detecting and recognizing text in natural scene images is a challenging, yet not completely solved task. In recent years several new systems that try to solve at least one of the two sub-tasks (text detection and text recognition) have been proposed. In this paper we present SEE, a step towards semi-supervised neural networks for scene text detection and recognition, that can be optimized end-to-end. Most existing works consist of multiple deep neural networks and several pre-processing steps. In contrast to this, we propose to use a single deep neural network, that learns to detect and recognize text from natural images, in a semi-supervised way. SEE is a network that integrates and jointly learns a spatial transformer network, which can learn to detect text regions in an image, and a text recognition network that takes the identified text regions and recognizes their textual content. We introduce the idea behind our novel approach and show its feasibility, by performing a range of experiments on standard benchmark datasets, where we achieve competitive results.
Text is ubiquitous in our daily lives. Text can be found on documents, road signs, billboards, and other objects like cars or telephones. Automatically detecting and reading text from natural scene images is an important part of systems, that are to be used for several challenging tasks, such as image-based machine translation, autonomous cars or image/video indexing. In recent years the task of detecting text and recognizing text in natural scenes has seen much interest from the computer vision and document analysis community. Furthermore, recent breakthroughs  in other areas of computer vision enabled the creation of even better scene text detection and recognition systems than before . Although the problem of can be seen as solved for text in printed documents, it is still challenging to detect and recognize text in natural scene images. Images containing natural scenes exhibit large variations of illumination, perspective distortions, image qualities, text fonts, diverse backgrounds, etc.
The majority of existing research works developed end-to-end scene text recognition systems that consist of complex two-step pipelines, where the first step is to detect regions of text in an image and the second step is to recognize the textual content of that identified region. Most of the existing works only concentrate on one of these two steps.
In this paper, we present a solution that consists of a single that can learn to detect and recognize text in a semi-supervised way. In this setting the network only receives the image and the textual labels as input. We do not supply any groundtruth bounding boxes. The text detection is learned by the network itself. This is contrary to existing works, where text detection and text recognition systems are trained separately in a fully-supervised way. Recent work  showed that are capable of learning how to solve complex multi-task problems, while being trained in an end-to-end manner. Our motivation is to use these capabilities of and create an end-to-end trainable scene text recognition system, that can be trained on weakly labelled data. In order to create such a system, we learn a single that is able to find single characters, words or even lines of text in the input image and recognize their content. This is achieved by jointly learning a localization network that uses a recurrent spatial transformer  as attention mechanism and a text recognition network. Figure 1 provides a schematic overview of our proposed system.
Our contributions are as follows:
This paper is structured in the following way: We first outline work of other researchers that is related to ours. Second, we describe our proposed system in detail. We then show and discuss our results on standard benchmark datasets and finally conclude our findings.
Over the course of years a rich environment of different approaches to scene text detection and recognition have been developed and published. Nearly all systems use a two-step process for performing end-to-end recognition of scene text. The first step, is to detect regions of text and extract these regions from the input image. The second step, is to recognize the textual content and return the text strings of the extracted text regions.
It is further possible to divide these approaches into three broad categories:
For each category, we will discuss some of these systems.
Hand Crafted Features In the beginning, methods based on hand crafted features and human knowledge have been used to perform text detection and recognition. These systems used features like MSERs , Stroke Width Transforms  or HOG-Features  to identify regions of text and provide them to the text recognition stage of the system. In the text recognition stage sliding window classifiers  and ensembles of SVMs  or k-Nearest Neighbor classifiers using HOG features  were used. All of these approaches use hand crafted features that have a large variety of hyper parameters that need expert knowledge to correctly tune them for achieving the best results.
Deep Learning Approaches More recent systems exchange approaches based on hand crafted features in one or both steps of recognition systems by approaches using . Gómez and Karatzas  propose a text-specific selective search algorithm that, together with a , can be used to detect (distorted) text regions in natural scene images. Gupta et al.  propose a text detection model based on the YOLO-Architecture  that uses a fully convolutional deep neural network to identify text regions.
Bissacco et al.  propose a complete end-to-end architecture that performs text detection using hand crafted features. Jaderberg et al.  propose several systems that use deep neural networks for text detection and text recognition. In  Jaderberg et al. propose to use a region proposal network with an extra bounding box regression CNN for text detection. A CNN that takes the whole text region as input is used for text recognition. The output of this CNN is constrained to a pre-defined dictionary of words, making this approach only applicable to one given language.
Goodfellow et al.  propose a text recognition system for house numbers, that has been refined by Jaderberg et al.  for unconstrained text recognition. This system uses a single , taking the whole extracted text region as input, and recognizing the text using one independent classifier for each possible character in the given word. Based on this idea He et al.  and Shi et al.  propose text recognition systems that treat the recognition of characters from the extracted text region as a sequence recognition problem. Shi et al.  later improved their approach by firstly adding an extra step that utilizes the rectification capabilities of Spatial Transformer Networks  for rectifying extracted text lines. Secondly they added a soft-attention mechanism to their network that helps to produce the sequence of characters in the input image. In their work Shi et al. make use of Spatial Transformers as an extra pre-processing step to make it easier for the recognition network to recognize the text in the image. In our system we use the Spatial Transformer as a core building block for detecting text in a semi-supervised way.
End-to-End trainable Approaches The presented systems always use a two-step approach for detecting and recognizing text from scene text images. Although recent approaches make use of deep neural networks they are still using a huge amount of hand crafted knowledge in either of the steps or at the point where the results of both steps are fused together. Smith et al.  and Wojna et al.  propose an end-to-end trainable system that is able to recognize text on French street name signs, using a single . In contrast to our system it is not possible for the system to provide the location of the text in the image, only the textual content can be extracted. Recently Li et al.  proposed an end-to-end system consisting of a single, complex that is trained end-to-end and can perform text detection and text recognition in a single forward pass. This system is trained using groundtruth bounding boxes and groundtruth labels for each word in the input images, which stands in contrast to our method, where we only use groundtruth labels for each word in the input image, as the detection of text is learned by the network itself.
A human trying to find and read text will do so in a sequential manner. The first action is to put attention on a word, read each character sequentially and then attend to the next word. Most current end-to-end systems for scene text recognition do not behave in that way. These systems rather try to solve the problem by extracting all information from the image at once. Our system first tries to attend sequentially to different text regions in the image and then recognize their textual content. In order to do this, we created a single consisting of two stages:
. In this section we will introduce the attention concept used by the text detection stage and the overall structure of the proposed system.
3.1Detecting Text with Spatial Transformers
A spatial transformer proposed by Jaderberg et al.  is a differentiable module for that takes an input feature map and applies a spatial transformation to this feature map, producing an output feature map . Such a spatial transformer module is a combination of three parts. The first part is a localization network computing a function , that predicts the parameters of the spatial transformation to be applied. These predicted parameters are used in the second part to create a sampling grid, which defines a set of points where the input map should be sampled. The third part is a differentiable interpolation method, that takes the generated sampling grid and produces the spatially transformed output feature map . We will shortly describe each component in the following paragraphs.
Localization Network The localization network takes the input feature map , with channels, height and width and outputs the parameters of the transformation that shall be applied. In our system we use the localization network () to predict two-dimensional affine transformation matrices , where :
is thereby the number of characters, words or textlines the localization network shall localize. The affine transformation matrices predicted in that way allow the network to apply translation, rotation, zoom and skew to the input image.
In our system the transformation matrices are produced by using a feed-forward together with a . Each of the transformation matrices is computed using the globally extracted convolutional features and the hidden state of each time-step of the :
where is another feed-forward/recurrent network. We use a variant of the well known ResNet architecture  as for our localization network. We use this network architecture, because we found that with this network structure our system learns faster and more successfully, as compared to experiments with other network structures, such as the VGGNet . We argue that this is due to the fact that the residual connections of the ResNet help with retaining a strong gradient down to the very first convolutional layers. The used in the localization network is a  unit. This is used to generate the hidden states , which in turn are used to predict the affine transformation matrices. We used the same structure of the network for all our experiments we report in the next section. Figure 2 provides a structural overview of this network.
Rotation Dropout During our experiments, we found that the network tends to predict transformation parameters, which include excessive rotation. In order to mitigate such a behavior, we propose a mechanism that works similarly to dropout , which we call rotation dropout. Rotation dropout works by randomly dropping the parameters of the affine transformation, which are responsible for rotation. This prevents the localization network to output transformation matrices that perform excessive rotation. Figure 3 shows a comparison of the localization result of a localization network trained without rotation dropout (top) and one trained with rotation dropout (middle).
Grid Generator The grid generator uses a regularly spaced grid with coordinates , of height and width . The grid is used together with the affine transformation matrices to produce regular grids with coordinates of the input feature map , where and :
During inference we can extract the resulting grids , which contain the bounding boxes of the text regions found by the localization network. Height and width can be chosen freely.
Localization specific regularizers The datasets used by us, do not contain any samples, where text is mirrored either along the x- or y-axis. Therefore, we found it beneficial to add additional regularization terms that penalizes grid, which are mirrored along any axis. We furthermore found that the network tends to predict grids that get larger over the time of training, hence we included a further regularizer that penalizes large grids, based on their area. Lastly, we also included a regularizer that encourages the network to predict grids that have a greater width than height, as text is normally written in horizontal direction and typically wider than high. The main purpose of these localization specific regularizers is to enable faster convergence. Without these regularizers, the network will eventually converge, but it will take a very long time and might need several restarts of the training. Equation 1 shows how these regularizers are used for calculating the overall loss of the network.
Image Sampling The sampling grids produced by the grid generator are now used to sample values of the feature map at the coordinates for each . Naturally these points will not always perfectly align with the discrete grid of values in the input feature map. Because of that we use bilinear sampling and define the values of the output feature maps at a given location where and to be:
This bilinear sampling is (sub-)differentiable, hence it is possible to propagate error gradients to the localization network, using standard backpropagation.
The combination of localization network, grid generator and image sampler forms a spatial transformer and can in general be used in every part of a . In our system we use the spatial transformer as the first step of our network. Figure 4 provides a visual explanation of the operation method of grid generator and image sampler.
3.2Text Recognition Stage
The image sampler of the text detection stage produces a set of regions, that are extracted from the original input image. The text recognition stage (a structural overview of this stage can be found in Figure 2) uses each of these different regions and processes them independently of each other. The processing of the different regions is handled by a . This is also based on the ResNet architecture as we found that we could only achieve good results, while using a variant of the ResNet architecture for our recognition network. We argue that using a ResNet in the recognition stage is even more important than in the detection stage, because the detection stage needs to receive strong gradient information from the recognition stage in order to successfully update the weights of the localization network. The of the recognition stage predicts a probability distribution over the label space , where , with being the alphabet used for recognition, and representing the blank label. The network is trained by running a for a fixed number of timesteps and calculating the cross-entropy loss for the output of each timestep. The choice of number of timesteps is based on the number of characters, of the longest word, in the dataset. The loss is computed as follows:
Where is the regularization term based on the area of the predicted grid , is the regularization term based on the aspect ratio of the predicted grid , and is the regularization term based on the direction of the grid , that penalizes mirrored grids. and are scaling parameters that can be chosen freely. The typical range of these parameters is . is the label at time step for the n-th word in the image.
The training set used for training the model consists of a set of input images and a set of text labels for each input image. We do not use any labels for training the text detection stage. The text detection stage is learning to detect regions of text by using only the error gradients, obtained by calculating the cross-entropy loss, of the predictions and the textual labels, for each character of each word. During our experiments we found that, when trained from scratch, a network that shall detect and recognize more than two text lines does not converge. In order to overcome this problem we designed a curriculum learning strategy  for training the system. The complexity of the supplied training images under this curriculum is gradually increasing, once the accuracy on the validation set has settled.
During our experiments we observed that the performance of the localization network stagnates, as the accuracy of the recognition network increases. We found that restarting the training with the localization network initialized using the weights obtained by the last training and the recognition network initialized with random weights, enables the localization network to improve its predictions and thus improve the overall performance of the trained network. We argue that this happens because the values of the gradients propagated to the localization network decrease, as the loss decreases, leading to vanishing gradients in the localization network and hence nearly no improvement of the localization.
In this section we evaluate our presented network architecture on standard scene text detection/recognition benchmark datasets. While performing our experiments we tried to answer the following questions:
In order to answer these questions, we used different datasets. On the one hand we used standard benchmark datasets for scene text recognition. On the other hand we generated some datasets on our own. First, we performed experiments on the SVHN dataset , that we used to prove that our concept as such is feasible. Second, we generated more complex datasets based on SVHN images, to see how our system performs on images that contain several words in different locations. The third dataset we exerimented with, was the dataset . This dataset is the most challenging we used, as it contains a vast amount of irregular, low resolution text lines, that are more difficult to locate and recognize than text lines from the SVHN datasets. We begin this section by introducing our experimental setup. We will then present the results and characteristics of the experiments for each of the aforementioned datasets. We will conclude this section with a brief explanation of what kinds of features the network seems to learn.
Localization Network The localization network used in every experiment is based on the ResNet architecture . The input to the network is the image where text shall be localized and later recognized. Before the first residual block the network performs a convolution, followed by batch normalization , ReLU , and a average pooling layer with stride 2. After these layers three residual blocks with two convolutions, each followed by batch normalization and ReLU, are used. The number of convolutional filters is 32, 48 and 48 respectively. A max-pooling with stride 2 follows after the second residual block. The last residual block is followed by a average pooling layer and this layer is followed by a with 256 hidden units. Each time step of the is fed into another with 6 hidden units. This layer predicts the affine transformation matrix, which is used to generate the sampling grid for the bilinear interpolation. We apply rotation dropout to each predicted affine transformation matrix, in order to overcome problems with excessive rotation predicted by the network.
Recognition Network The inputs to the recognition network are crops from the original input image, representing the text regions found by the localization network. In our SVHN experiments, the recognition network has the same structure as the localization network, but the number of convolutional filters is higher. The number of convolutional filters is 32, 64 and 128 respectively. We use an ensemble of independent softmax classifiers as used in  and  for generating our predictions. In our experiments on the FSNS dataset we found that using ResNet-18  significantly improves the obtained recognition accuracies.
Alignment of Groundtruth During training we assume that all groundtruth labels are sorted in western reading direction, that means they appear in the following order:
We stress that currently it is very important to have a consistent ordering of the groundtruth labels, because if the labels are in a random order, the network rather predicts large bounding boxes that span over all areas of text in the image. We hope to overcome this limitation, in the future, by developing a method that allows random ordering of groundtruth labels.
Implementation We implemented all our experiments using Chainer . We conducted all our experiments on a work station which has an Intel(R) Core(TM) i7-6900K CPU, 64 GB RAM and 4 TITAN X (Pascal) GPUs.
4.2Experiments on the SVHN dataset
With our first experiments on the SVHN dataset  we wanted to prove that our concept works. We therefore first conducted experiments, similar to the experiments in , on SVHN image crops with a single house number in each image crop, that is centered around the number and also contains background noise. Table 1 shows that we are able to reach competitive recognition accuracies.
Based on this experiment we wanted to determine whether our model is able to detect different lines of text that are arranged in a regular grid, or placed at random locations in the image. In Figure 5 we show samples from our two generated datasets, that we used for our other experiments based on SVHN data. We found that our network performs well on the task of finding and recognizing house numbers that are arranged in a regular grid.
During our experiments on the second dataset, created by us, we found that it is not possible to train a model from scratch, which can find and recognize more than two textlines that are scattered across the whole image. We therefore resorted to designing a curriculum learning strategy that starts with easier samples first and then gradually increases the complexity of the train images.
4.3Experiments on the dataset
Following our scheme of increasing the difficulty of the task that should be solved by the network, we chose the dataset by Smith et al.  to be our next dataset to perform experiments on. The dataset contains more than 1 million images of French street name signs, which have been extracted from Google Streetview. This dataset is the most challenging dataset for our approach as it
During our first experiments with that dataset, we found that our model is not able to converge, when trained on the supplied groundtruth. We argue that this is because our network was not able to learn the alignment of the supplied labels with the text in the images of the dataset. We therefore chose a different approach, and started with experiments where we tried to find individual words instead of textlines with more than one word. Table 2 shows the performance of our proposed system on the benchmark dataset. We are currently able to achieve competitive performance on this dataset. We are still behind the results reported by Wojna et al. . This likely due to the fact that we used a feature extractor that is weaker (ResNet-18) compared to the one used by Wojna et al. (Inception-ResNet v2). Also recall that our method is not only able to determine the text in the images, but also able to extract the location of the text, although we never explicitly told the network where to find the text! The network learned this completely on its own in a semi-supervised manner.
During the training of our networks, we used Visualbackprop  to visualize the regions that the network deems to be the most interesting. Using this visualization technique, we could observe that our system seems to learn different types of features for each subtask. Figure 3 (bottom) shows that the localization network learns to extract features that resemble edges of text and the recognition network learns to find strokes of the individual characters in each cropped word region. This is an interesting observation, as it shows that our tries to learn features that are closely related to the features used by systems based on hand-crafted features.
In this paper we presented a system that can be seen as a step towards solving end-to-end scene text recognition, only using a single multi-task deep neural network. We trained the text detection component of our model in a semi-supervised way and are able to extract the localization results of the text detection component. The network architecture of our system is simple, but it is not easy to train this system, as a successful training requires a clever curriculum learning strategy. We also showed that our network architecture can be used to reach competitive results on different public benchmark datasets for scene text detection/recognition.
At the current state we note that our models are not fully capable of detecting text in arbitrary locations in the image, as we saw during our experiments with the dataset. Right now our model is also constrained to a fixed number of maximum words that can be detected with one forward pass. In our future work, we want to redesign the network in a way that makes it possible for the network to determine the number of textlines in an image by itself.
Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. Curriculum learning.
Bissacco, A.; Cummins, M.; Netzer, Y.; and Neven, H. Photoocr: Reading text in uncontrolled conditions.
Bojarski, M.; Choromanska, A.; Choromanski, K.; Firner, B.; Jackel, L.; Muller, U.; and Zieba, K. Visualbackprop: efficient visualization of cnns.
Dai, J.; He, K.; and Sun, J. Instance-aware semantic segmentation via multi-task network cascades.
Epshtein, B.; Ofek, E.; and Wexler, Y. Detecting text in natural scenes with stroke width transform.
Goodfellow, I.; Bulatov, Y.; Ibarz, J.; Arnoud, S.; and Shet, V. Multi-digit number recognition from street view imagery using deep convolutional neural networks.
Gupta, A.; Vedaldi, A.; and Zisserman, A. Synthetic data for text localisation in natural images.
Gómez, L., and Karatzas, D. Textproposals: A text-specific selective search algorithm for word spotting in the wild.
He, K.; Zhang, X.; Ren, S.; and Sun, J. Deep residual learning for image recognition.
He, P.; Huang, W.; Qiao, Y.; Loy, C. C.; and Tang, X. Reading scene text in deep convolutional sequences.
Hochreiter, S., and Schmidhuber, J. Long short-term memory.
Ioffe, S., and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift.
Jaderberg, M.; Simonyan, K.; Vedaldi, A.; and Zisserman, A. Reading text in the wild with convolutional neural networks.
Jaderberg, M.; Simonyan, K.; Zisserman, A.; and Kavukcuoglu, K. Spatial transformer networks.
Jaderberg, M.; Vedaldi, A.; and Zisserman, A. Deep features for text spotting.
Li, H.; Wang, P.; and Shen, C. Towards end-to-end text spotting with convolutional recurrent neural networks.
Mishra, A.; Alahari, K.; and Jawahar, C. Scene text recognition using higher order language priors.
Nair, V., and Hinton, G. E. Rectified linear units improve restricted boltzmann machines.
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. Reading digits in natural images with unsupervised feature learning.
Neumann, L., and Matas, J. A method for text localization and recognition in real-world images.
Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. You only look once: Unified, real-time object detection.
Ren, S.; He, K.; Girshick, R.; and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks.
Shi, B.; Bai, X.; and Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition.
Shi, B.; Wang, X.; Lyu, P.; Yao, C.; and Bai, X. Robust scene text recognition with automatic rectification.
Simonyan, K., and Zisserman, A. Very deep convolutional networks for large-scale image recognition.
Smith, R.; Gu, C.; Lee, D.-S.; Hu, H.; Unnikrishnan, R.; Ibarz, J.; Arnoud, S.; and Lin, S. End-to-end interpretation of the french street name signs dataset.
Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting.
Sønderby, S. K.; Sønderby, C. K.; Maaløe, L.; and Winther, O. Recurrent spatial transformer networks.
Tokui, S.; Oono, K.; Hido, S.; and Clayton, J. Chainer: a next-generation open source framework for deep learning.
Wang, K., and Belongie, S. Word spotting in the wild.
Wang, K.; Babenko, B.; and Belongie, S. End-to-end scene text recognition.
Wojna, Z.; Gorban, A.; Lee, D.-S.; Murphy, K.; Yu, Q.; Li, Y.; and Ibarz, J. Attention-based extraction of structured information from street view imagery.
Yao, C.; Bai, X.; Shi, B.; and Liu, W. Strokelets: A learned multi-scale representation for scene text recognition.