Learning Markov Clustering Networks for Scene Text Detection
A novel framework named Markov Clustering Network (MCN) is proposed for fast and robust scene text detection. MCN predicts instance-level bounding boxes by firstly converting an image into a Stochastic Flow Graph (SFG) and then performing Markov Clustering on this graph. Our method can detect text objects with arbitrary size and orientation without prior knowledge of object size. The stochastic flow graph encode objects’ local correlation and semantic information. An object is modeled as strongly connected nodes, which allows flexible bottom-up detection for scale-varying and rotated objects. MCN generates bounding boxes without using Non-Maximum Suppression, and it can be fully parallelized on GPUs. The evaluation on public benchmarks shows that our method outperforms the existing methods by a large margin in detecting multioriented text objects. MCN achieves new state-of-art performance on challenging MSRA-TD500 dataset with precision of 0.88, recall of 0.79 and F-score of 0.83. Also, MCN achieves realtime inference with frame rate of 34 FPS, which is speedup when compared with the fastest scene text detection algorithm.
Detecting structural objects in an image is a ubiquitous problem in real-word. Powered by the recent advances in Convolutional Neural Networks (CNNs), the object detection system has achieved human-level accuracy with real-time processing capability [4, 20, 19, 15, 3]. Despite the progresses made in general object detection, we still confront problems in detecting objects in a specific application area.
In scene text detection, existing CNN-based methods may fail when producing bounding boxes with extremely large aspect ratio or unsupported orientation [22, 21]. These methods [20, 19, 15] follow the top-down prediction paradigms, where object boxes are produced by appreciating the global information of an object while neglecting the local information. Therefore, the top-down method usually requires prior knowledge of the text box geometry to design reference boxes, which is task-specific and heuristic. As a result, to maintain the detection performance for various text sizes and orientations, one will inevitably increase the number of reference boxes, and thus lower the inference speed due to the increased output dimension . On the other hand, due to the absence of the local semantic information, the existing methods have to rely on Non-Maximum Suppression (NMS)  to remove redundant bounding boxes, which is unparallelizable on GPUs.
To address these issues, we propose an unified framework called Markov Clustering Network (MCN) for detecting scale-varying and arbitrarily oriented texts. It is an end-to-end trainable model describing both the local correlation and semantic information of an object with Stochastic Flow Graph (SFG). As shown in Figure 1, equidistant and overlapping regions are considered as nodes of SFG with edges weighted by flow values. Nodes belonging to the same object are strongly connected by the flows and will be grouped together by applying fully paralleled Markov Clustering (MC) on the SFG. Bounding boxes are produced based on the generated clusters with post-processing.
In contrast with the top-down methods [20, 19, 15], our method predicts bounding boxes in a bottom-up manner. Essentially, the MCN predicts instance-level objectness by merging the dense object predictions according to the local correlation measurements. This framework can naturally detect texts with arbitrary size and orientation. Our method does not use NMS to produce bounding boxes and can be fully parallelized on GPUs.
We evaluate our method on public benchmarks and prove its robustness to large variation of scale, aspect ratio and orientation. Our method achieves the state-of-art performance with much faster inference. The contribution of this work is summarized as follows:
A bottom-up method for scene text detection is proposed which assembles local predictions into object bounding boxes by performing Markov Clustering on Stochastic Flow Graph;
Markov Clustering is regarded as a set of special differentiable neural network layers and an end-to-end training method is developed for learning graph clusters from image data;
The proposed inference process is fully paralleled on GPUs and achieves realtime processing capability with frame rate of 34 FPS, which means to speedup when compared with fastest scene text detection algorithm.
Our method outperforms existing scene text detection methods in detecting arbitrarily oriented text objects, and achieves new state-of-art performance on challenging MSRA-TD500 dataset with precision of 0.88, recall of 0.79 and F-score of 0.83.
2 Related Works
Over the past few years, much research effort have been devoted to text detection at character level [17, 26, 7, 8] and word level [28, 25, 31, 30, 2, 9, 5]. Character-based methods detect individual characters and group them into words. These methods find characters by classifying candidate regions extracted by region extraction algorithms or by classifying sliding windows. Such methods often involve a post-processing step of grouping characters into words. Word-based methods directly detect word bounding boxes. They often have a similar pipeline to the recent CNN-based general object detection networks.
Recently, the segment-based method has opened up a new direction to solve this problem [23, 21]. Instead of detecting the whole object, these methods target at detecting segments of an object and combining these segments to a bounding box. Work  combines spatial recurrent components with YOLO architecture to detect segments and connects the segments heuristically according to their horizontal distance. Inheriting from the SSD  method,  predicts both object segments and links in between on multi-resolution feature maps. Instance-level bounding boxes are generated by merging oriented bounding boxes according to the link scores between them. However, this method still requires predefined default box for bounding box regression, and excessive connections between segments significantly complicates the training and slows down the inference.
Different from the existing methods, our method treats detection as a graph clustering problem. Instance-level object regions are represented by strongly connected nodes in a graph which can be extracted by Markov Clustering. Therefore, our method can generate bounding boxes with arbitrary box geometry.
Markov Clustering Network (MCN) is an object detection method based on graph clustering. An image is translated by MCN into a spatial feature map which will be further constructed into a latticed graph called Stochastic Flow Graph (SFG). The nodes in correspond to the feature vectors extracted from the overlapping regions of the image. The edges are weighted by the flow values , , and predicted by MCN. They are 2D maps with size of denoting the connection intensity or interaction to current node or its three neighbors. In our prediction framework, the presence of an object is jointly represented by nodes with strong connections to each other, and the background region is represented by isolated nodes. Therefore, detecting an object is equivalent to predicting the flow values and then grouping the nodes according to their connection intensities. Given the flow values predicted by MCN, we extract the objectness by performing Markov Clustering (MC)  on . The strong connected nodes are grouped into clusters representing objects. By mapping the nodes of a cluster back to the input image, the corresponding bounding boxes can be produced by simple post-processing.
3.2 Object Representation by Stochastic Flow
The existing object detection methods can be categorized as top-down methods, where the detection relies on coarse global observation of an image [15, 19, 20]. Due to the absence of the local information, these methods usually predict offset of the object size and orientation relative to predefined references (reference boxes) [13, 21]. Designing these references is task-specific which can hardly cover all cases and will degrade the detection robustness. This problem is getting worse in detecting objects with arbitrary aspect ratio and various orientation. If the object geometry is not well-supported by the references, large amounts of failure will occur when detecting these objects.
Our method considers the object detection in a bottom-up manner to solve the problems mentioned above. In our method, as shown in Figure 2, an input image is converted via MCN to a Stochastic Flow Graph (SFG) with nodes and directed edges weighted by stochastic flows , , and . For a given node , the corresponding flows , , and are positive and sum up to 1. An object is abstracted as nodes connected by the outgoing flows , and , while the background region is represented by nodes isolated by the self-loop flows . Since the nodes have corresponding spatial relation in the original image, the presence as well as the geometry (size and orientation) of an object can be represented by nodes and their flows, which is insensitive to variation of size and orientation.
From the point of probability, the SFG is actually modeling the Markov random walk process, where each node denotes a state in a Markov chain and the corresponding directed weighted edges represent the transition probabilities of this state. For a random walk process starting at a given node , there exists a stationary distribution (or flow distribution) describing possible destination nodes of this process. Specifically, the node with maximum value in is denoted as the attractor of . Therefore, the strongly connected nodes can be regarded as nodes with the same attractor . This interpretation provides us a probabilistic description of flows and clusters. Moreover, it allows us to uniquely represent an instance-level object region with an attractor, which is the fundamental of our detection method.
3.3 Detecting Object by Markov Clustering
Based on the probabilistic interpretation and property of SFG, we apply Markov Clustering (MC) to extract the instance-level object regions. Markov Clustering is an algorithm to identify the strongly connected nodes and group them into clusters. In Markov Clustering, a flow matrix is constructed from with entry representing the flow value from node to node 111We use both 1D and 2D notation, alternatively, to index a node. The transformation between 1D notation and 2D notation can be represented by .. The -th column of represents the transition probability of a Markov random walk starting at node , which is denoted as . Markov Clustering is actually computing the stationary distribution for each node. It consists of a set of iterations including matrix-matrix multiplication and non-linear transformation, which are illustrated as follows:
Expand: Input , output .
Inflate: Input , output .
Prune: Input , output .
where is the intermediate result at -th iteration and is the number of iterations for convergence. The expansion step spreads the flows out of a node to its potential new node. It enhances the flows to the nodes which are reachable by multiple paths. The inflation step and pruning step are meant to regularize the iteration to ensure convergence by introducing a non-linearity into the process, while also have the effect of strengthening intra-cluster flows and weakening the inter-cluster flows . The pseudo-code for Markov Clustering in presented in Algorithm 1.
At the start of the process, the outgoing flow distribution of a node is smooth and uniform, and becomes more and more peaked as the iterations are executed. The columns of corresponding to the same cluster will converge to the same one-hot vector. It is reflected on that nodes within a tightly-linked group will flow to the same attractor at the end, which helps to identify any potential cluster. In addition, Markov Clustering does not require predefined number of clusters, and due to the parallelizability of three operations, Markov Clustering can be fully parallelized on GPUs.
3.4 Learning Clustering with Flow Labels
In this section, we illustrate the learning algorithm for MCN to correctly predict the stochastic flow for clustering nodes.
Locating Attractors for Clusters As illustrated previously, the converged flow matrix describes the flow distribution of possible attractors for each node. Therefore, labeling clusters is equivalent to labeling attractor for each node.
Defining the nodes within the same bounding box as a cluster, we compute the attractor for this cluster based on the geometry of the ground-true bounding box. As shown in Figure 3 (a), given a ground-truth bounding box, we firstly compute the coordinates of , which is the intersection between the major axis and the lower short-side of the bounding box. Second, we draw a horizontal line that traverses the node with lowest -coordinates in the bounding box region, and a vertical line that traverses the nearest node from . Finally, the intersection node between and is determined as the attractor. To ensure attractor being in a bounding box, we adjust the bounding box size, which may introduce new nodes into it.
From Attractors to Cluster Labels The Markov Clustering outputs the stationary distribution of potential attractors for each node. Thus, the ground-true label for each node is defined as the target distribution . As shown in Figure 3 (c), we firstly make 2D mask to record the 1D attractor index for each node. For the nodes within an object region, they share the same attractor index, while for an node corresponding to the background, it becomes the attractor of itself. Based on the attractor mask, we generate a 3D cluster (flow) label describing the target stationary distribution for all nodes, which is shown in Figure 3 (d). For specific node with attractor , the target distribution is a one-hot vector with -th entry labeled as 1.
Loss Function Given the converged flow distribution for a node and the target distribution , the loss function is represented by a cross-entropy loss between these two distribution:
and the flows of all nodes are globally optimized by minimizing the mean cross-entropy error represented by:
Gradients of Markov Clustering An end-to-end supervised training requires the differentiability of all the operations in a model and the feasibility of labeling the data. In this section, we focus on the differentiability of Markov Clustering. The operations included in Markov Clustering can be treated as special neural network layers, which are differentiable. We visualize the operations with a computing graph which computes the stationary distribution and corresponding cross-entropy loss given flow matrix and target distribution . In Figure 4, each node represents one operation in Markov Clustering, and the directed edges show the data flow throughout the whole clustering process for iterations. The output data of an operation is marked above the edge and corresponding gradient is marked below. From the computing graph, gradient of cost function of stochastic flow respecting is derived by using the chain rule illustrated below:
The computing graph for composes of a main data path from through a series of MC iterations to and a set of side paths directly connecting to the input of expansion node. Therefore, the gradient of respecting is computed by summing all gradients respecting input to all expansion node, as illustrated in Equation 11. In addition, to simplify the gradient computation, we set the threshold of pruning to be , making it be equivalent to a ReLU operation. Thus, the inflation becomes identical mapping with a gradient of . This trick will slightly increase the number of iterations for convergence but simplifies the gradient computation, leading to a faster training in general. In this testing phase, the threshold can be turned up for faster convergence.
4 Detail Implementation of MCN for Scene Text Detection
The architecture of MCN, inference flow and training flow are shown in Figure 5. An MCN consists of a CNN backbone network inherited from a pretrained VGG-16 model. We remove all the fully-connected layers and output features of the conv5_3 with resolution. For an input image size of , the conv5_3 output is of size . The conv5_3 features are respectively fed to a Fore-/Background Subnetwork (FBN) and a Local Correlation Subnetwork (LCN). FBN detects multi-scale objects with a Feature Pyramid Network (FPN)  and a 2D-Recurrent Neural Network (2D-RNN). LCN predicts spatial and semantic correlation between adjacent image patches with stride of . The objects’ presence probability and the local correlation measurements and between current image patch and its three neighbors (bottom, right and left) produced by FBN and CSN respectively are translated into four flow maps and . A latticed Stochastic Flow Graph (SFG) is constructed from flow maps which is further described by a flow matrix . By performing Markov Clustering on the SFG, we can group nodes that belongs to the same object together and generate instance-level bounding boxes based on Principle Component Analysis (PCA).
The MCN is end-to-end trainable with bounding box level labeling. As illustrated in Figure 5, the ground-truth bounding boxes are converted to node-wise object mask and flow label , which are used to compute the Object Loss , Object Cost , Flow Loss and Flow Cost . The total cost is computed by summing and together.
SynthText  contains over 800,000 synthetic scene text images. They are created by blending natural images with text rendered with random fonts, size, orientation, and color. It provides word level bounding box annotations. We only use this dataset to pretrain our model.
ICDAR 2013  is a dataset containing horizontal text lines. It has 229 text images for training and 223 images for testing.
ICDAR 2015  consists of 1000 training images and 500 testing images. This dataset features incidental scene text images taken by Google Glasses without taking care of positioning, view point and image quality.
MARA-TD500  is a multilingual dataset focusing on oriented texts. It consists of 300 training images and 200 testing images.
5.2 Experiment Details
Our model is pre-trained on SynthText and finetuned on real datasets. It is optimized by the standard SGD algorithm with a momentum of . Both training and testing images are resized to . The batch size is set to . In pretraining, the learning rate is set to for the first 60k iterations, and decayed by a factor of for the rest 30k iterations. The finetuning on public benchmarks runs at learning rate of with data augmentation proposed in . In testing, the threshold used for Pruning is set to . Both the training and testing flows are implemented with TensorFlow  r1.1 on Dell Precision T7500 workstation with Intel Xeon 5600 processor, 40 GB memory and a NVIDIA GTX 1080 GPU.
5.3 Detail Analysis
Baseline Comparison We conduct an experiment to validate the performance gain is coming from the proposed framework. The baseline model (Local-link) predicts the fore/background and four local link scores between nodes to capture the local correlation information. The instance-level bounding boxes are generated by finding the maximum connected (by link scores) component on the foreground regions. Both the baseline model and the MCN model is constructed based on VGG-16 backbone, and we keep the number of parameter to be roughly equal. The performance is shown in Table 1. It concludes that our method is overall better than the baseline setting (Local-link). In local-link model, nodes between two text regions may be unexpectedly connected by undirected links, leading to a fusion of two individual text instances. Due to a directed flow prediction and a data-driven clustering mechanism, MCN greatly reduces unexpected connections and can provide more robust instance-level bounding box proposal.
|P ()||R ()||F ()|
Profiling the MCN Figure 6 visualizes the predicted flows by MCN. The input images with three orientations, horizontal, right-oblique and left-oblique are shown in Figure 6 (a). Figure 6 (b) profiles the activation maps including the object region prediction , link scores , , , and stochastic flows , , , . All the activation maps with size of originally are upsampled to for demonstration. According the activation map of , , and , we draw the dominative flows and label the attractor on the input image, which is shown Figure 6 (c). In Figure 6 (d), the predicted bounding boxes and the ground-truth bounding boxes are labeled in yellow and red respectively.
|Jaderberg et al. ||0.89||0.68||0.77||-||-||-||-||-||-|
|Zhang et al. ||0.88||0.78||0.83||0.71||0.43||0.54||0.83||0.67||0.74|
|Gupta et al. ||0.92||0.75||0.83||-||-||-||-||-||-|
|Yao et al. ||-||-||-||0.72||0.59||0.65||0.77||0.75||0.76|
On one hand, MCN shows the high accuracy in detecting text objects close to each other. As shown in objectness map at first row of Figure 6, regions of multiple text objects merge together and we cannot generate bounding boxes directly from this map. The stochastic flows predicted by MCN captures the instance-level correlation and separate the merged regions into clusters. In some challenging casses with low quality , the flow-based prediction can maintain good performance since an text object is jointly predicted by multiple nodes and their connections. On the other hand, the MCN method is flexible to handle text objects with different lengths and orientations. The orientation of an object is also represented by all flows within the object region jointly, resulting in a more accurate bounding box generation.
5.4 Performance Comparison
Table 2 compares our method with the published works on public datasets of scene text detection. On the ICDAR-13 dataset, our method reaches the state-of-art performance with precision of 0.88, recall of 0.87 and F-score of 0.88. On ICDAR-15, a slight performance drop is observed as compared to the existing text detection methods. Since most of the text objects are of size smaller than the node density ( pixel), the flows predicted for these objects are weak, leading to inaccurate object detection. But MCN achieves a new state-of-art performance on the MSRA-TD500 dataset. As shown in Table 2, MCN outperforms the existing methods by a grate margin with precision of 0.88, recall of 0.79 and F-score of 0.83. Different from the ICDAR-13 that consists of only horizontal text objects, MSRA-TD500 contains large number of oblique and long text samples. The performance improvement in MSRA-TD500 shows that MCN is better at detecting multioriented text objects.
Figure 7 demonstrates the bounding box prediction of MCN. The samples include both English and Chinese with different scales and orientations. The predicted bounding boxes are labeled in yellow and the ground-truths are labeled in red. As shown in Figure 7, MCN detects multilingual text objects with various scales and orientations robustly. The flow clustering framework supports the different bounding box geometry flexibly. Compared with the region proposal based scene text detection algorithms [23, 21, 6, 13], our method predicts more elaborated instance-level bounding boxes. As for the segmentation based methods [29, 12], our method involves much less heuristic operations. Both the flexibility and data-driven characteristics make MCN be superior to existing scene text detection methods.
In this section, we analyze the computation time of Markov Clustering. The Markov Clustering algorithm is implemented based on CUDA 8.0 with cuDNN 5 library .
We profile the computation time on Table 3, as well as according precision, recall and F-score of bounding prediction with different . In general, the computation time of Markov Clustering increases linearly with the increase of . The detection performance also increases as increases, since the flow matrix requires sufficient number of iterations for convergence. Fortunately, it only takes few iterations for convergence to reach the best detection performance. As shown in Table 3, MCN reaches its best performance with and it takes only 0.86 ms to compute the clusters. This computing time is negligible when compared to the whole inference time of over 25 ms.
We also compare the inference speed in FPS with the recently proposed scene text detection methods on ICDAR-13 dataset. As shown in Table 4, our method achieves state-of-art performance and outperforms the existing methods with speedup. This is owing to flow-based method, which can tolerate inaccurate fore-/background prediction and thus maintains the same performance with less network parameters.
We present a novel Markov Clustering Network (MCN) for scene text detection. We treat the object detection problem as a graph-based clustering problem and develop a end-to-end trainable model for flexible scene text detection. MCN shows superiority in the sense of accuracy, robustness and speed. MCN outperforms the existing scene text detection algorithms in detecting multiscale and multioriented text objects. It also achieves speedup in comparison with the state-of-art algorithm. Our method is complementary to the existing top-down methods. Applying the extra top-down information to further improve the detection performance will be consider as future research extension.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
-  A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. Photoocr: Reading text in uncontrolled conditions. In Proceedings of the IEEE International Conference on Computer Vision, pages 785–792, 2013.
-  H. Ding, X. Jiang, B. Shuai, A. Q. Liu, and G. Wang. Context contrasted feature and gated multi-scale aggregation for scene segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2315–2324, 2016.
-  W. He, X.-Y. Zhang, F. Yin, and C.-L. Liu. Deep direct regression for multi-oriented scene text detection. arXiv preprint arXiv:1703.08289, 2017.
-  W. Huang, Z. Lin, J. Yang, and J. Wang. Text localization in natural images using stroke feature transform and text covariance descriptors. In Proceedings of the IEEE International Conference on Computer Vision, pages 1241–1248, 2013.
-  W. Huang, Y. Qiao, and X. Tang. Robust scene text detection with convolution neural network induced mser trees. In European Conference on Computer Vision, pages 497–511. Springer, 2014.
-  M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. International Journal of Computer Vision, 116(1):1–20, 2016.
-  D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. Icdar 2015 competition on robust reading. In Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, pages 1156–1160. IEEE, 2015.
-  D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. de las Heras. Icdar 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition, pages 1484–1493. IEEE, 2013.
-  Y. Li and J. Ma. A unified deep neural network for scene text detection. In International Conference on Intelligent Computing, pages 101–112. Springer, 2017.
-  M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu. Textboxes: A fast text detector with a single deep neural network. In AAAI, pages 4161–4167, 2017.
-  T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. arXiv preprint arXiv:1612.03144, 2016.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  A. Neubeck and L. Van Gool. Efficient non-maximum suppression. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, volume 3, pages 850–855. IEEE, 2006.
-  L. Neumann and J. Matas. Real-time lexicon-free scene text localization and recognition. IEEE transactions on pattern analysis and machine intelligence, 38(9):1872–1885, 2016.
-  J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable parallel programming with cuda. Queue, 6(2):40–53, 2008.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  B. Shi, X. Bai, and S. Belongie. Detecting oriented text in natural images by linking segments. arXiv preprint arXiv:1703.06520, 2017.
-  S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. Lim Tan. Text flow: A unified text detection system in natural scene images. In Proceedings of the IEEE International Conference on Computer Vision, pages 4651–4659, 2015.
-  Z. Tian, W. Huang, T. He, P. He, and Y. Qiao. Detecting text in natural image with connectionist text proposal network. In European Conference on Computer Vision, pages 56–72. Springer, 2016.
-  S. M. Van Dongen. Graph clustering by flow simulation. PhD thesis, 2001.
-  K. Wang and S. Belongie. Word spotting in the wild. In European Conference on Computer Vision, pages 591–604. Springer, 2010.
-  T. Wang, D. J. Wu, A. Coates, and A. Y. Ng. End-to-end text recognition with convolutional neural networks. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 3304–3308. IEEE, 2012.
-  C. Wolf and J.-M. Jolion. Object count/area graphs for the evaluation of object detection and segmentation algorithms. International Journal of Document Analysis and Recognition (IJDAR), 8(4):280–296, 2006.
-  C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu. Detecting texts of arbitrary orientations in natural images. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1083–1090. IEEE, 2012.
-  C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, and Z. Cao. Scene text detection via holistic, multi-channel prediction. arXiv preprint arXiv:1606.09002, 2016.
-  Z. Zhang, W. Shen, C. Yao, and X. Bai. Symmetry-based text line detection in natural scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2558–2567, 2015.
-  Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai. Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4159–4167, 2016.
Appendix A Appendices
a.1 Bounding Box Generation
Given a vertex in a cluster , we compute their coordinates in the input image , where and . Then the bounding box size and orientation of each cluster are computed based on Principle Component Analysis (PCA). Given a set of coordinates of a cluster , we compute the its eigenvectors and as well as the corresponding eigenvalues and . The coordinates of the four corners of the bounding box is computed by:
where is the center of the cluster and denotes the scaling factor which is set to .
a.2 From Image to Stochastic Flow
Crucially, accurate object detection relies on correct flow prediction. In MCN, the flows , , and are the outputs of the Flow Mapping Layer (FML) with regional object probability and correlation measurement , and as inputs. is generated by the Fore-/Background Network (FBN), while , and are output by Local Correlation Network (LCN). Both FBN and LCN are starting at the conv5_3 of VGG-16 pretrained network.
a.2.1 Fore-/Background Network
As shown in Figure 8 (a), the Fore-/Background Network is an FPN-based network  with spatial recurrent components and softmax output to predict the object score . The output of conv5_3 is further processed by a Feature Pyramid Network (FPN) and a 2-dimensional Recurrent Neural Network (2D-RNN) successively. In FPN shown in Figure 8 (b), input with size of is processed by four convolutional blocks with pooling layers to obtain additional feature maps with resolution of , , and . These feature maps together with the input are fused to resolution of by deconvolution consisted of layer-wise addition, bilinear upsampling and convolution. By fusing features with different resolution in a pyramid manner, our method have larger capacity to detect multiscale objects with less parameters. Subsequently, the output of FPN is fed to an 2D Recurrent Neural Network (2D-RNN) before region-based classification. We consider a spatial feature map as a 2D sequence which can be directly analyzed by a 2D-RNN. The structure of the proposed 2D-RNN is shown in Figure 8 (c). A 2D-RNN is composed of two Bidirectional RNNs (RNN-H and RNN-V), which are applied to the rows and columns of the input feature map independently. As shown in Figure 8, the outputs of 2D-RNN is constructed by concatenating two feature maps produced by RNN-H and RNN-V with size of along depth axis. Finally, a region-based classification is performed on the output feature map by a 2-layer convolutional network with softmax output, Figure 9 (d).
a.2.2 Local Correlation Subnetwork
To predict the semantic and spatial correlation between adjacent subregions, we build another subnetwork with additional four convolutional blocks and a softmax classifier starting at conv5_3, shown in Figure 9. The network outputs three correlation measurements , and representing the semantic and spatial correlation between current anchor and its three neighbors (bottom, right and left) respectively. As the conv5_3 features is corresponding to subregions of input image with stride of , the LCN is actually measuring the correlation among these overlapping subregions. Together with output of objetness network , , and are mapped to the Stochastic Flow , , and by Flow Mapping Layer (FML).
a.2.3 Flow Mapping Layer
The Flow Mapping Layer (FML) is point-wise non-linear function with input of , , and and output of , , and . The mapping is shown below:
Here, is actually the transition probability of self-loop, which is controlled by the likehood of background () and the correlation measurement between current vertex and its neighbors (, and ). It is designed to be weak for vertices within the same object region and to be strong for a vertex which corresponds to the background or is just the attractor of a cluster. This behavior is realized by firstly measuring the correlation intensity () modulated by an on-off function , and then projecting it to the exponential space. is parameterized by trainable variables , and . It takes as input and produces an on-off signal to control . It will disables the effect of , and and drive approaching to 1 when a vertex is in the background region. Accordingly, the values of , and will be small, making all the background vertices to be isolated. In the object region, the correlation intensity , and take control of since is small. In this case, will be large if weak correlation is measured and the vertex will become the attractor of a cluster. Otherwise, the vectices belongs to the same object region will be connected through , and and the flows of a cluster will end at the attractor.