Road Segmentation Using CNN with GRU
This paper presents an accurate and fast algorithm for road segmentation using convolutional neural network (CNN) and gated recurrent units (GRU). For autonomous vehicles, road segmentation is a fundamental task that can provide the drivable area for path planning. The existing deep neural network based segmentation algorithms usually take a very deep encoder-decoder structure to fuse pixels, which requires heavy computations, large memory and long processing time. Hereby, a CNN-GRU network model is proposed and trained to perform road segmentation using data captured by the front camera of a vehicle. GRU network obtains a long spatial sequence with lower computational complexity, comparing to traditional encoder-decoder architecture. The proposed road detector is evaluated on the KITTI road benchmark and achieves high accuracy for road segmentation at real-time processing speed.
In recent years, there is a growing research interest on automated driving and intelligent vehicles. As one of the most important parts in an automated driving system, road perception algorithm first gathers information from the road and sets up the constraints for the subsequent path planners . Then it searches for the drivable area and the lane occupancy so that the region of path planning and lane keeping can be determined. In this paper we focus on the road segmentation algorithm using a monocular camera input.
Cameras are the most popular sensors for autonomous and intelligent vehicles since they are cost effective. There are existing test benches such as KITTI providing annotated images for the evaluation of road/lane segmentation. Traditional computer vision based road segmentation algorithms often employ manually defined features such as edge  and histogram. However, manually defined features usually work on limited problem aspects and hard to be extended to new domains. Since 2014, CNN based deep learning algorithms have become more popular. CNN is a kind of neural network that takes the advantage of many parallel and cascade convolutional filters to solve high-dimensional non-convex problem such as regression, image classification, object detection and semantic segmentation. By processing limited dimensions and sharing weights in each layer, a CNN requires fewer parameters than the traditional artificial neural network and is much easier to train. From AlexNet, GoogleNet, VGGNet , InceptionNet-v3  to ResNet  , convolutional neural networks are growing larger that results better performance. Several famous convolutional neural networks are compared in accuracy and efficiency as shown in Table I.  also shows a detailed comparison among different CNNs.
By implementing deeper layers and trainable parameters, convolutional neural networks achieve amazing performance in variant light conditions, scales and shapes. Unfortunately as networks become deeper and larger, they take more computation, memory and processing time, which exceeds the capability of the embedded systems in an autonomous vehicle. In 2017, a few efficient CNNs are introduced to lower down the parameters and computational complexity for embedded devices. SequeezeNet , MobileNet , ShuffleNet  and Xception  are the state-of-art efficient CNNs that separate pixel wise convolutions and dense wise convolutions by applying grouped convolution and convolutions. Those efficient CNNs achieved competitive accuracy with less memories and processing time if compared to the traditional CNNs.
Recurrent neural network (RNN) is a kind of neural network structure that passes data sequence. Different from traditional artificial neural networks that fully connect all nodes and convolutional neural networks that explore nodes from local to global layer by layer, recurrent neural networks use state neurons to explore the relationship in context. Simple RNN, LSTM  and GRU  are typical recurrent neural networks. RNNs have been proposed to solve hard sequence problems such as machine translation, video caption. Most recently, RNNs have also been used to solve spatial sequence , 2D image sequence  and spatial-temporal sequence  problems.
In this paper, the problem of road segmentation is framed as a semantic caption task. The top, left and right boundaries of road area in an image are extracted by a CNN-GRU network. A CNN based local feature extractor and a GRU based context processor is implemented to construct the network. The proposed solution is trained and evaluated on KITTI road benchmarks and the results are satisfactory. We claim the proposed network is embedded system friendly and is ready for real-time applications. The rest of paper is organized as follows. Section II describes the proposed architecture. In section III experimental results on the benchmarks are presented and analyzed. Finally Section IV concludes the paper.
|Network||Publish year||Parameters||Multi-Adds||Top-1 accuracy on ImageNet|
Ii Algorithms Design
In this section, the overview of the proposed neural network is described, followed by the details of the main components including coordinate input channels, local feature encoder and context processor.
Ii-a Coordinate Input Channels
Coordinate of pixels in a road image is important to perception tasks. Most of the existing CNN based detection and segmentation solutions are trained with a large collection of images such as ImageNet, in which coordinates of pixels/cells are not taken as features because the camera views vary from image to image and objects may exist anywhere in an image. For road perception, however, there is a strong coherence between the likelihood of existence, shape and pose of an object and its position in an image. For example, road pavement is more likely located at the bottom of an image captured by the front view camera, while cars have more chances to be smaller in the center of an image and larger on the side, etc. Figure 1 shows the traffic scene heat map by analyzing KITTI dataset  that indicates the possibility of a pixel belonging to the road area. The closer a pixel locates to the center horizontally and bottom vertically, the more likely it denotes to road area. Several research works have taken coordinate input into consideration. In YOLO  and MultiNet , images are divided into cells and a convolutional neural network is built to process all cells in one image. Coordinates are involved in the convolution of cells. In  and , coordinates are introduced at the end of CNN structure as bias on decoders. In our solution, coordinates are introduced directly along with color channels to provide more position related information as we did in our previous work .
Ii-B Local Feature Encoder
Local feature encoder is a CNN based network that extracts features such as illumination and edges from local patches. In CNNs, local features are usually extracted by a group of convolution kernels trained by a large number of samples for a specific task. Traditional CNN based encoders such as FCN  cascades a number of convolution layers and each convolution layer grouped with pooling and non-linear functions, which requires large memories and extensive computations. In the proposed network, we implement a shallow structure with large kernels followed by convolutional layer as is used in  and . The first convolution layer in the encoder is constructed by shallow convolution kernels instead of using four convolution layers as in FCN . Figure 2 presents the detailed structure of the proposed encoder. To generate enough features while limiting the computational complexity, we implement convolution to reduce the dimension. Subsequently, another and a convolution layer is applied to further encode local features. Finally we encode the input into feature vectors and each vector contains features.
Ii-C Context Processor
Beside local features, context information throughout the entire image is also important to road segmentation. CNN based solutions use a deep encoder-decoder structure passing through all tensors in the feature map to achieve context processing. This kind of structures usually require very large GPU memories, vast amount of floating-point operations and long processing time. However, embedded systems in an intelligent vehicle have limited computational resources but require real-time processing speed. In StixelNet , conditional random field (CRF) is applied for context processing. It saves memory and float operations significantly with the penalty of slow processing speed, approximately at 1 second per frame. In our work, a GRU network is applied as context processor since it not only has rich gates to handle diverse features but also is capable of training as end-to-end. In our work, columns of feature vector are queued to GRU and context information is stored in its hidden states. Since rich context information is contained in both directions (from left to right and from right to left), a bi-directional GRU is build to process feature vector sequences in both directions. An implementation of 128 neurons is set as hidden state for each direction, so each context processor contains 256 neurons in total.
In our method, two context processors are built. As is shown in Figure 3, the first processor is built to predict the left and right bounds of road area and returns an output vector after it processes the full sequence of feature vectors. The second processor is used to predict the upper boundary of road area and returns an output vector every time it processes a feature vector. Both context processors are followed by a two-layer decoder to interpret the vector into normalized position of boundaries. An up sampling layer is applied to match the input height and width. The left, right and upper boundaries separate non-road pixels in the image. Combining with the bottom of an image as the default lower boundary, the total contoured area is marked as the road area as the road segmentation output.
Ii-D Network Structure
The overall network architecture is shown in Figure 4. Input of the neural network is . The first three input channels are red, green and blue channels coming from camera data, augmented with two additional channels as the row and column coordinate of each pixel. In order to converge in training, all RGB channels are divided by 255, row channel is normalized by image height and column channel is divided by image width so that all input channels are normalized to range. By passing through the local feature encoder, context processors and decoders, the left, right and upper boundaries of the road area in the image can be generated. For better convergence in training session, the output boundaries are normalized to range.
For evaluation and visualization purpose, a binary map is generated according to the output boundary. The predicted road area is the pixels enclosed by those three boundaries and the bottom of the image.
Ii-E Pyramid Prediction Scheme
Within each image frame, the features in near range and far range are dramatically different in size. Simply scaling an entire image frame to fit the input size of the network would result an unacceptable level of feature loss in far range and cause low accuracy of detecting the road further in distance. To avoid this problem, we propose a pyramid prediction scheme. When predicting the road area in the near range, the image frame is scaled to before sending to the network. When predicting the road area in the far range, the image frame is cropped to to match the network input size. By applying the pyramid prediction scheme, road area in near range and far range are predicted separately so that features in both ranges can be scaled to similar size, which makes our local feature encoder more stable and easier to train.
The proposed network is trained and evaluated using KITTI benchmarks. In KITTI dataset, there are 289 training images and 290 testing images for road detection. The training images have sizes range from to along with a binary label map presenting the drivable area. In training session, we augment the data samples by scaling the original images as well as the ground truth images to of their original resolution and then crop them using a shifting window. The horizontal shift is 60 pixels and vertical shift is 20 pixels. Finally, a total of 20,808 samples are generated and separated into a training set with 20,500 samples and a validating set with 308 samples. We also add Gaussian noise to the input data with standard deviation of 0.02% for additional diversity. Mean absolute error (MAE) of the boundary location in each column is selected as the loss function. Adam is an gradient descent based optimizer that adjusts learning rate on each neuron based on the estimation of lower-order moments of the gradients. We choose Adam as the optimizer because it converges quickly at the beginning and slows down near convergence. Input batch size is set to 125, learning rate is fixed at 1e-4. Figure 6 shows the error loss of the validation data after each training epoch. After a total 80 epochs training we get 0.0185 MAE on validation data.
We evaluated the trained network on KITTI test bench. There are two main metrics to evaluate the road segmentation: F1-score and average precision(AP). The metrics are calculated as in (1-4), where , , , denote true positive, true negative, false positive and false negative.
The evaluation results obtained from KITTI are F1-score of 86.91% and AP of 81.11%, which is comparable to the state-of-the-art methods reported so far. In addition, our solution has lower false positive rate of 4.39%, which is safe for autonomous vehicle. In Table II our work is compared with related solutions listed on KITTI road detection test bench. It shows that our work has similar F1-score and average precision with other works but has higher precision and lower false positive rate. More importantly, the proposed network has much fewer parameters to train and significantly less floating-point operations. Our proposed method of road segmentation can achieve real-time speed at 50 frames per second, when tested on an NVidia GTX 950M CPU with moderate processing power. We claim that the proposed solution is among the fastest in KITTI road detection test bench.
Figure 7 shows the typical result of our proposed road detector111FLOPs are estimated from the published results of the neural networks. Green pixels are true positives, while blue pixels are false positives and red pixels are false negatives. It can be seen that the majority of road surface are detected, and obstacles such as vehicles and railways are separated to avoid collisions. False negatives usually happen at road/vehicle and road/sidewalk boundaries, which mostly acceptable for automated driving. But the false positives on the sidewalks require further improvement.
In this paper, we present a neural network based solution for road segmentation that can achieve real-time processing speed. The CNN-RNN network mainly consists of a light-weighted local feature encoder and a recurrent neural network to process context information, which significantly reduces the floating-point operations and the memory usage. We train the network with KITTI road training database and evaluate on its test bench. The test result shows that our algorithm can achieve 86.91% F1-score and 81.11% average precision. However, the image-based road segmentation is still subjected to light conditions. Shadows, blurs and confusing colors are the main cause of false positives and false negatives. In our future work, multiple sensors including cameras, LiDARs and IMUs will be fused to further improve the road detector performance.
-  A. B. Hillel, R. Lerner, D. Levi, and G. Raz, “Recent progress in road and lane detection: a survey,” Machine vision and applications, vol. 25, no. 3, pp. 727–745, 2014.
-  J. Son, H. Yoo, S. Kim, and K. Sohn, “Real-time illumination invariant lane detection for lane departure warning system,” Expert Systems with Applications, vol. 42, no. 4, pp. 1816–1824, 2015.
-  L. Chen, J. Yang, and H. Kong, “Lidar-histogram for fast road and obstacle detection,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 1343–1348.
-  A. Khosroshahi, E. Ohn-Bar, and M. M. Trivedi, “Surround vehicles trajectory analysis with recurrent neural networks,” in Intelligent Transportation Systems (ITSC), 2016 IEEE 19th International Conference on. IEEE, 2016, pp. 2267–2272.
-  Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M. Yang, “Hedged deep tracking,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 4303–4311. [Online]. Available: https://doi.org/10.1109/CVPR.2016.466
-  J. Fritsch, T. Kuhnl, and A. Geiger, “A new performance measure and evaluation benchmark for road detection algorithms,” in Intelligent Transportation Systems-(ITSC), 2013 16th International IEEE Conference on. IEEE, 2013, pp. 1693–1700.
-  H. Yoo, U. Yang, and K. Sohn, “Gradient-enhancing conversion for illumination-robust lane detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 14, no. 3, pp. 1083–1094, 2013.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” CoRR, vol. abs/1409.4842, 2014. [Online]. Available: http://arxiv.org/abs/1409.4842
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [Online]. Available: http://arxiv.org/abs/1409.1556
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  A. Canziani, A. Paszke, and E. Culurciello, “An analysis of deep neural network models for practical applications,” CoRR, vol. abs/1605.07678, 2016. [Online]. Available: http://arxiv.org/abs/1605.07678
-  F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size,” CoRR, vol. abs/1602.07360, 2016. [Online]. Available: http://arxiv.org/abs/1602.07360
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017. [Online]. Available: http://arxiv.org/abs/1704.04861
-  X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” CoRR, vol. abs/1707.01083, 2017. [Online]. Available: http://arxiv.org/abs/1707.01083
-  F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” CoRR, vol. abs/1610.02357, 2016. [Online]. Available: http://arxiv.org/abs/1610.02357
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” vol. 9, pp. 1735–80, 12 1997.
-  J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” CoRR, vol. abs/1412.3555, 2014. [Online]. Available: http://arxiv.org/abs/1412.3555
-  N. Kalchbrenner and P. Blunsom, “Recurrent Continuous Translation Models,” Emnlp, no. October, pp. 1700–1709, 2013.
-  S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence - Video to text,” in Proceedings of the IEEE International Conference on Computer Vision, vol. 2015 International Conference on Computer Vision, ICCV 2015, 2015, pp. 4534–4542.
-  X. SHI, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. WOO, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 802–810. [Online]. Available: http://papers.nips.cc/paper/5955-convolutional-lstm-network-a-machine-learning-approach-for-precipitation-nowcasting.pdf
-  A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” CoRR, vol. abs/1601.06759, 2016. [Online]. Available: http://arxiv.org/abs/1601.06759
-  B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional neural network: A deep learning framework for traffic forecasting,” CoRR, vol. abs/1709.04875, 2017. [Online]. Available: http://arxiv.org/abs/1709.04875
-  J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint arXiv:1612.08242, 2016.
-  M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and R. Urtasun, “Multinet: Real-time joint semantic reasoning for autonomous driving,” arXiv preprint arXiv:1612.07695, 2016.
-  C.-A. Brust, S. Sickert, M. Simon, E. Rodner, and J. Denzler, “Efficient convolutional patch networks for scene understanding,” in CVPR Scene Understanding Workshop, 2015.
-  Z. Chen and Z. Chen, “Rbnet: A deep neural network for unified road and road boundary detection,” in International Conference on Neural Information Processing. Springer, 2017, pp. 677–687.
-  Y. Lyu, L. Bai, and X. Huang, “Real-time road segmentation using lidar data processing on an FPGA,” CoRR, vol. abs/1711.02757, 2017. [Online]. Available: http://arxiv.org/abs/1711.02757
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
-  D. Levi, N. Garnett, E. Fetaya, and I. Herzlyia, “Stixelnet: A deep convolutional network for obstacle detection and road segmentation.” in BMVC, 2015, pp. 109–1.
-  D. P. Kingma and J. L. Ba, “Adam: a Method for Stochastic Optimization,” International Conference on Learning Representations 2015, pp. 1–15, 2015.
-  A. Laddha, M. K. Kocamaz, L. E. Navarro-Serment, and M. Hebert, “Map-supervised road detection,” in Intelligent Vehicles Symposium (IV), 2016 IEEE. IEEE, 2016, pp. 118–123.