Controlling Steering Angle for Cooperative Self-driving Vehicles utilizing CNN and LSTM-based Deep Networks
A fundamental challenge in autonomous vehicles is adjusting the steering angle at different road conditions. Recent state-of-the-art solutions addressing this challenge include deep learning techniques as they provide end-to-end solution to predict steering angles directly from the raw input images with higher accuracy. Most of these works ignore the temporal dependencies between the image frames. In this paper, we tackle the problem of utilizing multiple sets of images shared between two autonomous vehicles to improve the accuracy of controlling the steering angle by considering the temporal dependencies between the image frames. This problem has not been studied in the literature widely. We present and study a new deep architecture to predict the steering angle automatically by using Long-Short-Term-Memory (LSTM) in our deep architecture. Our deep architecture is an end-to-end network that utilizes CNN, LSTM and fully connected (FC) layers and it uses both present and futures images (shared by a vehicle ahead via Vehicle-to-Vehicle (V2V) communication) as input to control the steering angle. Our model demonstrates the lowest error when compared to the other existing approaches in the literature.
Controlling the steering angle is a fundamental problem for autonomous vehicles , , . Recent computer vision-based approaches to control the steering angle in autonomous cars mostly focus on improving the driving accuracy with the local data collected from the sensors on the same vehicle and as such, they consider each car as an isolated unit gathering and processing information locally. However, as the availability and the utilization of V2V communication increases, real-time data sharing becomes more feasible among vehicles , , . As such, new algorithms and approaches are needed that can utilize the potential of cooperative environments to improve the accuracy and the effectiveness of self-driving systems.
In this paper, we present a deep learning-based approach that utilizes two sets of images coming from both the onboard sensors; e.g cameras; and from another vehicle ahead over V2V communication for the control of the steering angle in self-driving vehicles (see Fig. 1). Our proposed deep architecture contains a convolutional neural network (CNN) followed by a LSTM and a FC network. Unlike the traditional approach, that manually decomposes the autonomous driving problem into different components as in ,  the end-to-end model can directly steer the vehicle from the camera data and has been proven to operate more effectively in previous works , . We compare our proposed deep architecture to multiple existing algorithms in the literature on Udacity dataset. Our experimental results demonstrate that our proposed CNN-LSTM-based model yields the state-of-the-art results. Our main contributions are: (1) we propose an end-to-end vehicle-assisted steering angle control system for cooperative systems; (2) We propose using a large sequence of images as opposed to using only two consecutive frames; (3) introduce a new deep architecture that obtain the state-of-the-art results on the Udacity dataset.
Ii Related Work
The problem of navigating self-driving car by utilizing the perception acquired from sensory data has been studied in the literature with and without using end-to-end approaches. For example the works from ,  use multiple components for recognizing objects of safe-driving concerns, such as lanes, vehicles, traffic signs and pedestrians. The recognition results are then combined to give a reliable world representation, which are used with an Artificial Intelligence (AI) system to make decisions and control the car.
Recent works focus on using end-to-end approaches. The Autonomous Land Vehicle in a Neural Network (ALVINN) system was one of the earlier systems utilizing multilayer perceptron (MLP)  in 1989. Recently, CNNs were commonly used as in the DAVE-2 Project . In , the authors proposed an end-to-end trainable C-LSTM network that uses a LSTM network at the end of the CNN network. Similar approach was taken by the authors in , who designed a 3D CNN model with residual connections and LSTM layers. Other researchers have implemented different variants of convolutional architecture for end-to-end models in , , . Another widely used approach for controlling vehicle steering angle in autonomous systems is via sensor fusion where combining image data with other sensor data such as LiDAR, RADAR, GPS improves the accuracy in autonomous operation , . As an instance, in , the authors designed a fusion network using both image features and LiDAR features based on VGGNet.
All the above-listed work focus on utilizing the image data obtained from the on-board sensors and they do not consider the assisted data that comes from another car. In this paper, we demonstrate that using additional data that comes from the ahead vehicle helps us obtain better accuracy in controlling steering angle. In our approach, we utilize the information that is available to a vehicle ahead of our car to control the steering angle. The rest of the paper is organized as follows: In Section III the proposed approach is explained, Section IV provides details about the performed experiments. Finally, in Section V we conclude the paper and discuss possible directions for future work.
Iii Proposed Approach
Controlling steering angle directly from input images is a regression-value problem. For that purpose, we can either use a single image or a sequence of (multiple) images. Considering multiple frames in a sequence can benefit us in situations where the present image alone is affected by noise or contains less useful information. For example, when the current image is burnt largely by direct sunlight or when the vehicle reaches a dead-end. In such situations, the correlation between the current frame and the past frames can be useful to decide the next steering value. To utilize multiple images as a sequence, we use LSTM. LSTM has a recursive structure acting as a memory, through which a network can keep some past information and solve for a regression value based on the dependency of the consecutive frames , .
Our proposed idea in this paper relies on the fact that the condition of the road ahead has already been seen by another vehicle recently and we can utilize that information to control the steering angle of our car as discussed above. Fig. 1 illustrates our approach. In the figure, Vehicle 1 receives a set of images from Vehicle 2 over V2V communication and keeps the data on board. It combines the received data with the data obtained from the onboard camera and processes those two sets of images on board to control the steering angle via an end-to-end deep architecture. This method enables the vehicle to look ahead of its current position at any given time.
Our deep architecture is presented in Fig. 2. The network takes the set of images from both vehicles as input and at the last layer, it predicts the steering angle as the regression output. The details of our deep architecture are given in Table I. Since we construct this problem as a regression problem with a single unit at the end, we use the Mean Squared Error (MSE) loss function in our network during the training.
|1||Conv2D||5*5, 24 Filters||(5,4)||ReLU|
|2||Conv2D||5*5, 32 Filters||(3,2)||ReLU|
|3||Conv2D||5*5, 48 Filters||(5,4)||ReLU|
|4||Conv2D||5*5, 64 Filters||(1,1)||ReLU|
|5||Conv2D||5*5, 128 Filters||(1,2)||ReLU|
Iv Experiment Setup
In this section we will elaborate further on the dataset as well as data preprocessing and evaluation metrics. We conclude the section with details of our implementation.
In order to compare our results to existing work in the literature, we used the self-driving car dataset by Udacity. The dataset has a wide variation of 100K images from simultaneous Center, Left and Right camera on a vehicle, collected in sunny and overcast weather, 33K images belong to center camera. The dataset consists of 5 trips (from 5 different driving videos) with a total drive time of 1694 seconds (28.23 minutes). Test vehicle has 3 cameras mounted as in . Camera images are collected at a rate of around 20Hz. Steering wheel angle, brake, acceleration, GPS data was also recorded in the experiments. The distribution of the steering wheel angles over the entire dataset is shown in Fig. 3. As shown in Fig. 3, the dataset distribution includes a wide range of steering angles. The image size is 480*640*3 pixels and total dataset is of 3.63 GB. Since there is no dataset available with V2V communication images currently, here we simulate the environment by creating a virtual vehicle that is moving ahead of the autonomous vehicle and sharing camera images by using the Udacity dataset.
Udacity dataset has been used widely in the recent relevant literature , and we also use Udacity dataset in this paper to compare our results to the existing techniques in literature. Along with the steering angle, the dataset contains spatial (latitude, longitude, altitude) and dynamic (angle, torque, speed) information labelled with each image. The data format for each image is: index, timestamp, width, height, frame_id, filename, angle, torque, speed, latitude, longitude, altitude. For our purpose, we are only using the sequence of center-camera images.
Iv-B Data Preprocessing
The images in the dataset are recorded at the rate around 20 frame per second. Therefore, usually there is a large overlap between consecutive frames. To avoid overfitting, we used image augmentation to get more variance in our image dataset. Our image augmentation technic randomly adds brightness and contrast to change pixel values. We also tested image cropping to exclude possible redundant information that are not relevant in our application. However, in our test the models perform better without cropping.
For the sequential model implementation, we preprocessed the data in a different way. Since we do want to keep the visual sequential relevance in the series of frames while avoiding overfitting, we shuffle the dataset while keeping track of the sequential information. We then train our model with 80% images on the same sequence from the subsets and validate on the rest 20%.
Iv-C Vehicle-assisted Image Sharing
Modern wireless technology allows us to share data between vehicles at high bitrates of up to Gbits/s (e.g., in peer-to-peer and line-of-sight mmWave technologies [22, 23]). Such communication links can be utilized to share images between vehicles for improved control. In our experiments, we simulate that situation between two vehicles as follows: we assume that both vehicles are away from each other by seconds. We take the consecutive frames () from the self-driving vehicle (vehicle 1) at time step and the set of images containing future frames starting at () from the other vehicle. Thus, a single input data (sample) contains a set of frames for the model.
Iv-D Evaluation Metrics
The steering angle is a continuous variable predicted for each time step over the sequential data and the metrics: mean absolute error (MAE) and root mean squared error (RMSE) are two of the most common used metrics in the literature to measure the effectiveness of the controlling systems. For example, RMSE is used in ,  and MAE in . Both MAE and RMSE express average model prediction error and their values can range from 0 to . They both are indifferent to the error sign. Lower values are better for both metrics.
Iv-E Baseline Networks
As baseline, we include multiple deep architectures that have been proposed in the literature to compare our proposed algorithm. Those models from ,  and  are, to the best of our knowledge, the best reported approaches in the literature using a camera only. In total, we chose 5 baseline end-to-end algorithms to compare our results. We name these five models as models A, B, C, D and E in the rest of this paper. Model A is our implementation of the model presented in . Models B and C are the proposal of  . Models D and E are reproduced as in . The overview of these models is given in Fig. 4. Model A uses a CNN-based network while Model B combines LSTM with 3D-CNN and uses 25 time-steps as input. Model C is based on ResNet  model and Model D uses the difference image of two given time-steps as input to a CNN-based network. Finally, Model E uses the concatenation of two images coming from different time-steps as input to a CNN-based network.
Iv-F Implementation and Hyperparameter Tuning
Our implementations use Keras with a Tensor Flow backend. All the training was done on GPU (used two NVIDIA Tesla V100 16GB GPUs). When implemented on our system, the training took 4 hours for the model in  and between 9-12 hours for the deeper networks used in , in  and our proposed network.
We used Adam optimizer  in all our experiments (learning rate of , , , ). For learning rate, we tested from to and we found the best-performing learning rate being . We also studied the minibatch size to see its effect on our network. Minibatch sizes of 128, 64 and 32 are tested and the value 64 yielded the best results for us therefore we used 64 in our experiments reported in this paper.
Fig. 5 demonstrates how the value of the loss function changes as the number of epochs increases for both training and validation data sets. The MSE loss decreases after the first few epochs rapidly and then remains stable, remaining almost constant around the 14th epoch.
V Analysis and Results
Table II lists the comparison of the RMSE values for multiple end-to-end models after training them on the Udacity dataset. In addition to the five baseline models listed in Section IV-E, we also include two models of ours: Model F and Model G. Model F is our proposed approach with setting for each vehicle. Model G sets time-steps for each vehicle instead of 8 in our model. Since the RMSE values on Udacity dataset were not reported for Model D and Model E in , we re-implemented those models to compute the RMSE values on Udacity Dataset and reported the results from our implementation in Table II .
Table III lists the MAE values computed for our implementations of the models A, D, E, F, and G. Models A, B, C, D, and E do not report their individual MAE values in their respective sources. While we re-implemented each of those models in Keras, our implementations of the models B and C yielded higher RMSE values than their reported values even after hyperparameter tuning. Consequently, we did not include the MAE results of our implementations for those two models in Table III. The MAE values for the models A, D and E are obtained after hyperparameter tuning.
We then study the effect of changing the value of on the performance of our model in terms of RMSE. We train our model at separate values where is set to 1, 2, 4, 6, 8, 10, 12, 14, 20 and computed the RMSE value for both the training and validation data respectively at each value. The results were plotted in Fig. 6. As shown in the figure, we obtained the lowest RMSE value for both training and validation data at the value when , where for the validation data. The figure also shows that choosing the appropriate value is important to receive the best performance from the model. As Fig. 6 shows, the number of the used images in the input affects the performance. Next, we study how changing the value affects the performance of our end-to-end system in terms of RMSE value during the testing, once the algorithm is trained at a fixed .
Changing corresponds to varying the distance between the two vehicles. For that purpose, we first set frames (i.e., 1.5 seconds gap between the vehicles) and trained the algorithm accordingly (where ). Once our model was trained and learned the relation between the given input image stacks and the corresponding output value at , we studied the robustness of the trained system as the distance between two vehicles change during the testing. Fig. 7 demonstrates the results on how the RMSE value changes as we change the distance between the vehicles during the testing. For that, we run the trained model over the entire validation data where the input obtained from the validation data formed at values varying between 0 and 95 with increments of 5 frames, and we computed the RMSE value at each of those values.
As shown in Fig. 7, at , we have the minimum RMSE value (0.0443) as the training data was also trained by setting . However, another (local) minimum value (0.0444), that is almost the same as the value obtained the training value, is also obtained at . Because of those two local minimums, we noticed that the change in error remains small inside the red area as shown in the figure. However, the error does not increase evenly on both sides of the training value () as most of the RMSE values within the red area remains on the left side of the training value ().
Next, we demonstrate the performance of multiple models over each frame of the entire Udacity dataset in Fig. 9. There are total of 33808 images in the dataset. The ground-truth for the figure is shown in Fig. 8 and the difference between the prediction and the ground-truth is given in Fig. 9 for multiple algorithms. In each plot, the maximum and minimum error values made by each algorithm are highlighted with red lines individually. In Fig. 9, we only demonstrate the results obtained for Model A, Model D, Model E and Model F (ours). The reason for that is the fact that there is no available implementation of Model B and Model C from  and our implementations of those models (as they are described in the original paper) did not yield good results to be reported here. Our algorithm (Model F) demonstrated the best performance overall with the lowest RMSE value. Comparing all the red lines in the plots (i.e., comparing all the maximum and minimum error values) suggests that the maximum error made by each algorithm is minimum for our algorithm over the entire dataset.
Vi Concluding Remarks
In this paper, we present a new approach by sharing images between cooperative self-driving vehicles to improve the control accuracy of steering angle. Our end-to-end approach uses a deep model using CNN, LSTM and FC layers and our proposed model using shared images yields the lowest RMSE value when compared to the other existing models in the literature.
Unlike previous works that only use local information obtained from a single vehicle, we propose a system where the vehicles communicate with each other and share data. In our experiments, we demonstrate that our proposed end-to-end model with data sharing in cooperative environments yields better performance than the previous approaches that rely on only the data obtained and used on the same vehicle. Our end-to-end model was able to learn and predict accurate steering angles without manual decomposition into road or lane marking detection.
One potentially strong argument against using image sharing might be that using the geo-spatial information along with the steering angle from the future vehicle and employing the same angle value at that position. Here we argue that using GPS makes the prediction dependent on the location data which, like any other sensor, provides faulty location values in many cases due to various reasons yielding to force algorithms to use wrong image sequence as input. Image sharing over V2V communication helps the model to become resistant to such location-based errors.
We believe that, the reason for skew in Fig. 7 being towards the left, inside the red area is related to the carâs speed. As the car goes faster (which can be considered as increasing the value), there is less relevant information in the data that comes from the vehicle ahead (Vehicle 2) potentially yielding higher RMSE values. Furthermore, the distance between each frame also increases as the speed increases making the correlation between the consecutive time frames decrease. In future work, we will also focus on this aspect to analyze the exact reason.
More work and analysis are needed to improve the robustness of the proposed model. While this work relies on the simulated data, we are in the process of collecting real data obtained from actual cars communicating over V2V and will perform more detailed analysis on that larger new data.
- Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to End Learning for Self-Driving Cars. 2016.
- Zhilu Chen and Xinming Huang. End-To-end learning for lane keeping of self-driving cars. In IEEE Intelligent Vehicles Symposium, Proceedings, 2017.
- Hesham M. Eraqi, Mohamed N. Moustafa, and Jens Honer. End-to-End Deep Learning for Steering Autonomous Vehicles Considering Temporal Dependencies. oct 2017.
- H. N. Mahjoub, B. Toghi, S M Osman Gani, and Y. P. Fallah. V2x system architecture utilizing hybrid gaussian process-based model structures. In 2019 Annual IEEE Systems Conference (SysCon), pages 1–6, 2019.
- H. N. Mahjoub, B. Toghi, and Y. P. Fallah. A driver behavior modeling structure based on non-parametric bayesian stochastic hybrid architecture. IEEE Vehicular Technology Conference (VTC), August 2018.
- H. N. Mahjoub, B. Toghi, and Y. P. Fallah. A stochastic hybrid framework for driver behavior modeling based on hierarchical dirichlet process. IEEE Connected and Automated Vehicles Symposium (CAVS), August 2018.
- Mohamed Aly. Real time detection of lane markers in urban streets. In IEEE Intelligent Vehicles Symposium, Proceedings, 2008.
- Jose M. Alvarez, Theo Gevers, Yann LeCun, and Antonio M. Lopez. Road scene segmentation from a single image. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2012.
- Huazhe Xu, Yang Gao, Fisher Yu, and Trevor Darrell. End-to-end learning of driving models from large-scale video datasets. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017.
- Nakul Agarwal, Abhishek Sharma, and Jieh Ren Chang. Real-time traffic light signal recognition system for a self-driving car. In Advances in Intelligent Systems and Computing, 2018.
- Bok Suk Shin, Xiaozheng Mou, Wei Mou, and Han Wang. Vision-based navigation of an unmanned surface vehicle with object detection and tracking abilities. Machine Vision and Applications, 2018.
- Dean a Pomerleau. Alvinn: An Autonomous Land Vehicle in a Neural Network. Advances in Neural Information Processing Systems, 1989.
- Shuyang Du, Haoli Guo, and Andrew Simpson. Self-Driving Car Steering Angle Prediction Based on Image Recognition. Technical report, 2017.
- Alexandru Gurghian, Tejaswi Koduri, Smita V. Bailur, Kyle J. Carey, and Vidya N. Murali. DeepLanes: End-To-End Lane Position Estimation Using Deep Neural Networks. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2016.
- Johann Dirdal. End-to-end learning and sensor fusion with deep convolutional networks for steering an off-road unmanned ground vehicle. PhD thesis, 2018.
- Hao Yu, Shu Yang, Weihao Gu, and Shaoyu Zhang. Baidu driving dataset and end-To-end reactive control model. In IEEE Intelligent Vehicles Symposium, Proceedings, 2017.
- Hyunggi Cho, Young Woo Seo, B. V.K.Vijaya Kumar, and Ragunathan Raj Rajkumar. A multi-sensor fusion system for moving object detection and tracking in urban driving environments. In Proceedings - IEEE International Conference on Robotics and Automation, 2014.
- Daniel Gohring, Miao Wang, Michael Schnurmacher, and Tinosch Ganjineh. Radar/Lidar sensor fusion for car-following on highways. In ICARA 2011 - Proceedings of the 5th International Conference on Automation, Robotics and Applications, 2011.
- Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 2000.
- Klaus Greff, Rupesh K. Srivastava, Jan Koutnik, Bas R. Steunebrink, and Jurgen Schmidhuber. LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning Systems, 2017.
- Dhruv Choudhary and Gaurav Bansal. Convolutional Architectures for Self-Driving Cars. Technical report, 2017.
- B. Toghi, M. Saifuddin, H. N. Mahjoub, M. O. Mughal, Y. P. Fallah, J. Rao, and S. Das. Multiple access in cellular v2x: Performance analysis in highly congested vehicular networks. In 2018 IEEE Vehicular Networking Conference (VNC), pages 1–8, Dec 2018.
- Behrad Toghi, Md Saifuddin, Yaser P. Fallah, and M. O. Mughal. Analysis of Distributed Congestion Control in Cellular Vehicle-to-everything Networks. arXiv e-prints, page arXiv:1904.00071, Mar 2019.
- Mhafuzul Islam, Mahsrur Chowdhury, Hongda Li, and Hongxin Hu. Vision-based Navigation of Autonomous Vehicle in Roadway Environments with Unexpected Hazards. arXiv preprint arXiv:1810.03967, 2018.
- Songtao Wu, Shenghua Zhong, and Yan Liu. ResNet. CVPR, 2015.
- Lauren A. Hannah. Stochastic Optimization. In International Encyclopedia of the Social & Behavioral Sciences: Second Edition. 2015.