A Deep Learning Approach for Robust Corridor Following
For an autonomous corridor following task where the environment is continuously changing, several forms of environmental noise prevent an automated feature extraction procedure from performing reliably. Moreover, in cases where pre-defined features are absent from the captured data, a well defined control signal for performing the servoing task fails to get produced. In order to overcome these drawbacks, we present in this work, using a convolutional neural network (CNN) to directly estimate the required control signal from an image, encompassing feature extraction and control law computation into one single end-to-end framework. In particular, we study the task of autonomous corridor following using a CNN and present clear advantages in cases where a traditional method used for performing the same task fails to give a reliable outcome. We evaluate the performance of our method on this task on a Wheelchair Platform developed at our institute for this purpose.
The task of autonomous corridor following has been well discussed in the past [rev1, rev2, rev3, rev4, uavservo], especially on smart wheelchair platforms. Several classical works achieving this [following1, following2, servoing2, servoing2_expanded, KAK_navigation, Omnidirectional] use vision based algorithms. They extract selected features from a captured image, and pass them to a control law that computes a corrective velocity signal for adjusting the position of the robot on the corridor. There also exist other approaches that use different sensors [lidar, sonar, radar], and follow a similar procedure. In all of these works, there is an inherent reliance on a robust feature extraction and tracking step to provide reliable features to the control law. As such, the behaviour of the robot becomes undefined when the system fails to provide these features reliably.
In traditional visual servoing (TVS) approaches, when selected features do not appear in the captured image, or when the extracted features are grossly inaccurate, the control law fails to produce a reliable velocity for servoing the robot along the corridor. We can infer from this that for a TVS process to take place reliably, three major factors need to be accounted for to a good degree of accuracy.
Quality image features need to be selected for servoing.
They need to be available in the environment.
A robust algorithm is needed for tracking and extracting these features.
The works presented in [following1, following2, servoing2_expanded] are examples of TVS works. In [following1], autonomous corridor following is performed on a wheelchair following a TVS approach that uses vanishing point features from a corridor image for devising a control signal. The work in [following2] extends this to doorway traversal and presents it comprehensively. A similar approach for mobile robots is also proposed in [servoing2] and extended further in [servoing2_expanded]. Steps for feature extraction which are suggested in these works however do not account for images taken in dynamically changing environments, or for various noise types in the captured images. This hinders with the practical capability of the robot, as motion noise and occlusions are a common occurrence in the environment. Moreover, in cases where the required features cannot be estimated from the image, the outcome of the control law is undefined as it may encounter mathematical singularities. In order to overcome all of these challenges, we propose using a convolutional neural network for approximating a velocity vector output for corridor following directly from a camera image. Figure 1 provides an overview of the proposal to solve the problem as a robust alternative to traditional visual servoing.
Although there are a few works in the literature that use deep neural networks for visual servoing such as [Madhav_paper, Deep_Learning_Chaumette], our proposal (Also see [siupaper]) differs from them as we combine both the feature extraction and control signal computation in one stage, while they primarily focus on approximating the feature extraction stage. In [Madhav_paper] the authors finetune FlowNet [Flownet1] to estimate the relative angular and translational pose differences between the desired image and the current image. Similarly in [Deep_Learning_Chaumette], AlexNet has been used to approximate a relative pose between the current image and a reference image. In both these papers, the approximated relative pose is fed into a control law that computes a velocity vector for servoing. The approach that we present in this work combines the feature extraction and control law computation stages into one framework to directly predict a velocity vector, given an image.
In addition to this, in [QLearning], a Q learning based approach for visual servoing has been described for performing a target following task using a drone in a simulated environment. They demonstrate the efficacy in using deep features for robust servoing in noisy and occluded environments which further reinforces our usage of deep learning for this visual servoing task.
Our paper makes the following two contributions: i) we introduce a novel CNN based approach for performing an image based visual servoing task of autonomous corridor navigation, and ii) we present a robust comparison showcasing the advantage of our CNN approach against some critical drawbacks of the traditional approach described in [following1]. We carry out a rigorous analytical and practical analysis here to make our case for supporting this claim.
The paper has been organized as follows. Section II describes fundamental TVS concepts used for autonomous corridor following, and provides details of our CNN based approach. In Section III, we describe the methods we use for robust analysis of our CNN approach where TVS-based approaches fail to perform well. In IV, we showcase the results of our experimentation both statistically and practically on a Wheelchair Platform developed at our institute. Finally, we present the advantages of our method by evaluating it on fail cases of the traditional approach.
Ii CNN-Based Autonomous Corridor following
The basic modelling and control concepts of using TVS in a corridor following task is presented in this section. After that we discuss our CNN-based design along training and data preparation issues.
Ii-a TVS Modelling and Velocity Estimation
The wheelchair is assumed to be a four wheeled robot with two passive castor wheels in front and two actuated wheels in the rear. It thus behaves as a non-holonomic system constrained by two degrees of freedom. In order to servo this system along the corridor, a translational velocity and an angular velocity that describe its motion need to be computed. As our purpose for autonomous corridor following is for assisting the disabled wheelchair users, a constant and slow forward velocity , along with an for adjusting the position of the chair on the corridor is sufficient for completing the task.
In [following1], the authors describe a traditional approach for autonomous corridor following that makes use of vanishing point and vanishing line features to perform this task. They use an automated feature extraction mechanism that adds constraints on the lines detected by the LSD algorithm [lsd1, lsd2] from a captured image to obtain and , the selected features for servoing. is the coordinate of the vanishing point, while is the angle that the vanishing line makes with the corridor plane. (Refer Figure 2)
In our approach, we use a human annotator to mark and features on an image. This ensures that the outcome of these features is reliable, as it is often the case that the automated feature extraction step fails to extract accurate features. This occurs mainly due to environmental noise in the captured image or sub-optimal feature extraction parameters which are difficult to tune.
For autonomous corridor following, the desired motion is achieved when the wheelchair is moving straight and is positioned at the center of the hallway. This occurs when lies at the center of the captured frame i.e the origin, and is perpendicular to the corridor plane in the image. The corresponding feature values of (0,0) are consequently chosen as the desired feature values.
In [control], a control law for servoing an image based path following system such as ours is formulated as:
The value here represents the ground truth angular velocity that we require to train our network. The Jacobians and are defined in [following1] as follows:
Here and are the selected features, , , and . The , and values represent the position of the camera on the wheelchair, which in our case is set to m, m, and m. and are gain and translational velocity constants that are tuned to and for our task. The error is defined as the difference between the extracted features and the desired feature values.
The vanishing point coordinates are measured in meters and the slope of the vanishing lines are taken in radians.
The obtained here is the ground truth value used for training our convolutional neural network model.
Ii-B The Training Dataset
For training our network, we gather suitable corridor images from various open access sources [dataset1, dataset2, dataset3, dataset4, dataset5]. We also create our own dataset of corridor images belonging to our institute. The accumulated set consists of 3563 images in total. A small subset of 403 images belonging to this set have been discarded from training as their ground truths could not be estimated. This is because the vanishing point feature in these cases lies outside the frame of the image and cannot be extracted reliably. These images are deemed unreliable and Figure 3(c) shows examples of such cases. The remaining samples compose a clean set of images.
We add 4 different types of artificial noise (Mild and Strong Gaussian Blur, Motion Blur and JPEG Compression) randomly to the entire clean set and obtain a separate noisy set of images. The final dataset is a combination of the clean and noisy sets and contains 6320 images. The ground truth values for samples in the noisy set are identical to their counterparts in the clean set. Fig 3(a) and 3(b) show examples of clean images their noisy counterparts.
Adding noisy images to our dataset serves a dual purpose of increasing the data used for training our model as well as helping the neural network generalize better to noisy data. The train-test split on the final dataset is 90-10% and 10% of the train set generated is used for validation. As the test set is randomly sampled from the final dataset, and contains a mixture of clean and noisy images.
Ii-C Designing and Training our CNN
Figure 4 provides an overview of our visual servoing approach. We employ a technique called transfer learning [Transfer1, Transfer2] for training our model. In the most literal sense, transfer learning refers to ”transferring” knowledge obtained from training one model to another model that performs a task of a different nature. This is especially useful in cases such as ours where the dataset size is small and there is insufficient training material for neural network to converge.
We exploit this technique and fine-tune a ResNet-18 architecture [resnet] pre-trained on ImageNet for our task. This setup was chosen considering ResNet-18’s exceptional performance on ImageNet despite having a comparatively small model size. The model was pre-trained on ImageNet with an input size of x, and an output size of 1000 classes. All images in our dataset have accordingly been re-sized to x to make sense of the pre-trained weights. We replace the final layer of this pre-trained model with a 1-dimensional output that represents the required angular velocity for our servoing task. We perform regression using this setup.
A Mean Squared Error (MSE) loss function determines the gradients for backpropagation for each iteration. In our case, this can be written as,
Here, is the batch size during training which has been set to 8. is the predicted angular velocity after a forward pass through the network and is the target angular velocity for that sample.
We train the network on an Nvidia 1080 Ti having 12GB of GPU memory and 64GB of RAM. It takes around 30 minutes for running 40 train-validation epochs. We employ a Stochastic Gradient Descent scheme with a weight decay of 0.005 and momentum of 0.9. The learning rate is set to 0.005. 10-fold cross validation was performed to understand the variance in training data and accordingly tune these network hyperparameters.
Ii-D Network Evaluation Metric
The value or coefficient of determination has been used as an evaluation metric for assessing the performance of the neural network on the task of regression. It can be defined in our case as:
Here, represents the true value i.e, the target distribution, represents the predicted distribution, and is the mean of the target distribution. here represents the number of samples taken from the distribution, which in our case is the number of images in the test set.
The value ranges from to 1. A positive value closer to 0 indicates that the model is unable to explain the variability of the data, while a value closer to 1 shows that the output corresponds well with the target distribution.
Iii Robustness Analysis of CNN-based Corridor following Scheme
In this section, we discuss two methods for evaluating the performance of our CNN approach in cases when the TVS-based approaches like the one based on vanishing feature approach fails to perform well.
Iii-a Comparing Deep and Vanishing Features
When the vanishing feature method fails to produce a good , it is often due to feature extraction going wrong. Our CNN approach is independent of this explicit feature extraction step and gives an approximation for even in these fail cases. In order to better understand the outcome of our CNN approach, we try to estimate the deep feature from the obtained by the CNN, and compare its performance against the vanishing point feature obtained by the vanishing point approach on the ground truth.
Recall the control law from [control] presented in section II.
This equation can be reduced to the following:
Now, as we have only one equation with three variables, given we cannot find a closed form solution for , and . We can however limit the range of these values for our specific task to obtain a solution for a deep feature representing the vanishing point coordinate from . Note that we do this only for illustrating the performance of our network against the TVS approach.
As the and values are represented in metres in the image plane, we can safely assume that their absolute values would be less than . Thus, the absolute value of i.e., also becomes less than . Using this observation, we can neglect the last term in equation 8 as it is comparatively smaller to the other terms that contain , a large constant.
From [following1], we can also conclude that . However, in our experiments, we have observed that for most images in the dataset, . Due to this reason, we chose to neglect the third term containing in equation 8 as it does not have a significant effect on the value in our case. This can also be experimentally observed by computing while changing the value. We can then solve the following equation for :
As this is a third degree equation, with the discriminant , its real root can be expressed in terms of as:
here is a deep feature representing the vanishing point coordinate, that is obtained from the predicted by our CNN approach. This along with the traditional obtained from a TVS-based automated feature extraction mechanism is compared with the ground truth. Figure 5 shows a flowchart of our approach.
Iii-B Verifying Approximations for Unreliable Images
While training the model, we accounted for cases where the captured image is noisy, by adding noisy samples to the training data. However, there exist the unreliable images (Figure 3(c)) that were discarded from training the model as their ground truths could not be estimated using the traditional method. Our trained CNN model however can predict an approximation for for these images.
Just by looking at these unreliable images, a human can decipher if the wheelchair is meant to turn left or right to initiate the corridor following task. Also once a prediction is made, we know the direction of motion from the sign of the value. We leverage upon these two pieces of detail for partially verifying the accuracy of the predicted outcome on unreliable images using human annotation.
For each unreliable image in the dataset, the following steps are taken by a human annotator:
Pass the image through the trained network and obtain an approximation for . Classify it as left or right based on the sign of the value predicted.
Show the same image to a human annotator equipped with a binary output console corresponding to the left or right direction.
Compare the human annotated output with the network output and update two score values representing accuracy and false positive accuracy.
Each annotator is shown the entire dataset 3 times, and an average of the scores obtained is chosen for evaluation.
The accuracy score tells us how well the network is able to predict the correct direction of motion from the image. It is the percentage of unreliable images that have had their outcome predicted in the right direction of motion.
Here, is the number of correctly predicted samples and is the total number of unreliable images.
The false positive score quantifies the severity of the network’s bad performance on unreliable images. It is the average of the absolute value on images where the wrong direction has been predicted.
Here, is the total number of false positive samples and is the angular velocity predicted for these samples.
For autonomous corridor following in unreliable cases, a high accuracy score and low false positive score is desired.
Iv Experiments and Results
Iv-a Evaluation of Neural Network Performance
We evaluate our trained model on 4 noisy test sets in addition to the original test set using the score described in section II-D. Each noisy set comprises of images having one specific type of artificial noise. Table II shows the percentage value for each test set. Here, a similar score in all the test sets shows that the performance of the CNN on the noisy sets is as good as its performance on the original, non-noisy set, thereby establishing robustness to noise.
|Test Set Type||Value(%)|
On the unreliable test set, following the human verification method described in the earlier section (III-B), we get an accuracy score of on predicting the right direction of motion on 403 unreliable images. We get a false positive score of on the 88 images that were predicted wrong. This translates to of the highest value obtained in our tests which shows that even when the wrong direction is predicted, the magnitude of the velocity vector output remains relatively small. Although this is specific to our case, it is a significant result as the CNN performs well on these images where the traditional vanishing feature approach would fail entirely.
Iv-B Practical Implementation and Results
We practically evaluate our method on an Intelligent Wheelchair Platform developed at IIIT, Hyderabad111https://youtu.be/aRGVXq8cqDs. A Kinect v2 has been retrofitted onto this platform as a sensor for capturing images. All processing is done on board on a laptop having an Nvidia 1050 Ti with 4GB GPU memory and 8GB RAM. A Sabertooth motor controller attached to the wheelchair takes serial commands from the laptop and translates them into actuary signals that controls its motion. The entire epoch time from capturing an image to actuation takes seconds or Hz on this setup. Although slow for many real time systems, this control frequency is sufficient for our task as our final application is on an assistive wheelchair for the disabled. Our translational velocity is set to m/s, ensuring that the wheelchair is slow enough for sufficient coherency between subsequent frames.
We conduct autonomous corridor following experiments on different corridors across our institute, including environments that were previously unseen in the training dataset. In each experiment, the wheelchair is made to start at an arbitrary position at the beginning of the corridor making an arbitrary angle between with the wall. The corridor following task is then carried out using the proposed CNN method and the images captured are stored along with their corresponding CNN values. We then use the traditional vanishing feature approach to estimate an on these stored images, and also a ground truth using human annotation (Refer II-A).
Table I has samples of image sequences captured during the experiment and their corresponding values for the CNN, vanishing feature approach and the ground truth. There is a high correspondence between the values of the CNN and vanishing feature approach in sequence 1. In sequence 2 however, when the wheelchair starts at a sharper angle with the corridor wall, an ‘unreliable image’ is captured due to which the traditional does not get computed. A ground truth does not exist here either, as a human annotator cannot accurately mark features outside the image frame. The CNN approach here predicts a velocity in the anti-clockwise direction, which enables the wheelchair to initiate and follow through the servoing task.
Sequence 3 is taken from an environment outside the training dataset. Observe the erratic values that the traditional approach estimates due to bad feature extraction. This is primarily because we do not re-tune the traditional approach parameters for extracting features from this environment. The CNN on the other hand first predicts a small value for servoing on the unreliable image captured in the beginning. Once the corridor is fully visible, it predicts a better approximation closer to the GT value, and successfully completes the servoing task.
Iv-C Advantages of the CNN approach
Robustness to Environmental Noise: As our CNN is trained offline on several images of various corridor environments (both noisy and non-noisy) it works well on different environments including ones that are dynamically changing.
The normal image sequence in Figure 6 illustrates this with an example. In the second image, when a person enters the frame, the traditional approach fails to compute a correct , as its feature extraction step that is dependant on a line detector fails. The obtained is thus not representative of the actual vanishing point. Our CNN approach on the other hand predicts a velocity in the correct direction, which is backed up by a good deep feature extracted from the image.
Approximations for Unreliable Images: As mentioned earlier in Section II, using the traditional method for servoing fails on unreliable images. This is due to the feature extraction step failing by virtue of the required vanishing point feature not lying on the image frame. Here, even if is extracted as an extended coordinate, its value is cannot be verified. A very large can cause the control law parameters to “explode” leading to the calculation of an unstable . This holds true especially in cases where the extracted tends to . The control law reaches a mathematical singularity here.
Observe the unreliable image sequence in Figure 6. Here, in the first two images, as a well defined vanishing point does not exist, the traditional method fails. However, our CNN estimates as a deep feature, that enables motion in the correct direction, due to which corridor following becomes feasible.
V Conclusion and Future Work
We have shown that our approach has an advantage where a velocity outcome is predicted regardless of the input image. This can also be a disadvantage in some cases, where the wheelchair needs to stop in order to complete the task. To overcome this, future work may include training the CNN with ‘end of the corridor’ and ‘object of interest’ cases where the wheelchair would have to stop and reconsider its position before moving.
There is also a disadvantage in terms of the time taken for gathering a dataset for the purpose of corridor following, which is not required in traditional schemes. We plan to release the dataset of corridor images along with their human annotated ground truths to alleviate this issue for other researchers.
In conclusion, we have presented an end to end CNN based approach for autonomous corridor navigation on a wheelchair. Our network is trained to predict a velocity signal for the servoing task from a captured image. In doing this, our method overcomes some key limitations of a traditional visual servoing based approach. We demonstrate these by performing a statistical and experimental validation of our approach against the traditional approach.