A scene perception system for visually impaired based on object detection and classification using multi-modal DCNN

A scene perception system for visually impaired based on object detection and classification using multi-modal DCNN


This paper represents a cost-effective scene perception system aimed towards visually impaired individual. We use an odroid system integrated with an USB camera and USB laser that can be attached on the chest. The system classifies the detected objects along with its distance from the user and provides a voice output. Experimental results provided in this paper use outdoor traffic scenes. The object detection and classification framework exploits a multi-modal fusion based faster RCNN using motion, sharpening and blurring filters for efficient feature representation.

Deep Learning, CNN, Scene perception, Visually Impaired, Object detection, Multi-modal fusion

1 Introduction

Navigation of blind people is an important issue to be considered. They most commonly use white canes for obstacle detection, meanwhile memorizing all locations they are getting familiar with. In a new or unfamiliar environment they totally depend on individuals passing by to enquire for a certain area. In the world of sophisticated technology along with various sensors, there should be a system with the most basic innovation to make their life a bit easier. This innovation should complement the white canes: give alarm to the user about obstacles a few meters away and give direction for going to a particular area. The navigational aid should also be provided using their other, but now stronger senses like hearing, touch, smell etc. Traditional navigation aid methods such as white canes are very limited and are unable to provide a complete scene perception. For example a white cane may just provide information about the presence or absence of an obstacle. However no information about the kind of obstacle is available. In many cases it may be important to know the obstacle type; specifically if it is a door needed to be opened or a stair required to be climbed. Even information about moving objects and their direction of motion is important. The utilization of computer framework advances for navigational help solutions is relatively recent, for example a smartphone based navigation system(ARIANNA) for both indoor and outdoor environment is available croce2014arianna (). Work contributed by different researchers in this area can be discussed in terms of sensors used for input; output representation type; and hardware gear, used for either or both. Commonly the scene is perceived using ultrasonic sensorskay2000auditory () or by extracting images/videos using vision sensorsbach2003sensory (). The later is discussed in more detail in Section 2. The output of the above systems(provided in Table 1) can be in the form of tactile imagebach2003sensory (); tongue display through voltage pulsekaczmarek2011tongue (); sound patterns/musical auditory informationauvray2007learning ()abboud2014eyemusic () etc. Common wearable helping aids for blind include:

  1. Wearable tactile harness-vest display to give instructions about directional navigation using six vibrating motorsjones2006tactile ().

  2. A belt associated to a computer along with ultrasonic sensors gives acoustic response in guidance mode, where the system knows about the target and user is guided using tactile signal; image mode, the user is demonstrated about the environment using tactile image. It translates visual of the scene into tactile or acoustic information to permit secure and fast walkshoval1993navbelt ().

  3. Helmet mounted with ultrasonic chirps and speakers. It amplifies echoes produced by ultrasonic sounds for locating objects in spacesohl2015device ().

  4. Ultrasonic smart glasses work with ultrasonic waves to detect obstacleagarwal2017low () and many more have been developed for blind aid.

Some of these are shown in Figure 1. A graphical representation showing the distribution of type of work done for blind during various time spans is given in Figure 2. It is observed that although scene perception via TDU, tactile images, sound patterns are explored and used since , the last decade particularly witnesses the marketization of a lot of wearable devices.
Alongside developing different blind-aid systems there is also a thorough investigation about suitability and acceptability of these systems. For example,it is argued gori2016devices () that there are many problems that can be faced with these devices, such as (i)invasive that are coving ears, blocking the tongue, involve use of hands etc. So with these, visually impaired people can never feel free. They will have to hold the device whose size and weight is too big and high. Moreover, these are easy to handle by children. (ii)The user can feel cognitive load as these devices may require lot of attention, that can causes distraction from the primary task. (iii)These devices require lot of training for their usage which is difficult especially for children. (iv)These devices can even perform unsatisfactorily compared to the overhead of using them. (v)The cost of the devices are too high to afford by a common person. (vi)Lastly many of these devices are still at their prototype stage or are tested at pre-clinical level only and they are not available for desired task. In this paper, we particularly focus on scene perception via image processing algorithms. We are integrating a low-cost, light weight, simple, easily wearable system that will help detect and classify objects on the way of user along with their distances. In this context Section 2 reviews work particularly focused on object and obstacle detection, navigation assistance using images. Section 3 discuss the proposed system in detail. Due to wide application and enhanced performance of CNNs for object detection and classification, we reviewed different CNN architectures and their suitability for the current task. Experimental results are shown in Section 4. Conclusion is presented in Section 5.

Figure 1: A few assistive technological devices for blind aid

Figure 2: Time span representation of usage of particular technologies
References Results Device used Method
Paul and Stephen(2003)bach2003sensory () Voice message about visual scene Braille Convert the image from a video camera into a tactile image.
Kaczmarek(2011)kaczmarek2011tongue () Creates real-time tactile images on the tongue Tongue display unit (TDU) TDU is a programmable pulse generator that delivers dc-balanced voltage pulses suitable for electrotactile stimulation of the anteriordorsal tongue, through a matrix of surface electrodes.
Matsuda et al.(2008)matsuda2008finger () Allow communication between deaf-blind persons. Mechanical fingers used to transmit Braille symbols Finger Braille recognition system
Jones et al.(2006)jones2006tactile () Navigation Assistance harness-vest Convert navigation information into tactile inputs.
Shraga et al.(1993)shoval1993navbelt () Detect Obstacles Computer, ultrasonic sensors and stereophonic headphones The acoustic signals are transmitted as discrete beeps or continuous sounds.
Sohl-Dickstein et al.(2015)sohl2015device () Navigation aid and object perception Ultrasonic chirps Amplifies echoes produced by ultrasonic sounds to locate objects.
Kay(2000)kay2000auditory () Navigation aid Ultrasonic transmitter and two microphones Translate echoes in sounds for navigating and scanning objects.
Auvray et al.(2007)auvray2007learning () Audio output for visual input Webcam The Vibe device converts a video stream into a stereophonic sound stream.
Abboud et al.(2014)abboud2014eyemusic () Visual information through a musical auditory experience. Camera An algorithm conveys shape, location and color information using sound.
Joselin and Rene(2012)villanueva2012optical () Detect walls, openings, and vertical roads IR, LED and a photodiode Pulses emitted by LED, retro diffused light detected by the photodiode.
Mun-Cheon et al.(2015)kang2015novel () Obstacle(not type) Glasses-type vision camera Deformable Grid (set of vertices and edges with neighborhood system),
Van-Nam et al. (2015)hoang2015obstacle () moving objects (e.g., people) and static objects (e.g. trash, plant pots, fire extinguisher) and audio warning Electrode matrix and mobile Kinect, RF transmitter, The color image, depth image, and accelerometer information provided by Kinect
Shaomei et al. (2017)wu2017automatic () Give voice message for facebook feeds to the blind user. Facebook Artificial intelligence
Shweta et al. (2018)jaiswal2018small () Helps to avoid obstacles and give its approximate distance Ultrasonic sensors Ultrasonic signals
Rohit et al. (2017)agarwal2017low () Indication of obstacle with distance <=300cm using buzzer Ultrasonic Smart glasses Ultrasonic waves.
Robert et al. (2018)katzschmann2018safe () Detect obstacle with distance Sonar belt and infrared time-of-flight distance sensors Infrared light
Table 1: Blind aid technologies using different sensor inputs

2 Literature Review

As vision is an extremely vital sensory system in humans, its loss affects the performance of most activities of daily living; thereby haltering an individual’s quality of life, personal relationships, general lifestyle and career. With the advent of technology, scientists are trying to develop systems to make visually impaired individuals more independent and aware of their surroundings. It is often helpful to know the scene around you and then have the knowledge about the obstacles. There are a few devices which provide scene perception for example EyeCane and Eye Music. These use infrared ray to translate color, shapes,location and other information of the object/scene into soundscapes(auditory or tactile cues) which the brain can interpret visually nau2015use (). Use of vision sensors particularly for this purpose is limited. In general, vision sensor input is utilized by visually impaired individuals for reading documents (via OCR) or identifying street signs, hoardings etc (scene segmentation, followed by OCR) gori2016devices (). Social interaction assistance for individual with visual disability are also provided to some extent in the form of person recognition, facial expression recognition panchanathan2016social (). A number of devices are developed for blind people to provide them information about presence of obstacles, types of obstacle, their distance etc. This information is further utilized to assist blind people for navigating safely in indoor and outdoor environments. Ruxandra et al. proposed a smartphone based system that indicates the type of obstacle and categorizes it as urgent or normal depending on its distance from the user. Obstacle candidates are tracked with multiscale Lucas - Kanade using SIFT/SURF interest points and urgency of obstacles is identified by motion estimation using homomorphic transforms and agglomerative clustering. Then the image patches of detected obstacle regions are classified with SVM using a visual codebook generated with k-means clustered HOG features. tapu2013smartphone (). Mun-Cheon et al. indicates presence of obstacle using deformable grid(DG) which use the motion information accumulated over several frames. DG consists of a set of vertices and edges with n-neighborhood system. This method is accurate as well as robust to the motion tracking error and ego-motion of the camera. The above method detects the object having risk of collision by using the extent of contortion of the deformable grid. However, this method is unable to perform in areas having walls and doorskang2015novel (). To overcome the issue of detecting walls and doors, Van-Nam et al. developed an electrode matrix and mobile Kinect based obstacle detection and warning system for visually impaired individual in which they detect moving and stationary obstacles(using color and depth information given by Kinect) and warn the user with the help of Tongue Display Unit. The degree of warning depending on the depth information is provided by changing the level of electric signal on the electrode matrix hoang2015obstacle (). To provide stress free environment to blind people, work is done to detect potholes and uneven surfaces. Aravinda et al. use vision based system along with Laser patterns for detecting potholes and Uneven Surfaces rao2016vision (). They use Hough transform to detect lines recorded using laser. Kanwal et al. provide wall-like obstacle information (through voice message) using Kinect both as camera and depth estimator. Camera detects corners of the obstacle using Harris and Stephens corner detector and its infrared sensor give corresponding depth value for indoor environment kanwal2015navigation (). Aladren et al. proposed visual and range information based Navigation assistance for visually impaired. They used a consumer RGB-D camera, and take advantage of both range and visual information about floor, walls and obstacle for indoor environment. Their system gives voice commands about the obstacle to the user aladren2016navigation (). They perform segmentation using range data (via RANSAC) which is further enhanced using mean sift filtering on color data. Sarfraz and Rizvi sarfraz2007indoor () developed navigation assistance for indoor environment providing depth and object type information including presence of human, doors, hallway or corridor, staircases, elevators, moving objects. They developed individual algorithm for each obstacle to be detected using CannyEdgeDetector. They use camera vision input and text-to-speech synthesized output to provide navigation aid. Table 2 briefs some notable visually impaired assistive technologies which clearly portrays the general trend of existing systems in terms of sensors used to perceive input from the environment; output type and devices via which information is provided back to the user and feature extraction technique used for object detection and classification. In this particular work, our main contributions are;

  • We develop a scene perception system which will provide information about objects(object detection and classification from images) in the scene and their relative depth(using laser). We focus on making this system low-cost, light weight, simple and easily wearable emphasising that no explicit training is required to use the system. At the same time, we ensure that maximum information about the environment can be retrieved. The system currently works via restricted voice output which implies that the user can select how much information he wants. He can chose to be informed about scene changes at (a)fixed intervals or (b)a single time only (c)when obstacles are too close.

  • Most of the work discussed above use corner detectors, key point descriptors such as SIFT/SURF, edge detectors, HOG descriptors for obstacle classification. Inspite of their success, they often suffer from the issues like; background has more key points than candidate objects(or obstacles), absence of key points due to low resolution, poor texture and random motion. This work exploits the power of the state of the art DCNN due to its high performance for object detection and classification. In this course, we review different kinds of architectures which include single column and multicolumn CNN. Multicolumn architectures mostly use full image as one column and different patches of the image as second column input. We exploit multicolumn architecture by using different features like image edges(),optical flow() or scale space() representations along with RGB image(). In this paper, we demonstrate 3 multicolumn architectures; , and .

  • In general the most common CNN input is 3-channeled intensity image(RGB). Some contributions focus towards enhancing it by including different parametric inputs. For example a 4-channeled CNN input consist of RGB-D. Another kind of architecture may involve a multispectral fusion using 2 different CNNs, i.e. fusing separate convolutional feature map outputs of RGB and D inputs at early or late stages. In this paper we use a multimodal fusion, where we fuse the convolution feature maps of individual columns of the multicolumn CNN using summation and maximum operation. This proves to be beneficial by accommodating multimodal features with minimal computational space and speed as opposed to fusion techniques via concatenation. For example in our case we fuse outputs of convolutional feature maps for the different multicolumn architectures. Three different fusions are used here: and

3 System Details

The flow chart of system developed is shown in Figure 3. This system is at its early stage but is already capable to fulfill all of the basic requirements needed:

  1. Acquisition of a video stream from a webcam with HD resolution (1920x1080, 25 fps)

  2. Detection of multiple objects from the scene, even if the position of object or vehicle is not perfectly in front of the camera.

  3. Detection of one or more candidate objects in the scene.

  4. Generation of a vocal output as a synthesized voice saying the name of the recognized object or vehicle and its distance from the user

References Information Sensors Image features
Ruxandra et al.(2013)tapu2013smartphone () Obstacle detection and classification in real time and helps the user in avoiding and recognizing static and dynamic objects like vehicles, pedestrians, bicycles etc. All sensors available in Smartphone SIFT, SURF and HOG features extracted. Used SVM classifier for classification
Aravinda et al.(2016)rao2016vision () Detect potholes and Uneven Surfaces Laser and monocular camera Hough transform is used to detect the laser projected lines. The intersecting points obtained from the Hough transform are binned to create a histogram for each frame. This feature is dubbed as Histogram of Intersections.
Kanwal et al.(2015)kanwal2015navigation () Detect Wall-like obstacles (pillar, people) and their depth kinect Harris & Stephens corner detection
Alarden et al.(2016)aladren2016navigation () Provide Obstacle free path RGB-D camera Canny edge detector followed by the probabilistic Hough line transform
Sarfraz and Rizvi (2007)sarfraz2007indoor () Detect human presence, doors, hallway/corridor, staircases, elevators, moving objects in indoor settings. Monocular vision camera Canny Edge Detector
Saranya et al. (2018)saranyareal () Detect object camera Histogram features of intensity and gradient and Edge linking features.
Table 2: Blind aid technologies using vision sensor inputs

Figure 3: Flow chart for the proposed scheme

3.1 Single-board PC: Odroid

Single-board PC is used for testing purpose of the work done. It is potable and easy to wear. There are several single board computers in the market, among which the most renowned are probably BeegleBone, Raspberry, Odroid, Udoo, Lattepanda; they are released at an impressive cadence in progressively more powerful versions. We decide to use an Odroid XU4 board to develop our prototype due to the reason that the Odroid XU4 currently is ranked 5th. It is noted that this ranking is not only related to system performance, but also to other aspects that we do not consider mandatory for the present project, such as the platform cost, the availability of software in the web or the presence of communities that support software updates and forums. The main feature which encourage us to opt for Odroid XU4 platform is the processor: the Odroid XU4 has a Samsung Exynos5422 octa-core working at 2 GHz and perform better as compared to other competitors. In addition to that, the presence of an ARM Mali - T628 GPU may be very useful to further improve the processing throughput by exploiting parallel computation, due to the support of OpenCL. Another important aspect is the large amount of available memory, 2 GB of embedded DDR3, which let the image data and its features to get stored easily. Moreover, the memory is extendible up to 64 GB in eMMC format, allowing much faster access as compared to a common SD memory. Finally, two USB 3.0 ports allow us to easily capture high-resolution video streams from two cameras, while only slower USB 2.0 ports are available on the Raspberry. Hokuyo URG-04LX-UG01 scanning laser is attached to the USB to get the distance of the detected object from the user. It is small, affordable and accurate laser scanner and is able to report ranges from to in a 240 degree arc with 0.36 degree angular resolution. Its power consumption, 5V, allows it to be used on battery operated platforms. The Logitech C270 HD high quality Webcam is attached. Its 3 MP camera helps in recording superior quality with good clarity visuals in both day and night time environment, As this webcam has also the feature of effective light correction. Laser and camera can be attached to the chest as shown in Figure 4. Odroid is a small PC, capable of running a full Linux distribution. We choose Ubuntu Mint to have access to the well populated and mature community of Ubuntu users and software; consequently, the setup time may be significantly reduced and the procedures needed for having a ready-to-use system are quite the same as those for the x86 PC. The system currently runs at about 1 fps: further experiments with the User Group and appropriate optimization are in progress. The system architecture is shown in Figure 5.

Figure 4: Camera and laser attached to the chest

Figure 5: System architecture

3.2 Features

Features with deep learning for object detectionchu2018deep (), scene perception are widely used in recent research. Among CNNs for object detection, R-CNN, fast RCNN, faster RCNN are widely adapted. R-CNN extract region proposals, compute CNN features and classify the objects. To improve computation ability, Fast R-CNN use region of interest pooling by sharing the forward pass of CNN. These region proposals are created using selective search which is replaced by RPN in faster R-CNN ren2017faster (). Here a single network composed of region proposal and Fast R-CNN is used by sharing their convolutional features. An option to add segmentation properties to Fast RCNN is enabled by putting an object mask predicting feature with the already occurring branch for bounding box recognition he2017mask (). It is noticed that these object detection networks fine tuned VGG16 with PASCAL dataset. It is argued VGG16 performs better than AlexNet and GoogleNet. Another widely used network YOLO redmon2016yolo9000 () compose of entirely convolutional layers trained and tested on PASCAL VOC and COCO datasets is quite accurate and fast.

For enhanced performance researchers particularly focus on fine tuning including temporal information and architectural enhancement. yao2017coupled (), zhuo2017vehicle () and wang2016vehicle ()fine tuned different pretrained(on ILSVRC-2012 dataset) networks like GoogleNet, Alexnet, Fast-RCNN using their own datasets for vehicle detection. li2017attentive () kang2017optimizing () reduce false alarms by introducing context based CNN model or propagated motion information to adjacent frames. Architectural enhancement is done in different ways such as RGB, Depth aladren2016navigation ()hou2018object (), optical flow sarkar2017deep () information is treated as different channels and fed to single column CNN or mutilcolumns CNN is trained using different data(global and local patches). Some architectural modification over alexnet is used in wang2016multi () to wang2016brain (). For example in wang2016multi (), the 5th convolutional layer of AlexNet is replaced with a set of seven convolutional layers(referring to 7 different objects), which are piled in parallel with mean pooling and then fed to the fully connected layers. This network is trained in 2 phases; first individually seven networks are trained for each scene category and secondly their weights are used for the parallel layers and entire network is retrained. This enables classification of scene consisting of these objects more accurately. lu2015deep () propose a single column CNN having four convolution layers and three fully connected layers, with the last layer giving a softmax probability as output using information extracted from multiple image patches. In this an image is divided into multiple patches, each of the patches are fed to the CNN and from the last layer, feature output is extracted. For the output of all the image patches, some statistical aggregation such as minimum, maximum, median and averaging of all these features are performed and the final softmax is taken as the output. lu2015rating () train a 2 column CNN in which the input for the network is considered as global image as well as local image patches. wang2016brain () propose multicolumn CNN model in which CNN (trained with style attribute prediction to predict different style attributes for input) treated as additional CNN column is then added to the parallel input pathways of a global image column and a local patch column.
Incase of multicolumn CNN architectures, the feature output of the different columns are fused at various stages. liu2016multispectral () demonstrate the use of a multispectral CNN where different kind of images like intensity image, thermal image etc are used for training and obtained feature maps are fused using concatenation. In this work we use edge, optical flow, scale space representations along with RGB intensity images and fuse the convolutional feature maps before applying RPN. We use edges obtained from canny, sobel and prewitt edge detection algorithms; scales values of and ; orientation data of optical flow for extracting pertinent features through different scale and orientation information of an object. The PASCAL VOC dataset is used for extracting the features from 5th convolutional layer of VGG 16. The steps of training and testing are shown in Algorithm 1. Networks used for intensity image, edge image (canny, sobel, pewitt), Gaussian image, optical flow image are named as , and respectively. Features from of referred as are fused with features of of separately for resulting features map . In the same manner, feature map is obtained by fusing and . Feature map is obtained by fusing and . These feature maps are further passed for ROI pooling and classified using two different classification networks and as shown in Figure 6. has three fully connected layers while has one convolutional layer with three fully connected layers.

3.3 Fusion

A common way of fusing features is by concatenating themliu2016multispectral (). However this will lead to the enlarged size of feature map which take a lot of computational time and space. So in this paper, we used addition and/or maximum of features which will retain the size of feature map. There are three cases of fusion of feature extraction:

  1. In first case of edges, features are fused by adding the features ( and ), ( and ) and ( and ). Further taking the maximum of these three gives the final feature map as shown in equation 1. The process using different network classifiers ( and ) is shown in Figure 6.


    Figure 6: Multimodal object detection and classification using RGB and edge features
  2. In case of optical flow, are fused (added) to orientation features . Feature map is obtained as shown in equation 2. The whole process is shown in Figure 7 with classifier network . It is also done with .


    Figure 7: Multimodal object detection and classification using RGB and optical flow features
  3. For scaled images, the fusion is done by taking maximum of the features of and . The feature map is obtained using equation 3. The process is represented in Figure 8.


    Figure 8: Multimodal object detection and classification using RGB and Gaussian scaled features

3.4 Depth Data

During testing, the trained network will detect and classify the object or vehicle coming on the way of user and laser will tell the distance of the detected object or vehicle. The distance factor is added to the system by mapping the laser data with the vehicle detection and classification. The laser gives the data in the form of polar value , distance and angle of the object from the center point of laser. From the polar value, Cartesian coordinates(, ) of the object are calculated using equation 4.


So the laser data is represented in the form of (). At the same time, the image is captured by the camera, the pixel value of the particular object from that image is acquired in the form of . Both the captured scenes from camera and laser sensor, are divided into grids for the mapping purpose. Some of the instances of the process of getting data is shown in Figure 9. The grid containing pixel value of an object is mapped with grid of laser containing the object. A mapping from to is accomplished by using a function represented in equation 5 from which the distance() of the object is figured out.


where represents Data of camera and represents output and represents the index.

(a) Instances of images(captured via camera) in grid style

(b) Instances of images(generated via laser) in grid style
Figure 9: Mapping of laser data with images capture by camera

Input: Training Set P =,, number of classes C,
Output: , O is object detected, y is its label and Z is distance of object from user.
1: Divide P into 3 equal parts where j=1,2,3.
2: For j=1 to 3
     (i) Extract edges, scale space and optical features of intensity images(I) of         set .
     (ii) Fuse convolutional feature maps of I with A, where         in which E is for edges(canny(), sobel(), prewitt()), G is for         Gaussian(,),O is for optical flow.
     (iii)Pass the fused feature maps for ROI pooling using Region Proposal         Algorithm as given in Algorithm 2
     (iv) Create two networks having 1convolution and 3 fully connected         layers and only three fully connected layers using         shared weights of VGG16
     (v) Train this network with
     (vi) Output the trained net which can predict and classify the obstacle.
3: Using trained net, features of test set are extracted and SVM is trained to classify.
4: Calculate accuracy by comparing the predicted and actual output. 5: Laser is used for extracting the distance Z of obstacle predicted and classified by the net.

Algorithm 1 Obstacle predictor and classifier along with distance for blind

The first step is that image is given as input to a convolution network which will output a set of convolutional feature maps on the last convolutional layer
Then a sliding window of size n x n is run spatially on these feature maps. A set of anchors are generated which all have the same center but with different aspect ratios and different scales. All these coordinates are computed with respect to the original image.
For each of these anchors, a value p* is computed as shown in equation 6 which indicated how much these anchors overlap with the ground-truth(GT) bounding boxes.


where IU is intersection over union and is defined below in equation 7:

Algorithm 2 Region Proposal Algorithm

4 Experimental Results

As described in Section 3.2, the features from 5th convolutional layer of VGG16 are extracted using different image representation such as intensity or RGB image, edges(canny, sobel, prewitt), scale space using different gaussian filters. In training phase, network is trained using PASCAL dataset (with these image representations); that is divided into 3 parts (). The created networks and are trained on different datasets () individually with learning rate . Average of weights and bias of 3 trained net is calculated for getting the final net in both cases. Further testing with our own dataset is done using these two trained nets. During testing, a person is blind folded to realistically simulate the experience of visually impaired people. Testing is done at different spots and timings. There can be a situation when no movement is shown by the camera for a long time. The reason for no movement shown by the camera could actually be because of lack of vehicular movement or the person facing a wall or something obstructing the view of the camera. Another hindrance is the shadowy effect of trees and poles during the afternoon, which hampers the quality of videos. During evening, the traffic movement is high extracting data from such obstructed roads is another struggle. Overcoming the issues, different speech messages are prepared. Results of methods proposed by different researchers are presented in table 3. Results of the proposed work are presented in the form of (1) Vehicle Detection and Classification using deep neural network. (2) The distance of vehicle from a blind user using laser as shown in Figure 10.

Method DS1 DS2
HOGtapu2013smartphone () 50 52
Fast-RCNNgirshick2015fast () 54 61
Faster-RCNNren2017faster () 60 66
Yoloredmon2016yolo9000 () 65 68
Table 3: Results of already existing Methods

Figure 10: Results of object detection and classification along with their distance from the user

Accuracies for different network architectures in terms of object detection and classification for normal, edges and scaled images are shown in Table 4. Results depict that scaled and edge images give good results as compared to normal and network architecture having 1 convolutional layer and 3 fully connected layers proved to be good rather than only 3 fully connected layers. While comparing both architectures, net trained with normal data show noticeable improvement with while other are almost similar. This also proves that although using convolutional layer the improvement of result is quite visible when we are using normal data but it is not that visible when we are using some other features. So, using features is almost as beneficial as adding convolutional layer. Moreover, when net trained with edges/scaled/optical flow() of images are tested with edges/scaled/optical flow of images respectively, accuracy is higher rather than net trained with edges/scaled images are tested with normal images.

Learning Rate/testsets Edges Gaussian N-E N-G OF Normal
CNN_0C 79 77 78 78.5 77.5 65
CNN_1C 81.5 82 79 79.5 79.8 81
Table 4: Accuracies of Different network architectures for edges, scaled and normal images

Figure 11 and 12 shows the norm of the means and standard deviations of the weights gradients for each layer of network and respectively as function of the number of training epochs. The values are normalized by the L2 norms of the weights for each layer. Graphs represent that mean of convolutional layer in network reduce to zero at last as compared to other layers while standard deviation (STD) of fc2 layer converges at last. In case of network , mean of fc2 converges at last and STD of fc3 converges at last as compared to other layers.

Figure 11: The norm of the means and standard deviations of the weights gradients for each layer of network as function of the number of training epochs. The values are normalized by the L2 norms of the weights for each layer.

Figure 12: The norm of the means and standard deviations of the weights gradients for each layer of network , as function of the number of training epochs. The values are normalized by the L2 norms of the weights for each layer

5 Conclusion

This paper is concluded as a low-cost, light weight, simple and easily wearable system is proposed. Laser and high quality webcam is attached it. The system is trained using multicolumn CNN with edges, optical flow and scale space features. These convolutional feature maps are fused involving two kinds of multispectral fusions using addition and maximum. These feature maps are further passed for ROI pooling and classified using two different classification networks and having three fully connected layers and one convolutional layer with three fully connected layers respectively. Number of experiments done with these networks show that there is some improvement between and . However, that is minimum when we are using different kinds of features. Out of all the features mentioned above, scale space features with outperform the others. The proposed system is designed for helping visually impaired people. It detects and classify obstacles that come on the way and the distance of the obstacle from the user and warns the user about that.


  1. journal: Journal of Neurocomputing


  1. D. Croce, P. Gallo, D. Garlisi, L. Giarré, S. Mangione, I. Tinnirello, Arianna: A smartphone-based navigation system with human in the loop, in: Control and Automation (MED), 2014 22nd Mediterranean Conference of, IEEE, 2014, pp. 8–13.
  2. L. Kay, Auditory perception of objects by blind persons, using a bioacoustic high resolution air sonar, The Journal of the Acoustical Society of America 107 (6) (2000) 3266–3275.
  3. P. Bach-y Rita, S. W. Kercel, Sensory substitution and the human–machine interface, Trends in cognitive sciences 7 (12) (2003) 541–546.
  4. K. Kaczmarek, The tongue display unit (tdu) for electrotactile spatiotemporal pattern presentation, Scientia Iranica 18 (6) (2011) 1476–1485.
  5. M. Auvray, S. Hanneton, J. K. O’Regan, Learning to perceive with a visuo—auditory substitution system: localisation and object recognition with ‘the voice’, Perception 36 (3) (2007) 416–430.
  6. S. Abboud, S. Hanassy, S. Levy-Tzedek, S. Maidenbaum, A. Amedi, Eyemusic: Introducing a “visual” colorful experience for the blind using auditory sensory substitution, Restorative neurology and neuroscience 32 (2) (2014) 247–257.
  7. L. A. Jones, B. Lockyer, E. Piateski, Tactile display and vibrotactile pattern recognition on the torso, Advanced Robotics 20 (12) (2006) 1359–1374.
  8. S. Shoval, J. Borenstein, Y. Koren, The navbelt-a computerized travel aid for the blind, in: Proceedings of the RESNA Conference, 1993, pp. 13–18.
  9. J. Sohl-Dickstein, S. Teng, B. M. Gaub, C. C. Rodgers, C. Li, M. R. DeWeese, N. S. Harper, A device for human ultrasonic echolocation, IEEE Transactions on Biomedical Engineering 62 (6) (2015) 1526–1534.
  10. R. Agarwal, N. Ladha, M. Agarwal, K. K. Majee, A. Das, S. Kumar, S. K. Rai, A. K. Singh, S. Nayak, S. Dey, et al., Low cost ultrasonic smart glasses for blind, in: Information Technology, Electronics and Mobile Communication Conference (IEMCON), 2017 8th IEEE Annual, IEEE, 2017, pp. 210–213.
  11. M. Gori, G. Cappagli, A. Tonelli, G. Baud-Bovy, S. Finocchietti, Devices for visually impaired people: High technological devices with low user acceptance and no adaptability for children, Neuroscience & Biobehavioral Reviews 69 (2016) 79–88.
  12. Y. Matsuda, I. Sakuma, Y. Jimbo, E. Kobayashi, T. Arafune, T. Isomura, Finger braille recognition system for people who communicate with deafblind people, in: Mechatronics and Automation, 2008. ICMA 2008. IEEE International Conference on, IEEE, 2008, pp. 268–273.
  13. J. Villanueva, R. Farcy, Optical device indicating a safe free path to blind people, IEEE transactions on instrumentation and measurement 61 (1) (2012) 170–177.
  14. M.-C. Kang, S.-H. Chae, J.-Y. Sun, J.-W. Yoo, S.-J. Ko, A novel obstacle detection method based on deformable grid for the visually impaired, IEEE Transactions on Consumer Electronics 61 (3) (2015) 376–383.
  15. V.-N. Hoang, T.-H. Nguyen, T.-L. Le, T.-T. H. Tran, T.-P. Vuong, N. Vuillerme, Obstacle detection and warning for visually impaired people based on electrode matrix and mobile kinect, in: Information and Computer Science (NICS), 2015 2nd National Foundation for Science and Technology Development Conference on, IEEE, 2015, pp. 54–59.
  16. S. Wu, J. Wieland, O. Farivar, J. Schiller, Automatic alt-text: Computer-generated image descriptions for blind users on a social network service., in: CSCW, 2017, pp. 1180–1192.
  17. S. Jaiswal, J. Warrier, V. Sinha, R. K. Jain, M. Student, Small sized vision based system for blinds, International Journal of Engineering Science 15968.
  18. R. Katzschmann, B. Araki, D. Rus, Safe local navigation for visually impaired users with a time-of-flight and haptic feedback device, IEEE Transactions on Neural Systems and Rehabilitation Engineering.
  19. A. C. Nau, M. C. Murphy, K. C. Chan, Use of sensory substitution devices as a model system for investigating cross-modal neuroplasticity in humans, Neural regeneration research 10 (11) (2015) 1717.
  20. S. Panchanathan, S. Chakraborty, T. McDaniel, Social interaction assistant: a person-centered approach to enrich social interactions for individuals with visual impairments, IEEE Journal of Selected Topics in Signal Processing 10 (5) (2016) 942–951.
  21. R. Tapu, B. Mocanu, A. Bursuc, T. Zaharia, A smartphone-based obstacle detection and classification system for assisting visually impaired people, in: Computer Vision Workshops (ICCVW), 2013 IEEE International Conference on, IEEE, 2013, pp. 444–451.
  22. A. S. Rao, J. Gubbi, M. Palaniswami, E. Wong, A vision-based system to detect potholes and uneven surfaces for assisting blind people, in: Communications (ICC), 2016 IEEE International Conference on, IEEE, 2016, pp. 1–6.
  23. N. Kanwal, E. Bostanci, K. Currie, A. F. Clark, A navigation system for the visually impaired: a fusion of vision and depth sensor, Applied bionics and biomechanics 2015.
  24. A. Aladren, G. López-Nicolás, L. Puig, J. J. Guerrero, Navigation assistance for the visually impaired using rgb-d sensor with range expansion, IEEE Systems Journal 10 (3) (2016) 922–932.
  25. M. Sarfraz, S. A. J. Rizvi, Indoor navigational aid system for the visually impaired, in: Geometric Modeling and Imaging, 2007. GMAI’07, IEEE, 2007, pp. 127–132.
  26. N. Saranya, M. Nandinipriya, U. Priya, Real time object detection for blind people.
  27. W. Chu, D. Cai, Deep feature based contextual model for object detection, Neurocomputing 275 (2018) 1035–1042.
  28. S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal networks, IEEE transactions on pattern analysis and machine intelligence 39 (6) (2017) 1137–1149.
  29. K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Computer Vision (ICCV), 2017 IEEE International Conference on, IEEE, 2017, pp. 2980–2988.
  30. J. Redmon, A. Farhadi, Yolo9000: better, faster, stronger, arXiv preprint 1612.
  31. Y. Yao, B. Tian, F.-Y. Wang, Coupled multivehicle detection and classification with prior objectness measure, IEEE Transactions on Vehicular Technology 66 (3) (2017) 1975–1984.
  32. L. Zhuo, L. Jiang, Z. Zhu, J. Li, J. Zhang, H. Long, Vehicle classification for large-scale traffic surveillance videos using convolutional neural networks, Machine Vision and Applications (2017) 1–10.
  33. S. Wang, F. Liu, Z. Gan, Z. Cui, Vehicle type classification via adaptive feature clustering for traffic surveillance video, in: Wireless Communications & Signal Processing (WCSP), 2016 8th International Conference on, IEEE, 2016, pp. 1–5.
  34. J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, S. Yan, Attentive contexts for object detection, IEEE Transactions on Multimedia 19 (5) (2017) 944–954.
  35. D. Kang, J. Emmons, F. Abuzaid, P. Bailis, M. Zaharia, Optimizing deep cnn-based queries over video streams at scale, arXiv preprint arXiv:1703.02529.
  36. S. Hou, Z. Wang, F. Wu, Object detection via deeply exploiting depth information, Neurocomputing 286 (2018) 58–66.
  37. S. Sarkar, V. Venugopalan, K. Reddy, J. Ryde, N. Jaitly, M. Giering, Deep learning for automated occlusion edge detection in rgb-d frames, Journal of Signal Processing Systems 88 (2) (2017) 205–217.
  38. W. Wang, M. Zhao, L. Wang, J. Huang, C. Cai, X. Xu, A multi-scene deep learning model for image aesthetic evaluation, Signal Processing: Image Communication 47 (2016) 511–518.
  39. Z. Wang, S. Chang, F. Dolcos, D. Beck, D. Liu, T. S. Huang, Brain-inspired deep networks for image aesthetics assessment, arXiv preprint arXiv:1601.04155.
  40. X. Lu, Z. Lin, X. Shen, R. Mech, J. Z. Wang, Deep multi-patch aggregation network for image style, aesthetics, and quality estimation, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 990–998.
  41. X. Lu, Z. Lin, H. Jin, J. Yang, J. Z. Wang, Rating image aesthetics using deep learning, IEEE Transactions on Multimedia 17 (11) (2015) 2021–2034.
  42. J. Liu, S. Zhang, S. Wang, D. N. Metaxas, Multispectral deep neural networks for pedestrian detection, arXiv preprint arXiv:1611.02644.
  43. R. Girshick, Fast r-cnn, arXiv preprint arXiv:1504.08083.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description