GAPNet: Graph Attention based Point Neural Network for Exploiting Local Feature of Point Cloud
Exploiting fine-grained semantic features on point cloud is still challenging due to its irregular and sparse structure in a non-Euclidean space. Among existing studies, PointNet provides an efficient and promising approach to learn shape features directly on unordered 3D point cloud and has achieved competitive performance. However, local feature that is helpful towards better contextual learning is not considered. Meanwhile, attention mechanism shows efficiency in capturing node representation on graph-based data by attending over neighboring nodes. In this paper, we propose a novel neural network for point cloud, dubbed GAPNet, to learn local geometric representations by embedding graph attention mechanism within stacked Multi-Layer-Perceptron (MLP) layers. Firstly, we introduce a GAPLayer to learn attention features for each point by highlighting different attention weights on neighborhood. Secondly, in order to exploit sufficient features, a multi-head mechanism is employed to allow GAPLayer to aggregate different features from independent heads. Thirdly, we propose an attention pooling layer over neighbors to capture local signature aimed at enhancing network robustness. Finally, GAPNet applies stacked MLP layers to attention features and local signature to fully extract local geometric structures. The proposed GAPNet architecture is tested on the ModelNet40 and ShapeNet part datasets, and achieves state-of-the-art performance in both shape classification and part segmentation tasks.
GAPNet: Graph Attention based Point Neural Network for Exploiting Local Feature of Point Cloud
Can Chen School of SATM Cranfield University UK, MK43 0AL firstname.lastname@example.org Luca Zanotti Fragonara School of SATM Cranfield University UK, MK43 0AL email@example.com Antonios Tsourdos School of SATM Cranfield University UK, MK43 0AL firstname.lastname@example.org
noticebox[b]Preprint. Under review.\end@float
As point cloud data becomes increasingly popular in a wide range of applications such as: autonomous vehicle zhou2018voxelnet (); qi2018frustum (); ku2018joint (); liu2018real (), robotic mapping and navigation biswas2012depth (); zhu2017target (), 3D shape representation and modelling golovinskiy2009shape (), many researchers are drawing attention to shape analysis and understanding, especially when convolutional neural networks (CNNs) achieves significant success in computer vision tasks. However, CNNs heavily rely on the data with the standard grid structure, which leads to inefficient performance on irregular and unordered geometric data, such as point cloud. As a result, fully exploiting contextual information from point cloud remains a challenging problem.
In order to leverage advantages of CNNs, some approaches maturana2015voxnet (); wang2015voting (); riegler2017octnet () map unstructured point cloud to a standard 3D grid before applying CNN architectures. However, these volumetric representations are not efficient in terms of memory and computational efficiency due to the typical sparsity of point cloud structure. Instead of applying CNNs over gridded point cloud, PointNet qi2017pointnet () pioneers the approach that applies deep learning directly over irregular point cloud. In particular, PointNet makes input point cloud invariant to permutations and exploits point-wise features by independently applying a Multi-Layer-Perceptron (MLP) network and a symmetric function on each point. However, it only captures global feature without local information. PointNet++ qi2017pointnet++ () extends PointNet model by constructing a hierarchical neural network that recursively applies PointNet with designed sampling and grouping layers to extract local features. DGCNN wang2018dynamic () operates an edge convolution on points and corresponding edges to further exploit local information. Adapted from point cloud registration method, KC-Net shen2018mining () builds a kernel correlation layer to measure geometric affinities for points.
Attention mechanisms have proved to be efficient in many areas, such as machine translation task vaswani2017attention (); bahdanau2014neural (), vision-based task mnih2014recurrent (), and graph-based task velivckovic2017graph (). Inspired by graph attention networks velivckovic2017graph (), we primarily focus on fully exploiting fine-grained local features for point cloud in an attention manner in 3D shape classification and part segmentation tasks. The key contributions of our work are summarized as follows:
We propose a multi-head GAPLayer to capture contextual attention features by indicating different importance of neighbors for each point. Independent heads attend to different features from representation space in parallel and are further aggregated together to obtain sufficient power of feature extraction.
We propose self-attention and neighboring-attention mechanisms to allow the GAPLayer to compute the attention coefficients by considering the self-geometric information and local correlations to corresponding neighbors.
An attention pooling layer over neighbors is proposed to identify the most important features to obtain local signature representation to enhance network robustness.
Our GAPNet integrates the GAPLayer and the attention pooling layer into stacked Multi-Layer-Perceptron (MLP) layers or existing pipelines (e.g. PointNet) to better extract local contextual feature from unordered point cloud.
2 Related work
Learning features from volumetric grid.
Voxelization is an intuitive way to convert sparse and irregular point cloud to standard grid structure, after which standard CNNs can be applied for feature extraction. Voxnet maturana2015voxnet () voxelizes the point cloud into a volumetric grid that indicates spatial occupancy for each voxel, followed by a 3D CNN over occupied voxels to predict categories of objects. However, 3D dense and sparsely-occupied volumetric grid leads to large memory and computational cost for high spatial resolution. As a result, some improvements are proposed to address the sparsity problem. Kd-Net klokov2017escape () uses a kd-tree bentley1975multidimensional () to build an efficient 3D space partition structure and a deep architecture to learn representations of point cloud. Similarly, OctNet riegler2017octnet () applies 3D convolution on a hybrid grid-octree structure generated from a set of shallow octrees to achieve high resolution.
Learning features from unstructured point cloud directly.
PointNet qi2017pointnet () is the pioneer work that proposed the direct application of deep learning on the raw point cloud. In more detail, a Multi-Layer-Perceptron (MLP) network and a symmetric function (e.g. max pooling) are applied on every individual point to extract global feature. This approach provides an efficient way for unstructured point cloud understanding, however, local feature is not captured as the architecture only works on independent points without relationships measurement between points in the local regions. To address this problem, PointNet++ qi2017pointnet++ () constructs a hierarchical neural network that recursively applies PointNet with a sampling layer and a grouping layer to exploit local representations. DGCNN wang2018dynamic () extends PointNet by presenting an edge convolution operation (EdgeConv) that is applied on edge features which aggregate each point and corresponding edges connecting to the neighboring pairs. In order to leverage the advantages of standard CNN operation, PointCNN li2018pointcnn () attempts to learn a -convolutional operator to transform a given unordered point set to a latent canonical order, after which a typical CNN architecture is used to extract local features.
Learning features from multi-view models.
In order to apply standard CNN operation but also avoid large computation cost in volumetric-based methods, some researchers are interested in multi-view based approaches. For instance, qi2016volumetric (); wang2017dominant () learns features of point cloud in an indirect way by applying a typical 2D CNN architecture to multiple 2D image views that are generated by the multi-view projections over 3D point cloud. However, these multi-view approaches are not capable to realize semantic segmentation task for point cloud, as 2D images lack of depth information, which leads to the fact that it is non-trivial to classify each point from images.
Learning features from geometric deep learning.
Geometric deep learning bronstein2017geometric () is a modern term for a set of emerging techniques that attempts to address non-Euclidean structured data (e.g. 3D point cloud, social networks or genetic networks) by deep neural networks. Graph CNNs bruna2013spectral (); defferrard2016convolutional (); zhang2018graph () show advantages of graph representation in many tasks for non-Euclidean data, as it can naturally deal with these irregular structures. PointGCN ZhangR_18_gcnn_point_cloud () builds a graph CNN architecture to capture local structure and classify point cloud, which also proves that geometric deep learning has huge potential for unordered point cloud analysis.
3 GAPNet architecture
In this section, we propose our GAPNet model to better learn local representations for unstructured point cloud in shape classification and part segmentation tasks. We detail the model that consists of three components: GAPLayer (multi-head graph attention based point network layer) that is shown in Figure 2 , attention pooling layer, and GAPNet architecture shown in Figure 3 .
Let be a raw set of unordered points and input of our model, with -dimension, where is the number of the points, and is a feature vector that might contain 3D space coordinates , color, intensity, surface normal, etc. For the sake of simplicity, in this study we set and only use 3D coordinates as input features.
Local structure representation.
Considering the fact that the number of samples in point cloud can be very large in real applications (e.g. autonomous vehicle), allowing every point to attend to all other points will lead to high computation cost and gradient vanishing problem due to very small weights allocation on every other point for every point. As a result, we construct a directed -nearest neighbor graph to represent local structure of the point cloud, where are nodes for points, are edges connecting neighboring pairs of points, and is a neighborhood set of point . We define edge features as , where , and indicates the neighboring point to point .
To the benefit of the readers, we start by introducing a single-head GAPLayer that takes point cloud data as the input, jointly with a multi-head mechanism that concatenates all heads together over feature channels in our network. The structure of single-head GAPLayer is shown in Figure 2(b) .
In order to pay different attentions to different neighbors, we propose a self-attention mechanism and a neighboring-attention mechanism to capture attention coefficients for each point to its neighborhood as illustrated in Figure 1 . In more detail, the self-attention mechanism learns self-coefficients by considering self-geometric information for each individual point, while neighboring-attention mechanism focuses on local-coefficients by considering neighborhood.
where h() is a parametric non-linear function, chosen to be a single-layer neural network in our experiment, and is a set of learnable parameters of the filter.
We obtain attention coefficients by fusing self-coefficients and local-coefficients as defined by Equation 3 , where and are single-layer neural network with 1-dimension output. denotes non-linear activation function leaky RELU.
In order to align comparison of the attention coefficients across neighbors for different points, we use softmax function to normalize coefficients for all the neighbors to every point that is referred as 4 .
The goal of each single-head GAPLayer is to compute contextual attention feature for every point. For this, we utilize the obtained normalized coefficients to compute a linear combination that is shown in Equation 5 . As shown in Figure 2(b) , the outputs of single-head GAPLayer are attention feature and graph feature encoded from graph edges.
Where f() is a non-linear activation function, chosen to be RELU in our experiment.
In order to obtain sufficient structural information and stabilize the network, we concatenate independent single-head GAPLayers to generate a multi-attention features with channels. The equation is defined as 6 . As shown in Figure 2(a) , the outputs of multi-head GAPLayer (GAPLayer for short) are multi-attention features and multi-graph features that concatenate attention feature and graph feature respectively from corresponding head.
Such that is the attention feature of the -th head, is the total number of heads, and is concatenation operation over feature channels.
3.2 Attention pooling layer
To enhance network robustness and improve performance, we define an attention pooling layer on neighboring channel of multi-graph features. We use max pooling as our attention pooling operation which identifies the most important feature across heads to capture local signature representation defined as 7 . The local signature is connected to the intermediate layer for capturing global feature.
3.3 GAPNet architecture
Our GAPNet model shown in Figure 3 considers both shape classification and semantic part segmentation for point cloud. The architecture is similar to PointNet qi2017pointnet (). However, there are three main differences between the architectures. Firstly, we use an attention-aware spatial transform network to make the point cloud invariant to certain transformations. Secondly, instead of only processing individual points, we exploit local features by a GAPLayer before the stacked MLP layers. Thirdly, an attention pooling layer is used to obtain local signature that is connected to the intermediate layer for capturing a global descriptor.
In this section, we evaluate our GAPNet model in the classification and part segmentation tasks for 3D point cloud analysis, we then compare our performance to recent state-of-the-art methods and perform ablation study to investigate different design variations.
We demonstrate the effectiveness of our classification model on the ModelNet40 benchmark wu20153d () for shape classification. The ModelNet40 dataset contains 12,311 meshed CAD models that are classified to 40 man-made categories. We separate 9,843 models for training and 2,468 models for testing. Then we normalize the models in the unit sphere and uniformly sample 1,024 points over model surface. Besides, We further augment the training dataset by randomly rotating, scaling the point cloud and jittering the location of every point by means of Gaussian noise with zero mean and 0.01 standard deviation for all the models.
The classification model is presented in Figure 3 (top branch). In order to make the input points invariant to some geometric transformations, such as scale, rotation, we firstly apply an attention-aware spatial transformer network to align the point cloud to a canonical space. The network employs a single-head GAPLayer with 16 channels to capture attention features, followed by three shared MLP layers (64, 128, 1024) to output neurons with sizes 64, 128, 1024 respectively, then a max pooling operation and two full-connected layers (512, 256) are used to finally generate a transformation matrix.
A multi-head GAPLayer is then applied to generate multi-attention features with channels, where the number of heads is set as , and the number of encoding channels is set as . Our multi-attention features aggregate coordinate feature of point cloud to obtain a contextual attention features with the number of channels , which is then used to extract fine-grained features by four shared MLP layers (64, 64, 64, 128). The skip-connection method is employed to connect local signature and these intermediate layers, followed by a shared full-connected layer (1024) and a max pooling operation over feature channels to obtain a global feature for the entire point cloud. We finally apply three shared MLP layers (512, 256, 40) and dropout operation with a keep probability of 0.5 to transform global feature to 40 categories. Besides, the activation function ReLU with batch normalization is used in each layer, and the number of neighbors is set to 20.
During the training, our optimizer model is Adam kingma2014adam () with momentum 0.9, and we set batch size 32 and learning rate starts from 0.005 and then is divided by 2 every 20 epochs to 0.00001. The decay rate for batch normalization is initially set to 0.7 and increases to 0.99 gradually. Our model is trained on a NVIDIA GTX1080Ti GPU and TensorFlow v1.6.
Table 1 compares our results and complexity with several recent state-of-the-art works, and our model achieves the best performance on the ModelNet40 benchmark, and it outperforms the previous state-of-the-art model DGCNN by 0.2% accuracy.
To compare the complexity, we measured the model complexity and the computational complexity using the model size and forward time respectively. We also evaluated and listed in Table 1 the same metrics for all the available models in the same experimental environment. Although PointNet achieves the best computational complexity, our model outperforms it by 3.1% accuracy, which leads to the fact that our model achieves the best trade-off between accuracy and complexity.
|VOXNET maturana2015voxnet ()||83.0||85.9||-||-|
|POINTNET qi2017pointnet ()||86.0||89.2||41.8||14.7|
|POINTNET++ qi2017pointnet++ ()||-||90.7||19.9||32.0|
|KD-NET klokov2017escape ()||-||91.8||-||-|
|KC-NET shen2018mining ()||-||91.0||-||-|
|DGCNN wang2018dynamic ()||90.2||92.2||22.1||52.0|
We also test our classification model with different settings on the ModelNet40 benchmark wu20153d () . In particular, we analyze the effectiveness of the GAPLayer, attention pooling layer, and also different numbers of multiple heads and encoding channels.
Table 8 represents the advantages of our GAPLayer and attention pooling layer. It shows that attention pooling layer leads to 0.6% accuracy. Constant-GAPLayer indicates a model with the same structure as our GAPLayer, but all the coefficients are set to equal constants, and it indicates the effectiveness of graph attention mechanism and our GAPLayer model that leads to 0.7% accuracy.
For what concerns the impact of different numbers of heads and encoding channels . Table 8 indicates that appropriate numbers are beneficial to local feature extraction, however the performance degenerates when the numbers become further larger.
4.2 Semantic part segmentation
We evaluate our segmentation model on ShapeNet part dataset yi2016scalable () in semantic part segmentation task that is to classify part category for each point from a mesh model. The dataset consists of 16,881 CAD shapes of 16 categories, and each point from a model is annotated with a class of 50 part classes. Besides, each shape model is labeled with several but less than 6 parts. We follow the same sampling strategy as Section 4.1 to sample 2,048 points uniformly, and split dataset into 9,843 models for training and 2,468 models for testing in our experiment.
Our segmentation model shown in Figure 3 (bottom branch) is to predict a part category label for each point in the point cloud. We firstly use the same spatial transformer network and the GAPLayer as Section 4.1, followed by shared MLP layers (64, 64, 128). Then the second GAPLayer with 4 heads and 128 encoding channels is applied, followed by shared MLP layers (128, 128, 512) to obtain representations with 512 channels, which are concatenated with local signature generated from corresponding attention pooling layer of GAPLayer. The aggregated feature applies a shared full-connected layer (1024) and a max pooling to obtain a global feature, which is then duplicated 2048 times and finally applies four shared full-connected layers (256,256,128,50) with dropout probability 0.6 to transform the global feature to 50 part categories.
The training setting is similar to the setting in classification task, except that batch size is set to 8, number of neighbors is set to30, and we distribute the task to two NVIDIA TESLA V100 GPUs.
We use the mean Intersection over Union (mIoU) qi2017pointnet () as our evaluation scheme to align the evaluation metric. The IoU of each shape is calculated by averaging IoUs for all parts that fall into the same category, then the mIoU is the mean IoUs for all shapes from testing dataset.
Table 2 shows that our model achieves competitive results on the ShapeNet part dataset yi2016scalable (). Our model wins 8 categories for part segmentation compared with 6 winning categories from DGCNN wang2018dynamic (), although it outperforms ours by 0.4% accuracy. Figure 4(a) represents some shapes from our results, we also visualize the difference between ground truth and our prediction results as shown in Figure 4(b) , where left shapes indicate ground truth and right shapes show our prediction results.
In this paper, we propose a graph attention based point neural network, named GAPNet, to learn shape representations for point cloud. Experiments show state-of-the-art performance in shape classification and semantic part segmentation tasks. The success of our model also verifies the fact that graph attention network shows efficiency in not only similarity computation for graph nodes, but also geometric relationship understanding.
In the future, we can further explore several research avenues. For example, some applications, such as autonomous vehicle, normally need to process very large-scale point cloud data. As a result, how to efficiently and robustly deal with large-scale data would be a worthwhile work. Furthermore, it would be interesting to develop an efficient CNN-like operation for unstructured data analysis.
The HumanDrive project is a CCAV - Innovate UK funded R&D project (Project ref: 103283).
- (1) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- (2) Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.
- (3) Joydeep Biswas and Manuela Veloso. Depth camera based indoor mobile robot localization and navigation. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pages 1697–1702. IEEE, 2012.
- (4) Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
- (5) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
- (6) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852, 2016.
- (7) Aleksey Golovinskiy, Vladimir G Kim, and Thomas Funkhouser. Shape-based recognition of 3d point clouds in urban environments. In Computer Vision, 2009 IEEE 12th International Conference on, pages 2154–2161. IEEE, 2009.
- (8) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- (9) Roman Klokov and Victor Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In Proceedings of the IEEE International Conference on Computer Vision, pages 863–872, 2017.
- (10) Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven L Waslander. Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8. IEEE, 2018.
- (11) Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Di, and Baoquan Chen. Pointcnn: Convolution on x-transformed points. In Advances in Neural Information Processing Systems, pages 828–838, 2018.
- (12) Zhongze Liu, Huiyan Chen, Huijun Di, Yi Tao, Jianwei Gong, Guangming Xiong, and Jianyong Qi. Real-time 6d lidar slam in large scale natural terrains for ugv. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages 662–667. IEEE, 2018.
- (13) Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 922–928. IEEE, 2015.
- (14) Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual attention. In Advances in neural information processing systems, pages 2204–2212, 2014.
- (15) Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 918–927, 2018.
- (16) Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
- (17) Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2016.
- (18) Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
- (19) Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. Octnet: Learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3577–3586, 2017.
- (20) Yiru Shen, Chen Feng, Yaoqing Yang, and Dong Tian. Mining point cloud local structures by kernel correlation and graph pooling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4548–4557, 2018.
- (21) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- (22) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
- (23) Chu Wang, Marcello Pelillo, and Kaleem Siddiqi. Dominant set clustering and pooling for multi-view 3d object recognition. In Proceedings of British Machine Vision Conference (BMVC), volume 12, 2017.
- (24) Dominic Zeng Wang and Ingmar Posner. Voting for voting in online point cloud object detection. In Robotics: Science and Systems, volume 1, pages 10–15607, 2015.
- (25) Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.
- (26) Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
- (27) Li Yi, Vladimir G Kim, Duygu Ceylan, I Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer, Leonidas Guibas, et al. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG), 35(6):210, 2016.
- (28) Yingxue Zhang and Michael Rabbat. A graph-cnn for 3d point cloud classification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6279–6283. IEEE, 2018.
- (29) Yingxue Zhang and Michael Rabbat. A graph-cnn for 3d point cloud classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018.
- (30) Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018.
- (31) Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 3357–3364. IEEE, 2017.