Indoor positioning aims at navigation inside areas with no GPS-data availability, and could be employed in many applications such as augmented reality, autonomous driving specially inside closed areas and tunnels. In this paper, a deep neural network based architecture has been proposed to address this problem. In this regard, a tandem set of convolutional neural networks, as well as a Pix2Pix GAN network have been leveraged to perform as the scene classifier, scene RGB image to point cloud converter, and position regressor, respectively. The proposed architecture outperforms the previous works, including our recent work, in the sense that it makes data generation task easier and more robust against scene small variations, whilst the accuracy of the positioning is remarkably well, for both Cartesian position and quaternion information of the camera.


Ali Ghofrani, Rahil Mahdian Toroghi, Seyed Mojtaba Tabatabaie, Seyed Maziar Tabasi \address Faculty of Technology and Media Engineering, Iran Broadcasting University (IRIBU), Tehran, Iran
CEO/CTO at Alpha Reality, AR/VR Solution Company,, {smtabatabaie,m.tabasi} {keywords} Indoor positioning, point cloud data, Convolutional neural networks, Generative adversarial networks, Pix2Pix GAN.

1 Introduction

Global positioning system is a problem, which has been contributed using navigation systems, and GPS satellites. The indoor positioning, on the other hand is still challenging task due to the fact that inside covered areas with no GPS signal available, image processing tasks are the only solutions to be resorted (e.g., SIFT and SURF). These methods are not very accurate [10]. The main reason is the existence of several identical patterns inside the buildings, which could easily fool the positioning system.

The first data-driven approach using convolutional neural networks (CNN) was POSENET  [5], which could work for a limited open area. Further, a geometry-aware system was proposed for camera localization which incorporated perceptual and temporal features to improve the precision,  [3]. However, both these methods were applicable in outdoor positioning. In most traditional indoor positioning systems, which do not involve wireless means [15], the depth-assisted camera is necessary to be used [16], which is not always available in real-world scenarios, such as mobile handsets.

The first indoor positioning system using deep neural networks, was proposed by the authors of this paper [2], through scanning of the desired area segments using photogrammetry method. A classifier is then trained by a CNN structure (i.e. EfficientNet [12]), and followed by a MobileNet CNN structure [9], which has already been trained to perform as a regressor. This structure could achieve a remarkable precision result for the Cartesian position and quaternion information of the camera [2]. The remaining challenge of the previous work is that, generation of such a huge amount of RGB data for training the deep neural network is an overwhelming task. Moreover, for the case in which the area is subject to small changes, then the RGB based data is no longer trustable and the output of the previous system is not robust, at all.

A solution to the aforementioned problem would be to generate a point cloud data using a LiDAR system rather than RGB cameras, which is both easier and more robust.

In this work, we extended our research to investigate whether it would be possible for our regressor-CNN to be driven by a point-cloud data, rather than the RGB image. Wang et al. in [14], showed that it would be possible to detect the object using its associated point-cloud data. On the other hand, Shi et al. [11], showed that it is possible to render the point-cloud data into associated images using GAN neural networks [4].

Figure 1: A big-picture of the LiDAR-based indoor positioning.

Following these two works, as illustrated in figure 1, the CNN-regressor is trained by the point-cloud data instead of the RGB image. Moreover, due to the fact that the clients normally have access to only RGB images on their mobile handsets, therefore we need a transformer which converts the RGB data into its associated point-cloud data which we perform it using a Pix2Pix GAN neural network to achieve this mapping. This enables the training procedure to be performed much easier than our previous work, and further within small environmental changes the model could perform more robust than before. These are explicitly the novelties of our work.

2 The Proposed Framework

Regarding our previous work [2], the following steps should be taken in a sequence: 1) The input images of the clients should be given to a classifier in order to determine the associated scene. Segmentation of the desired environment into scenes could be optional. However, when we decide about the number of scenes we have to fix it, and the classifier should be trained based on that. The structure of this scene classifier, which is an EfficientNet B0 [12], is depicted in figure 3.

Figure 2: Image-to-image (RGB-2-Pointcloud) translation, using Pix2Pix GAN  [4]
Figure 3: Scene classifier based on EfficientNet B0.
Figure 4: Sequences of Camera movements for each scene

2) When the classifier determines the scene, the RGB image should be converted to its associated point-cloud using a Pix2Pix UNET-based GAN network [8]. 3) This generated point-cloud data, would be fed into the CNN-based regressor which has been trained based upon its associated scene.

Figure 5: MobileNet V2, as the regressor trained by the point-cloud dataset

Based on the above procedure, we need to primarily train a UNET-based GAN network [8], to perform the mapping of RGB images into point-cloud data. For this purpose, using a small amount of data samples which contain the pairs of RGB images, and their associated and compatible point-cloud data we could train the network. This network is depicted in figure 2.

Next, we need to train the regressor network, which is supposed to get the generated point-cloud data as the input and estimate the 7 values of Cartesian and Quaternion information as the output. This CNN-regressor (based on MobileNet V2) is depicted in figure 5.

3 Experiments and Analytics

The hardware being used for the present work, is GTX 1080-NVIDIA, on a core i7 Cpu Intel 7700, with 32 GB RAM. Tensorflow 1.13.1 has been used with CUDA 10.1, and Keras 2.2.4 softwares are the platforms to implement the tasks.

Since there were no available data containing the RGB and associated point-cloud, we generated this dataset from the freely available 3D scanned images of the Hallwyl museum in Stockholm  [1]. We sampled from this 3D model using the Unity software, and the normalized outputs are saved in our generated dataset111!FE9HFCLS!vHH7vqEd5PAFF-ItGR44ww222 More than pure data samples are generated from all the scenes using different regimes for the camera, depicted in figure 4. The equivalent point-cloud data for each of the image samples are created.

In order to create the point-clouds, inside the Unity software we have modified the mesh descriptor of the environment mesh from the surface shader to geometry shader, in which the mesh vertexes are demonstrated using the points. Thus, for each RGB image the equivalent point-cloud data has been created. Since the GAN network training, requires some RGB and associated point-cloud data pairs, and the scene classifier also needs to be trained on the scenes through RGB images this may give the wrong impression that the RGB images are again under usage. However, the amount of RGB images which could be employed for the GAN network is sufficient to train the classifier network, as well. This has been investigated and the result confusion matrix has been depicted in figure 6.

Figure 6: The confusion matrix for the classification of the scenes through EfficientNet.

For the classifier, the loss function being used is the categorical cross-entropy, and the model is monitored toward maximizing the validation accuracy.

Figure 7: (Left to right) input Point cloud, generated RGB-GAN output, and the ground truth RGB.
Figure 8: Classification accuracy (left), and loss (right) based on the categorical cross-entropy.

In order to achieve the optimum performance, the drop-connect is employed to avoid overfitting [13]. In addition, the swish as a SOTA activation function has been used, as the state-of-the-art [7].

To train the regressors, since the input dataset is point-cloud, it is not possible to use the imageNet-based training parameters, in a transfer learning procedure. Therefore, the entire training of the regressors has been performed from scratch via Xavier weight initializing technique [6]. The loss changing diagram has been depicted in figure 9.

Figure 9: (left) Quaternion loss, (right) Cartesian Loss. Losses are to be scaled using the scale factor in the loss function.

The loss function should be chosen as in [2]. This loss function is, as follows


where is the position data vector, Q is the quaternion information, and is the scale factor to make a balance between estimating the position and the quaternion.

Figure 10: (Left to right) RGB input data, generated point cloud-GAN output, and the ground truth point cloud.

The GAN training is based on the RGB-2-Point cloud data, which has been generated, as mentioned before. A sample of this data has been depicted in figure 10. In a further investigation, we turned the GAN to work as a point cloud to RGB converter. Interestingly, the same network could perform quite well, as depicted in figure 7.

X-position Y-position Z-position Quaternion
Error Value 0.019 m 0.027 m 0.0073 m    0.0096
Table 1: The regression error, for the position vector (X;Y;Z), and the camera Quaternion, over the test set (Unseen data)

4 Conclusion

An indoor position system has been proposed in this paper, based on a supervised deep network structure. The goal of the system is to achieve a high accuracy of the Cartesian (X,Y,Z) position and the camera quaternion, while being robust against environmental changes and object movements. A CNN-based classifier is used to identify the scene from the environment based on the client’s input RGB image. A GAN network has already been prepared to convert the RGB images into point cloud data which is easier available and more robust against variations of the scene background. The regressor CNNs are trained only based on the point clouds. The results of the experiments showed a remarkable achievement in positioning whilst making the entire procedure of our previous work much easier to be performed.


  • [1] (”) Cited by: §3.
  • [2] A. Ghofrani, R. M. Toroghi, and S. M. Tabatabaie (2019) ICPS-net: an end-to-end rgb-based indoor camera positioning system using deep convolutional neural networks. arXiv preprint arXiv:1910.06219. Cited by: §1, §2, §3.
  • [3] J. F. Henriques and A. Vedaldi (2018) Mapnet: an allocentric spatial memory for mapping environments. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8476–8484. Cited by: §1.
  • [4] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2016) Image-to-image translation with conditional adversarial networks. arxiv. Cited by: §1, Figure 2.
  • [5] A. Kendall, M. Grimes, and R. Cipolla (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pp. 2938–2946. Cited by: §1.
  • [6] S. K. Kumar (2017) On weight initialization in deep neural networks. CoRR abs/1704.08863. External Links: Link, 1704.08863 Cited by: §3.
  • [7] P. Ramachandran, B. Zoph, and Q. V. Le (2017) Swish: a self-gated activation function. arXiv preprint arXiv:1710.05941 7. Cited by: §3.
  • [8] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2, §2.
  • [9] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §1.
  • [10] T. Sattler, B. Leibe, and L. Kobbelt (2016) Efficient & effective prioritized matching for large-scale image-based localization. IEEE transactions on pattern analysis and machine intelligence 39 (9), pp. 1744–1756. Cited by: §1.
  • [11] S. Shi, X. Wang, and H. Li (2019-06) PointRCNN: 3d object proposal generation and detection from point cloud. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [12] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §1, §2.
  • [13] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. In International conference on machine learning, pp. 1058–1066. Cited by: §3.
  • [14] Y. Wang, W. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Weinberger (2019) Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving. In CVPR, Cited by: §1.
  • [15] C. Yang and H. Shao (2015) WiFi-based indoor positioning. IEEE Communications Magazine 53 (3), pp. 150–157. Cited by: §1.
  • [16] F. Zhang, T. Lei, J. Li, X. Cai, X. Shao, J. Chang, and F. Tian (2018) Real-time calibration and registration method for indoor scene with joint depth and color camera. International Journal of Pattern Recognition and Artificial Intelligence 32 (07), pp. 1854021. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description