Learning to Localize: A 3D CNN Approach to User Positioning in Massive MIMO-OFDM Systems

# Learning to Localize: A 3D CNN Approach to User Positioning in Massive MIMO-OFDM Systems

Chi Wu, Xinping Yi, Wenjin Wang, Li You, Qing Huang, Xiqi Gao, C. Wu, W. Wang, L. You, Q. Huang and X. Q. Gao are with the National Mobile Communications Research Laboratory, Southeast University, Nanjing 210096, China (e-mail: chiwu@seu.edu.cn; wangwj@seu.edu.cn; liyou@seu.edu.cn; huangqing@seu.edu.cn; xqgao@seu.edu.cn). X. Yi is with the Department of Electrical Engineering and Electronics, University of Liverpool, L69 3BX, United Kingdom (email: xinping.yi@liverpool.ac.uk)
###### Abstract

Massive MIMO, positioning, deep learning, 3D convolution neural network, fingerprint.
\bstctlcite

IEEEexample:BSTcontrol

## I Introduction

As location-based applications (LBA) are extensively deployed in modern society, accurate positioning has received enormous attention in both industry and academia [1]. The global positioning system (GPS) has provided real-time outdoor positioning for the mobile terminal (MT), which can reach several meters of accuracy [2]. However, in urban areas, the GPS positioning performance will be significantly degraded due to the blockage of buildings, cars, and pedestrians.

Recently, user positioning by exploiting rich information of multipath wireless propagation has drawn a lot of attention. Various positioning methods have been proposed in the literature, including the geometry-based and the fingerprint-based positioning methods. The geometry-based positioning is a triangulating-to-localize method that relies on information of wireless signals from the users to the base stations (BSs), e.g., angle of arrival (AOA) [3], time of arrival (TOA) [4], time difference of arrival (TDOA) [5], and the received signal strength (RSS) [6]. However, the estimation errors of AOA/TOA/TDOA/RSS have an crucial influence on positioning accuracy in complex urban environments. In contrast, fingerprint-based positioning is a matching-to-localize method that consists of offline fingerprint database construction and online fingerprint matching and location prediction. In the offline phase, reference points (RPs) are selected, for which the pairs of fingerprints and corresponding positions are stored in the database. In the online phase, the position of the MT is estimated by searching the collected database and matching the input fingerprint with the stored ones. As an indicator of the surrounding environment, fingerprints have been widely adopted for user positioning in the complex multipath environment.

The most common feature used in fingerprint-based positioning is RSS [7, 8, 9], yet there are two shortcomings of the RSS-based fingerprint. On the one hand, RSS suffers from fast fading fluctuation and hardware heterogeneity and is therefore unstable for positioning. On the other hand, RSS only captures the coarsest channel information that cannot meet the demand in complex communication environment. Recently, some researchers proposed to use the channel state information (CSI) as the fingerprint [10, 11, 12, 13]. Capturing more channel information than RSS, CSI has the potential to enhance the positioning accuracy. In certain scenarios (e.g., wireless sensor network [11], and WiFi network [10]), however, due to limited bandwidth and number of antennas, the CSI fingerprint is insufficient to capture the multipath characteristics because of the low resolution in the spatial or frequency domain.

Fortunately, such limitations can be overcome in massive multiple-input multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM) system. Thanks to the large-scale antenna array and wide bandwidth, CSI fingerprints are able to capture rich multipath information including powers, angles, and delays [14, 15, 16, 17, 18] for positioning. Various fingerprint-based positioning techniques have been proposed for massive MIMO/MIMO-OFDM systems. Of particular relevance is the approach proposed in [17], where a weighted k-nearest neighbor (WKNN) algorithm is applied using the angle-delay domain channel information as fingerprints in the massive MIMO-OFDM system. Other approaches for positioning include the use of some sophisticated techniques, such as Gaussian processes regression in [7] and compressive sensing in [19], to name a few.

Most recently, deep learning has found application in user positioning, inspired by its great success in image recognition, speech signal processing, and self-driving. As a matter of fact, the fingerprint-based positioning can be cast into an image recognition problem, in which the fingerprints can be treated as images to recognize. For indoor positioning, a two-step training deep neural network (DNN) based positioning method was proposed in [20] for the NLOS massive MIMO scenario, a single DNN classifier to determine the probabilities of the MT being on the collected RPs was used in [12], and a deep learning-based positioning method based on the classic deep belief nets (DBNs) with a stack of restricted Boltzmann machines (RBMs) was proposed in [21]. For outdoor positioning, a deep convolution neural network (CNN) was utilized in [22] to map the CSI into the 2D position coordinates.

However, the aforementioned methods were mainly dedicated to 2D positioning. When it comes to the UPA array, high angular resolution can be realized in both vertical and horizontal directions, which provides new opportunities for the fingerprint-based 3D positioning. To this end, we propose a novel deep learning based 3D positioning method for the MIMO-OFDM system with the UPA array, taking the angle-delay domain channel power matrix (ADCPM) as the fingerprint that contains multipath angles, delays, and powers information. Instead of CSI fingerprints in the spatial-delay domain, we translate them into the angle-delay domain with a 3D discrete Fourier transform (DFT), by which the sparsity structures can be fully exploited. To deal with the high dimensionality of the ADCPM, we propose a regression-oriented 3D CNN model that maps the ADCPM fingerprints into the 3D position coordinates directly. In particular, our 3D CNN model consists of four key components: (1) the convolution refinement module to refine the elementary feature maps from the ADCPM fingerprints; (2) 3D Inception blocks that are extended from the Inception blocks in AlexNet [23] and GoogLeNet [24, 25, 26]; (3) the average pooling to replace the full-connected layer at the bottom of the network, as suggested in [27], to reduce the number of parameters of the network; (4) the batch normalization (BN) in each layer to improve the convergence speed and generalization ability of the network [28]. Such an end-to-end positioning method works in the following way. In the offline training phase, the 3D CNN is trained by using the ADCPM fingerprints and the corresponding coordinates of RPs. In the online prediction phase, the trained 3D CNN is used to take new ADCPM fingerprints for position prediction.

To summarize, our contributions are three-fold:

• For the massive MIMO-OFDM systems with the UPA array, we propose a new type of fingerprints, ADCPM, which has rich and stable mutipath characteristics that are of close relevance to the position information.

• For the massive MIMO-OFDM systems with the UPA array, we propose a 3D CNN based positioning method, which achieves higher positioning accuracy with reduced storage and computational overhead than the searching-based methods (e.g., WKNN).

• We build a simulator and conduct extensive experiments to evaluate the proposed 3D CNN fingerprint-based positioning method with respect to positioning accuracy, storage and computation overhead.

The rest of the paper is organized as follows. In Section II, we investigate the 3D MIMO-OFDM system and propose a new type of fingerprint extracted from the multipath channel characteristics. In Section III, we introduce the design of our proposed 3D CNN-enabled positioning method, followed by the detailed network architecture of our 3D CNN model in Section IV. Simulation results are presented in Section V, and conclusion is given in Section VI.

Notations: We use to denote the imaginary unit. Vectors and matrices are denoted in lower-case bold-faced characters and upper-case bold-faced characters respectively, the element indices of vector and matrix start with 0. We use , , to denote the th element of the vector , the th element of matrix and the th column of matrix respectively. The superscript , , indicate the matrix transpose, conjugate-transpose and conjugate operation. The complex number field, real number field and, integer field are represented by , and . The symbol denote the Kronecker product of two matrices. We use to denote the expectation of random variable (RV) and random vector variable (RVC). denotes the largest integer not greater than , denotes the modulo- operation. denotes the delta function.

## Ii 3D MIMO-OFDM System and Channel Characteristics

In this section, we start with the 3D massive MIMO-OFDM system modeling. The BS is equipped with a uniform planar array (UPA), comprising antennas in each row and antennas in each column. Then we introduce the new type of fingerprint with the angle and delay information extracted from the channel characteristics.

### Ii-a Channel model

We consider the uplink transmission in a wide-band massive MIMO wireless system. Owing to occlusions and reflections, wireless signals propagate through multipaths. Assume the number of multipaths is and the number of the MT is , the AOA of the th path for the th MT can be decomposed into the elevation angle in the vertical direction and the azimuth angle in the horizontal direction, as shown in Fig. 1. Thus, the array response vector can be written as

 e(θp,k,φp,k)=e(v)(θp,k)⊗e(h)(θp,k,φp,k), (1)

with

 e(v)(θp,k)=[1,e−¯ȷ2πd(v)tλccosθp,k,⋯,e−¯ȷ2π(M−1)d(v)tλccosθp,k]T, (2)

and

 (3)

where and are the antenna spaces in the column and row respectively, and is the carrier wavelength. Then, the channel impulse response (CIR) of the th path for the th MT is represented by

 qp,k=ap,ke(θp,k,φp,k). (4)

We consider OFDM modulation with sub-carriers, and the sample interval is . We use and to denote the OFDM symbol duration and the cyclic prefix (a.k.a. guard interval) duration respectively. We assume the cyclic prefix duration is larger than the maximum channel delay of all the MTs. Let , where is the time of arrival (TOA) of the th path for the th MT. The frequency of the th sub-carrier is . Thus, the channel frequency response (CFR) associated with the th MT and the th sub-carrier is written as

 hk,l=Np∑p=1ap,ke(θp,k,φp,k)e−¯ȷ2πlrp,kNc. (5)

where is the complex path gain of the th path. The space-frequency channel response matrix (SFCRM) of the th MT known to the BS can be denoted by the concatenation of , i.e.

 (6)

### Ii-B Fingerprint from Angle-Delay Domain

Define and as phase-shifted discrete Fourier transform (DFT) matrices with the th element and respectively. Define as the matrix composed of the first columns of DFT matrix with the th element .

The fingerprints used in positioning are required to be closely linked to the MT’s positions. The CFR describes the space-frequency domain characteristics, but it is hard to build an intuitive relationship between position and CFR. Due to complex and changeable multipath propagation in the wireless channel, the AOAs and TOAs of received signals are unique for different positions. Therefore, it is sufficient to extract a fingerprint from the angle-delay domain. We reconstruct CFR matrix into angle-delay domain matrix, and is referred to as the angle-delay domain channel response matrix (ADCRM) of the th MT, given by

 Gk=1√MNNc(VHM⊗VHN)HkF∗Nc×Ng. (7)

As such, the angle-delay domain channel power matrix (ADCPM) of the th MT is introduced and used as a fingerprint hereafter, i.e.,

 Ωk≜E{Gk⊙G∗k}, (8)

with

 [Ωk]i,j≜E{∣∣[Gk]i,j∣∣2}. (9)

Define the following function:

 fM(x)=sin(Mx)Msin(x). (10)

Theorem 1: For 3D MIMO-OFDM systems with UPA at the BS, when and , the ADCPM is concentrated on specific position in the vertical angle-delay domain, given by

 (11)

where

 ¯mp,k=M2+Md(v)tλccosθp,k, (12)

When and , the ADCPM has the similar conclusion in the horizontal angle-delay domain, given by

 (13)

where

 ¯np,k=N2+Nd(h)tλcsinθp,kcosφp,k. (14)

When , and , the ADCPM is concentrated on the th angle direction and the th delay direction, given by

 limM→∞,N→∞,Nc→∞[Ωk]i,j=Np∑p=1σ2p,kδ(i−¯mp,kN−¯np,k)⋅δ(j−rp,k). (15)

Proof: See appendix A. ∎

Remark 1: For 3D massive MIMO-OFDM systems, by (7) and (8), the CFR in the space-frequency domain is translated into the ADCPM in the angle-delay domain. Theorem 1 reveals that the th element in the ADCPM corresponds to the average channel power of the th AOA and the th TOA.

Remark 2: In Theorem 1, determines the sparsity and specifies the sparsity pattern. When , and/or , the ADCPM is asymptotically a sparse matrix in the sense that most elements are equal to zero. For the finite , and , it will be shown later by numerical results that the sparsity maintains.

Remark 3: From Theorem 1, the sparsity pattern of the ADCPM depends on both the AoAs and the TOAs of multiple paths. For two MTs located at different positions, it is unlikely all the multipath components of the signals are identical, so are the corresponding ADCPMs. As such, the ADCPM can be a unique indicator to discriminate MTs from different geographical positions.

According to Theorem 1, the ADCPM is suitable to serve as the fingerprint for positioning, as the ADCPMs meet the following requirements for fingerprints.

1. The ADCPMs are closely related to geographic locations. As stated above, the ADCPM embeds information of the AOAs, the TOAs, and the channel power corresponding to a specific geographic location.

2. The ADCPMs have a sufficient degree of discrimination between different geographical locations, which increases as two locations are farther apart.

3. The ADCPM is stationary in the sense that it keeps unchanged over a relatively long period for a given location. In the multipath environment, as long as the distribution of the scatterers does not change, the angle and delay keep unchanged, so does the ADCPM.

In addition, the ADCPM is convenient to be extracted from the channel state information (CSI) at the BS through wideband signal processing. To conclude, the ADCPM provides a highly differentiable, stable, and easily accessible indicator for different geographic location, and therefore is an ideal fingerprint for positioning.

## Iii Convolution Neural Network for Positioning

Provided the ADCPM as the fingerprint in Section II-B, the problem arises as to how to realize positioning by exploiting the structural properties of ADCPM.

As the ADCPM fingerprint can be seen as an image, the widely used end-to-end image recognition method - convolution neural network (CNN) empowered deep learning - can be applied here for positioning. Thanks to its characteristics of space invariant, parameter sharing, and hierarchical representations, CNN is more efficient than the traditional fully-connected networks when dealing with large dimensional inputs [29]. For 3D massive MIMO-OFDM systems, the high dimensional ADCPM has sparsity patterns, which suggests that CNN could be an ideal positioning method to extract the positioning-related features from the ADCPM and convert these features into position information with relatively low computational complexity.

### Iii-a The Sparse ADCPM as Input

As shown in Fig. 2, the asymptotic property in Theorem 1 of the ADCPM sparsity pattern in the angle-delay domain still maintains in the practical setting with a finite number of antennas and limited bandwidth. Due to the sparsity pattern, the ADCPM fingerprint makes the difference of the channel at different positions more distinguishable. As such the features of the channel are easier to be extracted by a neural network.

In fact, the feature maps in the higher layers of CNN are also sparse because they solely focus on the discriminant structure within the input picture [30]. By using the sparse ADCPM as the input, it is easier for the neural network to capture the characteristic information of the channel, thereby simplifying the neural network structure and speeding up the convergence of the neural network.

In Theorem 1, the rows and the columns of the ADCPM corresponds respectively to the angles and the delay domain. Noticing that the angle in the 3D space can be described by a pair of vertical and horizontal angles, we reshape the ADCPM into a three-dimensional tensor in the following way

 [Xk]m,n,j=[Ωk]mN+n,j, (16)

where is the 3D ADCPM of the th MT with three dimensions indicating vertical angle, horizontal angle, and delay respectively. We use hereafter the reshaped ADCPM as the input of neural networks.

### Iii-B Regression-oriented Positioning

The CNN-based regression-oriented positioning is actually a complex non-linear function. Denote by such a function that predicts the true position according to the user’s 3D ADCPM , that is,

 ^ak=f(Xk), (17)

where is the prediction by the CNN. We use regression analysis to find the mapping function through minimizing the localization error between the true coordinate and the prediction

 ek=||^ak−ak||. (18)

Note that the traditional CNN models are often used for image classification with the last layer activated by a softmax function. As a matter of fact, the CNN model itself can be regarded as a regression function if the softmax function is replaced with a fully connected layer without activation function. If we use to denote the output before the last layer, then the estimated position after the last layer can be written by

 ^a=Wvec{T0}+b, (19)

where and are the parametric weight matrix and bias vector respectively that can be learned together with the training of previous CNN layers.

Let be the set of trainable parameters of the neural network. The cost function with respect to the mean square error (MSE) of the training data set can be given by

 J(θ)=1NtrainNtrain∑i=1∥ai−^ai∥2+λ2θTθ, (20)

where is the number of training samples, the second term employs the regularization to avoid over-fitting, and is the weight factor of the regularization.

## Iv Proposed 3D CNN Structure

Inspired by the 3D structure of , we propose to use the 3D CNN to realize the mapping function. Fig. 3 shows the network architecture of the proposed 3D CNN for fingerprint-based positioning, which is composed of a convolution refinement module, three extended 3D Inception modules, and a regression module. In addition, the max pooling is used for downsampling, and its description parameters are presented in the form of (size/stride). The convolution refinement module first refines the elementary feature maps from the 3D ADCPM. Then we modify and extend the Inception module into a 3D form to extract the advanced feature maps. Further, the regression module estimates the 3D position by employing a global average pooling layer and a fully connected layer without activation function.

Before proceeding further, we introduce the 3D convolution-normalization-activation (CNA) layer, an elementary building block used in our 3D CNN. The 3D CNA layer consists of three parts: a 3D convolution, a BN transform, and an activation function. Consider the input feature maps , where , , , denote the height, width, length, and the number of channels of respectively. Given the 3D convolution kernel , where , , are the kernel size of the height, width, and length respectively, and is the number of output channels. The convolution is performed with zero-padding [29] and the strides are set to be 1. By convolving the 3D kernel with the input feature maps, the 3D convolution yields output with

 [O]m,n,j,q=∑k1,k2,k3∑p[K]k1,k2,k3,p,q⋅[I]m+k1,n+k2,j+k3,p. (21)

Right after the 3D convolution, we adopt the BN to improve the convergence rate and generalization performance [28], followed by the rectified linear unit (ReLU) as the activation function. That is, the output feature maps of the 3D CNA layer are given by

 [T]m,n,j,q=max(0,BN([O]m,n,j,q)), (22)

where represents the BN transform. Note that we discard the bias term in (21) as suggested in [28].

In what follows, we describe the individual modules and the overall network structure in detail.

### Iv-a Convolution Refinement Module

The convolution refinement module is designed to extract features from the input 3D ADCPM. When designing the module, we take into account the corresponding physical meaning of each dimension of the 3D ADCPM. While the existing designs use the symmetric kernel, for which the size of each dimension is equal, we propose to use the asymmetric convolution kernel, based on the intuition that the size of kernel should reflect the correlation statistics of the corresponding dimension. As a consequence, the size of the subsequent max pooling should also change accordingly. From Fig. 2, we observe that the input units are more concentrated in the angular dimension and relatively dispersed in the time delay dimension. It suggests that the size of the corresponding convolution kernel size of the delay dimension can be larger than that of the angle dimension. At the same time, it is worth noting that the vertical and horizontal angles are somehow correlated, while the delay dimension is independent of the other two. Therefore, the delay-vertical angle and the delay-horizontal angle domains demand special treatments, compared with the vertical-horizontal angle domain. This motivates our design of the convolution refinement module.

We illustrate the case of , , and as an example to describe the convolution refinement module. The structure of the convolution refinement module is shown in Fig. 4. The expression in the box is in the form of (size*number/stride). As one of the key designs, we build two branches in this module, each of which consists of two 3D CNA layers and one max pooling layer. The left branch aims to extract the information from the delay-horizontal angle domain, while the right branch is dedicated to the information from the delay-vertical angle domain. We emphasize here that both branches use asymmetrical convolution kernels and pooling size. The outputs of two branches are combined using kernel-wise concatenation. After that, we employ a 3D CNA layer with the symmetrical kernel to extract the information from all three dimensions, followed by a max pooling to downsample the feature maps.

### Iv-B 3D Inception Module

In the deeper layers, we use the 3D Inception module to extract more precise features. The Inception module is a combination of 2D convolutions with different kernel sizes, e.g. , , , with a parallel average-pooling operation [24]. We extend the Inception module into a 3D form by replacing the 2D convolution with the 3D convolution. The structure has four parallel branches, whose outputs are concatenated into a single output vector that servers as the input of the next stage, as shown in Fig. 5, where is the factor of kernel number. The kernel is replaced by two cascaded kernels to reduce the computational overhead, as in [24].

Since the correlation statistics in the deeper layers are unknown, the usage of 3D convolution kernels with different sizes can avoid the loss of important features.

### Iv-C Regression Module

The regression module includes a global average pooling and a fully connected layer without activation function. While existing designs tend to add several layers of fully connected layers before the output layer of the network, we use a global average pooling to replace them. The global average pooling takes the average of each feature map [27], and is used to reshape the output of the convolution layers for the final fully connected layer. By replacing the middle fully connected layers with the global average pooling, we can avoid the huge parameters brought by the middle fully connected layers. Thus, the global pooling improves convergence rate while reducing overfitting. The output of global average pooling in vector form is fed into the fully connected layer indicated by (19) directly. The output of the regression module is the estimated coordinates of the position.

## V Simulation Results

In this section, we first introduce the simulation setup. Then we evaluate the positioning accuracy, time and storage overhead of the proposed 3D CNN-enabled positioning method. Next, we demonstrate the influence of antenna and bandwidth to our proposed 3D CNN-enabled positioning method. At last, we evaluate the robustness of our positioning method with the noise contaminated inputs.

### V-a Simulation Setup

To simulate the positioning process, we implement a spatial consistency channel model based on the QuaDriGa model [31]. Spatial consistency means the similarity of channels at two adjacent positions caused by similar scatterer environment. The fingerprint-based positioning relies on the similarity between fingerprints obtained from the channel. Thus, it is critical for our simulator to contain spatial consistency. The QuaDriGa model incorporates time evolution to realize spatial consistency in a preset track. Based on the transition idea from the WINNER model [31], we extend the spatial consistency into the whole 3D space by using reference points (RPs) transition. The RPs are uniformly distributed in the 3D space, and the interval between them is equal to the correlation distance used to describe the correlation of large-scale parameters (LSPs) [32]. A stochastic part generates the channel coefficients of the RPs, following [33, 34]. Then based on the geographic position, the channel of arbitrary MT in the 3D space can be generated by the transitions between the RPs. Thus, we build a simulation environment including the geographical correlation between the fingerprints of two neighboring positions.

In our simulation, we consider the 3GPP urban macro (UMa) NLOS scenario, where the carrier frequency is set to 2GHz. We choose the NLOS scenario instead of the LOS one because the positioning in the NLOS scenario is more challenging due to a lot of occlusions and reflections. The size of the positioning area is 30m 30m 9m, with the center point of its bottom surface coincides with the origin. The BS is equipped with a UPA at m, the antenna plane is perpendicular to the ground, facing the positioning area. The MT is equipped with an Omni-directional antenna. Unless otherwise specified, the transmission bandwidth is 20MHz, and the configuration of UPA is , . For simplicity but without loss of generality, we divide the cube positioning area into three planes with heights 1.5m, 4.5m, and 7.5m respectively. In the offline phase, the training points are uniformly selected on these three planes with an interval of 1m, then the fingerprints and the corresponding 3D positions are collected. In the online phase, the test points are randomly distributed on these three planes with a total number of 1000, for which their positions are inferred by putting the fingerprint into the trained 3D CNN.

The fingerprints and the corresponding 3D positions of the training points are generated and saved using MATLAB 2018b, and the 3D CNN training and testing are processed using TensorFlow 1.9. Our simulation is carried out on a workstation equipped with two E5 2643v3 CPUs and one Titan X Pascal 12GB GPU. The time overhead mentioned below only refers to the run time on TensorFlow 1.9.

### V-B Comparison with Other Positioning Algorithms

To evaluate the performance of the proposed 3D CNN-enabled positioning method, we use the WKNN positioning method proposed in [35] as the benchmark. For fair comparison, we modify the fingerprint similarity criterion by adding the normalization, given by

 J(Ωk,Ωl)=Tr(ΩTkΩl)∥Ωk∥F∥Ωl∥F, (23)

where denotes the Frobenius norm, and denotes the “trace” operator. The modification is inspired by the correlation matrix distance (CMD) proposed in [36] to measure the spatial structure of the non-stationary MIMO channel. The normalization term confines the similarity of fingerprints between 0 and 1 to eliminate the influence introduced by different channel powers. Thus the normalized fingerprint similarity criterion (23) makes the WKNN positioning algorithm more accurate, which ensures the fairness of comparison. Then, the number of neighbors is adopted in our simulation.

On the other hand, to demonstrate the benefit of the 3D CNN model, we also make the comparison with a downgraded 2D CNN model. We propose a 2D CNN-enabled positioning method with the ADCPM presented by (8) as the input. In that case, the ADCPM is regarded as an image with sparse highlights (supports), where the vertical and horizontal angles are collocated in the same dimension. Table I specifies the structure of the proposed 2D CNN where the 2D Inception module is modified from the 3D Inception module in Fig. 5 by replacing the 3D convolution with the 2D convolution.

We first compare the positioning accuracy of the three methods. The cumulative distribution function (CDF) of the localization error using different methods is illustrated in Fig. 6. The 3D CNN-enabled positioning method reaches the highest positioning accuracy with 90% of localization errors within 1m. Compared with the 3D counterpart, the 2D CNN method realizes the inferior positioning accuracy with only 80%, but is still superior to the WKNN positioning method with 78%.

The run time overhead is defined as the time spent per user positioning in the online phase, and we employ it to measure the computational complexity. The storage overhead refers to the required storage resource for positioning. We compare the time and storage overhead of the three positioning methods in Fig. 7. The 2D CNN-enabled positioning method requires the least time overhead, e.g. 15.78ms, and the least storage overhead, e.g. 10MB. Compared with the 2D counterpart, 3D CNN gains higher positioning accuracy at the cost of higher time overhead (24.42ms running time) and storage overhead (42.9MB). The WKNN positioning method has the highest computational complexity (1566ms) and storage requirement (119MB), as it needs to store the database collected in the positioning area and then search through the positioning area to find the nearest positions. Thus the time and storage overhead of WKNN positioning method will increase with the expansion of the positioning area. In contrast, the 2D/3D CNN-enabled positioning methods only need to store the trained parameters and infer the position directly through the trained regression networks, by which the time and storage overhead is independent of the size of the positioning area.

In conclusion, the proposed 3D CNN-enabled positioning method outperforms the WKNN positioning method in terms of positioning accuracy, time overhead and storage overhead. The proposed 2D CNN-enabled positioning method can be regarded as a simplification of the 3D CNN-enabled positioning method, which reduces the time overhead and storage overhead at the sacrifice of the positioning accuracy.

### V-C Different Configurations of Antenna Array and Bandwidth

To inspect how vertical and horizontal angles interact with each other, we compare the different antenna array geometry. Fig. 8 shows the CDF of localization error using different antenna arrangements, where the case of corresponds to the uniform linear antenna (ULA) array. To make a fair comparison, we maintain the same number of antenna elements. It can be observed that the two UPAs perform better with 90% () and 93% () reliability for 1m positioning accuracy, while the ULA performs 80% reliability for 1m positioning accuracy. Thus, the UPA with a 3D CNN positioning method is more suitable for the 3D positioning than the ULA by providing additional angle resolution in the vertical direction.

Next, we evaluate the impact of the number of antennas. Fig. 9 demonstrates the CDF of localization error with the number of BS antennas. The number of antennas is increased from 32 to 128 with the BS equipped with UPA. The localization errors at the 90% point are 1m, 1.05m, and 1.26m for the , , and UPAs respectively. The results clearly show that the positioning accuracy improves with both the increase of the number of antennas in column and row. Nevertheless, such improvement is getting saturated when the number of antennas reaches 64.

To understand how the delay domain features affect positioning accuracy, we evaluate the performance versus the transmission bandwidth. We consider that the transmission bandwidth rises from 10MHz to 30MHz. From Fig. 9, the localization errors at the 90% point are 1m, 1m, and 1.08m for the 30MHz, 20MHz, and 10MHz transmission bandwidth respectively. It can be observed that when the bandwidth is increased from 10MHz to 20MHz, the positioning accuracy is also growing, while the further increase to 30MHz does not help, indicating that the sufficient delay information (e.g., 20MHz) is able to capture all features in the 3D CNN model.

### V-D Positioning Robustness

In this subsection, we evaluate the robustness of our fingerprint extraction and 3D CNN-enabled positioning method against noise contamination. For comparison, we also consider the direct use of the information in the space-frequency domain. Similar to the ADCPM, we introduce the space-frequency channel power matrix (SFCPM) of the th MT as

 Ξk≜E{Hk⊙H∗k}. (24)

For comparison, the SFCPM is employed as an alternative fingerprint.

Then the following four combinations for the positioning method have been considered: a) WKNN with the SFCPM fingerprint; b) WKNN with the proposed ADCPM fingerprint; c) the proposed 3D CNN with the SFCPM fingerprint; d) the proposed 3D CNN with the proposed ADCPM fingerprint. In the training phase, the artificial noiseless channel matrices are generated by the Monte-Carlo method to train the model for 3D CNN or to construct the look-up table for WKNN. In the prediction phase, real noisy channels are used for positioning, where the average received signal-to-noise ratio (SNR) varies from 4 dB to 20 dB. By exploiting the sparsity of the ADCPM, we adopt a filter with a threshold on the value of the elements of ADCPM such that the values below this threshold is set to 0. In doing so, the noise contamination in the inputs is somewhat reduced.

Fig. 11 presents the average localization error versus the average received SNR. It can be observed that for both the searching-based method (i.e., WKNN) and the learning-based (i.e., 3D CNN) methods, the ADCPM fingerprint is more robust to noise contamination than its SFCPM counterpart thanks to channel sparsity in the angle-delay domain. At high SNR, while the searching-based method has the comparable performance with both SFCPM and ADCPM fingerprints, the learning-based method favors the ADCPM fingerprint, which introduces sparsity. As the SNR grows, the performance of the learning-based method approaches that of the searching-based one with considerably reduced computational complexity and storage requirement. We conclude that the proposed ADCPM fingerprint is more robust to the noise and the proposed 3D CNN-enabled positioning method performs well for the noisy inputs.

## Vi Conclusion

In this paper, we have proposed a learning-based user positioning method using 3D CNN for the 3D massive MIMO-OFDM system, exploiting the sparsity properties of channel statistics in the 3D angle-delay domain. We employed the 3D ADCPM as a new type of fingerprint which includes stable and stationary multipath characteristics, e.g. delay, power, and angle in the vertical and horizontal directions. By casting user positioning as an 3D image recognition problem, we proposed a novel 3D CNN architecture with the aim of regression to realize the fingerprint-based positioning. The simulation results demonstrated that the proposed method can achieve high positioning accuracy, reducing the computational complexity and storage overhead, and robust to the noise contamination of the fingerprints.

## Appendix A Proof of throrem 1

In order to prove Theorem 1, we provide the following preliminary results stated as Lemma 1-2.

We define the following vectors

 ~e(v)p,k=1√MVHMe(v)(θp,k), (25)

and

 ~e(h)p,k=1√NVHNe(h)(θp,k,φp,k). (26)

Lemma 1: When and , the AOA is extracted from array response vector explicitly via the DFT operation, given by

 limM→∞~e(v)p,k=α(¯mp,k)M, (27)
 limN→∞~e(h)p,k=α(¯np,k)N, (28)

where

 α(j)i=[0,⋯0j,1,0,⋯,0i−j−1]T, (29)

Proof: The th element of is calculated as

 [~e(v)p,k]i=1√MM−1∑m=01√Me¯ȷ2πm(i−M/2)Me−¯ȷ2πmd(v)tλccosθp,k=1Me−¯ȷπ(M−1)⎛⎜⎝d(v)tλccosθp,k−iM+12⎞⎟⎠⋅sin(M2(d(v)tλccosθp,k−iM+12))sin(12(d(v)tλccosθp,k−iM+12))=e−¯ȷπ(M−1)⎛⎜⎝d(v)tλccosθp,k−iM+12⎞⎟⎠⋅fM⎛⎝12⎛⎝d(v)tλccosθp,k−iM+12⎞⎠⎞⎠, (30)

where is defined in (10). From (30), if and only if . Further, if and only if , and in this case we have

 limM→∞fM(x)|x=nπ,n=0,±1,⋯=cos(Mx)cos(x), (31)

hence, if and only if the following constraint is satisfied

 12⎛⎝d(v)tλccosθp,k−iM+12⎞⎠=nπ,n=0,±1,⋯. (32)

Owing to , we derive that if and only if

 i=¯mp,k=M2+Md(v)tλccosθp,k, (33)

where in the case , is sufficient to be an integer. In this case, we derive from (30) and (31) that

 limM→∞~e(v)p,k=α(¯mp,k)M. (34)

Therefore, (27) is obtained. (28) can be derived by using the same method. ∎

Lemma 1 establishes the relationship between the array response vector and the deterministic component in the angle domain for 3D massive MIMO antennas, which guides us to extract both vertical and horizontal AOA specified by and respectively in the 3D space.

Based on Lemma 1, the CIR vector in the angle domain of the th path for the th MT can be written as

 ~qp,k=1√MN(VHM⊗VHN)qp,k, (35)

in which the index indicates the AOA and the corresponding element indicates the channel gain in the AOA direction.

Lemma 2: For 3D MIMO systems with UPA at the BS, when , the average channel power of each path is concentrated on specific position in the vertical angle domain, given by

 (36)

When , the average channel power of each path has the similar conclusion in the horizontal angle domain, given by

 (37)

When and , the average channel power of each path is concentrated in both the vertical and horizontal angle domain, given by

 (38)

Proof: Substituting (4) and (1) into (35), we have

 (39)

when , substituting (27) into (39), we have

 limM→∞~qp,k=ap,kα(¯mp,k)M⊗~e(h)p,k. (40)

Similar to (30), the th element of is calculated as

 limM→∞[~e(h)p,k]i=e−¯ȷπ(N−1)(¯np,k−iN)fN(¯np,k−i2N). (41)

Therefore, the th element of is written as

 limM→∞[~qp,k]i=ap,ke−¯ȷπ(N−1)(¯np,k−⟨i⟩NN)⋅fN(¯np,k−⟨i⟩N2N)δ(⌊i/N⌋−¯mp,k). (42)

From the assumption that , the average channel power in angle domain is calculated as

 (43)

Therefore, (36) is obtained. When , (37) can be derived by using the same method.

When and , (38) is a special case of (36) and (37). Substituting (27) and (28) into (39), we derive

 limM→∞,N→∞~qp,k=ap,kα(¯mp,k)M⊗α(¯np,k)N. (44)

Hence, the th element of can be written as

 limM→∞,N→∞[~qp,k]i=ap,kδ(i−¯mp,kN−¯np,k). (45)

From the assumption that , the average channel power in angle domain is calculated as

 (46)

Therefore, (38) is obtained. ∎

Lemma 2 establishes the relationship between the CIR in the antenna domain and that in the angle domain for 3D massive MIMO systems. When the number of UPA’s antennas is sufficiently large, the average channel powers corresponding to different AOA can be resolved via DFT operation. The resolution of power and angle in the vertical and horizontal directions depends on the number of antennas in the vertical and horizontal directions respectively.

The th column of is calculated as

 (47)

where is defined by substituting for in (10). When , if and only if . In this case, from (31), we have

 1√Nc[HkF∗Nc×Ng]j=Np∑p=1ap,ke(θp,k,φp,k)e−jπ(Nc−1)⋅δ(j−rp,k). (48)

Therefore, we derive

 limNc→∞1√NcHkF∗Nc×Ng=Np∑p=1ap,ke(θp,k,φp,k)[α(rp,k)Ng]T. (49)

Substituting (1) (25), (26) and (49) into (7), we have

 (50)

When and , according to (27) and (41), we have

 limM→∞,Nc→∞[Gk]i,j=Np∑p=1ap,kfN(¯np,k−⟨i⟩N2N)⋅δ(⌊i/N⌋−¯mp,k)δ(j−rp,k). (51)

From the assumption that , the average channel power in angle-delay domain is calculated as

 (52)

Substituting (9) into (52), (11) is obtained.

When , and , (13) can be derived by using the same method.

When , and , according to (27) and (28), we have

 limM→∞,N→∞,Nc→∞Gk=Np∑p=1ap,k(α(¯mp,k)M⊗α(¯np,k)N)⋅[α(rp,k)Ng]T. (53)

From the assumption that , the average channel power in angle-delay domain is calculated as

 limM→∞,N→∞,Nc→∞E{∣∣[Gk]i,j∣∣2}=Np∑p=1σ2p,kδ(i−¯mp,kN−¯np,k)⋅δ(j−rp,k). (54)

Substituting (9) into (54), we get (15).

## References

• [1] W. H. Chin, Z. Fan, and R. Haines, “Emerging technologies and research challenges for 5G wireless networks,” IEEE Wireless Commun., vol. 21, no. 2, pp. 106–112, April 2014.
• [2] B. A. Renfro, M. Stein, and N. Boeker, An analysis of global positioning system (GPS) standard positioning service (SPS) performance for 2018, Space Geophys. Lab. Appl. Res. Lab., Univ. Texas Austin, Tech. Rep. TR-SGL-19-02, Mar. 2018. [Online]. Available: {}{}}{https://www.gps.gov/systems/gps/performance/}{cmtt}
• [3] M.~H. Bergen, X.~Jin, D.~Guerrero, H.~A. L.~F. Chaves, N.~V. Fredeen, and J.~F. Holzman, Design and implementation of an optical receiver for angle-of-arrival-based positioning,'' J. Light. Technol, vol.~35, no.~18, pp. 3877--3885, Sept 2017.
• [4] A.~Y.~Z. Xu, E.~K.~S. Au, A.~K.~S. Wong, and Q.~Wang, A novel threshold-based coherent TOA estimation for IR-UWB systems,'' IEEE Trans. Veh. Technol., vol.~58, no.~8, pp. 4675--4681, Oct 2009.
• [5] Y.~Yuan, S.~Hou, and Q.~Zhao, An improved TDOA localization algorithm based on wavelet transform,'' in 2017 7th IEEE International Conference on Electronics Information and Emergency Communication (ICEIEC), July 2017, pp. 111--114.
• [6] W.~Cui, L.~Zhang, B.~Li, J.~Guo, W.~Meng, H.~Wang, and L.~Xie, Received signal strength based indoor positioning using a random vector functional link network,'' IEEE Trans Ind. Informat., vol.~14, no.~5, pp. 1846--1855, May 2018.
• [7] K.~N. R. S.~V. Prasad, E.~Hossain, and V.~K. Bhargava, Machine learning methods for RSS-based user positioning in distributed massive MIMO,'' IEEE Trans. Wireless Commun., vol.~17, no.~12, pp. 8402--8417, Dec 2018.
• [8] Z.~Wang, H.~Zhang, T.~Lu, and T.~A. Gulliver, Cooperative RSS-based localization in wireless sensor networks using relative error estimation and semidefinite programming,'' IEEE Trans. Veh. Technol., vol.~68, no.~1, pp. 483--497, Jan 2019.
• [9] S.~Tomic, M.~Beko, and R.~Dinis, 3-D target localization in wireless sensor networks using RSS and AoA measurements,'' IEEE Trans. Veh. Technol., vol.~66, no.~4, pp. 3197--3210, April 2017.
• [10] H.~Chen, Y.~Zhang, W.~Li, X.~Tao, and P.~Zhang, ConFi: Convolutional neural networks based indoor Wi-Fi localization using channel state information,'' IEEE Access, vol.~5, pp. 18 066--18 074, 2017.
• [11] X.~Wang, L.~Gao, S.~Mao, and S.~Pandey, CSI-based fingerprinting for indoor localization: A deep learning approach,'' IEEE Trans. Veh. Technol., vol.~66, no.~1, pp. 763--776, Jan 2017.
• [12] G.~Wu and P.~Tseng, A deep neural network-based indoor positioning method using channel state information,'' in 2018 International Conference on Computing, Networking and Communications (ICNC), March 2018, pp. 290--294.
• [13] A.~Decurninge, L.~G. Ordonez, P.~Ferrand, H.~Gaoning, L.~Bojie, Z.~Wei, and M.~Guillaud, CSI-based outdoor localization for massive MIMO: Experiments with a learning approach,'' in 2018 15th International Symposium on Wireless Communication Systems (ISWCS), Aug 2018, pp. 1--6.
• [14] C.~Sun, X.~Gao, S.~Jin, M.~Matthaiou, Z.~Ding, and C.~Xiao, Beam division multiple access transmission for massive MIMO communications,'' IEEE Trans. Commun., vol.~63, no.~6, pp. 2170--2184, June 2015.
• [15] L.~You, X.~Gao, A.~L. Swindlehurst, and W.~Zhong, Channel acquisition for massive MIMO-OFDM with adjustable phase shift pilots,'' IEEE Trans. Signal Process., vol.~64, no.~6, pp. 1461--1476, March 2016.
• [16] X.~Li, S.~Jin, H.~A. Suraweera, J.~Hou, and X.~Gao, Statistical 3-D beamforming for large-scale MIMO downlink systems over rician fading channels,'' IEEE Trans. Commun., vol.~64, no.~4, pp. 1529--1543, April 2016.
• [17] X.~Sun, X.~Gao, G.~Y. Li, and W.~Han, Single-site localization based on a new type of fingerprint for massive MIMO-OFDM systems,'' IEEE Trans. Veh. Technol., vol.~67, no.~7, pp. 6134--6145, July 2018.
• [18] B.~Wang, F.~Gao, S.~Jin, H.~Lin, and G.~Y. Li, Spatial- and frequency-wideband effects in millimeter-wave massive MIMO systems,'' IEEE Trans. Signal Process., vol.~66, no.~13, pp. 3393--3406, July 2018.
• [19] N.~Garcia, H.~Wymeersch, E.~G. Larsson, A.~M. Haimovich, and M.~Coulon, Direct localization for massive MIMO,'' IEEE Trans. Signal Process., vol.~65, no.~10, pp. 2475--2487, May 2017.
• [20] M.~Arnold, S.~Dorner, S.~Cammerer, and S.~Ten Brink, On deep learning-based massive MIMO indoor user localization,'' in 2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), June 2018, pp. 1--5.
• [21] X.~Wang, L.~Gao, S.~Mao, and S.~Pandey, CSI-based fingerprinting for indoor localization: A deep learning approach,'' IEEE Trans. Veh. Technol., vol.~66, no.~1, pp. 763--776, Jan 2017.
• [22] J.~Vieira, E.~Leitinger, M.~Sarajlic, X.~Li, and F.~Tufvesson, Deep convolutional neural networks for massive MIMO fingerprint-based positioning,'' in 2017 IEEE 28th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC), Oct 2017, pp. 1--6.
• [23] A.~Krizhevsky, I.~Sutskever, and G.~E. Hinton, ImageNet classification with deep convolutional neural networks,'' in Advances in Neural Information Processing Systems 25, F.~Pereira, C.~J.~C. Burges, L.~Bottou, and K.~Q. Weinberger, Eds.   Curran Associates, Inc., 2012, pp. 1097--1105. [Online]. Available: http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
• [24] C.~Szegedy, W.~Liu, Y.~Jia, P.~Sermanet, S.~Reed, D.~Anguelov, D.~Erhan, V.~Vanhoucke, and A.~Rabinovich, Going deeper with convolutions,'' in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
• [25] C.~Szegedy, V.~Vanhoucke, S.~Ioffe, J.~Shlens, and Z.~Wojna, Rethinking the inception architecture for computer vision,'' in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
• [26] C.~Szegedy, S.~Ioffe, and V.~Vanhoucke, Inception-v4, Inception-ResNet and the impact of residual connections on learning,'' CoRR, vol. abs/1602.07261, 2016. [Online]. Available: http://arxiv.org/abs/1602.07261
• [27] M.~Lin, Q.~Chen, and S.~Yan, Network in network,'' CoRR, vol. abs/1312.4400, 2013.
• [28] S.~Ioffe and C.~Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift,'' CoRR, vol. abs/1502.03167, 2015. [Online]. Available: http://arxiv.org/abs/1502.03167
• [29] I.~Goodfellow, Y.~Bengio, and A.~Courville, Deep Learning.   MIT Press, 2016, http://www.deeplearningbook.org.
• [30] M.~D. Zeiler and R.~Fergus, Visualizing and understanding convolutional networks,'' in Computer Vision -- ECCV 2014, D.~Fleet, T.~Pajdla, B.~Schiele, and T.~Tuytelaars, Eds.   Cham: Springer International Publishing, 2014, pp. 818--833.
• [31] S.~Jaeckel, L.~Raschkowski, K.~Borner, and L.~Thiele, QuaDRiGa: A 3-D multi-cell channel model with time evolution for enabling virtual field trials,'' IEEE Trans. Antennas Propag., vol.~62, no.~6, pp. 3242--3256, June 2014.
• [32] P.~K. et~al., WINNER II Channel Models, Version 1.1., Sep. 2007. [Online]. Available: {}{}}{http://www.ist-winner.org/}{cmtt}
• [33] Study on 3D channel model for LTE, Version 12.7.0, document 3GPP T.R. 36.873 , Dec. 2017.
• [34] Study on channel model for frequencies from 0.5 to 100 GHz, Version 15.0.0, June. 2018.
• [35] X.~Sun, X.~Gao, G.~Y. Li, and W.~Han, Fingerprint based single-site localization for massive MIMO-OFDM systems,'' in GLOBECOM 2017 - 2017 IEEE Global Communications Conference, Dec 2017, pp. 1--7.
• [36] M.~Herdin, N.~Czink, H.~Ozcelik, and E.~Bonek, Correlation matrix distance, a meaningful measure for evaluation of non-stationary MIMO channels,'' in 2005 IEEE 61st Vehicular Technology Conference, vol.~1, May 2005, pp. 136--140 Vol. 1.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters