Learning to Localize: A 3D CNN Approach to User Positioning in Massive MIMOOFDM Systems
Abstract
In this paper, we consider the user positioning problem in the massive multipleinput multipleoutput (MIMO) orthogonal frequencydivision multiplexing (OFDM) system with a uniform planner antenna (UPA) array. Taking advantage of the UPA array geometry and wide bandwidth, we advocate the use of the angledelay channel power matrix (ADCPM) as a new type of fingerprint to replace the traditional ones. The ADCPM embeds the stable and stationary multipath characteristics, e.g. delay, power, and angle in the vertical and horizontal directions, which are beneficial to positioning. Taking ADCPM fingerprints as the inputs, we propose a novel threedimensional (3D) convolution neural network (CNN) enabled learning method to localize users’ 3D positions. In particular, such a 3D CNN model consists of a convolution refinement module to refine the elementary feature maps from the ADCPM fingerprints, three extended Inception modules to extract the advanced feature maps, and a regression module to estimate the 3D positions. By intensive simulations, the proposed 3D CNNenabled positioning method is demonstrated to achieve higher positioning accuracy than the traditional searchingbased ones, with reduced computational complexity and storage overhead, and the ADCPM fingerprints are more robust to noise contamination.
IEEEexample:BSTcontrol
I Introduction
As locationbased applications (LBA) are extensively deployed in modern society, accurate positioning has received enormous attention in both industry and academia [1]. The global positioning system (GPS) has provided realtime outdoor positioning for the mobile terminal (MT), which can reach several meters of accuracy [2]. However, in urban areas, the GPS positioning performance will be significantly degraded due to the blockage of buildings, cars, and pedestrians.
Recently, user positioning by exploiting rich information of multipath wireless propagation has drawn a lot of attention. Various positioning methods have been proposed in the literature, including the geometrybased and the fingerprintbased positioning methods. The geometrybased positioning is a triangulatingtolocalize method that relies on information of wireless signals from the users to the base stations (BSs), e.g., angle of arrival (AOA) [3], time of arrival (TOA) [4], time difference of arrival (TDOA) [5], and the received signal strength (RSS) [6]. However, the estimation errors of AOA/TOA/TDOA/RSS have an crucial influence on positioning accuracy in complex urban environments. In contrast, fingerprintbased positioning is a matchingtolocalize method that consists of offline fingerprint database construction and online fingerprint matching and location prediction. In the offline phase, reference points (RPs) are selected, for which the pairs of fingerprints and corresponding positions are stored in the database. In the online phase, the position of the MT is estimated by searching the collected database and matching the input fingerprint with the stored ones. As an indicator of the surrounding environment, fingerprints have been widely adopted for user positioning in the complex multipath environment.
The most common feature used in fingerprintbased positioning is RSS [7, 8, 9], yet there are two shortcomings of the RSSbased fingerprint. On the one hand, RSS suffers from fast fading fluctuation and hardware heterogeneity and is therefore unstable for positioning. On the other hand, RSS only captures the coarsest channel information that cannot meet the demand in complex communication environment. Recently, some researchers proposed to use the channel state information (CSI) as the fingerprint [10, 11, 12, 13]. Capturing more channel information than RSS, CSI has the potential to enhance the positioning accuracy. In certain scenarios (e.g., wireless sensor network [11], and WiFi network [10]), however, due to limited bandwidth and number of antennas, the CSI fingerprint is insufficient to capture the multipath characteristics because of the low resolution in the spatial or frequency domain.
Fortunately, such limitations can be overcome in massive multipleinput multipleoutput orthogonal frequencydivision multiplexing (MIMOOFDM) system. Thanks to the largescale antenna array and wide bandwidth, CSI fingerprints are able to capture rich multipath information including powers, angles, and delays [14, 15, 16, 17, 18] for positioning. Various fingerprintbased positioning techniques have been proposed for massive MIMO/MIMOOFDM systems. Of particular relevance is the approach proposed in [17], where a weighted knearest neighbor (WKNN) algorithm is applied using the angledelay domain channel information as fingerprints in the massive MIMOOFDM system. Other approaches for positioning include the use of some sophisticated techniques, such as Gaussian processes regression in [7] and compressive sensing in [19], to name a few.
Most recently, deep learning has found application in user positioning, inspired by its great success in image recognition, speech signal processing, and selfdriving. As a matter of fact, the fingerprintbased positioning can be cast into an image recognition problem, in which the fingerprints can be treated as images to recognize. For indoor positioning, a twostep training deep neural network (DNN) based positioning method was proposed in [20] for the NLOS massive MIMO scenario, a single DNN classifier to determine the probabilities of the MT being on the collected RPs was used in [12], and a deep learningbased positioning method based on the classic deep belief nets (DBNs) with a stack of restricted Boltzmann machines (RBMs) was proposed in [21]. For outdoor positioning, a deep convolution neural network (CNN) was utilized in [22] to map the CSI into the 2D position coordinates.
However, the aforementioned methods were mainly dedicated to 2D positioning. When it comes to the UPA array, high angular resolution can be realized in both vertical and horizontal directions, which provides new opportunities for the fingerprintbased 3D positioning. To this end, we propose a novel deep learning based 3D positioning method for the MIMOOFDM system with the UPA array, taking the angledelay domain channel power matrix (ADCPM) as the fingerprint that contains multipath angles, delays, and powers information. Instead of CSI fingerprints in the spatialdelay domain, we translate them into the angledelay domain with a 3D discrete Fourier transform (DFT), by which the sparsity structures can be fully exploited. To deal with the high dimensionality of the ADCPM, we propose a regressionoriented 3D CNN model that maps the ADCPM fingerprints into the 3D position coordinates directly. In particular, our 3D CNN model consists of four key components: (1) the convolution refinement module to refine the elementary feature maps from the ADCPM fingerprints; (2) 3D Inception blocks that are extended from the Inception blocks in AlexNet [23] and GoogLeNet [24, 25, 26]; (3) the average pooling to replace the fullconnected layer at the bottom of the network, as suggested in [27], to reduce the number of parameters of the network; (4) the batch normalization (BN) in each layer to improve the convergence speed and generalization ability of the network [28]. Such an endtoend positioning method works in the following way. In the offline training phase, the 3D CNN is trained by using the ADCPM fingerprints and the corresponding coordinates of RPs. In the online prediction phase, the trained 3D CNN is used to take new ADCPM fingerprints for position prediction.
To summarize, our contributions are threefold:

For the massive MIMOOFDM systems with the UPA array, we propose a new type of fingerprints, ADCPM, which has rich and stable mutipath characteristics that are of close relevance to the position information.

For the massive MIMOOFDM systems with the UPA array, we propose a 3D CNN based positioning method, which achieves higher positioning accuracy with reduced storage and computational overhead than the searchingbased methods (e.g., WKNN).

We build a simulator and conduct extensive experiments to evaluate the proposed 3D CNN fingerprintbased positioning method with respect to positioning accuracy, storage and computation overhead.
The rest of the paper is organized as follows. In Section II, we investigate the 3D MIMOOFDM system and propose a new type of fingerprint extracted from the multipath channel characteristics. In Section III, we introduce the design of our proposed 3D CNNenabled positioning method, followed by the detailed network architecture of our 3D CNN model in Section IV. Simulation results are presented in Section V, and conclusion is given in Section VI.
Notations: We use to denote the imaginary unit. Vectors and matrices are denoted in lowercase boldfaced characters and uppercase boldfaced characters respectively, the element indices of vector and matrix start with 0. We use , , to denote the th element of the vector , the th element of matrix and the th column of matrix respectively. The superscript , , indicate the matrix transpose, conjugatetranspose and conjugate operation. The complex number field, real number field and, integer field are represented by , and . The symbol denote the Kronecker product of two matrices. We use to denote the expectation of random variable (RV) and random vector variable (RVC). denotes the largest integer not greater than , denotes the modulo operation. denotes the delta function.
Ii 3D MIMOOFDM System and Channel Characteristics
In this section, we start with the 3D massive MIMOOFDM system modeling. The BS is equipped with a uniform planar array (UPA), comprising antennas in each row and antennas in each column. Then we introduce the new type of fingerprint with the angle and delay information extracted from the channel characteristics.
Iia Channel model
We consider the uplink transmission in a wideband massive MIMO wireless system. Owing to occlusions and reflections, wireless signals propagate through multipaths. Assume the number of multipaths is and the number of the MT is , the AOA of the th path for the th MT can be decomposed into the elevation angle in the vertical direction and the azimuth angle in the horizontal direction, as shown in Fig. 1. Thus, the array response vector can be written as
(1) 
with
(2) 
and
(3) 
where and are the antenna spaces in the column and row respectively, and is the carrier wavelength. Then, the channel impulse response (CIR) of the th path for the th MT is represented by
(4) 
We consider OFDM modulation with subcarriers, and the sample interval is . We use and to denote the OFDM symbol duration and the cyclic prefix (a.k.a. guard interval) duration respectively. We assume the cyclic prefix duration is larger than the maximum channel delay of all the MTs. Let , where is the time of arrival (TOA) of the th path for the th MT. The frequency of the th subcarrier is . Thus, the channel frequency response (CFR) associated with the th MT and the th subcarrier is written as
(5) 
where is the complex path gain of the th path. The spacefrequency channel response matrix (SFCRM) of the th MT known to the BS can be denoted by the concatenation of , i.e.
(6) 
IiB Fingerprint from AngleDelay Domain
Define and as phaseshifted discrete Fourier transform (DFT) matrices with the th element and respectively. Define as the matrix composed of the first columns of DFT matrix with the th element .
The fingerprints used in positioning are required to be closely linked to the MT’s positions. The CFR describes the spacefrequency domain characteristics, but it is hard to build an intuitive relationship between position and CFR. Due to complex and changeable multipath propagation in the wireless channel, the AOAs and TOAs of received signals are unique for different positions. Therefore, it is sufficient to extract a fingerprint from the angledelay domain. We reconstruct CFR matrix into angledelay domain matrix, and is referred to as the angledelay domain channel response matrix (ADCRM) of the th MT, given by
(7) 
As such, the angledelay domain channel power matrix (ADCPM) of the th MT is introduced and used as a fingerprint hereafter, i.e.,
(8) 
with
(9) 
Define the following function:
(10) 
Theorem 1: For 3D MIMOOFDM systems with UPA at the BS, when and , the ADCPM is concentrated on specific position in the vertical angledelay domain, given by
(11) 
where
(12) 
When and , the ADCPM has the similar conclusion in the horizontal angledelay domain, given by
(13) 
where
(14) 
When , and , the ADCPM is concentrated on the th angle direction and the th delay direction, given by
(15) 
Proof: See appendix A. ∎
Remark 1: For 3D massive MIMOOFDM systems, by (7) and (8), the CFR in the spacefrequency domain is translated into the ADCPM in the angledelay domain. Theorem 1 reveals that the th element in the ADCPM corresponds to the average channel power of the th AOA and the th TOA.
Remark 2: In Theorem 1, determines the sparsity and specifies the sparsity pattern. When , and/or , the ADCPM is asymptotically a sparse matrix in the sense that most elements are equal to zero. For the finite , and , it will be shown later by numerical results that the sparsity maintains.
Remark 3: From Theorem 1, the sparsity pattern of the ADCPM depends on both the AoAs and the TOAs of multiple paths. For two MTs located at different positions, it is unlikely all the multipath components of the signals are identical, so are the corresponding ADCPMs. As such, the ADCPM can be a unique indicator to discriminate MTs from different geographical positions.
According to Theorem 1, the ADCPM is suitable to serve as the fingerprint for positioning, as the ADCPMs meet the following requirements for fingerprints.

The ADCPMs are closely related to geographic locations. As stated above, the ADCPM embeds information of the AOAs, the TOAs, and the channel power corresponding to a specific geographic location.

The ADCPMs have a sufficient degree of discrimination between different geographical locations, which increases as two locations are farther apart.

The ADCPM is stationary in the sense that it keeps unchanged over a relatively long period for a given location. In the multipath environment, as long as the distribution of the scatterers does not change, the angle and delay keep unchanged, so does the ADCPM.
In addition, the ADCPM is convenient to be extracted from the channel state information (CSI) at the BS through wideband signal processing. To conclude, the ADCPM provides a highly differentiable, stable, and easily accessible indicator for different geographic location, and therefore is an ideal fingerprint for positioning.
Iii Convolution Neural Network for Positioning
Provided the ADCPM as the fingerprint in Section IIB, the problem arises as to how to realize positioning by exploiting the structural properties of ADCPM.
As the ADCPM fingerprint can be seen as an image, the widely used endtoend image recognition method  convolution neural network (CNN) empowered deep learning  can be applied here for positioning. Thanks to its characteristics of space invariant, parameter sharing, and hierarchical representations, CNN is more efficient than the traditional fullyconnected networks when dealing with large dimensional inputs [29]. For 3D massive MIMOOFDM systems, the high dimensional ADCPM has sparsity patterns, which suggests that CNN could be an ideal positioning method to extract the positioningrelated features from the ADCPM and convert these features into position information with relatively low computational complexity.
Iiia The Sparse ADCPM as Input
As shown in Fig. 2, the asymptotic property in Theorem 1 of the ADCPM sparsity pattern in the angledelay domain still maintains in the practical setting with a finite number of antennas and limited bandwidth. Due to the sparsity pattern, the ADCPM fingerprint makes the difference of the channel at different positions more distinguishable. As such the features of the channel are easier to be extracted by a neural network.
In fact, the feature maps in the higher layers of CNN are also sparse because they solely focus on the discriminant structure within the input picture [30]. By using the sparse ADCPM as the input, it is easier for the neural network to capture the characteristic information of the channel, thereby simplifying the neural network structure and speeding up the convergence of the neural network.
In Theorem 1, the rows and the columns of the ADCPM corresponds respectively to the angles and the delay domain. Noticing that the angle in the 3D space can be described by a pair of vertical and horizontal angles, we reshape the ADCPM into a threedimensional tensor in the following way
(16) 
where is the 3D ADCPM of the th MT with three dimensions indicating vertical angle, horizontal angle, and delay respectively. We use hereafter the reshaped ADCPM as the input of neural networks.
IiiB Regressionoriented Positioning
The CNNbased regressionoriented positioning is actually a complex nonlinear function. Denote by such a function that predicts the true position according to the user’s 3D ADCPM , that is,
(17) 
where is the prediction by the CNN. We use regression analysis to find the mapping function through minimizing the localization error between the true coordinate and the prediction
(18) 
Note that the traditional CNN models are often used for image classification with the last layer activated by a softmax function. As a matter of fact, the CNN model itself can be regarded as a regression function if the softmax function is replaced with a fully connected layer without activation function. If we use to denote the output before the last layer, then the estimated position after the last layer can be written by
(19) 
where and are the parametric weight matrix and bias vector respectively that can be learned together with the training of previous CNN layers.
Let be the set of trainable parameters of the neural network. The cost function with respect to the mean square error (MSE) of the training data set can be given by
(20) 
where is the number of training samples, the second term employs the regularization to avoid overfitting, and is the weight factor of the regularization.
Iv Proposed 3D CNN Structure
Inspired by the 3D structure of , we propose to use the 3D CNN to realize the mapping function. Fig. 3 shows the network architecture of the proposed 3D CNN for fingerprintbased positioning, which is composed of a convolution refinement module, three extended 3D Inception modules, and a regression module. In addition, the max pooling is used for downsampling, and its description parameters are presented in the form of (size/stride). The convolution refinement module first refines the elementary feature maps from the 3D ADCPM. Then we modify and extend the Inception module into a 3D form to extract the advanced feature maps. Further, the regression module estimates the 3D position by employing a global average pooling layer and a fully connected layer without activation function.
Before proceeding further, we introduce the 3D convolutionnormalizationactivation (CNA) layer, an elementary building block used in our 3D CNN. The 3D CNA layer consists of three parts: a 3D convolution, a BN transform, and an activation function. Consider the input feature maps , where , , , denote the height, width, length, and the number of channels of respectively. Given the 3D convolution kernel , where , , are the kernel size of the height, width, and length respectively, and is the number of output channels. The convolution is performed with zeropadding [29] and the strides are set to be 1. By convolving the 3D kernel with the input feature maps, the 3D convolution yields output with
(21) 
Right after the 3D convolution, we adopt the BN to improve the convergence rate and generalization performance [28], followed by the rectified linear unit (ReLU) as the activation function. That is, the output feature maps of the 3D CNA layer are given by
(22) 
where represents the BN transform. Note that we discard the bias term in (21) as suggested in [28].
In what follows, we describe the individual modules and the overall network structure in detail.
Iva Convolution Refinement Module
The convolution refinement module is designed to extract features from the input 3D ADCPM. When designing the module, we take into account the corresponding physical meaning of each dimension of the 3D ADCPM. While the existing designs use the symmetric kernel, for which the size of each dimension is equal, we propose to use the asymmetric convolution kernel, based on the intuition that the size of kernel should reflect the correlation statistics of the corresponding dimension. As a consequence, the size of the subsequent max pooling should also change accordingly. From Fig. 2, we observe that the input units are more concentrated in the angular dimension and relatively dispersed in the time delay dimension. It suggests that the size of the corresponding convolution kernel size of the delay dimension can be larger than that of the angle dimension. At the same time, it is worth noting that the vertical and horizontal angles are somehow correlated, while the delay dimension is independent of the other two. Therefore, the delayvertical angle and the delayhorizontal angle domains demand special treatments, compared with the verticalhorizontal angle domain. This motivates our design of the convolution refinement module.
We illustrate the case of , , and as an example to describe the convolution refinement module. The structure of the convolution refinement module is shown in Fig. 4. The expression in the box is in the form of (size*number/stride). As one of the key designs, we build two branches in this module, each of which consists of two 3D CNA layers and one max pooling layer. The left branch aims to extract the information from the delayhorizontal angle domain, while the right branch is dedicated to the information from the delayvertical angle domain. We emphasize here that both branches use asymmetrical convolution kernels and pooling size. The outputs of two branches are combined using kernelwise concatenation. After that, we employ a 3D CNA layer with the symmetrical kernel to extract the information from all three dimensions, followed by a max pooling to downsample the feature maps.
IvB 3D Inception Module
In the deeper layers, we use the 3D Inception module to extract more precise features. The Inception module is a combination of 2D convolutions with different kernel sizes, e.g. , , , with a parallel averagepooling operation [24]. We extend the Inception module into a 3D form by replacing the 2D convolution with the 3D convolution. The structure has four parallel branches, whose outputs are concatenated into a single output vector that servers as the input of the next stage, as shown in Fig. 5, where is the factor of kernel number. The kernel is replaced by two cascaded kernels to reduce the computational overhead, as in [24].
Since the correlation statistics in the deeper layers are unknown, the usage of 3D convolution kernels with different sizes can avoid the loss of important features.
IvC Regression Module
The regression module includes a global average pooling and a fully connected layer without activation function. While existing designs tend to add several layers of fully connected layers before the output layer of the network, we use a global average pooling to replace them. The global average pooling takes the average of each feature map [27], and is used to reshape the output of the convolution layers for the final fully connected layer. By replacing the middle fully connected layers with the global average pooling, we can avoid the huge parameters brought by the middle fully connected layers. Thus, the global pooling improves convergence rate while reducing overfitting. The output of global average pooling in vector form is fed into the fully connected layer indicated by (19) directly. The output of the regression module is the estimated coordinates of the position.
V Simulation Results
In this section, we first introduce the simulation setup. Then we evaluate the positioning accuracy, time and storage overhead of the proposed 3D CNNenabled positioning method. Next, we demonstrate the influence of antenna and bandwidth to our proposed 3D CNNenabled positioning method. At last, we evaluate the robustness of our positioning method with the noise contaminated inputs.
Va Simulation Setup
To simulate the positioning process, we implement a spatial consistency channel model based on the QuaDriGa model [31]. Spatial consistency means the similarity of channels at two adjacent positions caused by similar scatterer environment. The fingerprintbased positioning relies on the similarity between fingerprints obtained from the channel. Thus, it is critical for our simulator to contain spatial consistency. The QuaDriGa model incorporates time evolution to realize spatial consistency in a preset track. Based on the transition idea from the WINNER model [31], we extend the spatial consistency into the whole 3D space by using reference points (RPs) transition. The RPs are uniformly distributed in the 3D space, and the interval between them is equal to the correlation distance used to describe the correlation of largescale parameters (LSPs) [32]. A stochastic part generates the channel coefficients of the RPs, following [33, 34]. Then based on the geographic position, the channel of arbitrary MT in the 3D space can be generated by the transitions between the RPs. Thus, we build a simulation environment including the geographical correlation between the fingerprints of two neighboring positions.
In our simulation, we consider the 3GPP urban macro (UMa) NLOS scenario, where the carrier frequency is set to 2GHz. We choose the NLOS scenario instead of the LOS one because the positioning in the NLOS scenario is more challenging due to a lot of occlusions and reflections. The size of the positioning area is 30m 30m 9m, with the center point of its bottom surface coincides with the origin. The BS is equipped with a UPA at m, the antenna plane is perpendicular to the ground, facing the positioning area. The MT is equipped with an Omnidirectional antenna. Unless otherwise specified, the transmission bandwidth is 20MHz, and the configuration of UPA is , . For simplicity but without loss of generality, we divide the cube positioning area into three planes with heights 1.5m, 4.5m, and 7.5m respectively. In the offline phase, the training points are uniformly selected on these three planes with an interval of 1m, then the fingerprints and the corresponding 3D positions are collected. In the online phase, the test points are randomly distributed on these three planes with a total number of 1000, for which their positions are inferred by putting the fingerprint into the trained 3D CNN.
The fingerprints and the corresponding 3D positions of the training points are generated and saved using MATLAB 2018b, and the 3D CNN training and testing are processed using TensorFlow 1.9. Our simulation is carried out on a workstation equipped with two E5 2643v3 CPUs and one Titan X Pascal 12GB GPU. The time overhead mentioned below only refers to the run time on TensorFlow 1.9.
VB Comparison with Other Positioning Algorithms
To evaluate the performance of the proposed 3D CNNenabled positioning method, we use the WKNN positioning method proposed in [35] as the benchmark. For fair comparison, we modify the fingerprint similarity criterion by adding the normalization, given by
(23) 
where denotes the Frobenius norm, and denotes the “trace” operator. The modification is inspired by the correlation matrix distance (CMD) proposed in [36] to measure the spatial structure of the nonstationary MIMO channel. The normalization term confines the similarity of fingerprints between 0 and 1 to eliminate the influence introduced by different channel powers. Thus the normalized fingerprint similarity criterion (23) makes the WKNN positioning algorithm more accurate, which ensures the fairness of comparison. Then, the number of neighbors is adopted in our simulation.
On the other hand, to demonstrate the benefit of the 3D CNN model, we also make the comparison with a downgraded 2D CNN model. We propose a 2D CNNenabled positioning method with the ADCPM presented by (8) as the input. In that case, the ADCPM is regarded as an image with sparse highlights (supports), where the vertical and horizontal angles are collocated in the same dimension. Table I specifies the structure of the proposed 2D CNN where the 2D Inception module is modified from the 3D Inception module in Fig. 5 by replacing the 3D convolution with the 2D convolution.
Module  size*number/stride 

CNA  [15 15]*32 
Max pooling  [5 5]/2 
CNA  [7 7]*64 
CNA  [5 5]*128 
Max pooling  [5 5]/2 
2D Inception module  
Max pooling  [5 5]/2 
2D Inception module  
2D Inception module  
Max pooling  [3 3]/2 
Global average pooling  [8 8] 
Fully connected layer  1024*3 
We first compare the positioning accuracy of the three methods. The cumulative distribution function (CDF) of the localization error using different methods is illustrated in Fig. 6. The 3D CNNenabled positioning method reaches the highest positioning accuracy with 90% of localization errors within 1m. Compared with the 3D counterpart, the 2D CNN method realizes the inferior positioning accuracy with only 80%, but is still superior to the WKNN positioning method with 78%.
The run time overhead is defined as the time spent per user positioning in the online phase, and we employ it to measure the computational complexity. The storage overhead refers to the required storage resource for positioning. We compare the time and storage overhead of the three positioning methods in Fig. 7. The 2D CNNenabled positioning method requires the least time overhead, e.g. 15.78ms, and the least storage overhead, e.g. 10MB. Compared with the 2D counterpart, 3D CNN gains higher positioning accuracy at the cost of higher time overhead (24.42ms running time) and storage overhead (42.9MB). The WKNN positioning method has the highest computational complexity (1566ms) and storage requirement (119MB), as it needs to store the database collected in the positioning area and then search through the positioning area to find the nearest positions. Thus the time and storage overhead of WKNN positioning method will increase with the expansion of the positioning area. In contrast, the 2D/3D CNNenabled positioning methods only need to store the trained parameters and infer the position directly through the trained regression networks, by which the time and storage overhead is independent of the size of the positioning area.
In conclusion, the proposed 3D CNNenabled positioning method outperforms the WKNN positioning method in terms of positioning accuracy, time overhead and storage overhead. The proposed 2D CNNenabled positioning method can be regarded as a simplification of the 3D CNNenabled positioning method, which reduces the time overhead and storage overhead at the sacrifice of the positioning accuracy.
VC Different Configurations of Antenna Array and Bandwidth
To inspect how vertical and horizontal angles interact with each other, we compare the different antenna array geometry. Fig. 8 shows the CDF of localization error using different antenna arrangements, where the case of corresponds to the uniform linear antenna (ULA) array. To make a fair comparison, we maintain the same number of antenna elements. It can be observed that the two UPAs perform better with 90% () and 93% () reliability for 1m positioning accuracy, while the ULA performs 80% reliability for 1m positioning accuracy. Thus, the UPA with a 3D CNN positioning method is more suitable for the 3D positioning than the ULA by providing additional angle resolution in the vertical direction.
Next, we evaluate the impact of the number of antennas. Fig. 9 demonstrates the CDF of localization error with the number of BS antennas. The number of antennas is increased from 32 to 128 with the BS equipped with UPA. The localization errors at the 90% point are 1m, 1.05m, and 1.26m for the , , and UPAs respectively. The results clearly show that the positioning accuracy improves with both the increase of the number of antennas in column and row. Nevertheless, such improvement is getting saturated when the number of antennas reaches 64.
To understand how the delay domain features affect positioning accuracy, we evaluate the performance versus the transmission bandwidth. We consider that the transmission bandwidth rises from 10MHz to 30MHz. From Fig. 9, the localization errors at the 90% point are 1m, 1m, and 1.08m for the 30MHz, 20MHz, and 10MHz transmission bandwidth respectively. It can be observed that when the bandwidth is increased from 10MHz to 20MHz, the positioning accuracy is also growing, while the further increase to 30MHz does not help, indicating that the sufficient delay information (e.g., 20MHz) is able to capture all features in the 3D CNN model.
VD Positioning Robustness
In this subsection, we evaluate the robustness of our fingerprint extraction and 3D CNNenabled positioning method against noise contamination. For comparison, we also consider the direct use of the information in the spacefrequency domain. Similar to the ADCPM, we introduce the spacefrequency channel power matrix (SFCPM) of the th MT as
(24) 
For comparison, the SFCPM is employed as an alternative fingerprint.
Then the following four combinations for the positioning method have been considered: a) WKNN with the SFCPM fingerprint; b) WKNN with the proposed ADCPM fingerprint; c) the proposed 3D CNN with the SFCPM fingerprint; d) the proposed 3D CNN with the proposed ADCPM fingerprint. In the training phase, the artificial noiseless channel matrices are generated by the MonteCarlo method to train the model for 3D CNN or to construct the lookup table for WKNN. In the prediction phase, real noisy channels are used for positioning, where the average received signaltonoise ratio (SNR) varies from 4 dB to 20 dB. By exploiting the sparsity of the ADCPM, we adopt a filter with a threshold on the value of the elements of ADCPM such that the values below this threshold is set to 0. In doing so, the noise contamination in the inputs is somewhat reduced.
Fig. 11 presents the average localization error versus the average received SNR. It can be observed that for both the searchingbased method (i.e., WKNN) and the learningbased (i.e., 3D CNN) methods, the ADCPM fingerprint is more robust to noise contamination than its SFCPM counterpart thanks to channel sparsity in the angledelay domain. At high SNR, while the searchingbased method has the comparable performance with both SFCPM and ADCPM fingerprints, the learningbased method favors the ADCPM fingerprint, which introduces sparsity. As the SNR grows, the performance of the learningbased method approaches that of the searchingbased one with considerably reduced computational complexity and storage requirement. We conclude that the proposed ADCPM fingerprint is more robust to the noise and the proposed 3D CNNenabled positioning method performs well for the noisy inputs.
Vi Conclusion
In this paper, we have proposed a learningbased user positioning method using 3D CNN for the 3D massive MIMOOFDM system, exploiting the sparsity properties of channel statistics in the 3D angledelay domain. We employed the 3D ADCPM as a new type of fingerprint which includes stable and stationary multipath characteristics, e.g. delay, power, and angle in the vertical and horizontal directions. By casting user positioning as an 3D image recognition problem, we proposed a novel 3D CNN architecture with the aim of regression to realize the fingerprintbased positioning. The simulation results demonstrated that the proposed method can achieve high positioning accuracy, reducing the computational complexity and storage overhead, and robust to the noise contamination of the fingerprints.
Appendix A Proof of throrem 1
In order to prove Theorem 1, we provide the following preliminary results stated as Lemma 12.
We define the following vectors
(25) 
and
(26) 
Lemma 1: When and , the AOA is extracted from array response vector explicitly via the DFT operation, given by
(27) 
(28) 
where
(29) 
Proof: The th element of is calculated as
(30) 
where is defined in (10). From (30), if and only if . Further, if and only if , and in this case we have
(31) 
hence, if and only if the following constraint is satisfied
(32) 
Owing to , we derive that if and only if
(33) 
where in the case , is sufficient to be an integer. In this case, we derive from (30) and (31) that
(34) 
Therefore, (27) is obtained. (28) can be derived by using the same method. ∎
Lemma 1 establishes the relationship between the array response vector and the deterministic component in the angle domain for 3D massive MIMO antennas, which guides us to extract both vertical and horizontal AOA specified by and respectively in the 3D space.
Based on Lemma 1, the CIR vector in the angle domain of the th path for the th MT can be written as
(35) 
in which the index indicates the AOA and the corresponding element indicates the channel gain in the AOA direction.
Lemma 2: For 3D MIMO systems with UPA at the BS, when , the average channel power of each path is concentrated on specific position in the vertical angle domain, given by
(36) 
When , the average channel power of each path has the similar conclusion in the horizontal angle domain, given by
(37) 
When and , the average channel power of each path is concentrated in both the vertical and horizontal angle domain, given by
(38) 
Proof: Substituting (4) and (1) into (35), we have
(39) 
when , substituting (27) into (39), we have
(40) 
Similar to (30), the th element of is calculated as
(41) 
Therefore, the th element of is written as
(42) 
From the assumption that , the average channel power in angle domain is calculated as
(43) 
When and , (38) is a special case of (36) and (37). Substituting (27) and (28) into (39), we derive
(44) 
Hence, the th element of can be written as
(45) 
From the assumption that , the average channel power in angle domain is calculated as
(46) 
Therefore, (38) is obtained. ∎
Lemma 2 establishes the relationship between the CIR in the antenna domain and that in the angle domain for 3D massive MIMO systems. When the number of UPA’s antennas is sufficiently large, the average channel powers corresponding to different AOA can be resolved via DFT operation. The resolution of power and angle in the vertical and horizontal directions depends on the number of antennas in the vertical and horizontal directions respectively.
The th column of is calculated as
(47) 
where is defined by substituting for in (10). When , if and only if . In this case, from (31), we have
(48) 
Therefore, we derive
(49) 
Substituting (1) (25), (26) and (49) into (7), we have
(50) 
When and , according to (27) and (41), we have
(51) 
From the assumption that , the average channel power in angledelay domain is calculated as
(52) 
When , and , (13) can be derived by using the same method.
References
 [1] W. H. Chin, Z. Fan, and R. Haines, “Emerging technologies and research challenges for 5G wireless networks,” IEEE Wireless Commun., vol. 21, no. 2, pp. 106–112, April 2014.
 [2] B. A. Renfro, M. Stein, and N. Boeker, An analysis of global positioning system (GPS) standard positioning service (SPS) performance for 2018, Space Geophys. Lab. Appl. Res. Lab., Univ. Texas Austin, Tech. Rep. TRSGL1902, Mar. 2018. [Online]. Available: {}{}}{https://www.gps.gov/systems/gps/performance/}{cmtt}
 [3] M.~H. Bergen, X.~Jin, D.~Guerrero, H.~A. L.~F. Chaves, N.~V. Fredeen, and J.~F. Holzman, ``Design and implementation of an optical receiver for angleofarrivalbased positioning,'' J. Light. Technol, vol.~35, no.~18, pp. 38773885, Sept 2017.
 [4] A.~Y.~Z. Xu, E.~K.~S. Au, A.~K.~S. Wong, and Q.~Wang, ``A novel thresholdbased coherent TOA estimation for IRUWB systems,'' IEEE Trans. Veh. Technol., vol.~58, no.~8, pp. 46754681, Oct 2009.
 [5] Y.~Yuan, S.~Hou, and Q.~Zhao, ``An improved TDOA localization algorithm based on wavelet transform,'' in 2017 7th IEEE International Conference on Electronics Information and Emergency Communication (ICEIEC), July 2017, pp. 111114.
 [6] W.~Cui, L.~Zhang, B.~Li, J.~Guo, W.~Meng, H.~Wang, and L.~Xie, ``Received signal strength based indoor positioning using a random vector functional link network,'' IEEE Trans Ind. Informat., vol.~14, no.~5, pp. 18461855, May 2018.
 [7] K.~N. R. S.~V. Prasad, E.~Hossain, and V.~K. Bhargava, ``Machine learning methods for RSSbased user positioning in distributed massive MIMO,'' IEEE Trans. Wireless Commun., vol.~17, no.~12, pp. 84028417, Dec 2018.
 [8] Z.~Wang, H.~Zhang, T.~Lu, and T.~A. Gulliver, ``Cooperative RSSbased localization in wireless sensor networks using relative error estimation and semidefinite programming,'' IEEE Trans. Veh. Technol., vol.~68, no.~1, pp. 483497, Jan 2019.
 [9] S.~Tomic, M.~Beko, and R.~Dinis, ``3D target localization in wireless sensor networks using RSS and AoA measurements,'' IEEE Trans. Veh. Technol., vol.~66, no.~4, pp. 31973210, April 2017.
 [10] H.~Chen, Y.~Zhang, W.~Li, X.~Tao, and P.~Zhang, ``ConFi: Convolutional neural networks based indoor WiFi localization using channel state information,'' IEEE Access, vol.~5, pp. 18 06618 074, 2017.
 [11] X.~Wang, L.~Gao, S.~Mao, and S.~Pandey, ``CSIbased fingerprinting for indoor localization: A deep learning approach,'' IEEE Trans. Veh. Technol., vol.~66, no.~1, pp. 763776, Jan 2017.
 [12] G.~Wu and P.~Tseng, ``A deep neural networkbased indoor positioning method using channel state information,'' in 2018 International Conference on Computing, Networking and Communications (ICNC), March 2018, pp. 290294.
 [13] A.~Decurninge, L.~G. Ordonez, P.~Ferrand, H.~Gaoning, L.~Bojie, Z.~Wei, and M.~Guillaud, ``CSIbased outdoor localization for massive MIMO: Experiments with a learning approach,'' in 2018 15th International Symposium on Wireless Communication Systems (ISWCS), Aug 2018, pp. 16.
 [14] C.~Sun, X.~Gao, S.~Jin, M.~Matthaiou, Z.~Ding, and C.~Xiao, ``Beam division multiple access transmission for massive MIMO communications,'' IEEE Trans. Commun., vol.~63, no.~6, pp. 21702184, June 2015.
 [15] L.~You, X.~Gao, A.~L. Swindlehurst, and W.~Zhong, ``Channel acquisition for massive MIMOOFDM with adjustable phase shift pilots,'' IEEE Trans. Signal Process., vol.~64, no.~6, pp. 14611476, March 2016.
 [16] X.~Li, S.~Jin, H.~A. Suraweera, J.~Hou, and X.~Gao, ``Statistical 3D beamforming for largescale MIMO downlink systems over rician fading channels,'' IEEE Trans. Commun., vol.~64, no.~4, pp. 15291543, April 2016.
 [17] X.~Sun, X.~Gao, G.~Y. Li, and W.~Han, ``Singlesite localization based on a new type of fingerprint for massive MIMOOFDM systems,'' IEEE Trans. Veh. Technol., vol.~67, no.~7, pp. 61346145, July 2018.
 [18] B.~Wang, F.~Gao, S.~Jin, H.~Lin, and G.~Y. Li, ``Spatial and frequencywideband effects in millimeterwave massive MIMO systems,'' IEEE Trans. Signal Process., vol.~66, no.~13, pp. 33933406, July 2018.
 [19] N.~Garcia, H.~Wymeersch, E.~G. Larsson, A.~M. Haimovich, and M.~Coulon, ``Direct localization for massive MIMO,'' IEEE Trans. Signal Process., vol.~65, no.~10, pp. 24752487, May 2017.
 [20] M.~Arnold, S.~Dorner, S.~Cammerer, and S.~Ten Brink, ``On deep learningbased massive MIMO indoor user localization,'' in 2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), June 2018, pp. 15.
 [21] X.~Wang, L.~Gao, S.~Mao, and S.~Pandey, ``CSIbased fingerprinting for indoor localization: A deep learning approach,'' IEEE Trans. Veh. Technol., vol.~66, no.~1, pp. 763776, Jan 2017.
 [22] J.~Vieira, E.~Leitinger, M.~Sarajlic, X.~Li, and F.~Tufvesson, ``Deep convolutional neural networks for massive MIMO fingerprintbased positioning,'' in 2017 IEEE 28th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC), Oct 2017, pp. 16.
 [23] A.~Krizhevsky, I.~Sutskever, and G.~E. Hinton, ``ImageNet classification with deep convolutional neural networks,'' in Advances in Neural Information Processing Systems 25, F.~Pereira, C.~J.~C. Burges, L.~Bottou, and K.~Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 10971105. [Online]. Available: http://papers.nips.cc/paper/4824imagenetclassificationwithdeepconvolutionalneuralnetworks.pdf
 [24] C.~Szegedy, W.~Liu, Y.~Jia, P.~Sermanet, S.~Reed, D.~Anguelov, D.~Erhan, V.~Vanhoucke, and A.~Rabinovich, ``Going deeper with convolutions,'' in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
 [25] C.~Szegedy, V.~Vanhoucke, S.~Ioffe, J.~Shlens, and Z.~Wojna, ``Rethinking the inception architecture for computer vision,'' in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [26] C.~Szegedy, S.~Ioffe, and V.~Vanhoucke, ``Inceptionv4, InceptionResNet and the impact of residual connections on learning,'' CoRR, vol. abs/1602.07261, 2016. [Online]. Available: http://arxiv.org/abs/1602.07261
 [27] M.~Lin, Q.~Chen, and S.~Yan, ``Network in network,'' CoRR, vol. abs/1312.4400, 2013.
 [28] S.~Ioffe and C.~Szegedy, ``Batch normalization: Accelerating deep network training by reducing internal covariate shift,'' CoRR, vol. abs/1502.03167, 2015. [Online]. Available: http://arxiv.org/abs/1502.03167
 [29] I.~Goodfellow, Y.~Bengio, and A.~Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
 [30] M.~D. Zeiler and R.~Fergus, ``Visualizing and understanding convolutional networks,'' in Computer Vision  ECCV 2014, D.~Fleet, T.~Pajdla, B.~Schiele, and T.~Tuytelaars, Eds. Cham: Springer International Publishing, 2014, pp. 818833.
 [31] S.~Jaeckel, L.~Raschkowski, K.~Borner, and L.~Thiele, ``QuaDRiGa: A 3D multicell channel model with time evolution for enabling virtual field trials,'' IEEE Trans. Antennas Propag., vol.~62, no.~6, pp. 32423256, June 2014.
 [32] P.~K. et~al., WINNER II Channel Models, Version 1.1., Sep. 2007. [Online]. Available: {}{}}{http://www.istwinner.org/}{cmtt}
 [33] Study on 3D channel model for LTE, Version 12.7.0, document 3GPP T.R. 36.873 , Dec. 2017.
 [34] Study on channel model for frequencies from 0.5 to 100 GHz, Version 15.0.0, June. 2018.
 [35] X.~Sun, X.~Gao, G.~Y. Li, and W.~Han, ``Fingerprint based singlesite localization for massive MIMOOFDM systems,'' in GLOBECOM 2017  2017 IEEE Global Communications Conference, Dec 2017, pp. 17.
 [36] M.~Herdin, N.~Czink, H.~Ozcelik, and E.~Bonek, ``Correlation matrix distance, a meaningful measure for evaluation of nonstationary MIMO channels,'' in 2005 IEEE 61st Vehicular Technology Conference, vol.~1, May 2005, pp. 136140 Vol. 1.