Reinforcement Learning in MultipleUAV Networks: Deployment and Movement Design
Abstract
A novel framework is proposed for quality of experience (QoE)driven deployment and dynamic movement of multiple unmanned aerial vehicles (UAVs). The problem of joint nonconvex threedimensional (3D) deployment and dynamic movement of the UAVs is formulated for maximizing the sum mean opinion score (MOS) of ground users, which is proved to be NPhard. In the aim of solving this pertinent problem, a threestep approach is proposed for attaining 3D deployment and dynamic movement of multiple UAVs. Firstly, genetic algorithm based Kmeans (GAKmeans) algorithm is utilized for obtaining the cell partition of the users. Secondly, Qlearning based deployment algorithm is proposed, in which each UAV acts as an agent, making their own decision for attaining 3D position by learning from trial and mistake. In contrast to conventional genetic algorithm based learning algorithms, the proposed algorithm is capable of training the direction selection strategy offline. Thirdly, Qlearning based movement algorithm is proposed in the scenario that the users are roaming. The proposed algorithm is capable of converging to an optimal state. Numerical results reveal that the proposed algorithms show a fast convergence rate after a small number of iterations. Additionally, the proposed Qlearning based deployment algorithm outperforms Kmeans algorithms and IterativeGAKmean (IGK) algorithms with a low complexity.
I Introduction
The unprecedented demands for high quality of wireless services impose enormous challenges for the conventional cellular communication networks. UAVassisted communications in which the UAVs are employed as aerial base stations for assisting the existing terrestrial communication infrastructure, has been viewed as a promising candidate technique for tackling with problems like quickresponse wireless service recovery after unexpected infrastructure damage or natural disasters, as well as cellular network offloading in hot spots such as sport stadiums, outdoor events [2, 3, 4]. Additionally, UAVs can also be employed in Internet of Things (IoT) networks which requires massive connectivity [5, 6]. In IoT networks, UAVs are capable of collecting data from ground IoT devices in a given geographical area, in which constructing a complete cellular infrastructure is unaffordable [7, 8].
Given the advantages of agility and mobility, low cost, their beneficial lineofsight (LoS) propagation, the application of UAVassisted communication networks is highly desired due to the following advantages:

Agility feature: UAV can be deployed quickly and easily to offer high quality services in quick response scenarios. Additionally, the cost of deploying UAV as aerial base station is much more affordable than terrestrial base stations [3].

LoS feature: As is deployed in the threedimensional space, UAV has the feature of LoS connections towards ground users which can be significantly less effected by shadowing and fading [9].

Mobility feature: UAV is capable of adjusting their 3D positions dynamically and swiftly according to users’ realtime locations, which enables it to provide flexible and ondemand service for users [8].
Sparked by the aforementioned characteristics of UAVassisted communication networks, American wireless carriers AT&T [10] have conducted the research of employing UAVs to develop airborne LTE services. UAVs have been utilized by Huawei corporation and China mobile corporation as aerial base stations in emergency communications, and their achievements had been exhibited in 2018 Mobile World Congress (MWC) in Barcelona. Additionally, UAVbased aerial base station which equipped with the fifth generation (5G) networks has experienced its first experiment on April, 2018 by Huawei cooperation in Shenzhen. With the number and requirement of UAVs increasing more than tenfold year by year, it faces with enormous challenges to design UAVassisted wireless networks.
IA Related Works
In order to fully reap the benefits of the deployment of UAVs for communication purposes, some core technical challenges need to be faced with firstly. For instance, the 3D placement of multiple UAVs, the interference elimination [11, 12], energy efficiency optimization [13, 14, 15], trajectory/movement design [16], airground channel modeling [17], power/bandwidth allocation [8] and the compatibility between aerial wireless networks and the existing cellular networks. Among all these challenges, the deployment of UAVs is the fundamental and foremost one. Some groups have made great contributions on extensively solving the aforementioned problems.
IA1 Deployment of Static UAVs
Early research contributions had studied the deployment of static UAVs for both singleUAV scenarios and multiUAV scenarios. The authors in [2] proposed a solution for the deployment of single aerial base station in the scenario that ground base stations were destroyed after natural disaster. In the letter [18], the authors proposed an analytical solution for optimizing the UAV’s altitude for providing maximum radio coverage for the users. In [19], the authors presented an optimal deployment algorithm for UAVBS in order to provide wireless service to the maximal number of connected ground users. At the meantime, the transmit power of UAV is also minimized. However, these research contributions mainly focus on the problem of a single UAV. Cooperative deployment of multipleUAV needs to be designed for providing reliable service for users. In [20], the authors proposed a novelty framework for maximizing the throughput of users while considering the fairness between the users. Three dimensional placement of the UAVs was obtained with the aid of circle packing theory. Both the total coverage radium and operating lifetime of the UAVs were maximized in [21]. However, high mobility and dynamic movement design of multiple UAVs is not considered in the aforementioned researches.
IA2 Deployment of mobile UAVs
It is intuitionist that UAV with movements is capable of obtaining better performance than a static UAV in the scenario that users are mobile. When faced with the challenge of trajectory design, current researches mainly consider the scenario that ground users are static. The trajectory of a single UAV was designed in [22] for serving each users via TDMA, even though orthogonal multiple access [23, 24] had been proposed to improve the system capacity for 5G networks. Finally, a cyclical trajectory was designed for providing more reliable services for users. Additionally, a significant throughput gain over the case of a single static UAV was obtained. In [25], the authors presented a lowcomplexity solution for the trajectory design of a single UAV for maximizing the minimum average throughput of all users. However, only a single UAV was considered in this research, and the trajectory of UAV was designed as simple circular trajectory. Research on the dynamic movement of multipleUAV is still a blank domain.
IA3 Reinforcement learning in UAVassisted communications
Machine learning applications have gained remarkable attentions in wireless communication networks during recent decades, and reinforcement learning algorithm has been proved to possess the capacity of tackling problems in UAVassisted wireless networks [26]. An interferenceaware path planning scheme based on deep reinforcement learning was proposed for a network of cellularconnected UAVs in [27], better wireless latency and transmit rate were achieved in the proposed scheme. In [28], the authors proposed a UAV relay scheme based on both reinforcement learning and deep reinforcement learning. With the aid of these two algorithms, energy consumption of the UAV is minimized while a better bit error rate (BER) performance is received. However, the multipleUAV scenario was not considered in this article and the movement of ground users was also neglected for the purpose of simplifying the system model.
IB Our New Contributions
Again, deploying UAVs as aerial base stations has seen increasing applications both in academia and industry [29]. However, currently, few research contributions consider the 3D deployment of multiUAV. The realtime movement of ground users is also overlooked by current articles, based on which, the mobility of UAVs needs to be considered to attain a better quality of service. In contrast to current articles [22, 27] which design the trajectory of UAV with fixed source and destination, the movement of UAVs in this article has no fixed destination due to the mobility of users. Additionally, the research gap still exists on investigating the QoEdriven UAVsassisted communications, which is also worth studying for further improving the wireless services. The motivations of this work are concluded as follows:

Mobility of UAVs has not been considered based on the movement of users. Current research contributions exist on the twodimensional placement of multiUAV or the mobility of singleUAV while ground users remain static. The UAVs need to travel in the scenario that ground users are roaming continuously to maintain good connections.

QoE has not been considered for UAVsassisted communications. Most existing research contributions address this issue without considering the specific requirements of different ground users. QoE is invoked for demonstrating the users’ satisfaction, and it is supposed to be considered in UAVassisted wireless networks.

Threedimensional dynamic movement of UAVs has not been considered in the UAVassisted wireless networks. Most existing researches only focus on the 2D movement design of UAVs. UAVs are capable of adjusting their 3D positions dynamically and swiftly for maintain a high performance. In this case, 3D dynamic movement design of UAVs needs to be investigated.
Therefore, the problem of 3D deployment and dynamic movement design of multiple UAVs is formulated for improving the users’ QoE instead of throughput. Our new contributions are concluded as follows:

We propose a novel QoEdriven multiUAV assisted communication framework, in which multiple UAVs are deployed in a 3D space to serve mobile users. Mean opinion score (MOS) is adopted for evaluating the satisfaction of users. Meanwhile, we formulate the sum MOS maximization problem by optimizing the UAVs’ placement and dynamic movement.

We develop a threestep approach for solving the formulated problem. i) We invoke GAKmeans algorithm for obtaining initial cell partition. ii) We develop a Qlearning based deployment algorithm for attaining 3D placement of UAVs when users are static at the initial time. iii) We develop a Qlearning based movement algorithm for designing 3D dynamic movement of the UAVs.

We conceive a Qlearning based solution to solve the NPhard 3D deployment and movement problem of the UAVs. In contrast to conventional genetic based learning algorithms, the proposed algorithm is capable of training the policy by learning from the trial and mistake.

We demonstrate that the proposed algorithms show a fast convergence. Additionally, the proposed Qlearning based deployment algorithm outperforms Kmeans algorithms and IGK algorithms with a low complexity.
IC Organization
In Section II, the problem formulation for the QoEbased 3D deployment and dynamic movement of multiUAV is presented. In Section III, efficient cell partition and UAVs’ deployment for QoE maximization with static users are proposed. In Section IV, Qlearning algorithm is utilized for attaining the UAVs’ dynamic movement when users are roaming. The numerical results for both deployment and dynamic movement of multiUAV are illustrated in Section V. The conclusions is presented in Section VI, which is followed by some essential proofs. Additionally, The list of notations is demonstrated in Table I.
Notations  Description  Notations  Description 

U  Ground users’ number  N  Cluster number and UAVs’ number 
{x_{{k_{n}}}},{y_{{k_{n}}}}  Ground users’ coordinates  {x_{n}},{y_{n}}  UAVs’ coordinate 
f_{c}  Carrier frequency  h_{n}  UAVs’ altitudes 
{P_{\max}}  UAV’s maximum transmit power  {g_{{k_{n}}}}  channel power gain from UAVs to users 
N_{0}  Noise power spectral  B  Total bandwidth 
{\mu_{LoS}},{\mu_{NLoS}}  Additional path loss for LoS and NLoS  {P_{LoS}},{P_{NLoS}}  LoS and NLoS probability 
{\Gamma_{{k_{n}}}}  Receives SINR of ground users  {I_{{k_{n}}}}  Receives interference of ground users 
{r_{{k_{n}}}}  Instantaneous achievable rate  {R_{sum}}  Overall achievable sum rate 
b_{1},b_{2}  Environmental parameters (dense urban)  \alpha  Path loss exponent 
{\text{MOS}}({k_{n}})  Overall achievable MOS  a_{t}  State in Qlearning model at time t 
a_{t}  Action in Qlearning model at time t  r_{t}  Reward in Qlearning model at time t 
II System Model
IIA System Description
Consider the downlink transmission in UAVassisted wireless networks, where multiUAV are deployed as aerial base stations for supporting the mobile users in a particular area. As illustrated in Fig. 1, this particular area is divided into N clusters, then the users are denoted as K=\left\{{{K_{1}},\cdots{K_{N}}}\right\}, while {K_{n}} is the users partitioned into cluster n, n\in\mathbb{N}=\left\{{1,2,\cdots N}\right\}. The clustering scheme will be discussed in Section III. Each user belongs to only one cluster, thus {K_{n}}\cap{K_{{n^{\prime}}}}=\phi,{n^{\prime}}\neq n,\forall{n^{\prime}},n\in% \mathbb{N}, while {{\rm K}_{n}}=\left{{K_{n}}}\right is the total number of users partitioned to the cluster n. In any cluster n, n\in\mathbb{N}, a single UAV is dispatched. The UAVs are able to connect with the core network by satellite or a fully working terrestrial base station. At any time t, multiple users in the same cluster are served by the same UAV simultaneously by employing FDMA [30]. The coordinate of each user is denoted by {w_{{k_{n}}}}={[{x_{{k_{n}}}}(t),{y_{{k_{n}}}}(t)]^{T}}\in{\mathbb{R}^{2\times 1% }},{k_{n}}\in{K_{n}}, where {x_{{k_{n}}}}(t) and {y_{{k_{n}}}}(t) are the coordinates of user k_{n} at time t. The movement of ground users is discussed in section IV. The vertical trajectory is denoted by {h_{n}}(t)\in[{h_{\min}},{h_{\max}}],0\leq t\leq{T_{n}}. The horizontal coordinate of the UAV is denoted by {q_{n}}(t)={[{x_{n}}(t),{y_{n}}(t)]^{T}}\in{\mathbb{R}^{2\times 1}},with 0\leq t\leq{T_{n}}.
Furthermore, the distance from the UAV n to user k_{n} at time t can be expressed as
\displaystyle{{d_{{k_{n}}}}(t)=\sqrt{{h_{n}}^{2}(t)+{{[{x_{n}}(t){x_{{k_{n}}}% }(t)]}^{2}}+{{[{y_{n}}(t){y_{{k_{n}}}}(t)]}^{2}}}}.  (1) 
IIB Signal Model
As are deployed in the 3D space, the UAVs have a higher probability of LoS connections than terrestrial communication infrastructures. The LoS probability is given by [18]
\displaystyle{{P_{{\text{LoS}}}}({\theta_{{k_{n}}}})={b_{1}}{(\frac{{180}}{\pi% }{\theta_{{k_{n}}}}\zeta)^{{b_{2}}}}},  (2) 
where {\theta_{{k_{n}}}}(t)={\sin^{1}}[\frac{{{h_{n}}(t)}}{{{d_{{k_{n}}}}(t)}}] is the elevation angle between UAV n and the ground user {k_{n}}. b_{1}, b_{2} and \zeta are constant values determined by environment. In the particular case, the altitude of the nUAV {h_{n}}(t)\in[{h_{\min}},{h_{\max}}],0\leq t\leq{T_{n}} is supposed to be properly chosen in practice to balance between the probability of LoS channel and the path loss. Then, the NonLineofSight (NLoS) probability is given by {P_{{\text{NLoS}}}}=1{P_{{\text{LoS}}}}.
Thus, the channel power gain from UAV n to user k_{n} at time t can be expressed as
\displaystyle{{g_{{k_{n}}}}(t)={K_{0}}^{1}d_{{k_{n}}}^{\alpha}(t){[{P_{{% \text{LoS}}}}{\mu_{{\text{LoS}}}}+{P_{{\text{NLoS}}}}{\mu_{{\text{NLoS}}}}]^{% 1}}},  (3) 
where {K_{0}}={\left({\frac{{4\pi{f_{c}}}}{c}}\right)^{2}}, \alpha is a constant denotes path loss exponent, {\mu_{LoS}} and {\mu_{NLoS}} are different attenuation factors considered for LoS and NLoS links, f_{c} is the carrier frequency, and finally c is the speed of light.
Denote that the total available bandwidth B is distributed to each user equally, thus the bandwidth of each user is expressed as {B_{{k_{n}}}}={B\mathord{\left/{\vphantom{B{{K_{n}}}}}\right.\kern1.2pt}{{K_{% n}}}}. Also denote that the total transmit power is allocated to each users uniformly, namely {p_{{k_{n}}}}={{{P_{\max}}}\mathord{\left/{\vphantom{{{P_{\max}}}{{K_{n}}}}}% \right.\kern1.2pt}{{K_{n}}}}, where {P_{\max}} is the maximum allowed transmit power of each UAV.
As the received interference from UAV to users is able to be mitigated by using the millimeter wave (mmWave) band and zeroforcing (ZF) beamforming [16, 10], the received SNR {\Gamma_{{k_{n}}}}(t) of ground user k_{n} connected to UAV n at time t can be expressed as
\displaystyle{{\Gamma_{{k_{n}}}}(t)=\frac{{{p_{{k_{n}}}}(t){g_{{k_{n}}}}(t)}}{% {{\sigma^{2}}}}},  (4) 
where {\sigma^{2}}={B_{{k_{n}}}}N_{0} with N_{0} denoting the power spectral density of the additive white Gaussian noise (AWGN) at the receivers. Assume the SNR target which must be received by all ground users is \gamma, then {\Gamma_{{k_{n}}}}(t)\geq\gamma.
Lemma 1.
For the aim of guaranteeing that all the users are capable of connecting to the networks, we have a constraint for the UAVs’ transmit power, which can be expressed as
\displaystyle{{P_{\max}}\geq\gamma{\sigma^{2}}{K_{0}}{\mu_{LoS}}\cdot\max\left% \{{{h_{1}},{h_{2}},\cdots{h_{n}}}\right\}}.  (5) 
Proof.
See Appendix A . ∎
Lemma 1 shows the constraint for transmit power of UAVs. When choosing the type of UAVs in practical application, the lower bound of transmit power is supposed to be no less than equation (5). In this case, the received SNR of every ground users can be larger than SNR threshold.
The user k_{n}’s transmit rate {r_{{k_{n}}}}(t) at time t can be expressed as
\displaystyle{{r_{{k_{n}}}}(t)={B_{{k_{n}}}}{\log_{2}}[1+\frac{{{p_{{k_{n}}}}(% t){g_{{k_{n}}}}(t)}}{{{\sigma^{2}}}}]}.  (6) 
Thus, the total transmit rate of all users at any time t is given by
\displaystyle{{R_{{\text{sum}}}}=\sum\limits_{n=1}^{N}{\sum\limits_{{k_{n}}=1}% ^{{{\rm K}_{n}}}{{r_{{k_{n}}}}(t)}}}.  (7) 
Proposition 1.
The altitude constraint for the UAV n has to satisfy
\displaystyle{{d_{{k_{n}}}}(t)\sin\left[{\frac{\pi}{{180}}(\zeta+{e^{M}})}% \right]\leq{h_{n}}(t)\leq{\left({\frac{{{P_{\max}}}}{{\gamma{K_{0}}{\sigma^{2}% }{\mu_{LoS}}}}}\right)^{{1\mathord{\left/{\vphantom{1\alpha}}\right.\kern1.2% pt}\alpha}}}},  (8) 
where M={{\left[{\ln({{\frac{S}{{({\mu_{LoS}}{\mu_{NLoS}})}}\frac{{{\mu_{NLoS}}}}{% {{\mu_{LoS}}{\mu_{NLoS}}}}}\mathord{\left/{\vphantom{{\frac{S}{{({\mu_{LoS}}% {\mu_{NLoS}})}}\frac{{{\mu_{NLoS}}}}{{{\mu_{LoS}}{\mu_{NLoS}}}}}{{b_{1}}}}}% \right.\kern1.2pt}{{b_{1}}}})}\right]}\mathord{\left/{\vphantom{{\left[{\ln({% {\frac{S}{{({\mu_{LoS}}{\mu_{NLoS}})}}\frac{{{\mu_{NLoS}}}}{{{\mu_{LoS}}{% \mu_{NLoS}}}}}\mathord{\left/{\vphantom{{\frac{S}{{({\mu_{LoS}}{\mu_{NLoS}})}% }\frac{{{\mu_{NLoS}}}}{{{\mu_{LoS}}{\mu_{NLoS}}}}}{{b_{1}}}}}\right.\kern1.% 2pt}{{b_{1}}}})}\right]}{{b_{2}}}}}\right.\kern1.2pt}{{b_{2}}}} and S=\frac{{{P_{\max}}}}{{\gamma{K_{0}}{\sigma^{2}}d_{{k_{n}}}^{\alpha}(t)}}. Other parameters are introduced in Section II.
Proof.
See Appendix B . ∎
Proposition 1 provides the altitude constraint for the UAVs. From Proposition, it is observed that the lower bound of the altitude is the function of distance {d_{{k_{n}}}}(t). At the same time, the upper bound of the altitude is the function of maximum transmit power {P_{\max}}. In this case, as the distance and transmit power varies, the altitude of the corresponding UAVs needs to be adjusted for proving reliable services to users.
IIC QualityofExperience Model
As illustrated in Fig.1, the requirements of transmit rate for different users varies. In this case, QoE model is essential to be considered in UAVassisted communication networks.
Definition 1.
QoE is a metric that depends on a person’s preferences towards a particular object or service related to expectations, experiences, behaviour, cognitive abilities, object’s attributes and the environment [31].
MOS is invoked as the measure of the user’s QoE [32, 33], and the MOS model is defined as follow
\displaystyle{{\text{MOS}}({k_{n}})={\xi_{1}}{\text{MO}}{{\text{S}}_{delay}}({% k_{n}})+{\xi_{2}}{\text{MO}}{{\text{S}}_{rate}}({k_{n}})},  (9) 
where {\xi_{1}} and {\xi_{2}} are coefficients and {\xi_{1}}+{\xi_{2}}=1.
Definition 2.
We consider five hypotheses in relation to the five alternatives (on the ordinal scale) for each QoE state, the measurement is the MOS received by users. These are: excellent (4.5), very good (3.5\sim 4.5), good (2\sim 3.5), fair (1\sim 2) and poor (1).
Derive from equation (9), the MOS model is simplified as follows [34]
\displaystyle{{\text{MOS}}({k_{n}})={C_{1}}\ln[d({r_{{k_{n}}}})]+{C_{2}}},  (10) 
where d\left({{r_{{k_{n}}}}}\right) is the delay time related to the transmit rate, while {\text{MOS(}}{k_{n}}{\text{)}} denotes the MOS score ranging from 1 to 4.5. Finally, C_{1} and C_{2} are constants determined by analyzing the experimental results of the web browsing applications, which are set to be 1.120 and 4.6746, respectively [34].
The delay time in (10) is given by [34]
\displaystyle{d({r_{{k_{n}}}})=3{\text{RTT}}+\frac{{{\text{FS}}}}{{{r_{{k_{n}}% }}}}+L\left({\frac{{{\text{MSS}}}}{{{r_{{k_{n}}}}}}}\right)+{\text{RTT}}{\kern 1% .0pt}\frac{{2{\text{MSS}}\left({{2^{L}}1}\right)}}{{{r_{{k_{n}}}}}}},  (11) 
where RTT[s] is the round trip time, while FS[bit] is the web page size and MSS[bit] is the maximum segment size. The parameter L=\left\{{{L_{1}},{L_{2}}}\right\} represents the number of slow start cycles with idle periods, which are defined as {L_{1}}={\log_{2}}\left({\frac{{{r_{{k_{n}}}}{\text{RTT}}}}{{{\text{MSS}}}}+1}% \right)1 and {L_{2}}={\log_{2}}\left({\frac{{{\text{FS}}}}{{{\text{2MSS}}}}+1}\right)1.
Remark 1.
When the UAVs fly towards some users to acquire better channel environment for these users, they get farther away from other users. Finally, the MOS of the users which are farther away from the UAVs become lower than those who are closer to the UAVs.
Our approach can accommodate other MOS models without loss of generality. Meanwhile, the model invoked in this paper can also be extended to other applications in particular scenarios.
IID Problem Formulation
Let Q=\left\{{{q_{n}}(t),0\leq t\leq{T_{n}}}\right\} and H=\left\{{{h_{n}}(t),0\leq t\leq{T_{n}}}\right\}. Again, we aim for optimizing the positions of the UAVs at each timeslots, i.e., \left\{{{x_{n}}(t),{y_{n}}(t),{h_{n}}(t)}\right\},{\kern 1.0pt}{\kern 1.0pt}{% \kern 1.0pt}n=1,2,\cdots N,{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}t=0,1,\cdots% {T_{n}}, for maximizing the MOS of users. Our optimization problem is then formulated as
\displaystyle\mathop{\mathop{\max}\limits_{C,Q,H}{\kern 1.0pt}{\kern 1.0pt}{% \kern 1.0pt}{\text{ MO}}{{\text{S}}_{{\text{total}}}}=\sum\limits_{n=1}^{N}{% \sum\limits_{{k_{n}}=1}^{{{\rm K}_{n}}}{{\text{MO}}{{\text{S}}_{k_{n}}}({r_{{k% _{n}}}})}}},  (12a)  
\displaystyle{\text{s}}{\text{.t}}{\text{.}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{K_{n}}% \cap{K_{{n^{\prime}}}}=\phi,{n^{\prime}}\neq n,\forall n,  (12b)  
\displaystyle{h_{\min}}\leq{h_{n}}(t)\leq{h_{\max}},\forall t,\forall n,  (12c)  
\displaystyle{\Gamma_{{k_{n}}}}(t)\geq\gamma_{K_{n}},\forall t,\forall{k_{n}},\hfill  (12d)  
\displaystyle\sum\limits_{{k_{n}}=1}^{{{\rm K}_{n}}}{{p_{{k_{n}}}}(t)}\leq{P_{% \max}},\forall t,\forall{k_{n}},  (12e)  
\displaystyle{p_{{k_{n}}}}(t)\geq 0,\forall{k_{n}},\forall t,  (12f) 
where K(n) is the users partitioned to the cluster n, {h_{n}}(t) is the altitude of UAV n, while {p_{{k_{n}}}}(t) is the transmit power form UAV n to ground user k_{n}. Furthermore, (12b) denotes that each user is partitioned into a single cluster; (12c) indicates the altitude constraint of the UAVs; (12d) formulates the minimized SINR target for each user to connected to the network; (12e) and (12f) qualifies the transmit power constraint of UAVs. Given the bandwidth and power allocation of the UAVs, this problem is then simplified into the problem of UAVs’ movement optimization. Since the movement of users affect the received transmit rate to satisfy their QoE requirement, the UAVs have to travel based on the realtime movement of users to maintain the QoE of users, the problem of maximizing the MOS of the UAVs in (12a) inherently incorporates the design of dynamic movement for the UAVs.
Theorem 1.
Problem (12a) is a nonconvex problem since the objective function is nonconvex over {x_{n}}(t)’s, {y_{n}}(t)’s and {h_{n}}(t)’s [13, 8].
Proof.
See Appendix C . ∎
From Theorem 1, it can be observed that the MOS of users is impacted by the UAVs’ altitude. This is because that both the distance and LoS probability between the UAVs and the users are related to the UAVs’ altitudes. Increasing the UAV’s altitude leads to a higher path loss while a higher LoS probability is obtained, the UAVs have to increase the transmit power for satisfying the users’ QoE requirements.
Remark 2.
The sum MOS depends on the transmit power, number and positions (both horizonal positions and altitudes) of the UAVs.
Due to the combinatorial features of cell partition and optimal position searching, exhaustive search solution provides a straightforward method for obtaining the final results. We propose an efficient GAKmeans based IGK algorithm for solving this problem. In addition, a lowcomplexity Qlearning based 3D deployment algorithm is also developed for the aim of avoiding the IGK algorithm’s huge complexity. Furthermore, when the initial position of UAVs is fixed, obtaining the dynamic movement is also nontrivial due to the nonconvex property of problem (12) in terms of the sum MOS maximization. Qlearning algorithm is proved to be an efficient solution for tackling the problem which has exact transition formulation. This is the motivation of applying Qlearning algorithm to tackle with this problem. This scheme is discussed in the next section.
III ThreeDimensional deployment of the UAVs
We consider the scenario that the UAV n is hovering above users with an alterable altitude while the users are static. The bandwidth and the transmit power of each UAV are allocated uniformly to each users. Thus, the optimization problem is simplified to a region segmentation problem, which is formulated as
\displaystyle\mathop{\mathop{\max}\limits_{C{\text{,}}Q,H}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\text{ MO}}{{\text{S}}_{{\text{total}}}}=\sum\limits_{n=1}^% {N}{\sum\limits_{{k_{n}}=1}^{{{\rm K}_{n}}}{{\text{MO}}{{\text{S}}_{k_{n}}}({r% _{{k_{n}}}})}}},  (13a)  
\displaystyle{\text{s}}{\text{.t}}{\text{.}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{K_{n}}% \cap{K_{{n^{\prime}}}}=\phi,{n^{\prime}}\neq n,\forall n,  (13b)  
\displaystyle{h_{\min}}\leq{h_{n}}\leq{h_{\max}},\forall n,  (13c)  
\displaystyle{\Gamma_{{k_{n}}}}\geq\gamma_{K_{n}},\forall{k_{n}}  (13d)  
\displaystyle\sum\limits_{{k_{n}}=1}^{{{\rm K}_{n}}}{{p_{{k_{n}}}}}\leq{P_{% \max}},\forall{k_{n}}  (13e)  
\displaystyle{p_{{k_{n}}}}\geq 0,\forall{k_{n}}.  (13f) 
Theorem 2.
Problem of (13a) is a NPhard problem. More specifically, problem (13a) is NPhard even only consider the users clustering.
Proof.
See Appendix D . ∎
IIIA Initial Algorithm for Cell Partition of Ground Users
User clustering is the first step for obtaining the deployment of the UAVs. It’s the process of partitioning the users into different cluster. In each cluster, the users are linked with a UAV. In this model, the users need to be partitioned into different clusters. In each cluster, a single UAV is employed. Kmeans algorithm, also named Lloyed algorithm can solve the problem of clustering and obtaining the initial 3D position of the UAVs with a low complexity. This algorithm is capable of partitioning users into different clusters based on the policy of nearest neighbor barycenter, and recalculate the barycenters of each cluster [35]. Specifically, Kmeans algorithm is capable of obtaining the results that the squared error between the empirical mean of a cluster and the points is minimized [36]. By invoking Kmeans algorithm, the users are partitioned into N clusters.
However, Kmeans is a greedy algorithm that the result of this algorithm has a high probability of converging to a local minimum even when clusters are well partitioned.The reason is the sensitivity of Kmeans algorithm to the initial center [35], the performance of Kmeans algorithm is deterministic over runs. For the aim of supplying a gap of Kmeans algorithm, GAKmeans algorithm based on Genetic algorithm will be invoked for obtaining global optimum clustering results.
IIIB Qlearning Algorithm for 3D Deployment of UAVs
In this section, given the cell partition of the users, our goal is for obtaining the 3D locations of the UAVs to maximize the sum MOS. In the process of cell partition, GAKmeans algorithm is capable of obtaining a group of locations of the UAVs which have the minimal Euclidean distance. As the MOS of a particular user is related to the distance between this user and the UAV, so GAKmeans could be regarded as a lowcomplexity scheme to obtain deployment of the UAVs. However, the objective of GAKmeans is obtaining minimal sum Euclidean distance, it can be noticed from (11) that the transmit rate is not only related to Euclidean distance between users and UAV, but also related to the probability of LoS. So GAKmeans is not effective to solve the 3D deployment problem. Although IterativeGAKmean (IGK) algorithm is able to obtain the optimal altitudes of the UAVs through iterative process, the complexity of IGK algorithm is high, the worst case computational complexity of this algorithms is O(N({U^{2}}+U+1)). In this case, a more effective algorithm is required to solve the 3D deployment problem.
In the Qlearning model, the UAVs act as agents, and the Qlearning model consists of four parts: states, actions, rewards and Qvalue. The aim of Qlearning is for attaining a policy that maximizes the observed rewards over the interaction time of the agents. At each time slot t during iteration, each agent observes a state, {s_{t}}, from the state space, S. Accordingly each agent carries out an action, {a_{t}}, from the action space, A, selecting the directions based on policy, J. The decision policy J is determined by a Qtable, Q({s_{t}},{a_{t}}). The principle of policy at each time slot is to choose an action which makes model obtain maximum Qvalue. Following the action, the state of each agent transitions to a new state {s_{t+1}} and the agent receives a reward, {r_{t}}, determined by the instantaneous sum MOS of ground users.
IIIB1 State Representation
The individual agents are presented using a threestate UAV model defined as: \xi=(x_{{\text{UAV}}},y_{{\text{UAV}}},h_{{\text{UAV}}}), where (x_{{\text{UAV}}},y_{{\text{UAV}}}) is the horizonal position of UAV and h_{{\text{UAV}}} is the altitude of UAV. As the UAVs are deployed in a particular district, the state space can be denoted as {x_{{\rm{UAV}}}}:\left\{{0,1,\cdots{X_{d}}}\right\}, {y_{{\rm{UAV}}}}:\left\{{0,1,\cdots{Y_{d}}}\right\}, {h_{{\rm{UAV}}}}:\left\{{{h_{\min}},\cdots{h_{\max}}}\right\}, where {X_{d}} and {Y_{d}} are the maximum coordinate of this particular district.
For each trial, the initial state (position) of each UAV is determined randomly, the update rate of Qtable is determined by the initial position of the UAVs and the number of users and UAVs. The closer each UAV is placed with the optimal position, the faster update rate will be.
IIIB2 Action Space
At each step, each agent (UAV) carries out an action {a_{t}}\in A, which includes choosing a direction for themselves, according to the current state, {s_{t}}\in S, based on the decision policy J. Each UAV can fly towards variety of directions which makes the problem nontrivial to solve. However, by assuming that the UAVs fly with simple coordinated turns, we can simplify the model down to 7 directions. We can obtain a tradeoff between the accuracy and model complexity when the directions are appropriately chosen.
IIIB3 State Transition Model
The transition from {s_{t}} to {s_{t+1}} with reward {r_{t}} when action {a_{t}} is carried out can be characterized by the conditional transition probability p({s_{t+1}},{r_{t}}{s_{t}},{a_{t}}). The goal of the Qlearning model is for maximizing the long term rewards expressed as follow
\displaystyle{{G_{t}}={\rm E}[\sum\limits_{n=0}^{\infty}{{\beta^{n}}{r_{t+n}}}% ]},  (14) 
where \beta is the discount factor.
The agents, states,actions and rewards in Qlearning model are defined as follows:

Agent: UAV n, n\in\mathbb{N}=\left\{{1,2,\cdots N}\right\}.

State: The UAVs’ 3D positions (horizontal coordinates and altitudes) are modeled as states.

Actions: Actions for each agent are the set of coordinates to present flying directions of the UAVs. (1,0,0) denotes the right turn of the UAV; (1,0,0) indicates the the left turn of the UAV; (0,1,0) represents the the forward of the UAV; (0,1,0) implies the backward of the UAV; (0,0,1) means the ascent of the UAV; (0,0,1) indicates the descending of the UAV; (0,0,0) represents the UAV will stay static.

Rewards: The reward function is formulated by the instantaneous MOS of the users. If action that the agent carries out at current time t is able to improve the sum MOS, then the UAV receives a positive reward. The agent receives a negative reward if otherwise. Therefore, the reward function can be express as
\displaystyle{{r_{t}}=\left\{{\begin{array}[]{*{20}{c}}{{\kern 1.0pt}{\kern 1.% 0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.% 0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}1,{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\text{if}}{\kern 1.0pt}{\kern 1.0pt}{\text{MO}% }{{\text{S}}_{{\text{new}}}}{\kern 1.0pt}>{\text{MO}}{{\text{S}}_{{\text{old}}% }},{\kern 1.0pt}}\\ {0.1,{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\text{if}}{\kern 1.0pt}{\kern 1.0pt}{\text{MO}}{{\text{S}}_{{\text{new}}% }}{\kern 1.0pt}{\text{ = MO}}{{\text{S}}_{{\text{old}}}},}\\ {{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}1,{\kern 1.0pt}{\kern 1.% 0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.% 0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.% 0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.% 0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.% 0pt}{\kern 1.0pt}{\text{if}}{\kern 1.0pt}{\kern 1.0pt}{\text{MO}}{{\text{S}}_{% {\text{new}}}}{\kern 1.0pt}<{\text{MO}}{{\text{S}}_{{\text{old}}}}.}\end{array% }}\right.} (15)
Remark 3.
The reward value 0.1 when {\text{MO}}{{\text{S}}_{{\text{new}}}}{\kern 1.0pt}={\text{MO}}{{\text{S}}_{{% \text{old}}}} is for preventing the model to converge to the local optimal solution. The reward value can be refined further to increase the convergence rate.
During the course of learning, the stateaction value function for each agent can be iteratively updated and it can be given as (16) at the top of next page.
\displaystyle{Q(s,a)\leftarrow(1\alpha)\cdot\underbrace{Q(s,a)}_{{\text{old}}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\text{value}}}+\underbrace{\alpha}_{{% \text{learning}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\text{rate}}}\cdot% \overbrace{[\underbrace{r}_{{\text{reward}}}+\underbrace{\gamma}_{{\text{% discount}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\text{factor}}}\cdot% \underbrace{{{\max}_{a^{\prime}}}Q(s^{\prime},a^{\prime})}_{{\text{estimate}}{% \kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\text{of}}{\kern 1.0pt}{\kern 1.0pt}{% \kern 1.0pt}{\text{optimal}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\text{% future}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\text{value}}}]}^{{\text{% learning}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\text{value}}}.}  (16) 
The policy in Qlearning based deployment algorithm is chosen according to the \epsilongreedy policy. More specifically, the policy J that maximizes the Qvalue is chosen with a high probability of 1\epsilon, while other actions are selected with a low probability to avoid staying in a local maximum, i.e.,
\displaystyle{\begin{gathered}\displaystyle Pr(J=\widehat{J})=\left\{\begin{% matrix}1\epsilon,&\widehat{a}=\text{argmax}Q\left(s,a\right),\\ \epsilon/\left(\lefta\right1\right),&otherwise.\end{matrix}\right.\end{% gathered}}  (17) 
Remark 4.
The convergence rate of the proposed algorithm for solving the problem of 3D deployment of the UAVs varies as the initial position of the UAVs is changed.
IIIC Analysis of the Proposed Algorithms
Algorithm  Kmeans algorithm  IGK algorithm  Exhaust search  Qlearning algorithm 

Complexity  O(TUN)  O(TN({U^{2}}+U+1))  O(TU{N^{U}})  O(TUN) 
Optimality  Suboptimal  Suboptimal  Optimal  Suboptimal 
Stability  Not Stable  Not Stable  Stable  Stable 
IIIC1 Complexity
The computational complexity of evaluation of squareerror of a given solution string is O(TU), the complexity of mutation operator is O(TU^{2}) and Kmeans operator is O(TUN). Thus, the total complexity of GAKmeans is calculated as O(TN({U^{2}}+U+1)). As for the computational complexity of Qlearning algorithm, through training process offline, when Qtable is updated, the complexity of Qlearning algorithm and Kmeans algorithm is the same, which can be calculated as O(TUN).
IIIC2 Stability and Convergence
As Kmeans algorithm is very sensitive to initial center, the performance of Kmeans algorithm and IGK algorithm is deterministic over runs. On the contrary, it has been proved in [37] that when learning rate \alpha satisfies the conditions of convergence theorem \sum\nolimits_{t\geq 0}{{\alpha_{t}}=+\infty,} and \sum\nolimits_{t\geq 0}{\alpha_{t}^{2}}<+\infty, regardless the initial settings of Q(s,a), if trial time is sufficiently large, the Qlearning algorithm is capable of converging to the optimal value with probability 1. The core point of Qlearning algorithm is updating Qtable. Once the Qtable is updated, the position and altitude of UAVs are founded, and the stability of algorithm is stable.
IV Dynamic movement design of UAVs
In this section, a Qlearning based algorithm is proposed for obtaining UAVs’ dynamic movement, which can be proved as nonconvex [8]. As the users are roaming at each time slot, the optimal position of UAVs in each cluster changes when the positions of the users change [16]. There is no standard solution for solving the calculated problem proposed efficiently, for the aim of tackling with the challenges, reinforcement learning approach will be invoked in this section.
Remark 5.
As the users move continuously, the optimal positions of the UAVs change at each timeslot. The UAVs have to travel to offer a better downlink service. Otherwise, the users’ QoE requirements will not be satisfied.
Please note that, as UAVs and ground users move continuously, users may move away from their initial areas and mix with other users initially allocated to other UAVs. In this case, updating the usersUAV association according to the new location of both users and UAVs is capable of further enhancing the quality of service for ground users. Nevertheless, it requires rising the size of action space from 7 to 7+2N\sum\nolimits_{n=1}^{N}{\left{{K_{n}}}\right} in Qlearning algorithm, in which N is the number of UAVs and \left{{K_{n}}}\right denotes the total number of users in the nth cluster. In this case, the complexity of algorithm will increase tremendously and it will be nontrivial to obtain a policy that maximizes the observed rewards because of huge Qtable. The potential solution is invoking deep reinforcement learning algorithm to reduce the dimension of Qtable, which will be discussed in our future work. In this paper, we study the trajectory design of UAVs during a short time. So, we assume that the user could not roam into other clusters. The practical mobility of users will be considered in our future work.
Before obtaining the dynamic movement of UAVs, the mobility of the users is supposed to be considered firstly. There exists a variety of mobility models such as deterministic approach, hybrid approach and random walk model [38, 39]. In this paper, we choose random walk model, also denoted as Markovian mobility model^{1}^{1}1Our approach is capable of accommodating other mobility model without loss of generality., which can present the movement of the users. The direction of each user’s movement is uniformly distributed among left, right, forward, backward, ascent and descending. The speed of the users is assigned as a pedestrian between [0,c_{max}], where c_{max} denotes the maximum speed of a user. The realtime position of the users is presented by random walk model in this paper. This model can be replaced by variety of mobility models such as deterministic approach and hybrid approach. In this case, the proposed method to obtain the dynamic movement of the UAVs can be implemented no matter how ground users move. At each times lot, as the users move, the UAV takes an action {a_{t}}\in A, which also includes choosing a direction for itself, according to the current state, {s_{t}}\in S, based on the decision policy J, then, the state of next times lot is {s_{t+1}}={s_{t}}+{a_{t}}. Qlearning method is utilized to obtain each action at each time slot.
In contrast to the Qlearning based deployment algorithm, the state in the proposed Qlearning based dynamic movement design algorithm to obtain dynamic movement of the UAVs can be divided to two partitions: the dynamic 3D position of the UAVs and the dynamic 2D position of the users, which can be denoted as: \xi=(x_{{\text{UAV}}},y_{{\text{UAV}}},h_{{\text{UAV}}},x_{{\text{user}}},y_{{% \text{user}}}). ({x_{{\text{user}}}},{y_{{\text{user}}}}) is determined by the initial position and movement of the users, ({x_{{\text{UAV}}}},{y_{{\text{UAV}}}},{h_{{\text{UAV}}}}) is determined by the dynamic of position of the UAVs and the action they take at last time slot.
Remark 6.
The convergence rate of the proposed algorithm for solving the problem of dynamic movement of UAVs varies when the simulation timespan is changed. The smaller simulation timespan is, the more accurate dynamic movement will obtain, while the more complex model will be. There is a tradeoff between improving the QoE of users and the complexity of proposed algorithm.
It’s worth to be noted that the proposed Qlearning algorithm for 3D dynamic movement of UAVs is based on the results of 3D deployment of the UAVs. At the beginning of time slot, the UAVs are at their optimal position based on the Qlearning based deployment algorithm. Then, at each time slot, the optimal actions of the UAVs is attained based on the mobility of users, which make up the dynamic movement of the UAVs.
Parameter  Description  Value 

f_{c}  Carrier frequency  2GHz 
{P_{\max}}  UAVs’ maximum transmit power  20dBm 
N_{0}  Noise power spectral  170dBm/Hz 
U  Ground users’ number  100 
N  UAVs’ number  4 
B  Bandwidth  1MHz 
b_{1},b_{2}  Environmental parameters (dense urban)  0.36,0.21 [20] 
\alpha  Path loss exponent  2 
{\mu_{LoS}}  Additional path loss for LoS  3dB [20] 
{\mu_{NLoS}}  Additional path loss for NLoS  23dB [20] 
\alpha_{l},\beta  Learning rate, Discount factor  0.01,0.7 
V Numerical Results
In this section, the numerical results are evaluated to the performance of the QoE of the users with proposed 3D deployment scheme and dynamic movement obtaining scheme. In this simulation, we consider that 100 users are uniformly and independently distributed within a geographical urban district with size 1{\text{km}}\times 1{\text{km}}, and the users roam follow random walk mode at each time slot. The other simulation parameters are given in Table III.
VA Numerical Results of 3D Deployment Problem
Fig. 3 plots the 3D deployment of the UAVs at initial time. It can be observed that ground users are divided into 4 clusters asymmetrically, each cluster is served by one UAV. UAVs in different cluster obtains different altitudes. This is because that as the altitudes of the UAVs increase, a higher path loss and high LoS probability are obtained simultaneously. The altitudes of the UAVs are determined by the density and positions of ground users.
Fig. 4 plots the QoE of each ground user. It can be observed that as the distance from UAV to a ground user increases, the MOS of this ground user decreases. This result is also analyzed by the insights in Remark 1. In different cluster, the MOS figures are differentiated, this is because ground users has various QoE requirements. In this case, the MOS of a ground user is determined by the distance with UAV as well as the users’ QoE requirements.
Fig. 5 plots the QoE versus maximum number of users over different algorithms. As analysis in Section III, the complexity of exhaust search solution is nontrivial, for the aim of showing the result of exhaust search solution, different number of users are considered. From fig. 5, it can be observed that the total MOS is capable of being improved as the UAVs’ transmit power and the ground users’ number increase. Additionally, Qlearning algorithm is capable of achieving better performance than Kmeans algorithm and IGK algorithm when increasing the transmit power of UAVs.
Fig. 6 plots the QoE of ground users as the maximum transmit power of UAVs varies. Kmeans algorithm, IGK algorithm, random deployment are also demonstrated as benchmarks. One can observe that, the sum MOS attained increases when rising transmit power of UAVs, which is because that the received SINR at the users can be improved and improves the sum MOS of ground users. Moreover, it can be observed that increasing the number of cluster is capable of improving the sum MOS. This result is also analyzed by the insights in Remark 2. When the transmit power of UAVs is 5dBm, the MOS of different cluster number is nearly the same. This is because the received SINR is not capable of satisfying the rate requirement of users. It can also be observed from Fig. 6, increasing the transmit power is able to further enlarge the gap between performance of different cluster numbers. Furthermore, Qlearning algorithm outperforms Kmeans algorithm and IGK algorithm while closing to exhaustive search. It is also worth noting that, by training process offline, when Qtable is updated, the complexity of Qlearning algorithm and Kmeans algorithm is the same.
Fig. 7 plots the convergence of Qlearning algorithm. Different line represents different initial position of UAVs. This figure illustrated the reduction of the convergence rate of Qlearning algorithm after about 20 iteration time in training process. It also investigates the stability of the convergence rate at the end, because the Qtable is already updated after training. This result is also analyzed by the insights in Remark 4. In this figure, different lines represents different initial position of UAVs in each trial, it can be observed that the optimal position will be found after about 20 trials regardless of the initial position of UAVs.
VB Numerical Results of Dynamic Movement Problem
Fig. 8 plots the sum MOS with the UAVs’ movement derived from Qlearning algorithm. The sum MOS when UAVs are static and the sum MOS with the movement of UAVs derived from IGK algorithm are also illustrated as benchmark. When ground users move follow random walk model, UAVs are supposed to move along to the movement of users. Otherwise, the sum MOS will decrease as shown in 8. This result is also analyzed by the insights in Remark 5. One can also observe that Qlearning algorithm performs better than IGK algorithm. Furthermore, note that the computation complexity of IGK algorithm is nontrivial to obtain dynamic movement of UAVs. In contrast to IGK algorithm, Qlearning algorithm enables each UAV gradually learn its dynamic movement, then the dynamic movement is obtained.
Fig. 9 plots the dynamic movement of UAVs derived from proposed approach when ground users move. In this figure, the dynamic movement of a UAV is shown and the duration time is 100s. In this simulation, we assume all the UAVs can move at a constant speed. At each time slot, UAVs choose a direction from the action space which contains 7 directions, then the dynamic movement will maximize the QoE of ground users at each time slot. It’s important to note here, we can adjust the timespan to improve the accuracy of dynamic movement. This in turn increases the number of required iterations for convergence. Therefore, a tradeoff exists between improving the QoE of ground users and the running complexity of the proposed algorithm.
VI Conclusions
The QoEdriven 3D deployment and dynamic movement of multiple UAVs were jointly studied. The algorithm were proposed for solving the problem of maximizing the sum MOS of the users. By proving the 3D deployment problem as a nonconvex problem, three steps were provided to tackle this problem. More particularly, GAKmeans algorithm was invoked to obtain initial cell partition. A Qlearningbased deployment algorithm was proposed for attaining 3D placement of UAVs when users are static. A Qlearning based movement algorithm was proposed for obtaining 3D dynamic movement of UAVs. It is demonstrated that the proposed 3D deployment scheme outperforms Kmeans algorithm and IGK algorithm with low complexity. Additionally, with the aid of proposed approach, 3D dynamic movement of UAVs is obtained.
Appendix A: Proof of Lemma 1
The received SINR of user k_{n} in time t while connecting to UAV n, denoted by {\Gamma_{{k_{n}}}}(t), can be expressed as
{\Gamma_{{k_{n}}}}(t)=\frac{{{p_{{k_{n}}}}(t){g_{{k_{n}}}}(t)}}{{{\sigma^{2}}}% }\geq\gamma. 
Equation (A.1) can be rewrite as
{P_{\max}}\geq\gamma{\sigma^{2}}{K_{0}}d_{{k_{n}}}^{\alpha}(t)({P_{LoS}}{\mu_{% LoS}}+{P_{NLoS}}{\mu_{NLoS}}). 
It can be easily proved that ({P_{LoS}}{\mu_{LoS}}+{P_{NLoS}}{\mu_{NLoS}})\leq{\mu_{LoS}}. When the UAVuser link has a 100% probability of LoS connection, the equation will turn to equality.
Thus, we can obtain that
\begin{array}[]{l}{P_{\max}}\geq\gamma{\sigma^{2}}{K_{0}}d_{{k_{n}}}^{\alpha}(% t){\mu_{LoS}}\\ {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}\geq\gamma{\sigma^{2}}{K_{0}}{\mu_{LoS}}\cdot\max% \left\{{{h_{1}},{h_{2}},\cdots{h_{n}}}\right\}.\end{array} 
The proof is completed.
Appendix B: Proof of Proposition 1
The received SINR of user k_{n} in time t while connecting to UAV n, denoted by {\Gamma_{{k_{n}}}}(t), can be expressed as Equation (A.1). Equation (A.1) can be rewrite as follows
{p_{{k_{n}}}}(t){K_{0}}^{1}d_{{k_{n}}}^{\alpha}(t){L^{1}}\geq\gamma{\sigma^% {2}}, 
where L={P_{LoS}}({\theta_{{k_{n}}}}){\mu_{LoS}}+{P_{NLoS}}({\theta_{{k_{n}}}}){\mu_% {NLoS}}. Then, d_{{k_{n}}}^{\alpha}(t)\leq\frac{{{p_{{k_{n}}}}(t)}}{{\gamma{K_{0}}{\sigma^{2}% }L}}\leq\frac{{{P_{\max}}}}{{\gamma{K_{0}}{\sigma^{2}}L}}. In this case, we obtain that
\begin{gathered}\displaystyle{P_{LoS}}({\theta_{{k_{n}}}})={b_{1}}{(\frac{{180% }}{\pi}{\theta_{{k_{n}}}}\zeta)^{{b_{2}}}}\\ \displaystyle{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}\geq\frac{S}{% {({\mu_{LoS}}{\mu_{NLoS}})}}\frac{{{\mu_{NLoS}}}}{{{\mu_{LoS}}{\mu_{NLoS}}}% }\\ \end{gathered}, 
where S=\frac{{{P_{\max}}}}{{\gamma{K_{0}}{\sigma^{2}}d_{{k_{n}}}^{\alpha}(t)}}. Then, we have {\theta_{{k_{n}}}}(t)={\sin^{1}}(\frac{{{h_{n}}(t)}}{{{d_{{k_{n}}}}(t)}})\geq% \frac{\pi}{{180}}[\zeta+{e^{M}}]. Finally, we obtain
{h_{n}}(t)\geq{d_{{k_{n}}}}(t)\sin\left[{\frac{\pi}{{180}}(\zeta+{e^{M}})}% \right]. 
Following from (B.4), we have d_{{k_{n}}}(t){\kern 1.0pt}\leq{\left[{\frac{{{p_{{k_{n}}}}(t)}}{{\gamma{K_{0}% }{\sigma^{2}}L}}}\right]^{{1\mathord{\left/{\vphantom{1\alpha}}\right.\kern1.% 2pt}\alpha}}}. It can be proved that {P_{LoS}}({\theta_{{k_{n}}}}){\mu_{LoS}}+{\mu_{NLoS}}(1{P_{LoS}}({\theta_{{k_% {n}}}})\leq{\mu_{LoS}} because of {\mu_{LoS}}<{\mu_{NLoS}}. So, we have d_{{k_{n}}}(t){\kern 1.0pt}\leq{(\frac{{{P_{\max}}}}{{\gamma{K_{0}}{\sigma^{2}% }{\mu_{LoS}}}})^{{1\mathord{\left/{\vphantom{1\alpha}}\right.\kern1.2pt}% \alpha}}}. It’s easy to obtain that {h_{n}}(t)\leq d_{{k_{n}}}(t){\kern 1.0pt}, then, we have
{h_{n}}(t){\kern 1.0pt}\leq{(\frac{{{P_{\max}}}}{{\gamma{K_{0}}{\sigma^{2}}{% \mu_{LoS}}}})^{{1\mathord{\left/{\vphantom{1\alpha}}\right.\kern1.2pt}\alpha}% }}. 
The altitude constraint of UAV n is given by
{d_{{k_{n}}}}(t)\sin\left[{\frac{\pi}{{180}}(\zeta+{e^{M}})}\right]\leq{h_{n}}% (t){\kern 1.0pt}\leq{(\frac{{{P_{\max}}}}{{\gamma{K_{0}}{\sigma^{2}}{\mu_{LoS}% }}})^{{1\mathord{\left/{\vphantom{1\alpha}}\right.\kern1.2pt}\alpha}}}. 
The proof is completed.
Appendix C: Proof of Theorem 1
The sum MOS can be expressed as
\displaystyle\begin{array}[]{l}{\rm{MO}}{{\rm{S}}_{{\rm{total}}}}=\sum\limits_% {n=1}^{N}{\sum\limits_{{k_{n}}=1}^{{{\rm{K}}_{n}}}{{\rm{MO}}{{\rm{S}}_{k,n}}({% r_{{k_{n}}}})}}\\ {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\rm{=}}{C_{1}}\sum\limits_{n=1}^{N}{% \sum\limits_{{k_{n}}=1}^{{{\rm{K}}_{n}}}{\ln(d({r_{{k_{n}}}}))}}+{C_{2}}.\end{array}  (C.1) 
Equation (C.1) can be represented and given by (C.2) at the top of next page.
\displaystyle{\text{ MO}}{{\text{S}}_{{\text{total}}}}={C_{2}}{C_{1}}\sum% \limits_{n=1}^{N}{\sum\limits_{{k_{n}}=1}^{{{\text{K}}_{n}}}{\ln({\text{3}}RTT% +\frac{{FS}}{{{r_{{k_{n}}}}}}+L(\frac{{MSS}}{{{r_{{k_{n}}}}}}+RTT){\kern 1.0pt% }\frac{{2MSS({2^{L}}1)}}{{{r_{{k_{n}}}}}})}}.  (C.2) 
The instantaneous achievable rate of user {{r_{{k_{n}}}}} in (B.1) can be expressed as
\begin{array}[]{l}{r_{{k_{n}}}}(t)={B_{{k_{n}}}}{\log_{2}}(1+\frac{{{p_{{k_{n}% }}}(t){g_{{k_{n}}}}(t)}}{{{\sigma^{2}}}})\\ =\frac{B}{{{K_{n}}}}{\log_{2}}(1+\frac{{\frac{{{P_{\max}}}}{{{K_{n}}{{(\frac{{% 4\pi{f_{c}}}}{c})}^{2}}d_{{k_{n}}}^{\alpha}(t)[{P_{{\rm{LoS}}}}{\mu_{{\rm{LoS}% }}}+{P_{{\rm{NLoS}}}}{\mu_{{\rm{NLoS}}}}]}}}}{{{\sigma^{2}}}}).\end{array} 
It is easy to verify that f(z)=\log(1+\frac{\gamma}{A}) is convex with regard to A>0, \gamma>0 and \gamma>A. Then, the nonconvexity of (C.3) equals to the nonconvexity of d_{{k_{n}}}^{\alpha}(t)({P_{{\text{LoS}}}}{\mu_{{\text{LoS}}}}+{P_{{\text{NLoS% }}}}{\mu_{{\text{NLoS}}}}).
It can be proved that d_{{k_{n}}}^{\alpha}(t)({P_{{\text{LoS}}}}{\mu_{{\text{LoS}}}}+{P_{{\text{NLoS% }}}}{\mu_{{\text{NLoS}}}}) is convex over the altitude of UAVs. However, as we have three variate (Xcoordinate, Ycoordinate and altitude of UAVs), and {x_{n}}(t)’s, {y_{n}}(t)’s and {h_{n}}(t)’s varies at each time slot, d_{{k_{n}}}^{\alpha}(t)({P_{{\text{LoS}}}}{\mu_{{\text{LoS}}}}+{P_{{\text{NLoS% }}}}{\mu_{{\text{NLoS}}}}) is nonconvex.In this case, (C.3) and (C.1) is also nonconvex over {x_{n}}(t)’s, {y_{n}}(t)’s and {h_{n}}(t)’s.
The proof is completed.
Appendix D: Proof of Theorem 2
In order to prove that problem (13a) is NPhard, three steps are supposed to be taken based on the computational complexity theory. Firstly, a known NPcomplete decision problem Q is supposed to be chosen. Secondly, a polynomial time transformation from any instance of Q to an instance of problem (13a) is supposed to be constructed. Thirdly, the two instances are supposed to be proven that they have the same objective value.
In this article, we prove that (13a) is NPhard even the QoE requirements of ground users are the same, in the meantime, the probability of LoS connections from UAVs towards ground users is one. Thus, problem (13a) can be rewrited as
\begin{gathered}\displaystyle\mathop{\max}\limits_{C{\text{,}}Q,H}{\kern 1.0pt% }{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt% }{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt% }{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt% }{R_{sum}}\\ \displaystyle{\text{s}}{\text{.t}}{\text{.}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{13(b),13(c),13(d),13(e),13(f)}.\\ \end{gathered} 
As the probability of LoS connections is one, the overall achievable sum rate is
\begin{array}[]{l}{R_{{\rm{sum}}}}=\sum\limits_{n=1}^{N}{\sum\limits_{{k_{n}}=% 1}^{{{\rm{K}}_{n}}}{{r_{{k_{n}}}}(t)}}\\ =\sum\limits_{n=1}^{N}{\sum\limits_{{k_{n}}=1}^{{{\rm{K}}_{n}}}{\frac{B}{{{K_{% n}}}}{{\log}_{2}}\left({1+\frac{{{P_{\max}}}}{{{\sigma^{2}}{K_{n}}{{(\frac{{4% \pi{f_{c}}}}{c})}^{2}}d_{{k_{n}}}^{\alpha}(t){\mu_{{\rm{LoS}}}}}}}\right)}}% \end{array}. 
Following from (D.2), the overall achievable sum rate only relate to the sum Euclidean distance. Furthermore, when the UAVs’ altitudes are assumed to be fixed, the problem (D.1) is simplified as
\begin{gathered}\displaystyle\mathop{\min}\limits_{C{\text{,}}Q}{\kern 1.0pt}{% \kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{% \kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{% \kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}% \sum\limits_{n=1}^{N}{\sum\limits_{{k_{n}}=1}^{{{\rm K}_{n}}}{{d_{{k_{n}}}}}}% \\ \displaystyle{\text{s}}{\text{.t}}{\text{.}}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1% .0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{13(b),13(d)}.\\ \\ \end{gathered} 
Problem (D.3) is known as the planar Kmeans problem. The authors in [40] demonstrate the NPhardness of planar Kmeans problem by a reduction from planar 3SAT problem, which is known as a NPcomplete problem. The prove process will not be shown is this paper,and it can be seen in [40]. Note that problem (D.3) is a special case of problem (13a), then the original problem in (13a) is NPhard.
The proof is completed.
References
 [1] X. Liu, Y. Liu, and Y. Chen, “Deployment and movement for multiple aerial base stations by reinforcement learning,” in 2018 IEEE Globecom Workshops (GC Wkshps). IEEE, 2018, pp. 1–6.
 [2] J. Kosmerl and A. Vilhar, “Base stations placement optimization in wireless networks for emergency communications,” in IEEE Proc. of International Commun. Conf. (ICC), 2014, pp. 200–205.
 [3] Y. Zeng, R. Zhang, and T. J. Lim, “Wireless communications with unmanned aerial vehicles: Opportunities and challenges,” IEEE Commun. Mag., vol. 54, no. 5, pp. 36–42, May 2016.
 [4] A. Osseiran, F. Boccardi, V. Braun, Kusume et al., “Scenarios for 5G mobile and wireless communications: the vision of the METIS project,” IEEE Commun. Mag., vol. 52, no. 5, pp. 26–35, May 2014.
 [5] Z. Qin, J. Fan, Y. Liu, Y. Gao, and G. Y. Li, “Sparse representation for wireless communications,” arXiv preprint arXiv:1801.08206, 2018.
 [6] Z. Qin, Y. Gao, M. D. Plumbley, and C. G. Parini, “Wideband spectrum sensing on realtime signals at subNyquist sampling rates in single and cooperative multiple nodes,” IEEE Trans. Signal Process., vol. 64, no. 12, pp. 3106–3117, Dec 2016.
 [7] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Mobile Unmanned Aerial Vehicles (UAVs) for energyefficient Internet of Things Communications,” IEEE Trans. Wireless Commun., vol. 16, no. 11, pp. 7574–7589, Nov. 2017.
 [8] Q. Wang, Z. Chen, and S. Li, “Joint power and trajectory design for physicallayer secrecy in the UAVaided mobile relaying system,” arXiv preprint arXiv:1803.07874, 2018.
 [9] W. Khawaja, O. Ozdemir, and I. Guvenc, “UAV airtoground channel characterization for mmwave systems,” arXiv preprint arXiv:1707.04621, 2017.
 [10] X. Zhang and L. Duan, “Optimal deployment of UAV networks for delivering emergency wireless coverage,” arXiv preprint arXiv:1710.05616, 2017.
 [11] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Drone small cells in the clouds: Design, deployment and performance analysis,” in IEEE Proc. of Global Commun. Conf. (GLOBECOM), 2015, pp. 1–6.
 [12] B. Van der Bergh, A. Chiumento, and S. Pollin, “LTE in the sky: trading off propagation benefits with interference costs for aerial nodes,” IEEE Commun. Mag., vol. 54, no. 5, pp. 44–50, May 2016.
 [13] Y. Zeng and R. Zhang, “Energyefficient UAV communication with trajectory optimization,” IEEE Trans. Wireless Commun., vol. 16, no. 6, pp. 3747–3760, Mar 2017.
 [14] S. Kandeepan, K. Gomez, L. Reynaud, and T. Rasheed, “Aerialterrestrial communications: terrestrial cooperation and energyefficient transmissions to aerial base stations,” IEEE Trans. Aerosp. Electron. Syst., vol. 50, no. 4, pp. 2715–2735, Dec 2014.
 [15] K. Li, W. Ni, X. Wang, R. P. Liu, S. S. Kanhere, and S. Jha, “Energyefficient cooperative relaying for unmanned aerial vehicles,” IEEE Trans. Mobile Computing., vol. 15, no. 6, pp. 1377–1386, Aug 2016.
 [16] L. Liu, S. Zhang, and R. Zhang, “CoMP in the sky: UAV placement and movement optimization for multiuser communications,” arXiv preprint arXiv:1802.10371, 2018.
 [17] L. Bing, “Study on modeling of communication channel of UAV,” Procedia Computer Science, vol. 107, pp. 550–557, 2017.
 [18] A. AlHourani, S. Kandeepan, and S. Lardner, “Optimal LAP altitude for maximum coverage,” IEEE Wireless Commun. Lett., vol. 3, no. 6, pp. 569–572, Jul 2014.
 [19] M. Alzenad, A. ElKeyi, F. Lagum, and H. Yanikomeroglu, “3D placement of an unmanned aerial vehicle base station (UAVBS) for energyefficient maximal coverage,” IEEE Wireless Commun. Lett., May 2017.
 [20] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Wireless communication using unmanned aerial vehicles (UAVs): Optimal transport theory for hover time optimization,” IEEE Trans. Wireless Commun., vol. 16, no. 12, pp. 8052–8066, Dec 2017.
 [21] M. Mozaffari, W. Saad, and M. Bennis, “Efficient deployment of multiple unmanned aerial vehicles for optimal wireless coverage,” IEEE Commun. Lett., vol. 20, no. 8, pp. 1647–1650, Jun 2016.
 [22] J. Lyu, Y. Zeng, and R. Zhang, “Cyclical multiple access in UAVAided communications: A throughputdelay tradeoff,” IEEE Wireless Commun. Lett., vol. 5, no. 6, pp. 600–603, Dec 2016.
 [23] Y. Liu, Z. Qin, M. Elkashlan, Z. Ding, A. Nallanathan, and L. Hanzo, “Nonorthogonal multiple access for 5G and beyond,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2347–2381, Nov 2017.
 [24] Y. Cai, Z. Qin, F. Cui, G. Y. Li, and J. A. McCann, “Modulation and multiple access for 5G networks,” IEEE Commun. Surv. Tut., Oct 2017.
 [25] Q. Wu and R. Zhang, “Common throughput maximization in UAVenabled OFDMA systems with delay consideration,” available online: arxiv. org/abs/1801.00444, 2018.
 [26] A. GalindoSerrano and L. Giupponi, “Distributed QLearning for aggregated interference control in cognitive radio networks,” IEEE Trans. Veh. Technol., vol. 59, no. 4, pp. 1823–1834, May 2010.
 [27] U. Challita, W. Saad, and C. Bettstetter, “Cellularconnected UAVs over 5G: Deep reinforcement learning for interference management,” arXiv preprint arXiv:1801.05500, 2018.
 [28] X. Lu, L. Xiao, and C. Dai, “UAVaided 5G communications with deep reinforcement learning against jamming,” arXiv preprint arXiv:1805.06628, 2018.
 [29] S. Zhang, Y. Zeng, and R. Zhang, “Cellularenabled UAV communication: Trajectory optimization under connectivity constraint,” arXiv preprint arXiv:1710.11619, 2017.
 [30] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Unmanned aerial vehicle with underlaid devicetodevice communications: Performance and tradeoffs,” IEEE Trans. Wireless Commun., vol. 15, no. 6, pp. 3949–3963, Jun. 2016.
 [31] Y. Gao, X. Wei, L. Zhou, and M. Guizani, “QoE in multimedia domain: a usercentric quality assessment,” International Journal of Multimedia Intelligence and Security, vol. 3, no. 2, pp. 162–186, 2018.
 [32] K. Mitra, A. Zaslavsky, and C. Ahlund, “Contextaware QoE modelling, measurement, and prediction in mobile computing systems,” IEEE Trans. Mobile Computing., vol. 14, no. 5, pp. 920–936, May 2015.
 [33] J. Cui, Y. Liu, Z. Ding, P. Fan, and A. Nallanathan, “QoEbased resource allocation for multicell NOMA networks,” IEEE Trans. Wireless Commun., vol. 17, no. 9, pp. 6160–6176, Sep. 2018.
 [34] M. Rugelj, U. Sedlar, M. Volk, J. Sterle, M. Hajdinjak, and A. Kos, “Novel crosslayer QoEaware radio resource allocation algorithms in multiuser OFDMA systems,” IEEE Commun. Mag., vol. 62, no. 9, pp. 3196–3208, Sept 2014.
 [35] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “An efficient kmeans clustering algorithm: Analysis and implementation,” IEEE Trans. Pattern Anal. Machine Intell., vol. 24, no. 7, pp. 881–892, Aug 2002.
 [36] J. Cui, Z. Ding, and P. Fan, “The application of machine learning in mmWaveNOMA systems,” in 2018 IEEE 87th Vehicular Technology Conference (VTC Spring), 2018, pp. 1–6.
 [37] A. Asheralieva and Y. Miyanaga, “An autonomous learningbased algorithm for joint channel and power level selection by D2D pairs in heterogeneous cellular networks,” IEEE Trans. Commun., vol. 64, no. 9, pp. 3996–4012, Sept. 2016.
 [38] J. Ren, G. Zhang, and D. Li, “Multicast capacity for VANETs with directional antenna and delay constraint under random walk mobility model,” IEEE Access, vol. 5, pp. 3958–3970, 2017.
 [39] M. Chen, M. Mozaffari, W. Saad, C. Yin, M. Debbah, and C. S. Hong, “Caching in the sky: Proactive deployment of cacheenabled unmanned aerial vehicles for optimized qualityofexperience,” IEEE J. Sel. Areas Commun., vol. 35, no. 5, pp. 1046–1061, May 2017.
 [40] M. Mahajan, P. Nimbhorkar, and K. Varadarajan, “The planar kmeans problem is NPhard,” Theoretical Computer Science, vol. 442, pp. 13–21, 2012.