Self-Organizing mmWave Networks : A Power Allocation Scheme Based on Machine Learning

Self-Organizing mmWave Networks : A Power Allocation Scheme Based on Machine Learning


Millimeter-wave (mmWave) communication is anticipated to provide significant throughout gains in urban scenarios. To this end, network densification is a necessity to meet the high traffic volume generated by smart phones, tablets, and sensory devices while overcoming large pathloss and high blockages at mmWaves frequencies. These denser networks are created with users deploying small mmWave base stations (BSs) in a plug-and-play fashion. Although, this deployment method provides the required density, the amorphous deployment of BSs needs distributed management. To address this difficulty, we propose a self-organizing method to allocate power to mmWave BSs in an ultra dense network. The proposed method consists of two parts: clustering using fast local clustering and power allocation via Q-learning. The important features of the proposed method are its scalability and self-organizing capabilities, which are both important features of 5G. Our simulations demonstrate that the introduced method, provides required quality of service (QoS) for all the users independent of the size of the network.

I Introduction

Millimeter-wave (mmWave) communication is one of the main technologies of the next generation of cellular networks (5G). The large bandwidth at mmWave frequency has the potential to enhance network throughput by tenfolds [1]. However, large path loss and shadowing limit the performance of mmWave systems and need to be dealt with. One approach to overcome this problem is based on increasing the density of access points [2, 3]. However, as the number of access points increases, the complexity of network management increases. Keeping this in mind, one of the features of future mmWave base stations (BSs) is self-deployment by users. In other words access points can be deployed in a plug-and-play fashion, and the network architecture may change frequently. Considering the above points, 5G needs self-organizing methods to configure, adapt, or heal itself when necessary. In this paper, a self-organizing algorithm is proposed to maximize the sum capacity in a dense mmWave network while providing users with their required quality of service (QoS). The algorithm consists of clustering, based on fast local clustering (FLOC), and distributed power allocation, via Q-learning. Scalability and fast convergence of FLOC, adaptability and distributed nature of Q-learning, makes their combination a suitable tool to achieve self-organization in a dense network.

Ii System Model

The System model considers a dense outdoor urban scenario as an important example of 5G, i.e., we consider the downlink of densely deployed mmWave BSs. To this end let us consider mmWave BSs that are distributed based on the homogeneous spatial Poisson point process (SPPP) with density  [4]. Each BS is associated to one user. BSs share a single frequency resource block (FRB) to support their associated users. We assume a time invariant channel model, i.e. slow fading. The channel vector between the BS and user , can be written as follows


where and denote the path loss and the path gain between the BS and user . The path loss between the BS and its associated user , , follows the free space propagation based on Frii’s law [1]. Here, we consider that the majority of interferers have non-line-of-sight (NLOS) paths[5]. Hence, the path loss () can be written as [1]


where and are factors used to achieve best fit to channel measurements, is the distance between the BS and the user , denotes the logarithmic shadowing factor, where , and denotes the lognormal shadowing variance.

The received signal in the downlink at the user includes the desired signal from its associated BS (BS ), interference from neighboring BSs, and also thermal noise. Hence, the signal-to-interference-noise-ratio (SINR) at the user is given by


where denotes the power transmitted by the BS, is the set of interfering BSs, and denotes the variance of the additive white Gaussian noise. Accordingly, the normalized capacity at the user is given by


Iii Problem Formulation

The goal of the optimization problem is to find the best power distribution between mmWave BSs () in order to maximize the sum capacity of the network, while supporting all users with their required QoS. The optimization problem () can be formulated as

subject to (5b)

Here, the objective (5a) is to maximize the sum capacity of the network while providing all users with their required QoS in (5c). The first constraint, (5b), refers to the power limitation of every BS. The term in (5c) refers to the minimum required SINR for the user.

Eq. (5a) contains the interference term in the denominator of SINR term. In a dense network the interference term cannot be ignored [6]. Due to the presence of the interference term, the objective function (5a) is a non-concave function [7].

The solution to should have certain features. First, it should be distributed due to no central authority in this network. Second, the range of mmWave BSs is limited, so each user will receive interference from the BSs in its neighborhood. Therefore, the solution should consider local clustering to reduce the computation overhead. Third feature is self-healing. The number of BSs in the network changes sporadically, which means the solution should be adaptive to new possible architectures. Considering the above, in this paper, we propose a method which contains two parts : a fast local clustering method to locally cluster the BSs, and in each cluster, BSs will choose their transmitting power based on Q-learning [8]. Q-learning is model-free (adaptable) and gives the BSs the ability to learn from their environment by interacting with it (self-organization).

Iv Cluster Based Distributed Power Allocation Using Q-Learning (CDP-Q)

In our proposed method, mmWave BSs are considered as the agents of Q-learning, so the terms agent and mmWave BS are used interchangeably. CDP-Q is a distributed method in which multiple agents (mmWave BSs) find a sub-optimal policy (power allocation) to maximize the network capacity. CDP-Q consists of two parts: (1) clustering, and (2) power allocation. Clustering is based on a local clustering method, and power allocation is based on Q-learning. In the following each part is detailed.

Iv-a MmWave BSs Clustering

Since mmWave signals suffer from high pathloss and shadowing, only neighboring BSs that are close in distance interfere with each other. Consequently, we propose to use a clustering mechanism to divide BSs into clusters in which the interference of one cluster is negligible on other clusters’ users.

In this paper, we propose to use Fast local clustering (FLOC) [9] to divide mmWave BSs into clusters. FLOC is a distributed message-passing clustering method with complexity, which guarantees scalability, and produces non-overlapping clusters. Another feature of FLOC is local self-healing, which means re-clustering, due to addition of a new node or removing a node, does not propagate through all clusters. In order to apply FLOC in a mmWave network, the following concepts are defined:

  • Cluster head (CH): The mmWave BS that is chosen as the head of the cluster. In our algorithm, there is no priority between a cluster head and other members of the cluster.

  • In-bound (IB), and out-band (OB) node: In FLOC, a node is in-bound if it is a unit distance from a CH. A unit distance is a set value, which in this case is the range of mmWave links, i.e., - [1]. Accordingly, we define in-bound as , which is an indication of strong interference, and out-band as , which indicates the edge of the cluster around a CH. Finally, if a node is in out-band distance of a cluster , and not in an in-bound distance of any other clusters, then node will join the cluster as an OB node.

Iv-B Distributed Power Allocation Using Q-Learning

The output of Q-learning is a decision policy (power allocation) which is represented as a function called Q-function. Here, the Q-function of agent is represented as a table called a Q-table (). The columns of a Q-table are the actions (), and the rows are the state () of the agent .

In multi-agent Q-learning, agents can act independently or cooperatively. In the independent learning, each agent interacts with the environment without communicating with other agents. In fact, it considers the other agents as part of the environment. Independent learning has shown good performance in many applications [10]. In independent learning, since the environment is not stationary, oscillation and longer convergence time might happen for the agents, but due to no communication overhead between agents compared to cooperative learning, we choose independent learning. Motivated by this fact, the agents will select their actions according to [11]


in which, subscript denotes time step of Q-learning. The CDP-Q algorithm is presented in Algorithm 1.

1:  Cluster formation based on Sec. IV-A
2:  for all Clusters in Parallel do
3:     for all Agents do
4:        Initialize arbitrarily
5:        Initialize
6:        for all episodes do
7:           send to other agents of the cluster
8:           receive
9:           Choose according to Eq. 6
10:           Take action , observe
12:        end for
13:     end for
14:  end for
Algorithm 1 The proposed CDP-Q algorithm

In the following the actions, states, and the reward function of the proposed Q-learning method are defined.


The set of actions (powers) is defined as , which uniformly covers the range between minimum () and maximum () power.


We define equally spaced concentric circles around the cluster head (CH) of each cluster. These circles, define rings with units of spacing, around the CH. The state of the agent at time step is defined as which shows the ring number that the agent belongs to. Considering the definition of the Q-table and the states at the beginning of this section, if the agents’ location is fixed, each agent will choose just one row of its Q-table to search for the best action decision.


is the immediate reward incurred due to selection of the action at state by the agent at time step . The constraint in (5c), can be represented as: for . is the normalized capacity of agent at time step . Based on this, the normalized proposed reward function for the agent at time step is defined as


The rationale behind the proposed reward function is as follows

  • The term (a) normalizes the value of reward function.

  • The objective of the optimization problem is to maximize the capacity of the network, so the term (b) results in a higher reward for higher capacity for an agent.

  • To satisfy the QoS constraint for agent , capacity deviation of its associated user from the required QoS, term (c), should result in a lower reward.

  • There is a maximum reward () for an agent to provide fairness between the agents which is shown in Fig. 1.

  • The proposed reward function is a first order function of , which reduces each iteration’s complexity.

Fig. 1: Proposed reward function (RF).

V Simulation Results

In this section the simulation setup is detailed and then the results of the simulations are presented.

V-a Simulation Setup

A dense mmWave BS network, with approximately BSs in a area is considered. The BSs are distributed based on SPPP and operate independently in the network. Each BS, supports one user equipment (UE), which is located in a radius of m around the BS. The QoS for a user is defined as the required SINR to support the user’s service. The value of is considered for all the users.

To perform Q-learning, the learning rate is considered as , the discount factor as , , m, and . The maximum number of iterations is set to . The remaining parameters of the simulation are represented in Table I.

Param. Value Param. Value
f 28 GHz -10 dBm
8.7 dB 35 dBm
72.0 2.92
-120 dBm
TABLE I: Simulation Parameters

V-B Clustering Results

The implementation of clustering algorithm, is an event driven, message-passing distributed program in C++. Every BS is simulated as an independent thread, and is added to the network randomly in . The clustering algorithm converges in less than for the assumed value for the . The resulted clusters in two different distribution of BSs are shown in different colors in Fig. 3, and 3. Each cluster head (CH) is marked with a filled color.

Fig. 2: BSs in .
Fig. 3: BSs in .

V-C Power Allocation Results

According to [2, 1], the coverage range of millimeter communication is in the range of m, which means the maximum coverage of for each mmWave BS. Considering the interference-limited assumption and the value of , a cluster might have mmWave BSs. Hence, the CDP-Q algorithm results in clusters that include 2 to 14 BSs.

The results of power allocation using the proposed reward function are compared to the exponential reward function proposed in [12], which are presented as EXP-Q in the simulations. For all possible cluster sizes, power allocation using the proposed reward function is simulated, and the normalized capacity of all BSs in the clusters are plotted in Fig. 5. The same simulations for EXP-Q are presented in Fig. 5.

Fig. 4: Capacity of clusters’ members.
Fig. 5: Capacity of clusters’ members.

As it is shown in these figures, while both reward functions satisfy all members for all sizes of clusters with their required QoS, the normalized capacity of users in the CDP-Q are close to each other, while in EXP-Q the normalized capacity of users are much diverse. The diversity of normalized capacity values in EXP-Q effects the fairness index. The fairness index in each cluster is measured using Jain’s fairness index [13] and is shown in Fig. 7. In Fig. 7, the CDP-Q maintains fairness for all sizes of clusters, while EXP-Q fails to support users with fairness in large cluster sizes. On the other hand, total capacity of the clusters are shown in Fig. 7 with respect to the cluster size. According to Fig. 7, the CDP-Q provides higher capacity than the EXP-Q for all sizes of clusters.

Fig. 6: Jain’s fairness index.
Fig. 7: Sum capacity of clusters.

Vi Conclusion

In this paper, a self-organized distributed power allocation algorithm is presented. The proposed algorithm reduces the optimization complexity by using a distributed clustering method, and provides adaptability in power allocation by using Q-learning. The proposed reward function, satisfies required QoS for the users in all sizes of the resulted clusters, and outperforms the exponential based reward function.


  1. S. Rangan, T. S. Rappaport, and E. Erkip, “Millimeter-wave cellular wireless networks: Potentials and challenges,” Proceedings of the IEEE, vol. 102, no. 3, pp. 366–385, March 2014.
  2. R. Baldemair, T. Irnich, K. Balachandran, E. Dahlman, G. Mildh, Y. Selén, S. Parkvall, M. Meyer, and A. Osseiran, “Ultra-dense networks in millimeter-wave frequencies,” IEEE Commun. Mag., vol. 53, no. 1, pp. 202–208, January 2015.
  3. T. Bai and R. W. Heath, “Coverage in dense millimeter wave cellular networks,” in Asilomar Conference on Signals, Systems and Computers, Nov 2013, pp. 2062–2066.
  4. D. P. Kroese and Z. Botev, “Spatial process generation,” Aug 2013.
  5. M. Rebato, M. Mezzavilla, S. Rangan, F. Boccardi, and M. Zorzi, “Understanding noise and interference regimes in 5G millimeter-wave cellular networks,” in 22th European Wireless Conference, May 2016, pp. 1–5.
  6. S. Niknam and B. Natarajan, “On the Regimes in Millimeter wave Networks: Noise-limited or Interference-limited?” CoRR, Apr. 2018. [Online]. Available:
  7. Z.-Q. Luo and W. Yu, “An introduction to convex optimization for communications and signal processing,” IEEE J. Select. Areas Commun., vol. 24, no. 8, pp. 1426–1438, Aug 2006.
  8. C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, no. 3, pp. 279–292, 1992. [Online]. Available:
  9. M. Demirbas, A. Arora, V. Mittal, and V. Kulathumani, “A fault-local self-stabilizing clustering service for wireless ad hoc networks,” IEEE Trans. Parallel Distrib. Syst., vol. 17, no. 9, pp. 912–922, Sept 2006.
  10. L. Panait and S. Luke, “Cooperative multi-agent learning: The state of the art,” Autonomous Agents and Multi-Agent Systems, vol. 11, no. 3, pp. 387–434, Nov 2005. [Online]. Available:
  11. R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 1st ed.   Cambridge, MA, USA: MIT Press, 1998.
  12. H. Saad, A. Mohamed, and T. ElBatt, “Distributed cooperative Q-learning for power allocation in cognitive femtocell networks,” in Proc. IEEE Veh. Technol. Conf., Sept 2012, pp. 1–5.
  13. R. Jain, D. Chiu, and W. Hawe, “A quantitative measure of fairness and discrimination for resource allocation in shared computer systems,” CoRR, 1998. [Online]. Available:
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description