Fast reinforcement learning for decentralized MAC optimization
Abstract
In this paper, we propose a novel decentralized framework for optimizing the transmission strategy of Irregular Repetition Slotted ALOHA (IRSA) protocol in sensor networks. We consider a hierarchical communication framework that ensures adaptivity to changing network conditions and does not require centralized control. The proposed solution is inspired by the reinforcement learning literature, and, in particular, Qlearning. To deal with sensor nodes’ limited lifetime and communication range, we allow them to decide how many packet replicas to transmit considering only their own buffer state. We show that this information is sufficient and can help avoiding packets’ collisions and improving the throughput significantly. We solve the problem using the decentralized partially observable Markov Decision Process (DecPOMDP) framework, where we allow each node to decide independently of the others how many packet replicas to transmit. We enhance the proposed Qlearning based method with the concept of virtual experience, and we theoretically and experimentally prove that convergence time is, thus, significantly reduced. The experiments prove that our method leads to large throughput gains, in particular when network traffic is heavy, and scales well with the size of the network. To comprehend the effect of the problem’s nature on the learning dynamics and vice versa, we investigate the waterfall effect, a severe degradation in performance above a particular traffic load, typical for codesongraphs and prove that our algorithm learns to alleviate it.
I Introduction
The scenery of Internet of Things (IoT) technology is rapidly evolving, both in terms of opportunities and needs, and is expanding its outreach to a wide spectrum of daily life applications. Communication in IoT networks and wireless sensor networks (WSNs) is in general challenging, as IoT devices and sensors have limited capabilities, such as limited battery capacity and communication range. To coordinate the access of the shared wireless resources, a MAC protocol is employed. MAC design aims at optimizing the performance of communication by formulating the strategies IoT or sensor nodes use to access the common channel. Communication protocols, such as Slotted ALOHA [1], offer efficient random access mechanisms, but face problems for networks of increased size and channels with varying noise conditions and network load. Thus, there is still an urgent need to redesign ALOHA so that it optimally uses the available bandwidth and users can obtain the demanded content with fewer transmissions and without imposing coordination between the nodes. Such optimization of Slotted ALOHA will lead to prolonging the life of the sensors as fewer transmissions will be required for the communication.
MAC protocol design is often studied as a distributed resource allocation problem, where sensors attempt transmission of packets to a shared channel, and therefore, compete for the restricted bandwidth resources. There exist two diametrical families of MAC protocols, namely: {enumerate*}[label=()]
TimeDivision Multiple Access (TDMA) based, where allocation of slots is static and performed apriori, and
contentionbased, where nodes randomly select time slots to transmit. TDMA has been successfully applied in VANETS [2] due to its ability to provide deterministic access time without collisions in realtime applications. Conversely, contentionbased methods are more appropriate for adaptive scenarios where resources and communication load change over time and energy consumption is limited [3], despite the fact that in these methods packet collisions occur because of the random packet transmission decisions made.
Slotted ALOHA, belonging to the family of contentionbased protocols, is widely used for designing random multiple access mechanisms, but suffers from low throughput due to packet collisions that lead to packet loss. Diversity Slotted ALOHA (DSA) [4] significantly improves upon it by introducing a burst repetition rate, that allows network nodes to transmit a predefined number of replicas of the original messages. The introduction of the repetition rate enables Contention Resolution Slotted ALOHA (CRDSA) [5], that helps exploiting interference cancellation (IC) for the retrieval of collided packets. To further improve the performance of [4, 5], Irregular Repetition Slotted ALOHA (IRSA), introduced in [6], allows for a variable number of replicas for each user. The work in [6] relates the process of successive interference cancellation applied to colliding users to the process of iterative beliefpropagation (BP) erasuredecoding of codesongraphs. The number of replicas in IRSA is decided by sampling from a probability distribution, which is designed such as to decrease packet loss. IRSA shows that diversity in the behavior of individual nodes, in the form of selecting the number of replicas, results in better overall throughput.
Further improvements of IRSA can be found in the work of [7], which extends IRSA by introducing Coded Slotted ALOHA (CSA), where coding is performed between the packets available at the nodes. In [8], a frameless variant of CRDSA is introduced, which limits delays, as sensor nodes are not obliged to wait for the next frame to transmit their messages. Frame asynchronous Coded ALOHA [9] combines methods in [7] and [8] and shows an improvement both in achieved error floor and observed delay. Although these are interesting research directions, computational complexity introduced because of the coding procedure compared to the noncoding variants may limit their use in the sensor networks under study. For this reason, we do not explore this direction, but we leave it as a future work. However, we should note that our scheme is generic and can be extended for the coded variants of IRSA.
IRSA performance depends on the optimization scheme used to derive the degree distribution function, i.e., the probability distribution used to decide the number of replicas. This distribution can be optimized using differential evolution, which is used to asymptotically analyze the transmission policy, i.e., the number of replicas. More recently in [10], the use of Multiarmed Bandits (MABs) was introduced, as a remedy for inaccurate asymptotic analysis in nonasymptotic settings and as an alternative to computationally expensive finite length block analysis. This work has been proposed for an IRSA variant that incorporates users’ prioritization [11]. The main drawback of this formulation is that it leads to a continuous action space, an intractability addressed through discretization, that has been proven to significantly degrade performance [12]. Another disadvantage of MABs is that their framework is not expressive enough as they are stateless. This renders MABs inappropriate for sensor networks, where operations are constrained by sensor nodes’ characteristics, such as battery level, memory size, etc., valuable information that MABs fail to incorporate in the decisionmaking process.
In this paper, we investigate the optimization of the transmission strategy of sensor networks following the Markov Decision Process (MDP) [13] framework. In particular, in our scheme sensor nodes are capable of independently and distributively learning the optimal number of replicas to transmit in a slotted IRSA protocol. Guided by the nature of the problem under consideration, we design a distributed, modelfree, offline learning algorithm that deals with partial observability, which refers to the inability of a sensor node to observe information that requires global access to the network. Hence, under partial observability nodes act on information only local to them, for example the state of their buffer, i.e., the number of packets in it. This approach has successfully been applied in the domain of sensor networks [14] and draws from its need for scalable, efficient, decentralized optimization algorithms. To deal with partial observability we employ decentralized POMDP (DecPOMDP) algorithms, that are associated with high complexity, as they are NEXPComplete [15]. Hence, to overcome this problem, we explore realistic variations of it that exploit the problem’s nature, in particular independence of agents in terms of learning. Distributed optimization in sensor networks has been extensively studied in [16] and successful applications have mainly been offered in the areas of packet routing [17] and object tracking [18]. Machine learning concepts have been explored in [3], where an actorcritic algorithm to optimally schedule active times in an TimeoutMAC protocol is presented and [19], where a multistate sequential learning algorithm is proposed, that learns the number of existing critical messages and reallocates resources in a contentionfree MAC protocol. However, none of these works addresses decentralized resource allocation under a random access MAC mechanism. Our solution leverages techniques from the multiagent reinforcement learning literature to design transmission strategies for agents that optimally manage the available time slots and maximize packet throughput. To the best of our knowledge, this is the first attempt to formulate a decentralized and adaptive solution for MAC design in the context of Slotted ALOHA. Our main contributions consists in:

the design of an intelligent sensor network that adapts to communication conditions and optimizes its behavior in terms of packet transmission, using reinforcement learning;

the derivation of an algorithm from the family of DecPOMDP that employs virtual experience concepts to accelerate the learning process [20];

the investigation of the impact of the waterfall effect on the learning dynamics and the ability of our proposed algorithm to alleviate it.
Section II describes the problem under investigation, introduces the suggested framework and models the problem highlighting underlying assumptions. In Section III, we provide the necessary theoretical background by outlining the vanilla IRSA protocol in order to derive the goal of optimization. In Section IV, we formulate our proposed decentralized reinforcement learning based POMDP IRSA protocol, henceforth referred to as DecRL IRSA. Finally, Section V exhibits the experiments performed to configure and evaluate our optimization technique.
Ii Intelligent sensor network framework
Iia Sensor network description
Let us consider a network of sensor nodes collecting measurements from their environment and transmitting them to a core network for further process. The main bottleneck of the operation of the network is the transmission of the packets nodes possess through a common communication medium, as it is also used by neighboring sensor nodes that transmit their packets over it. Abiding to the vanilla Slotted ALOHA framework and its variants, in our work time is divided into frames of fixed duration, each one consisting of time slots. At the beginning of each frame each sensor randomly chooses one of the available slots to transmit its packet. ALOHA transmission protocol is depicted in Fig. 1. In this paper, a contentionbased approach is used, and, therefore, collisions occur due to the fact that sensors may choose to transmit simultaneously in a slot. This results in a degradation of the observed throughput.
IiB Proposed communication framework
The design of an efficient MAC protocol requires sensor nodes to be equipped with the capability of independently deciding upon their transmission strategy. Traditional approaches solve the MAC optimization centrally, assuming that all problemrelated information will become available to a central node. This introduces a communication overhead that is needed to exchange the information required to make the optimal transmission decisions. This communication is expensive for largesized networks, and, in general, does not scale well with the size of the network. Further, centralized algorithms fail to exploit the underlying network structure, which can facilitate the optimization of transmission strategies by exhibiting characteristics such as locality of interaction. Here, we aim at designing a protocol that can be easily applied in largesized networks, as well as to optimize its functionality in a distributed way, considering sensor nodes as the basic building block. In Fig. 2, we illustrate the overall structure of our communication model. This resulted from the following desired characteristics:
Hierarchical structure
It has been often argued that intelligent behavior of complex systems should be pursued through the adoption of hierarchical structures that support the emergence of collective intelligence [21]. Early in the pursuit of artificial intelligence [21] collective intelligence was recognized as a means of achieving intelligent behavior in complex systems based on interaction in populations of agents instead of sophisticated units. Inspired by [22], the network is organized into clusters, based on features such as proximity, common characteristics, e.g., priority or common behavior, e.g. packet content. Each cluster in Fig. 2, illustrated with a dashed ellipsis, is formed by the sensors in it, one of which is the clusterhead. The latter is responsible to collect the packets from all the sensor nodes in a cluster and then transmit them to the core network. Therefore, clusterheads serve as intermediate nodes between the sensor nodes and the core network. This design enables scalability of the network architecture. It also presents the opportunity of forming clusters based on common characteristics that affect optimization, e.g., requests for the same content can be addressed by optimizing locally the cached content in the clusterheads. In the rest of the paper, we do not deal with the cluster formation problem but we assume that this has already taken place. Thus, we focus on the optimization of the transmission strategies of the clusterheads.
Adaptivity
Sensor networks that employ reinforcement learning to adjust to changes of their environment have been shown to be a promising approach that can ensure realtime, optimal allocation of resources in nonstationary environments [3]. Motivated by this, our protocol is based on Qlearning, a modelfree algorithm that learns optimal policies through interaction with its environment, and, thus, adapts to its changes.
Decentralization
Here, we aim at designing a decentralized solution that exploits the ad hoc, timevarying and heterogeneous nature of the network. By removing the need for a centralized point of control, our solution leverages locality of information and interaction to create nodes that contribute to the optimal overall throughput following computationally efficient policies.
IiC Preliminaries
This section will present our formulation and assumptions made regarding the physical layer, sensor nodes’ buffers’ models and packets’ arrival process. Tables II and II summarize the notation used regarding systemrelated and noderelated variables, respectively.
IiC1 Physical layer
We consider frequency, nonselective channels, which are characterized at the beginning of each time frame by the traffic , where is the time index indicating the beginning of a frame. A set of slots form a frame. We assume that the traffic can be estimated perfectly in light of the number of nodes and size of frame and that it remains constant during a frame, similar to the work in [7, 23], where also stays constant for all frames. Although traffic imposes some central knowledge about the network condition, it can be easily derived by the clusterhead if we assume groupwise observability, as in the work of [14], which denotes the ability of an agent to fully determine global information based only on observations of its cluster.
Notation  Systemrelated 
number of sensor nodes  
number of slots in frame  
channel load  
packet throughput  
total number of transmitted packets in frame  
probability loss rate  
size of packet  
uncontrolled state  
number of episodes  
number of iterations in episode  
learning rate  
discount factor  
history window  
initial state distribution  
coverage time  
exponent of learning rate  
threshold for egreedy exploration  
virtual experience transformation 
Notation  Noderelated 
condition  
number of replicas to transmit  
number of arrivals in node’s buffer  
size of node’s buffer  
maximum number of replicas  
current state of buffer  
nodedegree probability distribution  
space of states  
space of actions  
space of observations  
space of virtual experience  
space of histories  
immediate reward  
expected reward  
policy  
stateaction value function under policy  
state value function under policy 
Note that the proposed framework is oblivious to the underlying modulation and coding schemes. Similar to the work in [24], our only assumption is that the packet throughput and number of transmitted packets can be expressed as
(1)  
(2) 
where represents the number of packets transmitted by all nodes during the particular frame, is the packet loss rate and is a node’s condition. Nodes’ condition can in general depend on the buffer state, battery level and, in general, any information that should potentially affect their behavior. Our work considers only the buffer state of the network nodes, as incorporating more variables in will increase computational requirements. However, our framework is general, and, depending on the application of interest, can easily incorporate additional characteristics to .
There are three sources of packet loss, i.e., packet collisions, imperfect interference cancellation and bitlevel channel noise, that depends on nodes’ transmission and noise power. In the rest of our work, and without loss of generality, we will assume that interference cancellation is perfect and that noise power is zero. Successful transmission will therefore be guaranteed if the iterative BP erasuredecoding algorithm, used for SIC, succeeds to recover the original packets. Thus, in our approach, (1) and (2) will be oblivious to .
IiC2 Buffer and traffic model
We assume that the transmission buffer of a node is modeled as a firstin firstout queue. The source injects packets of size bits into the transmission buffer in each time frame according to an independent and identically distributed (i.i.d) distribution . The packets arriving to a node are stored in a finitelength buffer, of capacity . Therefore, the buffer state of a sensor node evolves, recursively, as follows:
(3) 
where denotes the initial buffer state and is the packet goodput, representing the number of successfully transmitted packets in a frame for node . The packets arriving after the beginning of frame cannot be transmitted until frame and unsuccessfully transmitted packets stay in the transmission buffer for later retransmission.
In the following sections, we will formulate MAC optimization as a multiagent problem and propose an efficient reinforcement learning based algorithm that enables sensor nodes to maximize the overall packet throughput of the network.
Iii IRSA overview
In this section, we briefly overview the IRSA protocol [6]. IRSA has been proposed to deal with the case where nodes attempt to transmit their packets into a number of transmission slots over the same communication channel. We assume that there are time slots per frame. The channel is fully characterized by its normalized traffic, defined as , which represents the average number of attempted packet transmissions by all nodes per time slot. The objective of IRSA is to optimize the normalized throughput , defined as the probability of successful packet transmission per slot. At the beginning of each time frame a user attempts transmission of a message by randomly choosing one of the slots to transmit a packet. In a vanilla Slotted ALOHA protocol, a transmission is successful only if no other user transmits in the same slot. The resulting throughput is a function of the normalized traffic, in particular it is . In an IRSA protocol, however, a user has the capability of transmitting a variable number of replicas of the original message in the available time slots, a strategy that improves throughput due to interference cancellation. The throughput in this case is governed by the degree distribution, a polynomial probability distribution describing the probability that each user transmits replicas of its message at a particular time frame. This probability distribution is expressed as
(4) 
where is the maximum number of replicas a sensor node is allowed to send. The objective of a MAC optimization algorithm is to select the values in (4) so that overall network throughput is maximized. Formally, the optimization objective can be cast as
Find:  (5)  
subject to 
The optimization in (5) can be performed using any linear programming or gradientbased optimization algorithm, but differential evolution is usually performed [6, 11]. In asymptotic settings () iterative IC convergence analysis can be used to formulate how collision resolution probability evolves with decoding time [6] and a stability condition can be formed, which defines the maximum channel load, , for which the probability of unsuccessful transmissions is negligible. Section IV presents the proposed approach that allows users to learn their transmission strategies in a distributed manner for nonasymptotic scenarios.
Iv DecRL MAC protocol design
The discussion will proceed with the adoption of the MDP model for the design of an efficient MAC optimization strategy abiding to the framework defined in Section II. Our method employs ideas and tools from reinforcement learning and DCOP to satisfy the desired traits of the considered network setting.
Iva MDP formulation
Recall from (1) and (2) that there are two parameters affecting the state of the environment: i.e., the current channel load G and a node’s condition . We first assume that the sensor network is a single agent that interacts with its environment, which includes the channel and itself. This concept is depicted in Fig. 3. We model the problem as an MDP and define the state as
(6) 
where is the state of the agent, is the set of all states, represents the state of sensor node and stands for the part of the environment that is uncontrolled by the sensor nodes and corresponds to in our formulation.
The transition probabilities of the defined MDP can be formulated as
(7)  
(8) 
where is the individual packet goodput of sensor node that depends on the current values and . Note that we dropped time index for simplicity of notation.
From (7), we can see that the transitions of the uncontrollable state are independent of the transmission strategy and the states of individual nodes. Please note here that we assume that the channel probabilistically and stochastically switches states based on the arrival and departure of sensor nodes in the network, changes to noise conditions, etc. Further, from (8), we observe that individual transitions of sensor nodes depend on the states and actions of other nodes, channel load, noise conditions and packet throughput. Therefore, transition independence for sensor nodes does not hold.
The action of the agent , with being the action space, consists in the joint actions of all the sensor nodes in the network. These actions represent the values of the coefficients of the probability distribution function in (4), that is
(9) 
Recall that is the maximum number of replicas a sensor node is allowed to send.
The above MDP formulation, although genuinely modeling the MAC optimization problem, leads to a continuous action space, that scales exponentially with the number of sensor nodes. This renders learning of the optimal action infeasible for largesized problems. To circumvent this drawback, we redefine the actions as the number of replicas to send. Therefore
(10) 
During the learning phase the agent finds a deterministic policy , with and , by choosing the optimal for each sensor node (except for exploratory moves in the learning process where a random action is preferred). After learning has completed, the probability distribution is computed using the information of visited stateaction pairs. Therefore, upon implementation of our protocol the policy is probabilistic with , where is the appropriate coefficient in . This technique allows us to leverage the benefits of maintaining a small action space, while using a stochastic policy. The latter is important in multiagent scenarios, where existence of an optimal deterministic policy is not guaranteed due to an agent’s uncertainty regarding the behavior of other agents [25].
The choice of the reward function is guided by our aim to design active, selfinterested agents attempting to improve overall packet throughput, while lacking access to a global performance measure, i.e., the channel load. We define the immediate reward of an agent as
(11) 
where is the number of messages in the buffer of sensor node at time . This reward makes the nodes eager to transmit when their buffers are full, instead of making the decisions purely based on the outcome of the current transmission.
The formulated MDP is episodic with episodes and learning iterations per episode. At the beginning of an episode each agent can be in a random state . Experience, in the form of the Qtable and visits to stateaction pairs, carries over episodes. Solving the formulated MDP requires finding the optimal transmission policy which is the one that maximized the expected discounted reward starting in state and then following policy . The reward takes into consideration immediate and delayed rewards, and is represented as
(12) 
where is the expected return, is the immediate return and is the discount factor that evaluates the effect of future rewards in the current state (a value of closer to zero means that the agent is myopic, while when is close to 1 the agent is farsighted). Equation (12) can be rewritten as a Bellman equation [13]
(13) 
The main drawback of MDPs is that in many practical scenarios, as in our case, the transition probability and the reward function that generates the reward are unknown, which makes hard to evaluate policy . To this aim, we adopt Qlearning [26] that allows to learn from delayed rewards and determine the optimal policy, in absence of the transition probability and reward function. In Qlearning, policies and the value function are represented by a twodimensional lookup table indexed by stateaction pairs . Formally, for each state and action , the value under policy , represents the expected discounted reward starting from , taking the action , and thereafter following policy . is defined as follows
(14) 
We define the optimal policy as the one that maximizes the expected reward for all states
(15) 
Bellman’s optimality equation for allows to define independently from any specific policy
(16)  
(17) 
Using the Qlearning algorithm, a learned action value function Q directly approximates through value iteration. Correspondingly, the Qvalue iterative formula is given by
(18) 
where is the learning rate, which determines to what extent newly acquired information overrides old information. The above solution is guaranteed to converge to the optimal solution under the RobbinsMonro conditions:
(19) 
As noted earlier, a state consists of all the information necessary for the network to choose the optimal action . This necessity urges us to encompass in a state information about nodes’ condition and the uncontrolled state, which includes battery level, buffer size, number of packets to transmit, the channel’s noise, load, etc. Clearly, this information cannot be available as it would impose huge communication load, while a MAC protocol should prevent channel congestion and be unintrusive. To alleviate this drawback of MDPs and Qlearning, in the next section we present a novel framework, based on partiallyobservable MDPs, which has been successfully used in solving problems in resource optimization problems in sensor networks [27].
IvB Dealing with partial observability
POMDPs [28] acknowledge the inability of an MDP to observe its state, which they remedy by introducing the notion of observations. Observations contain information that is relevant but insufficient to describe the actual state on their own. In our case, the network and the sensor nodes cannot observe , as this requires global knowledge of the environment, which is hard to achieve. We, therefore, constrain observability to information only locally available to the sensor nodes. Following our description in Section IIC regarding a sensor node condition , we assume that the only staterelated information a node has access to is the number of messages stored in the buffer of each sensor node, that is
(20) 
POMDPs can be optimally solved using the framework of Belief MDPs [28], but this renders learning intractable, as it is performed in continuous state spaces. We adopt a fixed horizon of observations, which is a common approach that, however, has no convergence guarantees. Nevertheless, it has been successfully employed in object tracking problems due to its simplicity and expressive power [14].
Through the adoption of a fixed history window , the observation tuple of each sensor node is defined as
(21) 
and defined as in (20).
The Belief MDP, whose states correspond to the beliefs over states, is assumed to satisfy the Markov property. Histories of observations serve as an approximation to beliefs, therefore Qlearning can be applied as in the general MDP case.
The distributed nature of the problem has so far been purposely neglected in order to focus on the decision process formulation. Following the observations made in Section IIB regarding the need for decentralization, next we proceed by formulating a distributed representation of the problem under the DecPOMDP framework, introduced in [15].
IvC DecPOMDP Formulation
Decentralized Partiallyobservable MDPs offer a powerful framework for designing solutions that take into account partial observability and are controlled in a distributive way. Here, the aim is to design a modelfree solution that can help achieve improved overall throughput. The state of the environment includes information about the number of agents and number of slots per time frame, both expressed through . Recall that each agent can only observe its own buffer and thus deduce if its transmission was successful. Fig. 4 depicts the sensor network as a DecPOMP.
Definition 1.
DecPOMDP A decentralized partially observable Markov decision process is defined as a tuple , where

is the set of agents

is a finite set of states s in which the environment can be

is the finite set of joint actions

is the transition probability function

is the immediate reward function

is the finite set of joint observations

is the observation probability function

is the history window

is the initial state distribution at time t = 0
Definition 1 extends the singleagent POMDP model by considering joint actions and observations. In our case , and is the individual reward agent observes. As regards the initial distribution , we assume a uniform distribution taking values in the range . Note that our algorithm does not need an external, i.e. provided by the environment, common reward function , but agents individually measure their rewards based on their observations.
As we mentioned in Section I, the decentralization property of the POMDP framework changes the nature of the problem to NEXPcomplete, a class of problems too complicated to provide any realtime solution. Nevertheless, the theoretical properties of this family of problems have been studied and efficient algorithms have been developed in [28]. For example, the Witness algorithm is introduced in [28] as a polynomial time alternative to value iteration in policy trees. As these algorithms suffer from extreme memory requirements due to the continuous nature of the problem, locality of interaction has been leveraged in the Networked Distributed POMDP setting [18], where LIDJESP and GOA are introduced for planning in DecPOMDPs. Contrarily to the above, we will use modelfree remedies to circumvent the inherent intractability, an approach that will benefit from lower complexity, both in terms of computation and time.
To learn the optimal policy using a modelfree approach one can apply simple singleagent Qlearning. This is performed as follows
(22) 
Although this approach leads to an optimal policy, it is inappropriate in the DecPOMDP framework as adopting a centralized point of control creates a large state space and demands global access to information. In [29] independent learning, in which each agent learns its own Qvalue function ignoring other agents’ actions and observations, is studied. By ignoring the effect of interaction among agents this approach may converge to local optimal policies or oscillate. Nevertheless, independent learning offers a distributed, tractable solution that has proven adequate in relevant applications [14]. Motivated by the encouraging results in [29], we formulate the problem as a population of agents which make decisions independently of each other on how to handle common resources in order to maximize social welfare, i.e., the overall throughput. Our adoption of the powerful framework of DecPOMDP is justified by the realistic nature of MAC protocol design, as its success will depend on the achievement of low complexity. The work in [16] strained the importance of realistic modeling in networking applications, as specific characteristics have a significant, algorithmspecific impact on the solution. In Section V, we will experimentally investigate the performance of independent learning under various learning settings in order to draw qualitative conclusions about the appropriate behaviors of sensor agents and design a MAC protocol that surpasses the performance of vanilla IRSA.
IvD Virtual experience
Qlearning is a modelfree learning approach, however due to its conceptual simplicity proves to be inefficient for realtime applications, as extensive interaction with the environment is required. Leveraging past experience is a technique that has successfully been used in demanding RL tasks due to its effectiveness and its respect to the structural properties of Qlearning [20]. Key intuition behind it, is that an agent can update the Qvalues of states it has previously visited. These batch updates can significantly decrease convergence time, provided that the agent avoids acting on outdated information. A related notion is that of virtual experience [24, 30], where an agent “imagines” state visits instead of “remembering” them. The work in [24] separated the effect of the environment into “known” and “unknown” dynamics and introduced the notion of virtual experience in their attempt to extrapolate experience of actual rewards to states that do not affect the unknown dynamics and are, therefore, equivalent in the light of new information. Virtual experience was applied to postdecision states, and not to actual states. Next, we will proceed by formulating virtual experience in the observational histories of our own learning setting.
As defined in (21), an agent’s history of observations is a tuple of past buffer states. Based on this information, an agent chooses the preferred number of replicas to send. The unknown environment dynamics in this case include the arrival and collision model, take place after the selection of replica’s number, and determine the reward the agent experiences as well as the next observation . Although agent’s observation vector is essential for determining the optimal action, we should point out that the unknown dynamics do not directly depend on . In particular, if the observation tuple is , then the collision model cannot discern any difference in states of the following form
(23)  
(24) 
where is the difference in observations between two consecutive states.
Virtual experience can be viewed as applying the transformation formulated in (25) on visited states and then updating all states that have the same representation. We call a virtual state, as it is neither visited nor directly used in the Qlearning update, but serves as an intermediate state in order to acknowledge states equivalent towards the unknown environment dynamics. We illustrate this in Fig. 11.
(25) 
The reason for the above formulation is that collisions should intuitively depend on the relative buffer states , as they determine the channel congestion. The actual values are useful in shaping the eagerness of agents to transmit data. Formally and according to [31] a pair is equivalent to a pair if and can be derived from . Following the above observation for each move of an agent a batch update on all pairs with and will be performed. Note that we cannot extrapolate experience to states with different actions, as the collision dynamics depend on the action performed.
Equipping Qlearning with virtual experience increases computational complexity, as instead of updating one entry of the Qtable in each learning iteration, all pairs with the same are updated. This complexity increase is equal to the number of those pairs, which we denote by and can be bounded as
where  (26)  
and  (27)  
(28) 
and are used to avoid considering virtual states with numbers of packets in their buffers that are either negative or exceed the maximum capacity .
The conception of virtual experience in [31] was not accompanied by its theoretical properties regarding convergence time, we therefore conclude this section with some remarks on the effect of virtual experience on it. Inspired by the work in [32], where convergence time of Qlearning was studied in relation to its parameters, e.g., the learning rate and discount factor, and lower bounds were computed for synchronous and asynchronous learning using polynomial and linear learning rates, we study how virtual experience affects convergence time and derive a similar bound. We limit ourselves to asynchronous learning using a polynomially decreasing learning rate, as is the current case, and extend it by considering multiple updates in each iteration.
We first study how virtual experience affects coverage time , i.e., the learning iterations necessary to visit all state action pairs at least once and then proceed to bounding convergence time. Our remarks will be based on Lemma 33 from [32].
Lemma 1.
Assume that is the probability of visiting all state action pairs in an interval , where an interval corresponds to a time period of iterations. Then, using virtual experience, the probability of visiting all stateaction pairs in an interval is .
Proof.
The probability can also be interpreted as the percentage of unique pairs visited, i.e., , where is the number of iterations where the pair was visited for the first time and the denominator represents the size of the stateaction space. We assume that states are sampled with replacement from an i.i.d. probability distribution. As noted earlier, virtual experience increases the number of states updated in a learning iteration by , with defined in (26). It follows then that , where is the number of iterations where the visited pair was unique using virtual experience. Thus, . ∎
Lemma 2.
Assume that from any start state we visit all stateaction pairs with probability in steps. Then with probability from any initial state we visit all stateaction pairs in steps for a learning period of length .
Proof.
The probability of not visiting all stateaction pairs in consecutive intervals is . If we define as , then this probability equals and steps will be necessary to visit all stateaction pairs. ∎
Corollary 2.1.
Virtual experience alters coverage time by a factor of .
According to [32], convergence time depends on the covering time based on the following theorem.
Theorem 3.
Let be the value of the asynchronous Qlearning algorithm using polynomial learning rate at time . Then, with probability at least , we have , given that
where is a parameter that determines how fast the learning rate converges to zero, i.e, .
Proof.
The proof is identical with Theorem 4 in [32]. ∎
IvE Computational complexity
The proposed protocol is a computationally attractive alternative to transmission strategies that are based on finite length analysis [23], which has exponential complexity. In our framework, at each learning iteration an agent has to choose its transmission strategy and then update its local Qtable. In contrast to the work in [10], in the proposed scheme the action space is discrete and increases linearly with . The size of the observation space, which coincides with the size of the Qtable, is , where is the size of sensor nodes’ buffer and is the history window. The observation space scales exponentially with and linearly with . Finally, the complexity associated with the number of agents is , as each agent learns independently.
V Simulation results
This section begins with a performance comparison of the proposed DecRL IRSA protocol and vanilla IRSA. It subsequently studies the effect of different learning schemes on the performance of independent learning with the twofold goal of drawing conclusions about the behavior of agents and providing a guideline for configuring system parameters to determine the optimal strategy. Finally, we evaluate the proposed scheme advanced with the virtual experience concept to show the reduced convergence time.
Va Simulation Setup
The following experiments are performed on a toy network with frames of size 10 and channel loads , which remain constant throughout the learning and simulation of communication time. Unless stated otherwise, performance is averaged over 1000 Monte Carlo trials, the number of sensor nodes is determined by , learning requires 1500 iterations and confidence intervals are calculated based on 20 independent experiments with confidence level. As regards configuration of learning, we experimentally validated that egreedy exploration with a constant exploration rate , a decreasing learning rate following formula, where is the number of times the current stateaction pair has been visited, and a constant discount factor offer the optimal policy. As a baseline method for our comparisons we use IRSA with , which was experimentally evaluated in the work of [6] and proved superior to other commonly used distributions derived in [33].
VB Protocol Comparison
Based on the observations of the work in [16], a protocol orchestrating a multiagent system should be examined in the regard of the following properties: completeness, i.e., its ability to find the optimal solution, if any, rate of convergence, complexity and scalability. Of these, completeness is a requirement often dropped in realtime, nonstationary environments, as convergence to a good solution is more valued than exhausting one’s resources, i.e., CPU power, time and memory, in the vain pursuit of the optimal one. As regards scalability, our method is invariant to the number of agents due to independent learning, while the complexity scales exponentially with the size of the observation history. Nevertheless, as we show later, our scheme gets most of the benefits from the history consideration by adopting a short history window. Hence, complexity is not an issue for our solution.
Fig. 7 performs a statistical analysis on the performances of the two protocols under consideration by presenting confidence intervals. From this figure, it is obvious that DecRL IRSA is superior to vanilla IRSA in all cases with the difference gap becoming wider for channel loads above . We also observe that performance has higher variations in high channel loads. Fig. 7 illustrates convergence time for independent learning in different channel loads. From this figure, we can see that convergence is guaranteed and is fast for low channel loads. For only four learning iterations are necessary, while for seven iterations are needed. In the case of high channel loads DecRL IRSA fails to transmit messages faster than their arrival rate, the node’s buffer thus saturates fast to for and tends to saturate at the end of the episode for . Based on this observation, we design a mechanism for agents to detect “bad” episodes and reset the POMDP to an arbitrary state. We classify an episode as “bad” if the rewards deteriorate for three consecutive iterations.
Fig. 9 illustrates how DecRL IRSA and vanilla IRSA achieved throughput changes with different frame sizes (). As regards scalability of DecRL IRSA, it appears robust and its performance increases with bigger frame sizes. This can be attributed to the fact that learning is more effective in more complex networks, where collisions occur more often, thus, learning to avoid other agents has a more profound impact on the overall throughput. Vanilla IRSA also improves its performance for increased frame sizes, as it provingly works better in asymptotic settings. This is attributed to the fact that the probability distribution is computed using asymptotic analysis and is therefore closer to optimal for frames that exceed 200 time slots. Nevertheless, the performance gap of the observed throughput of DecRL IRSA compared with vanilla IRSA remains high in heavy channel loads (), due to the waterfall effect of vanilla IRSA. To conclude scalability analysis, the slight superiority of vanilla IRSA manifested for low in asymptotic settings is irrelevant to practical scenarios, as the assumption of very large frame size leads to inefficient implementations, in particular in sensor and IoT networks, that require a complex receiver and introduce delay.
VC Effect of state space size
The size of the state space, i.e., the number of possible states for an agent, depends on the length of the history of observations, as well as the maximum value of the observations, which is equal to , the size of the buffers of agents. Increasing has a twofold effect. Firstly, it increases the size of the state space, thus making learning harder due to the need for longer exploration. Secondly, it dilates the range of rewards, thus agents are made more eager to transmit. Assuming buffer sizes of constant size, constrained by characteristics of the sensor nodes, one anticipates to improve performance of learning by increasing the history window, as that will lead to better approximation of actual states. Nevertheless, letting memory constraints aside, this will lead to an exponential increase of the world size leading either to intractable problems or high time requirements. Thus, it is crucial to determine the minimum amount of information necessary for agents to derive efficient policies. Note that for the sake of a fair comparison learning iterations were also increased to 3000 for increased history window and buffer size. Fig. 9 demonstrates that using a value of , i.e., only one packet is kept in the buffer, leads to lower throughput for channel loads above , as agents are not made eager enough to transmit. On the other hand, increased buffer size improves the perceived throughput for loads above , but it slightly degrades it for the rest.
Regarding history size, Fig. 11 reveals that the effect of increased world size is more profound. This results from the fact that, according to Section IVE, size scales exponentially with and linearly with . We observe that by decreasing the window to , a severe degradation in performance is observed, suggesting that the information provided to the agents through the observation tuples is not substantial. Increasing the learning iterations for has a counterintuitive effect, as performance is degraded, whereas we would expect that an increased world size would benefit from larger training times. In this case, learning iterations perform optimally, so we can assume that by equipping agents with larger memory leads to learning of better actions. Still, considering the current parameterization, is the best performing choice.
VD Virtual experience
Virtual experience was introduced to reduce convergence time, which we experimentally measure using the weighted percent error metric and convergence time, similar to the work in [24]. Fig. 11 shows how throughput varies for different number of learning iterations and suggests that, using virtual experience, the optimal number of iterations was reduced from 1500 to 500. Fig. 13 performs a statistical analysis on convergence time for different channel loads using a confidence level on 40 independent experiments and . We observe that convergence is fast for low loads regardless of the use of virtual experience. For , however, we observe that virtual experience exhibits an improvement of around , which can be attributed to increasing the number of batch updates by a factor of . Also, vanilla DecIRSA usually fails to converge for high channel loads, although throughput remains close to optimal. This observation suggests that, in this case, there are different policies that lead to optimal behavior, so vanilla DecRL IRSA is less biased to the optimal one. Note that the degradation in performance with increasing learning iterations, observed in Fig. 11 and manifested at around 1500 iterations for vanilla DecRL IRSA and 500 using virtual experience, is attributed to overtraining.
VE Waterfall effect
The performance of IRSA has been proven to be governed by a stability condition [6] which leads to a waterfall effect similar to the one observed in the decoding of LDPC codes [34]. From a learning perspective, this profoundly changes the nature of the problem and thus the learning objective. As described in Section IVC, the problem is one of agents competing for a pool of common resources.This formulation resembles the El Farol bar problem, a wellstudied scenario in the reinforcement learning literature, but this description is not rich enough to illustrate the learning objectives of individual agents. In the realm of low channel loads (), where resources are abundant, agents must learn to coordinate their actions, as there is a number of replicas to transmit that optimizes packet throughput. Note that for low channel loads () even a random strategy is appropriate, so learning is of no practical interest. In the realm of high channel loads () however we can acknowledge the task as a Dispersion game [35], where agents need to cooperate in order to avoid congesting the channel by exploiting it in different time frames. Different problem nature urges for different learning behavior, thus we expect that parameterization of learning should vary with . Fig. 13 illustrates the performance of three different parameterizations, each one optimal for a different range of values for . The random strategy was implemented by sampling the number of replicas uniformly from at each node’s transmission. Note that and stand for the threshold below which the probability of unsuccessful transmission is negligible and a random strategy is optimal, respectively. We observe that by optimizing the parameters for a particular range of values, we obtain significant gains in the region of interest ().
Vi Conclusion
We have examined the problem of decentralized MAC design through a reinforcement learning perspective and proved that learning transmission strategies can be beneficial even under the assumption of sensor nodes’ independent learning. Our experiments suggest that the “waterfall effect” of the problem, common in social games where agents compete for common resources, leads to different learning dynamics that demand adaptive solutions. Our method’s superiority is manifested especially in high channel loads, where the need for adaptivity is more eminent and agents benefit from shortsightedness and increased exploration, which implicitly ensures better coordination. From the results we can conclude that in order to make learning tractable for online application scenarios, it is essential to achieve fast convergence. We observed that even maintaining a small observation space, by restricting the history window to 2, the performance remains satisfactory. Finally, the results show that we significantly reduced convergence time by introducing virtual experience into learning.
References
 [1] N. M. Abramson, “THE ALOHA SYSTEM: Another Alternative for Computer Communications,” in Proc. of joint Computing Conf. AFIPS’70, Honolulu, HI, USA, Nov. 1970.
 [2] M. Hadded, P. Muhlethaler, A. Laouiti, R. Zagrouba, and L. A. Saidane, “Tdmabased mac protocols for vehicular ad hoc networks: A survey, qualitative analysis, and open research issues,” IEEE Communications Surveys & Tutorials, vol. 17, no. 4, pp. 2461–2492, Jun. 2015.
 [3] Z. Liu and I. Elhanany, “RLMAC: A QoSAware Reinforcement Learning based MAC Protocol for Wireless Sensor Networks,” in Proc. IEEE Int. Conf. on Networking, Sensing and Control, ICNSC ’06, Ft. Lauderdale, FL, USA, Aug. 2006.
 [4] G. L. Choudhury and S. S. Rappaport, “Diversity ALOHA A Random Access Scheme for Satellite Communications,” IEEE Trans. on Communications, vol. 31, no. 3, pp. 450–457, Mar. 1983.
 [5] E. Casini, R. D. Gaudenzi, and O. D. R. Herrero, “Contention resolution diversity slotted aloha (crdsa): An enhanced random access schemefor satellite access packet networks,” IEEE Trans. on Wireless Communications, vol. 6, no. 4, pp. 1408–1419, Apr. 2007.
 [6] G. Liva, “Graphbased analysis and optimization of contention resolution diversity slotted aloha,” IEEE Trans. on Communications, vol. 59, no. 2, pp. 477–487, Feb. 2011.
 [7] E. Paolini, G. Liva, and M. Chiani, “Coded slotted aloha: A graphbased method for uncoordinated multiple access,” IEEE Trans. on Information Theory, vol. 61, no. 12, pp. 6815–6832, Dec. 2015.
 [8] A. Meloni, M. Murroni, C. Kissling, and M. Berioli, “Sliding windowbased contention resolution diversity slotted aloha,” in Proc. of IEEE Global Communications Conf., GLOBECOM’12, Anaheim, CA, USA, Dec. 2012.
 [9] E. Sandgren, A. G. i Amat, and F. Brannstrom, “On frame asynchronous coded slotted aloha: Asymptotic, finite length, and delay analysis,” IEEE Trans. on Communications, vol. 65, no. 2, pp. 691–704, Feb 2017.
 [10] L. Toni and P. Frossard, “IRSA Transmission Optimization via Online Learning,” 2018. [Online]. Available: http://arxiv.org/abs/1801.09060
 [11] ——, “Prioritized random mac optimization via graphbased analysis,” IEEE Trans. on Communications, vol. 63, no. 12, pp. 5002–5013, Dec. 2015.
 [12] K. Waugh, D. Schnizlein, M. Bowling, and D. Szafron, “Abstraction Pathologies in Extensive Games,” in Proc. of AAMAS ’09, Budapest, Hungary, May 2009.
 [13] R. E. Bellman, Dynamic Programming. Dover Publications, Incorporated, 2003.
 [14] C. Zhang and V. Lesser, “Coordinated multiagent reinforcement learning in networked distributed pomdps,” San Francisco, CA, USA, Aug. 2011.
 [15] D. S. Bernstein, S. Zilberstein, and N. Immerman, “The Complexity of Decentralized Control of Markov Decision Processes,” Mathematics of Operations Research, vol. 27, no. 4, pp. 819–840, Nov. 2002.
 [16] V. Lesser, M. Tambe, and C. L. Ortiz, Eds., Distributed Sensor Networks: A Multiagent Perspective. Norwell, MA, USA: Kluwer Academic Publishers, 2003.
 [17] J. Dowling, E. Curran, R. Cunningham, and V. Cahill, “Using feedback in collaborative reinforcement learning to adaptively optimize manet routing,” IEEE Trans. on Systems, Man, and Cybernetics  Part A: Systems and Humans, vol. 35, no. 3, pp. 360–372, May 2005.
 [18] R. Nair, P. Varakantham, M. Tambe, and M. Yokoo, “Networked distributed pomdps: A synthesis of distributed constraint optimization and pomdps,” in Proc. of the 20th National Conference on Artificial Intelligence, AAAI’05, Jul. 2005.
 [19] T. Park and W. Saad, “Distributed learning for low latency machine type communication in a massive internet of things,” CoRR, vol. abs/1710.08803, 2017.
 [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller, “Playing atari with deep reinforcement learning,” in Proc. of NIPS Deep Learning Workshop, NIPS’13, Lake Tahoe, CA, USA, Dec. 2013.
 [21] M. Minsky, “Steps toward artificial intelligence,” Proceedings of the IRE, vol. 49, no. 1, pp. 8–30, Jan. 1961.
 [22] T. Chen, H. Zhang, G. M. Maggio, and I. Chlamtac, “Cogmesh: A clusterbased cognitive radio network,” in Proc. of IEEE Int. Symp. on New Frontiers in Dynamic Spectrum Access Networks, DySPAN’07, Apr. 2007.
 [23] E. Paolini, “Finite length analysis of irregular repetition slotted aloha (irsa) access protocols,” in Proc. of IEEE Int. Conf. on Communication Workshop, ICCW’15, London, UK, Jun. 2015.
 [24] N. Mastronarde and M. van der Schaar, “Fast Reinforcement Learning for EnergyEfficient Wireless Communication,” IEEE Trans. on Signal Processing, vol. 59, no. 12, pp. 6262–6266, Dec. 2011.
 [25] M. L. Littman, “Markov games as a framework for multiagent reinforcement learning,” in Proc. of the 11th Int. Conf. on Machine Learning, ICML’94, New Brunswick, NJ, USA, Jul. 1994.
 [26] C. J. C. H. Watkins and P. Dayan, “Qlearning,” Machine Learning, vol. 8, no. 3, pp. 279–292, May 1992.
 [27] P. Nurmi, “Reinforcement learning for routing in ad hoc networks,” in Proc. of IEEE Int. Symp. on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks and Workshops, WiOpt’07, Limassol, Cyprus, Apr. 2007.
 [28] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,” Artif. Intell., vol. 101, no. 12, pp. 99–134, May 1998.
 [29] C. Claus and G. Boutilier, “The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems,” in Proc. of the 15th National/10th Int. Conf. on Artificial Intelligence/Innovative Applications of Artificial Intelligence, AAAI ’98/IAAI ’98, Madison, WI, USA, Jul. 1998.
 [30] N. Thomos, E. Kurdoglu, P. Frossard, and M. van der Schaar, “Adaptive prioritized random linear coding and scheduling for layered data delivery from multiple servers,” IEEE Transactions on Multimedia, vol. 17, no. 6, pp. 893–906, June 2015.
 [31] N. H. Mastronarde, “Online learning for energyefficient multimedia systems,” Ph.D. dissertation, University of California, 2011.
 [32] E. EvenDar and Y. Mansour, “Learning rates for qlearning,” J. Mach. Learn. Res., vol. 5, pp. 1–25, Dec. 2004.
 [33] R. Storn and K. Price, “Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces,” Journal of Global Optimization, vol. 11, no. 4, pp. 341–359, Dec. 1997.
 [34] T. J. Richardson, M. A. Shokrollahi, and R. L. Urbanke, “Design of capacityapproaching irregular lowdensity paritycheck codes,” IEEE Trans. on Information Theory, vol. 47, no. 2, pp. 619–637, Feb 2001.
 [35] T. Grenager, R. Powers, and Y. Shoham, “Dispersion Games: General Definitions and Some Specific Learning Results,” in Proc. of 18th National/14th Conf. on Artificial Intelligence/Innovative Applications of Artificial Intelligence, AAAI ’02, Edmonton, AL, Canada, Jul. 2002.