# Multi-Agent Q-Learning Aided Backpressure Routing Algorithm for Delay Reduction

###### Abstract

In queueing networks, it is well known that the throughput-optimal backpressure routing algorithm results in poor delay performance for light and moderate traffic loads. To improve delay performance, state-of-the-art backpressure routing algorithm (called BPmin [1]) exploits queue length information to direct packets to less congested routes to their destinations. However, BPmin algorithm estimates route congestion based on unrealistic assumption that every node in the network knows real-time global queue length information of all other nodes. In this paper, we propose multi-agent Q-learning aided backpressure routing algorithm, where each node estimates route congestion using only local information of neighboring nodes. Our algorithm not only outperforms state-of-the-art BPmin algorithm in delay performance but also retains the following appealing features: distributed implementation, low computation complexity and throughput-optimality. Simulation results show our algorithm reduces average packet delay by for light traffic loads and by for moderate traffic loads when compared to state-of-the-art BPmin algorithm.

## I Introduction

Backpressure routing algorithm, which routes packets in a queueing network by congestion gradients, holds great potentials for applications in different areas, like sensor networks [2], mobile ad hoc networks [3] and transportation systems [4, 5]. It is well known that the backpressure routing algorithm achieves maximum network throughput (throughput optimality) by exploring all possible routes (even route loops) to balance traffic loads over the entire queueing network. This is effective for queueing networks with heavy traffic loads. However, for light and moderate traffic loads, excessive route exploration may lead to packets being directed to unnecessarily long routes or even route loops as shown in Fig. 1, which results in poor delay performance [6].

To improve delay performance of backpressure routing algorithm, available works [7, 1, 8, 9, 10, 2, 11] aim at directing packets to shorter routes to their destinations by exploiting various information of queueing networks, such as information of queue length, shortest path length (distance of the shortest path between two nodes) and packet delay (see Section VI for details). Out of these works, state-of-the-art BPmin algorithm proposed in [1] significantly reduces average packet delay of backpressure routing algorithm. According to BPmin algorithm, every node in a queueing network needs to know queue length information of all other nodes in real time. Based on these queue length information, every node calculates the sum of queue length of each route as route congestion estimate and then directs packets to least congested routes to their destinations. However, in queueing networks it is unrealistic for nodes to collect such real-time global queue length information. Moreover, BPmin algorithm requires the knowledge of network throughput capacity to make routing decisions, which is hard to determine.

In this paper, we propose multi-agent Q-learning aided backpressure routing algorithm (QL-BP), where each node estimates route congestion using only local information of neighboring nodes. Specifically, every node under our QL-BP algorithm maintains multiple Q-learning agents, where each Q-learning agent continuously updates its route congestion estimate using neighboring nodes’ queue length information and neighboring nodes’ route congestion estimates. Based on estimated route congestion, every node directs packets to least congested routes to their destinations. Our algorithm not only outperforms state-of-the-art BPmin algorithm in delay performance but also retains the following appealing features: distributed implementation, low computation complexity and throughput-optimality. Simulation results show our algorithm reduces average packet delay by for light traffic loads and by for moderate traffic loads when compared to state-of-the-art BPmin algorithm.

The rest of this paper is organized as follows. In Section II, we introduce network models concerning communication links, resource allocation, transmission rates, packet generating, etc. In Section III, we describe in details our multi-agent Q-learning aided backpressure routing algorithm (QL-BP). In Section IV, we analyze the performance of our QL-BP algorithm. In Section V, we do simulations to evaluate the delay performance of our QL-BP algorithm. We introduce related work in Section VI and conclude the whole paper in Section VII.

## Ii Network Model

Notation | Definition |
---|---|

The communication link between two nodes and , , is denoted by pair , , | |

which is different from link . | |

The state of link at slot , which represents factors affecting transmission rate of link at slot , | |

like node position, channel fading and interference coefficients. | |

, the matrix of all link states at slot . | |

The resource allocation decision over link at slot , such as link activation, coding, modulation, etc. | |

, the matrix of resource allocation decision over all links at slot . | |

(respectively, ) under algorithm is denoted by (resp. ). | |

The offered transmission rate (packets/slot) over link at slot under and . | |

Actual data amount transmitted over link during slot may be less than due to insufficient data. | |

, the matrix of offered transmission rates over all links. | |

The offered transmission rate to commodity over link at slot , . | |

Actual data amount of commodity transmitted over link during slot may be less than . | |

Further, under algorithm is denoted by . | |

The amount of packets node generates at slot , which are destined for node . | |

The queue of node , which stores packets destined for node . | |

The queue length of queue at slot , i.e, the number of packets queueing up at queue at slot . | |

By convention, . | |

, the matrix of queue length of all queues at slot . | |

The bias associated with queue at slot . | |

, the matrix of bias for all queues at slot . | |

Matrix of information of a queueing network, which is used to extract bias. | |

Examples include information of queue length, shortest path length, packet delay, etc. | |

Route congestion estimated by Q-learning agent of node for routes of commodity and by the way of | |

node ’s neighbor . | |

, the matrix of route congestion estimates of node . |

We consider a multi-hop queueing network represented by a directed graph as shown in Fig. 2, where is the set of nodes and is the set of directed links. The whole network operates over discrete time slots . In the network, every node both transmits packets generated by itself and relays packets from other nodes to their destinations. For this purpose, each node maintains seperate queues to store packets destined for different destinations. For example, queue of node stores packets destined for node . All packets destined for the same destination are referred to as commodity . Key definitions and notations to be used in the following are summarized in Table I.

Transmission rate is affected by random link states and resource allocation decisions such that

(1) |

where is the finite space of link states and is the finite space of resource allocation decisions. We assume that the outgoing transmission rate and incoming transmission rate of all nodes are upper bounded

(2) | ||||

(3) |

The amount of packets generated by each node at each slot is also upper bounded by a positive constant such that

(4) |

After time slots (called convergence interval in [12]), the queueing network arrives at steady state such that packet generating processes and link states converge as follows

(5) | ||||

(6) |

where is an indicator function that returns value if is true and otherwise, is the average rate of packet generating process , is the rate of link states being at state , and are two small positive numbers.

Queue length for two adjacent slots satisfies the following relationship

(7) |

because is the offered transmission rate to commodity over link at slot , however, node may not have enough packets to transmit to node at slot .

## Iii Multi-Agent Q-Learning Aided Backpressure Routing Algorithm

In this section we introduce in details our multi-agent Q-learning aided backpressure routing algorithm (QL-BP). First, we propose a bias based general framework for delay reduction in backpressure routing algorithm. Then, we build QL-BP algorithm based on this general framework.

### Iii-a Bias Based General Framework

(8) |

(9) |

(10) |

(11) |

The whole bias based general framework consists of three stages: information collection, bias extraction and backpressure routing, as illustrated in Fig. 3 and summarized in Algorithm 1. At stage of information collection, our framework (referred to as BPBias) collects useful (local or global) information , like queue length, shortest path and packet delay, for delay reduction. At stage of bias extraction, BPBias framework extracts useful features (e.g., route congestion estimate) from as matrix of bias , where is the bias for queue at slot and is a function of and upper bounded by a positive constant ,

(12) | |||

(13) | |||

(14) |

At the final stage of backpressure routing, BPBias framework programs extracted bias into backpressuring routing algorithm, enabling the algorithm to adaptively change packet routes for delay reduction. Specifically, bias for different queue can be dynamically adjusted according to real-time information , so that packets can be directed to better routes.

The methods for extracting bias from can be either heuristic based methods [7, 1, 8, 9, 10, 2, 11] or machine learning based methods, like Q-learning [13]. This flexibility enables our framework to be very general and cover many bias based backpressure routing algorithms as special cases as listed in Table II.

Special Case | Condition |
---|---|

[7] | varies with real-time queue length |

and packet delay | |

[1] | is set to be bias functions in [1] |

[8] | for all links |

[9] | is the shortest distance |

between node and node | |

[10] | contains information of constrains |

on route length | |

[2] | is a constant calculated as in [2] |

[11] | is a function of packet delay information |

### Iii-B QL-BP Algorithm

Based on this general BPBias framework, we propose multi-agent Q-learning aided backpressure routing algorithm (QL-BP), where Q-learning agents are responsible for extracting route congestion estimate from collected information, which is used as bias to aid backpressure routing algorithm to reduce packet delay.

From now on, we focus on queueing networks with independent links (e.g., wireline networks or wireless networks with orthogonal links). Under our QL-BP algorithm, each node maintains multiple Q-learning agents, where each agent is associated with one commodity and one neighbor of node , responsible for estimating the route congestion for routes of commodity and by the way of node ’s neighbor . Thus, each node maintains a table storing route congestion estimates.

At stage of information collection, each node observes local link states and collects local information by exchanging its own queue length and table of route congestion estimates with its neighboring nodes. At stage of bias extraction, each Q-learning agent of node updates its route congestion estimate as follows:

(15) |

where and are Q-learning parameters, . If , set . Each node calculates bias for commodity as

(16) |

Finally, at stage of backpressure routing, based on extracted bias and observed link states each node makes resource allocation and routing decisions as in BPBias.

From the description of QL-BP algorithm, we see that QL-BP algorithm is a special case of BPBias framework. For queueing networks with independent links, transmission rates of all links are also independent of each other. Thus, the maximum of the weighted sum of (10) can be achieved by each node independently maximizing corresponding terms as follows:

(17) | ||||

where denotes the available resource allocation decision for link . Therefore, our QL-BP algorithm can be implemented in a distributed way. Furthermore, each node under QL-BP algorithm only needs to exchange information with neighboring nodes and maximizes weighted sum locally, the computation complexity of QL-BP algorithm is low as compared to algorithms globally maximizing the weighted sum of (10).

###### Remark 1

Our QL-BP algorithm can be further improved by considering shortest path information (referred to as QLSP-BP). Under QLSP-BP algorithm, each node calculates bias for commodity as

(18) |

where is the length of the shortest path from node to node . The rest of QLSP-BP is the same with QL-BP.

## Iv Algorithm Performance Analysis

In this section, we show that our QL-BP algorithm is also throughput-optimal.

First, we introduce the following definitions concerning queueing network stability, queueing network stability region and throughput optimality.

###### Definition 1 (Network Stability [12])

A single queue is said to be stable if as , where

(19) |

A queueing network is said to be stable if all queues are stable.

###### Definition 2 (Network Stability Region)

Packet generating rates are said to be supported by a queueing network if the queueing network can be stabilized by some routing algorithm under . The network stability region is the closure of the set of all packet generating rates that can be supported by the queueing network.

###### Definition 3 (Throughput Optimality)

An algorithm is said to be throughput optimal if it can stabilize the queueing network for all packet generating rates that are within network stability region , i.e., , .

Then, we establish the throughput-optimality of our QL-BP algorithm.

###### Theorem 1 (QL-BP Throughput-Optimality)

For a queueing network with stability region , our QL-BP algorithm is throughput optimal.

Proof: Since QL-BP algorithm is a special case of BPBias framework, if we can prove that the general BPBias is throughput optimal, then QL-BP algorithm is also throughput optimal.

According to the definition of throughput-optimality, we need to prove that for any packet generating rates within network stability region , i.e., , , our framework BPBias can stabilize the queueing network. Some steps of the following proof are similar to that of [12], which are included here for completeness.

Recall that is the convergence interval of queueing network . For any routing algorithm and time interval , queue length satisfies the following relationship

(20) |

where

(21) | ||||

(22) |

Refer to Appendix -B for its derivation. Since bias from , we have

(23) |

After squaring both sides of and basic algebraic manipulations, we get

(24) |

Define Lyapunov function . By summing over all nodes and commodities and taking conditional expectations, we get the -step Lyapunov drift as follows:

(25) |

where is a constant given by

(26) |

and the expectation is with respect to random link states , packet generating processes and resource allocation decisions . Inequality can also be written alternatively as

(27) |

where

(28) | |||

(29) |

Next, we show that for any such that , , our framework stabilizes the queueing network and thus our framework is throughput-optimal.

Let and be the quantity under BPBias and under any other algorithm, respectively. Then, we have the following relationship

(30) |

where is given in . Refer to Appendix -B for its derivation.

Thus, the -step Lyapunov drift under BPBias algorithm

(31) | |||

(32) | |||

(33) |

According to Theorem 6 [12], we know that for any such that , , there exists a stationary randomized routing algorithm , which makes resource allocation and routing decisions independent of and , such that

(34) |

Substitute into , we get

(35) | |||

(36) | |||

(37) |

where .

Since is the convergence interval of the queueing network, it is easy to determine the value of such that , i.e., . From the -step Lyapunov drift bound and Lemma 2 [12], we know that our BPBias framework stabilizes the queueing network for any such that , , and thus it is throughput optimal. Therefore, QL-BP algorithm is also throughput optimal. This completes the proof.

## V Simulation

In this section, we evaluate the delay performance of our QL-BP algorithm by simulations and compare it to other variants of backpressure routing algorithms.

### V-A Simulation Setup

We consider the network topology as shown in Fig. 4, which consists of nodes, indexed by a pair of coordinates. All links are bidirectional and the maximum data transmission rates for all links are 1 packet/slot. We assume all links can transmit packets simultaneously without interfering with each other, such as wireline network or wireless network with orthogonal channels. We consider traffic flows with the following source-destination pairs: ((1,3),(2,5)), ((2,3),(2,7)), ((2,2),(1,6)), ((3,4),(2,7)), ((1,1),(1,7)), ((4,3),(5,4)), ((4,6),(6,6)), and ((5,3),(5,6)). All source nodes generate packets according to Poisson distribution with rate packets/slot. We implemented by Python our QL-BP algorithm, QLSP-BP algorithm, traditional backpressure routing algorithm (BP) [9, 12], shortest path based backpressure routing algorithm (SP-BP)[9, 12] and state-of-the-art BPmin algorithm [1]. For our QL-BP algorithm and QLSP-BP algorithm, we set Q-learning parameters to enable agents to quickly update their route congestion estimates. We run simulations for slots for each simulation setting and calculate the average delay of packets received by destinations under different algorithms.

### V-B Simulation Results

From Fig. 5, we can observe that our QL-BP algorithm reduces average packet delay by when compared to traditional BP algorithm under light traffic loads with and by under moderate traffic loads with , indicating that QL-BP algorithm effectively learns route congestion and adaptively directs packets to better routes. However, QL-BP algorithm results in higher packet delay than state-of-the-art BPmin algorithm. This is because nodes of BPmin algorithm know perfect real-time global queue length information and thus can accurately estimate congestion of different routes and direct packets to the least congested routes. Nodes of QL-BP algorithm only know local information of neighboring nodes, thus can only loosely estimate congestion of different routes, which may lead to directing packets to suboptimal routes. However, BPmin algorithm is not realistic since real-time global queue length information is hard to collect by nodes in real world. Our QL-BP algorithm trades off some packet delay for distributed algorithm implementation and low computation complexity, thus can be easily deployed in real queueing networks.

Our QL-BP algorithm can be greatly improved by considering shortest path information. From Fig. 5, we see that QLSP-BP algorithm outperforms all variants of backpressure routing algorithms including state-of-the-art BPmin algorithm: reducing average packet delay by for light traffic loads with and by for moderate traffic loads with when compared to BPmin algorithm. In summary, our algorithm can be easily deployed in real queueing networks, but also achieves the best delay performance when compared to other variants of backpressure routing algorithms.

## Vi Related Work

The traditional backpressure routing algorithm routes packets according to congestion gradients, like water flowing through pipe networks according to pressure gradients [3]. According to the traditional backpressure routing algorithm, the pressure of a queue is defined to be the number of packets queueing up at that queue (queue length). The pressure gradient between two queues of neighboring nodes is defined to be the difference of their queue pressure. The traditional backpressure routing algorithm routes packets based on only pressure gradients between neighboring nodes, i.e., local queue length information, without considering queue length of farther nodes and location of destinations. Its short-sightedness to farther nodes and blindness to destinations result in poor delay performance.

Available works on impoving delay performance of backpressure routing algorithm exploit various information of queueing networks, such as information of queue length [7, 1, 8], shortest path length (distance of the shortest path between two nodes) [9, 10, 2] and packet delay [7, 11]. Despite different forms of these works, they share the common characteristic: using bias to help backpressure routing to reduce packet delay. According to whether bias value varies with time, these works are classified into two groups: backpressure routing with constant bias and backpressure routing with time-varying bias.

For backpressure routing with constant bias, Neely et al. [9] proposed an enhanced backpressure routing algorithm, where the constant bias is shortest path length. They combined the information of queue length and shortest path length to route packets in the direction of their destinations to shorten packet routes. Ying et al. [10] also used shortest path length as constant bias, however in a different way, to shorten packet routes, where they imposed constraints on length of packet routes and maintained shortest path based queues to help meet constraints. Instead of constructing constant bias from only shortest path length information, Jiao et al. [2] built constant bias as a function of packet arrival rates, link transmission rates, and shortest path length. Athanasopoulou et al. [8] proposed an -Backpressure routing algorithm, where is a constant bias whose value is properly tuned to avoid long packet routes. Yin et al. [7] proposed a variant of backpressure routing algorithm, whose route searching process dynamically switches between shortest path mode and traditional backpressure routing mode based on constant bias (called threshold in [7]) to reduce packet delay.

For backpressure routing with time-varying bias, Ji et al. [11] introduced a delay-based backpressure routing algorithm to reduce packet delay for light traffic loads (called last packet problem in [11]), where the time-varying bias is the delay of the head-of-line packet. Cui et al. [1] showed that time-varying bias based backpressure routing algorithms can significantly reduce packet delay. They also proposed two specific time-varying bias based backpressure routing algorithm: one considering local queue length information of up to two-hop nodes, the other considering global queue length information of all nodes (called BPmin). Out of these works, BPmin proposed in [1] achieved state-of-the-art result in delay performance of backpressure routing algorithm.

## Vii Conclusion

In this paper we proposed multi-agent Q-learning aided backpressure routing algorithm. Our algorithm not only outperforms variants of backpressure routing algorithms, including state-of-the-art BPmin algorithm, in delay performance but also retains the following appealing features: distributed implementation, low computation complexity and throughput-optimality. In the future, we will explore more advanced learning method (like deep learning [14]) aided backpressure routing algorithm and do simulations to compare its delay performance with existing variants of backpressure routing.

### -A Derivation of inequality

We use mathematical induction to prove inequality . Basis: From , we know for

(38) |

Inductive step: Assume the inequality holds for

(39) |

We need to prove the inequality still holds for . From , we know that

(40) | |||

(41) | |||

(42) | |||

(43) | |||

(44) |

where follows from and , follows from and , follows from and . Thus, the inequality holds for any including .

###### Lemma 1

(45) | ||||

(46) |

Proof: For , we have

(47) | ||||

(48) |

By comparing equations with , we have . Similarly, we have

(49) | ||||

(50) |

By comparing equations with , we have .

### -B Derivation of inequality

For the inner terms of , we have the following identity

(51) |

where the condition is removed because . Thus, quantity can be alternatively written as