# QPS-r: A Cost-Effective Crossbar Scheduling Algorithm and Its Stability and Delay Analysis

###### Abstract

Switches and routers today primarily employ an input-queued (IQ) crossbar architecture to interconnect the input ports with the output ports. In an IQ switch, a crossbar schedule, or a matching between the input ports and the output ports needs to be computed for each switching cycle, or time slot. It is a challenging research problem to design switching algorithms that can produce high-quality matchings – those resulting in high switch throughput and low queueing delays – yet have a very low computational complexity (to support high link speeds such as 40 Gbps per port) when the switch has a large number of ports (e.g., 128). Indeed, there appears to be a fundamental tradeoff between the computational complexity of the matching (scheduling) algorithm and the quality of the computed matchings.

Parallel maximal matching algorithms (adapted for switching) appear to have stricken the best such tradeoff. On one hand, they provide the following Quality of Service (QoS) guarantees: Using maximal matchings as crossbar schedules results in at least 50% switch throughput and order-optimal (i.e., independent of the switch size ) average delay bounds for various traffic arrival processes. On the other hand, using processors (one per port), their per-port computational complexity can be as low as (more precisely iterations that each has computational complexity) for an switch.

In this work, we propose QPS-r, a parallel iterative switching algorithm that has the lowest possible computational complexity: per port. Yet, the matchings that QPS-r computes have the same quality as maximal matchings in the following sense: Using such matchings as crossbar schedules results in exactly the same aforementioned provable throughput and delay guarantees as using maximal matchings, as we show using Lyapunov stability analysis. Although QPS-r builds upon an existing add-on technique called Queue-Proportional Sampling (QPS), we are the first to discover and prove this nice property of such matchings. We also demonstrate that QPS-3 (running 3 iterations) has comparable empirical throughput and delay performances as iSLIP (running iterations), a refined and optimized representative maximal matching algorithm adapted for switching.

## I Introduction

The volume of network traffic across the Internet and in data-centers continues to grow relentlessly, thanks to existing and emerging data-intensive applications, such as big data analytics, cloud computing, and video streaming. At the same time, the number of network-connected devices is exploding, fueled by the wide adoption of smart phones and the emergence of the Internet of things. To transport and “direct” this massive amount of traffic to their respective destinations, switches and routers capable of connecting a large number of ports (called high-radix [1, 2]) and operating at very high line rates are badly needed.

Many present day switching systems in Internet routers and data-center switches employ an input-queued crossbar to interconnect their input ports and output ports (e.g., Cisco Nexus 5000 Series [3], Arista 7500 Switch [4], and Juniper QFX 10000 Switches [5]). Though it was commonly believed that such a (monolithic) crossbar is difficult to scale beyond 64 (input/output) ports, recent advances in switching hardware technologies (e.g., [6, 7, 1]) have made high-radix crossbars not only technologically feasible but also economically and environmentally (i.e., more energy-efficient) favorable, as compared to low-radix crossbars.

In an input-queued crossbar switch, each input port can be connected to only one output
port and vice versa in each switching cycle or time slot.
Hence, in every time slot, the switch needs to compute
a one-to-one matching between input and output ports (i.e., the crossbar schedule).
A major research challenge
of designing high-link-rate high-radix
switches (e.g., 128 ports or beyond of 40 Gbps each)
is to develop algorithms that can
compute “high quality” matchings – i.e., those that result in
high switch throughput and low queueing
delays for packets – in a few nanoseconds.
For example, with a cell^{1}^{1}1It is common practice in switches/routers to slice incoming
packets into cells of a fixed size or a range of variable sizes [8, Chapter 2, Page 21].
For example, Juniper QFX 10000 switch [5] uses cell sizes varying
between 96 and 176 bytes, and Huawei NE40E switch [9] uses
a fixed cell size. size of 64 bytes (the minimum cell size used in Arista 7500 Switch [4]),
a switch supporting 40 Gbps per-port rates has to compute a matching
every 12.8 nanoseconds.
Clearly, a suitable matching algorithm has to have very low (ideally ) computational complexity,
yet output “fairly good” matching decisions most of time.

### I-a The Family of Maximal Matchings

A family of parallel iterative algorithms for computing maximal matchings (to be precisely defined in § II-B) are arguably the best candidates for crossbar scheduling in high-link-rate high-radix switches, because they have reasonably low computational complexities, yet can provide fairly good QoS guarantees. More specifically, using maximal matchings as crossbar schedules results in at least 50% switch throughput in theory (and usually much higher throughput in practice), as shown in [10]. In addition, it results in low packet delays that also have excellent scaling behaviors such as order-optimal (i.e., independent of switch size ) under various traffic arriving processes when the offered load is less than 50% (i.e., within the provable stability region), as shown in [11, 12]. In comparison, matchings of higher qualities such as maximum matching and maximum weighted matching are much more expensive to compute, as will be elaborated in § II-B. Hence, it is fair to say that, maximal matching algorithms overall deliver the biggest “bang” (performance) for the “buck” (computational complexity).

Unfortunately, parallel maximal matching algorithms are still not “dirt cheap” computationally. More specifically, all existing parallel/distributed algorithms that compute maximal matchings on general bipartite graphs (i.e., without additional constraints or conditions such as the graph being degree-bounded [13] and/or already edge-colored [14]) require a minimum of iterations (rounds of message exchanges). This minimum is attained by the classical algorithm of Israel and Itai [15]; the PIM algorithm [16] is a slight adaptation of this classical algorithm to the switching context, and iSLIP [17] further improves upon PIM by reducing its per-iteration per-port computational complexity to via de-randomizing a computationally expensive ( complexity to be exact) operation in PIM. Although parallel iterative maximal matching algorithms and their variants are found in real-world products (e.g., Cisco Nexus 5548P switch uses an enhanced iSLIP algorithm [18]), it is hard to scale them to a large number of switch ports. For example, recent experiments in [6, 19] demonstrated the feasibility of using iSLIP (or its variants) for or larger crossbar switches, but at the cost of cutting corners on the matching computation (e.g., running a single iteration instead of iterations), which results in lower-quality crossbar schedules and poorer throughput and delay performances.

### I-B QPS-r: Bigger Bang for the Buck

In this work, we propose QPS-r, a parallel iterative algorithm that has the lowest possible computational complexity: per port. More specifically, QPS-r requires only (a small constant independent of ) iterations to compute a matching, and the computational complexity of each iteration is only ; here QPS stands for Queue-Proportional Sampling, an add-on technique proposed in [20] that we will describe shortly. Yet, even the matchings that QPS-1 (running only a single iteration) computes have the same quality as maximal matchings (running iterations) in the following sense: Using such matchings as crossbar schedules results in exactly the same aforementioned provable throughput and delay guarantees as using maximal matchings, as we will show using Lyapunov stability analysis. Note that QPS-r performs as well as maximal matching algorithms not just in theory: We will show that QPS-3 (running 3 iterations) has comparable empirical throughput and delay performances as iSLIP (running iterations) under various workloads.

QPS-r has another advantage over parallel iterative maximal matching algorithms such as iSLIP and PIM: Its per-port communication complexity is also , much smaller than that of maximal matching algorithms such as iSLIP. In each iteration of QPS-r, each input port sends a request to only a single output port. In comparison, in each iteration of PIM or iSLIP, each input port has to send requests to all output ports to which the corresponding VOQs are nonempty, which incurs communication complexity per port.

Although QPS-r builds on the QPS data structure and algorithm proposed in [20], our work on QPS-r is very different in three important aspects. First, in [20], QPS was used only as an add-on to other crossbar scheduling algorithms such as SERENA [21] and iSLIP [17] by generating a starter matching for other switching algorithms to further refine, whereas in this work, QPS-r is used only as a stand-alone algorithm. Second, we are the first to discover and prove that (QPS-r)-generated matchings and maximal matchings provide exactly the same aforementioned QoS guarantees, whereas in [20], no such mathematical similarity or connection was mentioned. Third, the establishment of this mathematical similarity is an important theoretical contribution in itself, because maximal matchings have long been established as a cost-effective family both in switching [16, 17] and in wireless networking [11, 12], and with this connection we have considerably enlarged this family.

Although we show that QPS-r has exactly the same throughput and delay bounds as that of maximal matchings established in [10, 11, 12], our proofs are different for the following reason. A departure inequality (see Property 1), satisfied by all maximal matching algorithms was used in the stability analysis of [10] and the delay analysis of [11, 12]. This inequality, however, is not satisfied by QPS-r in general. However, QPS-r satisfies this departure inequality in expectation, which is a weaker guarantee and we show that this is enough to obtain the throughput and delay bounds in our proofs.

The rest of this paper is organized as follows. In § II, we provide some background on the input-queued crossbar switches. In § III, we first review QPS, and then describe QPS-r. Then in § IV, we derive the throughput and the queue length (and delay) bounds of QPS-r, followed by the performance evaluation in § V. In § VI, we survey related work before concluding this paper in § VII.

## Ii Background on Crossbar Scheduling

In this section, we provide a brief introduction to the crossbar scheduling (switching) problem, and describe and compare the aforementioned three different types of matchings. Throughout this paper we adopt the aforementioned standard assumption [8, Chapter 2, Page 21] that all the incoming variable-size packets are first segmented into fixed-size packets (also referred to as cells), and then reassembled at their respective output ports before leaving the switch. Each fixed-size cell takes one time slot to switch. We also assume that all input links/ports and output links/ports operate at the same normalized line rate of , and so do all wires and crosspoints inside the crossbar.

### Ii-a Input-Queued Crossbar Switch

In an input-queued crossbar switch, each input port has Virtual Output Queues (VOQs) [22]. The VOQ at input port serves as a buffer for packets going from input port to output port . The use of VOQs solves the Head-of-Line (HOL) blocking issue [23], which severely limits the throughput of the switch system.

An input-queued crossbar can be modeled as a weighted bipartite graph, of which the two disjoint vertex sets are the input ports and the output ports respectively. In this bipartite graph, there is an edge between input port and output port , if and only if the VOQ at input port , the corresponding VOQ, is nonempty. The weight of this edge is defined as the length of (i.e., the number of packets buffered at) this VOQ. A set of such edges constitutes a valid crossbar schedule, or a matching, if any two of them do not share a common vertex. The weight of a matching is the total weight of all the edges belonging to it (i.e., the total length of all corresponding VOQs).

Each such matching can be represented as an sub-permutation matrix (a - matrix that contains at most one entry of “” in each row and in each column) as follows: if and only if the edge between input port and output port is contained in (i.e., input port is matched to output port in ). To avoid any confusion, only (not ) is used to denote a matching in the sequel, and it can be both a set (of edges) and a matrix.

### Ii-B Maximal Matching

As mentioned in § I, three types of matchings play important roles in crossbar scheduling problems: (I) maximal matchings, (II) maximum matchings, and (III) maximum weighted matchings. A matching is called a maximal matching, if it is no longer a matching, when any edge not in is added to it. A matching with the largest possible number of edges is called a maximum matching or maximum cardinality matching. Neither maximal matchings nor maximum matchings take into account the weights of edges, whereas maximum weighted matchings do. A maximum weighted matching is one that has the largest total weight among all matchings. By definition, any maximum matching or maximum weighted matching is also a maximal matching, but neither converse is generally true.

As mentioned earlier, the family of maximal matchings has long been recognized as a cost-effective family for crossbar scheduling. Compared to maximal matching, maximum weighted matching (MWM) (i.e., the well-known MaxWeight scheduler [24] in the context of crossbar scheduling) is much less cost effective. Although MWM provides stronger QoS guarantees such as % switch throughput [25, 26] and average packet delay [27] in theory (and usually even better empirical delay performance in practice as shown in [25]), the state of the art serial MWM algorithm (suitable for switching) has a prohibitively high computational complexity of [28], where is the maximum possible weight (length) of an edge (VOQ). By the same measure, maximum matching is not a great deal either: It is only slightly cheaper to compute than MWM, yet using maximum matchings as crossbar schedules generally cannot guarantee 100% throughput [29].

Compared to maximal matching algorithms, QPS-r provides the same provable QoS guarantees at a much lower computational complexity. More specifically, in a single iteration (i.e., with ), QPS-r computes a matching that is generally not maximal, yet using such matchings as crossbar schedules can result in the same provable throughput guarantee (at least 50%) and delay bounds as using maximal matchings, as we will show in § IV. QPS-r can make do with less (iterations) because the queue-proportional sampling operation implicitly makes use of the edge weight (VOQ length) information, which maximal matching algorithms do not. One major contribution of this work is to discover the family of (QPS-r)-generated matchings that is even more cost-effective.

## Iii The QPS- Algorithm

The QPS-r algorithm simply runs iterations of QPS (Queue-Proportional Sampling) [20] to arrive at a matching, so its computational complexity per port is exactly times those of QPS. Since is a small constant, it is , same as that of QPS. In the following two subsections, we describe QPS and QPS-r respectively in more details.

### Iii-a Queue-Proportional Sampling (QPS)

QPS was used in [20] as an “add-on” to augment other switching algorithms as follows. It generates a starter matching, which is then populated (i.e., adding more edges to it) and refined, by other switching algorithms such as iSLIP [17] and SERENA [30], into a final matching. To generate such a starter matching, QPS needs to run only one iteration, which consists of two phases, namely, a proposing phase and an accepting phase. We briefly describe them in this section for this paper to be self-contained.

#### Iii-A1 The Proposing Phase

In this phase, each input port proposes to exactly one output port – decided by the QPS strategy – unless it has no packet to transmit. Here we will only describe the operations at input port ; that at any other input port is identical. Like in [20], we denote by the respective queue lengths of the VOQs at input port , and by their total (i.e., ). Input port 1 simply samples an output port with probability , i.e., proportional to , the length of the corresponding VOQ (hence the name QPS); it then proposes to output port , with the value that will be used in the next phase. The computational complexity of this QPS operation, carried out using a simple data structure proposed in [20], is per (input) port.

#### Iii-A2 The Accepting Phase

We describe only the action of output port in the accepting phase; that of any other output port is identical. The action of output port depends on the number of proposals it receives. If it receives exactly one proposal from an input port, it will accept the proposal and match with the input port. However, if it receives proposals from multiple input ports, it will accept the proposal accompanied with the largest VOQ length (called the “longest VOQ first” accepting strategy), with ties broken uniformly at random. The computational complexity of this accepting strategy is on average and can be made even in the worst case [20].

### Iii-B The QPS-r Scheme

The QPS-r scheme simply runs QPS iterations. In each iteration, each input port that is not matched yet, first proposes to an output port according to the QPS proposing strategy; each output port that is not matched yet, accepts a proposal (if it has received any) according the “longest VOQ first” accepting strategy. Hence, if an input port has to propose multiple times (once in each iteration), due to all its proposals (except perhaps the last) being rejected, the identities of the output ports it “samples” (i.e., proposes to) during these iterations are samples with replacement, which more precisely are i.i.d. random variables with a queue-proportional distribution.

At the first glance, sampling with replacement may appear to be an obviously suboptimal strategy for the following reason. There is a nonzero probability for an input port to propose to the same output port multiple times, but since the first (rejected) proposal implies this output port has already accepted “someone else” (a proposal from another input port), all subsequent proposals to this output port will surely go to waste. For this reason, sampling without replacement (i.e., avoiding all output ports proposed to before) may sound like an obviously better strategy. However, it is really not, since compared to sampling with replacement, it has a much higher computational complexity of , but improves the throughput and delay performances only slightly according to our simulation studies.

## Iv Throughput and Delay Analysis

In this section, we show that QPS-1 (i.e., running a single QPS iteration) delivers exactly the same provable throughput and delay guarantees as maximal matching algorithms. When , QPS-r in general has better throughput and delay performances than QPS-1, as more input and output ports can be matched up during subsequent iterations, although we are not able to derive better bounds.

### Iv-a Preliminaries

In this section, we introduce the notation and assumptions that will later be used in our derivations. We define three matrices , , and . Let be the queue length matrix where each is the length of the VOQ at input port during time slot . With a slight abuse of notation, we refer to this VOQ as (without the term).

We define and as the sum of the row and the sum of the column respectively of , i.e., and . With a similar abuse of notation, we define as the VOQ set (i.e., those on the row), and as (i.e., those on the column).

Now we introduce a concept that lies at the heart of our derivations: neighborhood. For each VOQ , we define its neighborhood as , the set of VOQs on the row or the column. We denote this neighborhood as , since it has the shape of a cross. Figure 1 illustrates , where the row and column in the shadow are the VOQ sets and respectively. can be viewed as the interference set of VOQs for VOQ [11, 12], as no other VOQ in can be active (i.e., transmit packets) simultaneously with . We define as the total length of all VOQs in (the set) at time slot , that is

(1) |

Here we need to subtract the term so that it is not double-counted (in both and .

Let be the traffic arrival matrix where is the number of packets arriving at the input port destined for output port during time slot . For ease of exposition, we assume that, for each , is a sequence of i.i.d. random variables, the second moment of their common distribution () is finite, and this sequence is independent of other sequences (for a different and/or ). Our analysis, however, holds for more general arrival processes (e.g., Markovian arrivals) that were considered in [11, 12], as we will elaborate shortly. Let be the departure matrix for time slot output by the crossbar scheduling algorithm. Similar to , is a - matrix in which if and only if a packet departs from during time slot . For any , the queue length process evolves as follows:

(2) |

Let be the (normalized) traffic rate matrix (associated with ) where is normalized (to the percentage of the line rate of an input/output link) mean arrival rate of packets to VOQ . With being an i.i.d. process, we have . We define as the maximum load factor imposed on any input or output port by ,

(3) |

A switching algorithm is said to achieve 100% throughput or be throughput-optimal if the (packet) queues are stable whenever .

As mentioned before, we will prove in this section that, same as the maximal matching algorithms, QPS-1 is stable under any traffic arrival process whose rate matrix satisfies (i.e., can provably attain at least throughput, or half of the maximum). We also derive the average delay bound for QPS-1, which we show is order-optimal (i.e., independent of switch size ) when the arrival process further satisfies that for any , has finite variance. In the sequel, we drop the subscript term from and simply denote it as .

Similar to , we define as the total number of packet arrivals to all VOQs in the neighborhood set :

(4) |

where and are similarly defined as and respectively. , , and are similarly defined, so is . We now state some simple facts concerning , , and as follows.

###### Fact 1.

Given any crossbar scheduling algorithm, for any , we have, (at most one packet can depart from input port during time slot ), , and .

###### Fact 2.

Given any i.i.d. arrival process and its rate matrix is whose maximum load factor is defined in (missing), for any , we have .

The following fact is slightly less obvious.

###### Fact 3.

Given any crossbar scheduling algorithm, for any , we have

(5) |

### Iv-B Why QPS-1 Is Just as Good?

The provable throughput and delay bounds of maximal matching algorithms were derived from a “departure inequality” (to be stated and proved next) that all maximal matchings satisfy. This inequality, however, is not in general satisfied by matchings generated by QPS-1. Rather, QPS-1 satisfies a much weaker form of departure inequality, which we discover is fortunately barely strong enough for proving the same throughput and delay bounds.

###### Property 1 (Departure Inequality, stated as Lemma 1 in [12, 11]).

If during a time slot , the crossbar schedule is a maximal matching, then each departure process satisfies the following inequality

(6) |

###### Proof:

We reproduce the proof of Property 1 with a slightly different approach for this paper to be self-contained. Suppose the contrary is true, i.e., . This can only happen when and . However, implies that no nonempty VOQ (edge) in the neighborhood (see Figure 1) is a part of the matching. Then this matching cannot be maximal (a contradiction) since it can be enlarged by the addition of the nonempty VOQ (edge) . ∎

Clearly, the departure inequality (missing) above implies the following much weaker form of it:

(7) |

In the rest of this section, we prove the following lemma:

###### Lemma 1.

The matching generated by QPS-1, during any time slot , satisfies the much weaker “departure inequality” (missing).

Before we prove Lemma 1, we introduce an important definition and state four facts about QPS-1 that will be used later in the proof. In the following, we will run into several innocuous possible situations that all result from queue-proportional sampling, and we consider all of them to be .

We define as the probability of the event that the proposal from input port to output port is accepted during the accepting phase, conditioned upon the event that input port did propose to output port during the proposing phase. With this definition, we have the first fact

(8) |

since both sides (note is a 0-1 r.v.) are the probability that proposes to and this proposal is accepted. Applying the “ operator” to both sides, we obtain the second fact

(9) |

The third fact is that, for any output port ,

(10) |

In this equation, the LHS is the conditional probability ( is also a 0-1 r.v.) that at least one proposal is received and accepted by output port , and the second term on the RHS of (missing) is the probability that no input port proposes to output port (so receives no proposal). This equation holds since when receives one or more proposals, it will accept one of them (the one with the longest VOQ).

The fourth fact is that, for any ,

(11) |

This inequality holds because when input port proposes to output port , and no other input port does, has no choice but to accept proposal.

### Iv-C Proof of Lemma 1

Now we are ready to prove Lemma 1.

By the definition of , we have,

(13) |

### Iv-D Throughput Analysis

In this section we prove, through Lyapunov stability analysis, the following theorem (i.e., Theorem 1), which states that QPS-1 can attain at least throughput. The proof will make use of the much weaker departure inequality (missing). The same throughput bound was proved in [10], through fluid limit analysis, for maximal matching algorithms using the (stronger) departure inequality (missing) which as stated earlier is not satisfied by matchings generated by QPS-1.

###### Theorem 1.

Whenever the maximum load factor , QPS-1 is stable in the following sense: The queueing process is a positive recurrent Markov chain.

###### Proof:

is clearly a Markov chain, since in (missing), the term is a function of and is a random variable independent of . We define the following Lyapunov function of : , where is defined earlier in (missing). This Lyapunov function was first introduced in [11] for the delay analysis of maximal matching algorithms for wireless networking. By the Foster-Lyapunov stability criterion [31, Proposition 2.1.1], to prove that is positive recurrent, it suffices to show that, there exists a constant such that whenever (because it is not hard to verify that the complement set of states is finite and the drift is bounded whenever belongs to this set), we have

(18) |

where is a constant. It is not hard to check (for more detailed derivations, please refer to [11]),

(19) |

Hence the drift (LHS of (missing)) can be written as

(20) |

Now we claim the following two inequalities, which we will prove shortly.

(21) | |||

(22) |

With (missing) and (missing) substituted into (missing), we have

where is a constant. Since , we have . Hence, there exist such that, whenever ,

Now we proceed to prove (missing).

(23) | ||||

(24) |

In the above derivations, inequality (missing) holds due to (missing), being independent of for any , and Fact 2 that .

Now we proceed to prove (missing), which upper-bounds the conditional expectation . It suffices however to upper-bound the unconditional expectation , which we will do in the following, since we can obtain the same upper bounds on and ( and respectively) whether the expectations are conditional (on ) or not. Note the other two terms and are independent of (the condition) .

As for any , is i.i.d., we have,

(25) | ||||

(26) |

Remarks. Now that we have proved that is positive recurrent. Therefore, for any , the long term departure rate . Hence, we have,

(27) |

where is the variance of , because LHS of § IV-D is the long term average of (missing), and the long term average of (missing) can be simplified as the RHS of § IV-D.

### Iv-E Delay Analysis

In this section, we derive the queue length bound (readily convertible to the delay bound by Little’s Law) of QPS-1 using the following moment bound theorem [31, Proposition 2.1.4]. Although in the interest of space, in this work we only show the delay analysis for the i.i.d. traffic arrivals, those for more general arrivals are almost identical. It can be shown that the delay analysis results for general Markovian arrivals derived in [11, 12] for maximal matchings (using the stronger “departure inequality” (missing)) hold also for QPS-1.

###### Theorem 2.

Suppose that is a positive recurrent Markov chain with countable state space . Suppose , , and are non-negative functions on such that,

(28) |

Then .

Now we derive the queue length bound for QPS-1 when the maximum load factor . We define , , , and terms in this theorem in such a way that the LHS and the RHS of (missing) become the LHS and the RHS of (missing) respectively (e.g., define as , as , and as ). Then, we have,

(29) | ||||

(30) | ||||

(31) |

In the above derivation, inequality (missing) is due to (missing) (whose LHS is ), inequality (missing) is due to Theorem 2, and equality (missing) is due to § IV-D.

Therefore, we have,

This queue-length bound is identical to that derived in [11, 12, Section III.B] for maximal matchings under i.i.d. traffic arrivals. It is not hard to check (by applying Little’s Law) that the average delay (experienced by packets) is bounded by a constant independent of (i.e., order-optimal) for a given maximum load factor , if the variance for any is assumed to be finite. For the special case of Bernoulli i.i.d. arrival (when ), this bound (the RHS) can be further tightened to . This implies, by Little’s Law, the following “clean” bound: where is the expected delay averaged over all packets transmitting through the switch.

## V Evaluation

In this section, we evaluate, through simulations, the performance of QPS-r under various load conditions and traffic patterns. We compare its performance with that of iSLIP [17], a refined and optimized representative parallel maximal matching algorithm (adapted for switching). The performance of the MWM (Maximum Weighted Matching) is also included in the comparison as a benchmark. Our simulations show conclusively that QPS-1 performs very well inside the provable stability region (more precisely, with no more than 50% offered load), and that QPS-3 has comparable throughput and delay performances as iSLIP, which has much higher computational and communication complexities.

### V-a Simulation Setup

In our simulations, we first fix the number of input/output ports, to . Later, in section V-C we investigate how the mean delay performances of these algorithms scale with respect to . To measure throughput and delay accurately, we assume each VOQ has an infinite buffer size and hence there is no packet drop at any input port. Each simulation run is guided by the following stopping rule [32, 33]: The number of time slots simulated is the larger between and that is needed for the difference between the estimated and the actual average delays to be within time slots with probability at least .

We assume in our simulations that each traffic arrival matrix is Bernoulli i.i.d. with its traffic rate matrix being equal to the product of the offered load and a traffic pattern matrix (defined next). Similar Bernoulli arrivals were studied in [30, 17, 20]. Note that only synthetic traffic (instead of that derived from packet traces) is used in our simulations because, to the best of our knowledge, there is no meaningful way to combine packet traces into switch-wide traffic workloads. The following four standard types of normalized (with each row or column sum equal to ) traffic patterns are used: (I) Uniform: packets arriving at any input port go to each output port with probability . (II) Quasi-diagonal: packets arriving at input port go to output port with probability and go to any other output port with probability . (III) Log-diagonal: packets arriving at input port go to output port with probability and go to any other output port with probability equal of the probability of output port (note: output port equals output port ). (IV) Diagonal: packets arriving at input port go to output port with probability , or go to output port with probability . These traffic patterns are listed in order of how skewed the volumes of traffic arrivals to different output ports are: from uniform being the least skewed, to diagonal being the most skewed.

### V-B QPS-r Throughput and Delay Performances

We first compare the throughput and delay performances of QPS-1 (1 iteration), QPS-3 (3 iterations), iSLIP ( = 6 iterations), and MWM (length of VOQ as the weight measure). Figure 2 shows their mean delays (in number of time slots) under the aforementioned four traffic patterns respectively. Each subfigure shows how the mean delay (on a log scale along the y-axis) varies with the offered load (along the x-axis). We make three observations from Figure 2. First, Figure 2 clearly shows that, when the offered load is no larger than , QPS-1 has low average delays (i.e., more than just being stable) that are close to those of iSLIP and MWM, under all four traffic patterns. Second, the maximum sustainable throughputs (where the delays start to “go through the roof” in the subfigures) of QPS-1 are roughly , and respectively, under the four traffic patterns respectively; they are all comfortably larger than the provable lower bound. Third, the throughput and delay performances of QPS-3 and iSLIP are comparable: The former has slightly better delay performances than the latter under all four traffic patterns except the uniform.

### V-C Scale with Port Numbers

Figure 3 shows how the mean delays of QPS-3, iSLIP (running iterations given any ), and MWM scale with the number of input/output ports , under the four different traffic patterns. With one exception, we have simulated the following different values of : . The exception is that we did not obtain the delay values for MWM (not a “main character” in our story) for , as it proved to be prohibitively expensive computationally to do so. In all these plots, the offered load is , which is quite high compared to the maximum achievable throughputs of QPS-3 and iSLIP (shown in Figure 2) under these four traffic patterns. Figure 3 shows that the mean delays of QPS-3 are slightly lower (i.e., better) than those of iSLIP under all traffic patterns except the uniform. In addition, the mean delay curves of QPS-3 remain almost flat (i.e., constant) under log-diagonal and diagonal traffic patterns. Although they increase with under uniform and quasi-diagonal traffic patterns, they eventually almost flatten out when gets larger (say when ). These delay curves show that QPS-3, which runs only 3 iterations, deliver slightly better delay performances, under all non-uniform traffic patterns, than iSLIP (a refined and optimized parallel maximal matching algorithm adapted for switching), which runs iterations with each iteration has computational complexity.

## Vi Related Work

Scheduling in crossbar switches is a well-studied problem with a large amount of literature. So, in this section, we provide only a brief survey of prior work that is directly related to ours, focusing on those we have not described earlier.

-complexity algorithms that attain throughput. Several serial randomized algorithms, starting with TASS [34] and culminating in SERENA [30], have been proposed that have a total computational complexity of only yet can provably attain throughput; SERENA, the best among them, also delivers a good empirical delay performance. However, this complexity is still too high for scheduling high-line-rate high-radix switches, and none of them has been successfully parallelized (i.e., converted to a parallel iterative algorithm) yet. Notice that computational complexity is not the complexity barrier for attaining throughput, sub-linear algorithms attaining throughput do exist. However, those algorithms compromise delay performances and/or generality. For example, the algorithm proposed in [35] can provably attain throughput under the assumption that the traffic rate matrix is known a prior.

-complexity algorithms. In [36], a crossbar scheduling algorithm specialized for switching variable-size packets was proposed, that has total computational complexity (for the entire switch). Although this algorithm can provably attain throughput, its delay performance is poor. For example, as shown in [20], its average delays, under the aforementioned four standard traffic matrices, are roughly orders of magnitudes higher than those of SERENA [30] even under a moderate offered load of . A parallel iterative algorithm called RR/LQF (Round Robin combined with Longest Queue First), that has time complexity per iteration per port was recently proposed, in [37]. Even though iterations of this algorithm have to be run (for each scheduling) for it to provably attain at least 50% throughput, running only iteration leads to reasonably good empirical throughput and delay performance over round-robin-friendly workloads such as uniform and hot-spot.

Batch scheduling algorithms. In all algorithms above, a matching decision is made in every time slot. An alternative type of algorithms [38, 39, 40] is frame-based, in which multiple (say ) consecutive time slots are grouped as a frame. These matching decisions in a frame are batch-computed, which usually has lower time complexity than independent matching computations. However, since is usually quite large (e.g., ), and a packet arriving at the beginning of a frame has to wait till at least the beginning of the next frame to be switched, frame-based scheduling generally lead to higher queueing delays. The best known provable delay guarantee for this type of algorithms is average delays for an switch [39]. However, this algorithm has a high computational complexity of per time slot.

## Vii Conclusion

In this work, we propose QPS-r, a parallel iterative crossbar scheduling algorithm with computational complexity per port. We prove, through Lyapunov stability analysis, that it achieves the same QoS (throughput and delay) guarantees in theory, and demonstrate through simulations that it has comparable performances in practice as the family of maximal matching algorithms (adapted for switching); maximal matching algorithms are much more expensive computationally (at least iterations and a total of per-port computational complexity). These salient properties make QPS-r an excellent candidate algorithm that is fast enough computationally and can deliver acceptable throughput and delay performances for high-link-rate high-radix switches.

## References

- [1] C. Cakir, R. Ho, J. Lexau, and K. Mai, “Modeling and Design of High-Radix On-Chip Crossbar Switches,” in Proc. of the ACM/IEEE NoCS, (New York, NY, USA), pp. 20:1–20:8, ACM, 2015.
- [2] C. Cakir, R. Ho, J. Lexau, and K. Mai, “Scalable High-Radix Modular Crossbar Switches,” in Proceedings of the HOTI, pp. 37–44, Aug 2016.
- [3] “Cisco Nexus 5000 Series Architecture: The Building Blocks of the Unified Fabric.” https://bit.ly/30aqAeb, Jul 2017. Accessed: 2018-12-25.
- [4] “Arista 7500 Switch Architecture (’A day in the life of a packet’).” https://bit.ly/2YfLaYG. Accessed: 2018-12-27.
- [5] “QFX10000 Switches System Architecture.” https://juni.pr/2HfeWWH. Accessed: 2018-12-25.
- [6] G. Passas, M. Katevenis, and D. Pnevmatikatos, “A 128 x 128 x 24Gb/s Crossbar Interconnecting 128 Tiles in a Single Hop and Occupying 6% of Their Area,” in Proc. of the ACM/IEEE NoCS, pp. 87–95, May 2010.
- [7] G. Passas, M. Katevenis, and D. Pnevmatikatos, “Crossbar nocs are scalable beyond 100 nodes,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 31, pp. 573–585, April 2012.
- [8] D. J. Aweya, Switch/Router Architectures: Shared-Bus and Shared-Memory Based Systems. Wiley-IEEE Press, 1 edition ed., Jun 2018.
- [9] “Introduction to an SFU.” https://bit.ly/2VuIBF9. Accessed: 2018-12-27.
- [10] J. Dai and B. Prabhakar, “The Throughput of Data Switches with and without Speedup,” in Proceedings of the IEEE INFOCOM, (Tel Aviv, Israel), pp. 556–564, Mar. 2000.
- [11] M. J. Neely, “Delay Analysis for Maximal Scheduling in Wireless Networks with Bursty Traffic,” in Proceedings of the IEEE INFOCOM, April 2008.
- [12] M. J. Neely, “Delay Analysis for Maximal Scheduling With Flow Control in Wireless Networks With Bursty Traffic,” IEEE/ACM Trans. Netw., vol. 17, pp. 1146–1159, Aug 2009.
- [13] M. Fischer, “Improved Deterministic Distributed Matching via Rounding,” ArXiv e-prints, Mar. 2017, 1703.00900.
- [14] J. Hirvonen and J. Suomela, “Distributed Maximal Matching: Greedy is Optimal,” in Proceedings of the ACM PODC, PODC ’12, (New York, NY, USA), pp. 165–174, ACM, 2012.
- [15] A. Israel and A. Itai, “A Fast and Simple Randomized Parallel Algorithm for Maximal Matching,” Inf. Process. Lett., vol. 22, pp. 77–80, Feb. 1986.
- [16] T. E. Anderson, S. S. Owicki, J. B. Saxe, and C. P. Thacker, “High-speed Switch Scheduling for Local-area Networks,” ACM Trans. Comput. Syst., vol. 11, pp. 319–352, Nov. 1993.
- [17] N. McKeown, “The iSLIP Scheduling Algorithm for Input-queued Switches,” IEEE/ACM Trans. Netw., vol. 7, pp. 188–201, Apr. 1999.
- [18] “Cisco Nexus 5548p Switch Architecture.” https://bit.ly/2VAwEhw, Sept 2010.
- [19] G. Passas, M. Katevenis, and D. Pnevmatikatos, “The combined input-output queued crossbar architecture for high-radix on-chip switches,” IEEE Micro, vol. 35, pp. 38–47, Nov 2015.
- [20] L. Gong, P. Tune, L. Liu, S. Yang, and J. J. Xu, “Queue-Proportional Sampling: A Better Approach to Crossbar Scheduling for Input-Queued Switches,” Proceedings of the ACM on Measurement and Analysis of Computing Systems - SIGMETRICS, vol. 1, pp. 3:1–3:33, June 2017.
- [21] “Cisco 12000 Series Internet Router Architecture: Switch Fabric.” Accessed: 2018-03-21.
- [22] Y. Tamir and G. L. Frazier, “High-performance Multi-queue Buffers for VLSI Communications Switches,” SIGARCH Comput. Archit. News, vol. 16, pp. 343–354, May 1988.
- [23] M. Karol, M. Hluchyj, and S. Morgan, “Input Versus Output Queueing on a Space-Division Packet Switch,” IEEE Trans. Commun., vol. 35, pp. 1347–1356, December 1987.
- [24] L. Tassiulas and A. Ephremides, “Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks,” IEEE Transactions on Automatic Control, vol. 37, pp. 1936–1948, Dec. 1992.
- [25] N. McKeown, A. Mekkittikul, V. Anantharam, and J. Walrand, “Achieving 100% Throughput in an Input-Queued Switch,” IEEE Trans. Commun., vol. 47, pp. 1260–1267, Aug. 1999.
- [26] L. Tassiulas and A. Ephremides, “Stability Properties of Constrained Queueing Systems and Scheduling Policies for Maximum Throughput in Multihop Radio Networks,” IEEE Trans. Autom. Control, vol. 37, pp. 1936–1948, Dec. 1992.
- [27] D. Shah and M. Kopikare, “Delay bounds for approximate maximum weight matching algorithms for input queued switches,” in Proceedings of the IEEE INFOCOM, vol. 2, pp. 1024–1031 vol.2, 2002.
- [28] R. Duan and H.-H. Su, “A Scaling Algorithm for Maximum Weight Matching in Bipartite Graphs,” in Proceedings of the ACM-SIAM SODA, pp. 1413–1424, 2012.
- [29] I. Keslassy, R. Zhang-Shen, and N. McKeown, “Maximum size matching is unstable for any packet switch,” IEEE Communications Letters, vol. 7, pp. 496–498, Oct 2003.
- [30] P. Giaccone, B. Prabhakar, and D. Shah, “Randomized Scheduling Algorithms for High-Aggregate Bandwidth Switches,” IEEE J. Sel. Areas Commun., vol. 21, pp. 546–559, May 2003.
- [31] B. Hajek, “Notes for ece 467 communication network analysis.” https://bit.ly/1JtPGu0, 2006.
- [32] J. M. Flegal, G. L. Jones, et al., “Batch means and spectral variance estimators in Markov chain Monte Carlo,” The Annals of Statistics, vol. 38, no. 2, pp. 1034–1070, 2010.
- [33] P. W. Glynn, W. Whitt, et al., “The Asymptotic Validity of Sequential Stopping Rules for Stochastic Simulations,” Ann. Appl. Probab., vol. 2, no. 1, pp. 180–198, 1992.
- [34] L. Tassiulas, “Linear Complexity Algorithms for Maximum Throughput in Radio Networks and Input Queued Switches,” in Proceedings of the IEEE INFOCOM, (San Francisco, CA, USA), pp. 533–539, Mar. 1998.
- [35] C.-S. Chang, W.-J. Chen, and H.-Y. Huang, “On service guarantees for input-buffered crossbar switches: a capacity decomposition approach by Birkhoff and von Neumann,” in Proceedings of International Workshop on Quality of Service (IWQoS), pp. 79–86, May 1999.
- [36] S. Ye, T. Shen, and S. Panwar, “An Scheduling Algorithm for Variable-Size Packet Switching Systems,” in Proceedings of the 48th Annual Allerton Conference, pp. 1683–1690, Sept. 2010.
- [37] B. Hu, K. L. Yeung, Q. Zhou, and C. He, “On Iterative Scheduling for Input-Queued Switches With a Speedup of ,” IEEE/ACM Trans. Netw., vol. 24, pp. 3565–3577, December 2016.
- [38] G. Aggarwal, R. Motwani, D. Shah, and A. Zhu, “Switch Scheduling via Randomized Edge Coloring,” in Proceedings of the IEEE FOCS, pp. 502–512, Oct 2003.
- [39] M. J. Neely, E. Modiano, and Y. S. Cheng, “Logarithmic Delay for Packet Switches Under the Crossbar Constraint,” IEEE/ACM Trans. Netw., vol. 15, pp. 657–668, June 2007.
- [40] L. Wang, , T. Lee, and W. Hu, “A Parallel Complex Coloring Algorithm for Scheduling of Input-Queued Switches,” IEEE Trans. Parallel Distrib. Syst., pp. 1–1, 2018.