CLEX: Yet Another Supercomputer Architecture?
Abstract
We propose the CLEX supercomputer topology and routing scheme. We prove that CLEX can utilize a constant fraction of the total bandwidth for pointtopoint communication, at delays proportional to the sum of the number of intermediate hops and the maximum physical distance between any two nodes. Moreover, alltoall communication can be realized optimally both with regard to bandwidth and delays. This is achieved at node degrees of , for an arbitrary small constant . In contrast, these results are impossible in any network featuring constant or polylogarithmic node degrees. Through simulation, we assess the benefits of an implementation of the proposed communication strategy. Our results indicate that, for a million processors, CLEX can increase bandwidth utilization and reduce average routing path length by at least factors respectively in comparison to a torus network. Furthermore, the CLEX communication scheme features several other properties, such as deadlockfreedom, inherent faulttolerance, and canonical partition into smaller subsystems.
I Introduction & Related Work
Ever since the advent of massively parallel computing architectures, there has been lively interest in the question how the nodes^{1}^{1}1By “node” we mean the smallest computing unit that can be seen as (more or less) a sequentially working device. In today’s multiprocessor systems this means a single core. of a supercomputer should be interconnected, e.g. as a fat tree, butterfly, or hypercube [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. Naturally, these topologies try to balance between the desires for small node degrees and short routing paths. Moreover, it is crucial to serve routing requests in parallel, using little, distributed computation, and to deal with faults. Eventually, the theoretical understanding of these issues became systematic and mature [14, 15, 16, 17].
Today, communication in supercomputers is by and large implemented by means of lowdegree interconnection topologies. Thus, one would expect to find the wellanalysed topologies that have been proposed decades ago to dominate the market. But far from it! Stateoftheart architectures like Cray XMT or IBM Blue Gene provide pointtopoint communication on top of a threedimensional torus network [18, 19]. This appealing simplicity in design comes at a cost, as such a system is fundamentally limited in communication. Konecny [19] writes “Because the most interesting mode of operation assumes uniformly distributed traffic, the network performance is expected to be dominated by the bisection bandwidth.” In a threedimensional torus of processors, one can partition the processors such that two subsets of processors are connected by edges only. In other words, because of communication limitations “the third dimension” of processing is lost, since the average pointtopoint bandwidth between these subsets scales with times the individual link capacity. Bluntly, in today’s supercomputers, for processors the torus architecture restricts communication to less than 1% of the total available bandwidth in the worst case.
So why is it that such an apparently suboptimal design is chosen by practitioners? We believe the answer to this question to be twofold. On the one hand, a (locally) gridlike communication network is of course wellsuited to deal with communication patterns that are local as well.^{2}^{2}2This is for instance true for computational problems arising from physical systems, e.g., from fluid or solid body dynamics, which have been a (if not the) main focus of parallel computing in the past. We argue, however, that this approach has several shortcomings. Firstly, it restricts the range of problems for which the computer architecture is fitting to problems that are parallelizable in a way that matches the network topology. Secondly, programmers need to be aware of this issue and program accordingly, which might be a nontrivial and errorprone task. Thirdly, on large scales, load balancing issues may result in more complex, less local communication patterns if an efficient progress of computation is to be ensured. And finally, even if all these issues can be overcome at the time when the system goes online, it will typically be in use for several years, implying that it is difficult to predict whether demands will change during the lifetime of the supercomputer.
On the other hand, the theory on interconnection networks fails to address some questions of practical significance. For one, how should one actually realize one of the suggested topologies? This turns out to be critical for performance, as the efficiency of the whole communication infrastructure might break down because some of the physical links are exceedingly long: these connections will suffer larger communication delays, consume more space and energy, and complicate the physical layout of the system. To the best of our knowledge, this issue has been neglected in all theoretical studies of the matter; in stark contrast, even for the threedimensional torus, which is fairly amenable to lowdistortion “embedding”, optimizing the stretch has been considered a worthwhile task [20]. What is more, we believe it to be important to devise routing algorithms that deal with faults in a seamless, fully distributed, and automatic manner. Therefore, it is not enough to show that a topology exhibits a large number of disjoint short path between to destinations, but one also needs to give a routing scheme that exploits the high connectivity of the system to establish robustness with respect to failing nodes or links.
In consequence, we would like to revive the interconnection discussion from a theoretical point of view^{3}^{3}3Numerous works are published all the time, but typically a topology is chosen and tested using standard routing mechanisms. For instance, [21] provides a twolevel architecture similar to a twolevel CLEX system, but no mated routing algorithms or theoretical analysis is given. by presenting a new topology we call CLEX (CLiqueEXpander). Essentially, the CLEX design is the result of seeking efficient communication in a world of physical constraints. To this end, we deviate from standard analysis by measuring delays not solely in terms of hops, but also considering the physical distance a signal needs to travel.^{4}^{4}4Although typically bandwidth is the primary concern, recently Barroso pointed out that it is feasible and crucial to strive for small delays in warehousescale computing [22]. We prove that a pointtopoint communication bandwidth per node (to and from arbitrary destinations) matching the total bandwidth per node up to a constant factor can be achieved, at delays that are (asymptotically) proportional to the maximum physical distance between any two nodes. Moreover, applying an asymmetric bandwidth assignment to the links, alltoall communication^{5}^{5}5Adiga et al. [18] state that “MPI_AlltoAll is an important MPI collective communications operation in which every node sends a different message to every other node.” can be realized optimally both with regard to bandwidth and delays.
As constant or polylogarithmic node degrees necessarily incur an average hop distance of , the price we pay for these properties are node degrees of , for an arbitrarily small constant . However, these fairly high degrees are “localized” in the sense that all but a constant number of them connect the nodes of the basic building blocks of our topology, i.e., cliques of size . Thus, one way to interpret our results is to view the CLEX approach as a method to localize the issue of an efficient (lowdegree) communication network to much smaller systems of nodes, which may e.g. reside on a single multicore board. A multicore board will offer means of onboard communication by itself, and due to small distances and integrated circuits one can expect it to be of greater efficiency than that of a comparable largescale network. Thus, the high connectivity of a CLEX system could be considered an abstraction that can be replaced by any efficient local communication scheme within the cliques (cf. e.g. [7]).
Nonetheless, we do also propose a routing scheme that indeed is designed for the highdegree CLEX network as is. Within cliques, it employs recent results on parallel randomized load balancing [23], ensuring a high degree of efficiency and resilience of the overall approach. From our point of view, the properties of the resulting system justify to reraise the question whether high degrees can be worth the effort. In fact, one could see this as another step of localization: Our algorithm reduces the routing problem on the clique level to one on the node level, namely to the one of efficiently routing between input and output ports. This task now is to be solved on a physically much smaller scale, dealing with smaller communication delays and being able to rely on much better synchronization between the individual components. Again, one is free to replace the full connectivity between the ports by any combination of topology and routing scheme that is efficient at this scale.
To add some salt to the above theoretical considerations, we assess the efficiency of a CLEX architecture in practice. To this end, we simulate pointtopoint communication in two systems comprising nodes and nodes. The results of our simulation indicate that the usable bandwidth of a CLEX architecture could be an order of magnitude larger than the theoretical optimum of the IBM Blue Gene and Cray XMT tori interconnection networks. Since our comparison assumes identical total bandwidth in both designs, this is not a mere consequence of indirectly increasing bandwidth via node degrees, but a fundamental difference of the underlying topologies.
Ii Topology and Routing Algorithms
In this section, we give solutions to the alltoall and pointtopoint communication problems. To this end, we define an abstract model amenable to formal analysis. However, the applied proof techniques extend to stronger models which better match a realworld system. In particular, the assumptions of asynchronicity and faultfree behaviour can be dropped. After describing the topology of the CLEX architecture, we briefly compare two algorithms solving alltoall communication efficiently on our topology and the threedimensional torus. Finally, we give an algorithm for pointtopoint communication and analyze its synchronous running time. Our theoretical findings are supported by the simulations presented in Section III.
Iia Model and Problem Formulation
We model a supercomputer as an undirected graph , , where nodes represent the computing elements (processors) and edges bidirectional communication links. To simplify the presentation, we assume that for each , the loop is contained in , i.e., nodes may “send messages to themselves”. We assume that communication is reliable and proceeds in synchronous rounds. Message size is in , i.e., we assume that a constant number of node identifiers of size fits into a message. Any upper bound on the message size respecting this constraint is feasible; for the purpose of our analysis, we however assume that in each round only one “unit payload” can be sent by each node along each edge. Nodes have access to an infinite source of random bits. We point out, however, that our algorithms will in practice work reliably also with pseudorandom instead of true random bits, since all our results hold with high probability (w.h.p.)^{6}^{6}6That is, with probability at least for a tunable constant .
Note that the assumptions on the communication model (which simplify the presentation) can be considerably relaxed. Our algorithms can be run asynchronously by including round counters into messages. Furthermore, as demonstrated in [23], the load balancing scheme can be made resilient to a constant (independent) probability of message loss.
Observe that the total delay a message suffers comprises the time it takes being relayed at intermediate nodes plus the time the signal travels along the edges of the interconnection network. Thus, the simplistic measure given by round complexity may not be accurate in practice; we also need to understand the influence of propagation times. Therefore, we define the maximal (average) delay () as
where () is the maximal (average) number of hops until a message is delivered, () is the maximal (average) physical distance a message travels, and and are appropriate constants (comprising units). To get an idea of the order of magnitude of the respective terms, suppose that it takes a few clock cycles before a message can be relayed (to a free channel) and clock speeds are in the order of gigahertz. Thus, forwarding a message is initiated after a few nanoseconds. At speed of light, a signal travels about a foot per nanosecond.
Formally, we will solve the following problems.
Problem II.1 (PointtoPoint Communication)
Each node is given a (finite) set of messages
with destinations . The goal is to deliver all messages to their destinations, minimizing delays. By
we denote the set of messages a node shall receive. We abbreviate and , i.e., the maximal numbers of messages a single node needs to send or receive, respectively.
Alltoall communication is a special case of pointtopoint communication.
Problem II.2 (AllToAll Communication)
Each node is given a message . The goal is to deliver (a copy of) each message to all nodes, minimizing delay.
Note that this problem is easier to solve, since by setting and an instance of Problem II.1 is obtained.
IiB Interconnection Network
Evidently, with node degrees of at most , any algorithm for Problem II.2 must take at least rounds to complete. Similarly, Problem II.1 cannot be solved in less than rounds, as no node can send or receive more than messages in each round. Thus, in order to hope for good running times, the communication graph needs to expand very quickly, i.e., for any set of nodes with it is necessary that has outgoing edges. At the same time, we need to be aware that longrange links bridging a large physical distance should not be used frequently, which needs to be respected by our routing scheme and thus also the underlying topology. This motivates the following recursive graph construction.
Definition II.3 (CLEX Graphs)
Suppose for a constant that and are integer. We recursively define the (directed) CLEX graph of levels. Set , i.e., a clique of nodes, and label its nodes . Assuming that is already defined, is composed of isomorphic copies , , plus additional edges. Using the label a node inherits from , we can identify it uniquely with .
Observe that each copy of connects each of its subgraphs by many edges to any , , such that degrees increase by exactly on each level. Thus, has uniform degrees of . Its diameter is bounded by , as and is at most .^{7}^{7}7Any copy of is connected to all other copies, hence we can follow a shortest path in to one endpoint of this edge, traverse it, and follow another shortest path in to the destination. We remark that these graphs are not Cayley graphs (cf. [14]).
Note that if for any we set and , both and are integer, i.e., is defined for . In other words, it is possible to choose arbitrarily small and arbitrarily large.
The CLEX topology can be realized with e.g. a gridlike node positioning. We ensure that nodes that are connected on low levels are close to each other by arranging them in cubes. On each new level, we simply arrange an appropriate number of cubes to a larger cube. Such a system will not experience a significant stretch in distances due to embedding issues.
As will emerge from the analysis, it is feasible to replace nodes’ edges in by a single link of capacity to one of the endpoints of these edges (such that each node gets also exactly one incoming link of this capacity). This is also compatible with the solution of Problem II.2 proposed in Section IIC. Note that this way, node degrees become , with merely longrange links (i.e., links that are not on the basic level). Clearly, this is to be preferred in any realworld system, however, for ease of presentation, we stick to uniform edge capacities in our exposition.
IiC AlltoAll Communication
Problem II.2 has simple solutions both on the torus and on CLEX graphs. On the torus, first nodes exchange all messages in direction, then , and finally direction. This defines for each message a tree with the source as root on which the message is flooded. Thus this scheme is bandwidthoptimal up to factor three. If links are congested, i.e., the necessary traffic exceeds the available bandwidth, asymptotic optimality with regard to delays follows from this observation. On the other hand, in absence of congestion the solution is also delayoptimal, as the trees have minimal depth. For the CLEX design, things are less obvious.
We will show that physical average and maximal routing path lengths can also be kept close to the optimum in CLEX systems. For the sake of simplicity, throughout this paper we assume that the torus is (locally) a perfect threedimensional grid of nodes.^{8}^{8}8Due to physical constraints, the embedding will incur an additional stretch (cf. [20]). We assume that the CLEX topology is realized in a hierarchical cube structure as described in Section IIB. We assume that cable connections are as short as possible, for both considered topologies.^{9}^{9}9However, one might want to arrange connections in a CLEX system in a more convenient manner, resulting in a small increase in cable length whose influence we neglect.
Using the previously explained scheme to solve Problem II.2 on the torus, messages travel on average many hops. Hence, the maximal delay in an uncongested setting would be roughly proportional to this value. Observe that no architecture can perform significantly better, as processors cannot be packed much more densely because of cooling issues and physical routing path lengths are optimal up to a factor of .
The strategy to solve Problem II.2 on is very similar to the one for tori. Each message is flooded along a tree induced by with respect to hop distance shortest paths from to all nodes, where links on lower levels are preferred (because they bridge shorter distances), giving a bandwidthoptimal solution up to factor , since links on level one have to deal with most of the load. Note that an asymmetric bandwidth assignment to the different levels reduces this factor, cf. Section IIIA. Messages are delivered to all destinations after travelling at most one edge on each level. Since we assumed that processors are arranged in a cubic grid and links are direct connections, maximal link lengths on Level are , i.e., maximal propagation delays are
Hence, we achieve a approximation to physically optimal delays in , which on threedimensional tori is impossible. For the test settings presented in Section III, the term is close to , i.e., the algorithm will at least perform as good as any solution on a torus interconnection network with regard to propagation delays.
Moreover, as observed in Section IIB, the diameter of is , i.e., messages make at most that many hops. For fixed , we thus achieve asymptotically optimal maximal delays proportional to the maximal spatial distance between any two nodes. For the parameter values considered in Section III, the number of hops reduces about a factor in comparison to a torus network of the same size. As in a torus network typically the number of hops will be the dominant factor contributing to delays in alltoall communication, a CLEX network promises a considerable improvement.
IiD PointtoPoint Communication
In the following, w.l.o.g. we assume that in Problem II.1 , as e.g. in case the number of messages each node needs to send is upper bounded by (and we want to show a bound that is essentially linear in ). Rerouting each message through a uniformly and independently at random (u.i.r.) chosen intermediate node (i.e., applying Valiant’s trick [24]), we need to solve the two slightly simpler problems that each node needs to send at most messages whose destinations are distributed u.i.r. or each node needs to receive at most messages whose origins are distributed u.i.r. Note that these problems are (asymptotically speaking) indeed less difficult, as applying Chernoff’s bound we see that w.h.p., in the first case each node needs to receive at most messages, while in the second case, no node initially holds more than many messages. Thus, for the sake of analyzing the asymptotic complexity of the problem, w.l.o.g. we assume in the following that message destinations are distributed u.i.r.
We proceed by defining and analyzing the Algorithms , , which solve Problem II.1 on . In case of , the communication graph is simply , thus we can use an algorithm suitable for complete graphs. The advantage of full connectivity is that any node may serve as relay for any message, reducing the routing problem to a load balancing task. We follow the approach from [23]. For the sake of clarity, we present a simplified algorithm to illustrate the concept. Initialize and at each node. The algorithm executes the following loop until all messages are delivered:

Create copies of each message. Distribute these copies uniformly at random among all nodes, but under the constraint that (up to one) all nodes receive the same number of messages.

To each node, forward one copy of a message destined to it (if any has been received in the previous step; any choice is feasible). Confirm this to the original sender of the message.

Delete all messages for which confirmations have been received and all currently held copies of messages.

Set and .
Intuitively, this algorithm exploits that the number of messages that still needs to be delivered falls rapidly, thus enabling the nodes to try routing increasingly many redundant copies of the remaining messages without causing too much traffic. If just one of these copies can be deleted, the message will not participate in the subsequent phase. Hence the number of messages will fall by a factor that is exponential in the number of copies per message, permitting to use an exponentially larger number of copies in the next phase without overloading the communication network.
The techniques and proofs presented in [23] yield the following bound on the running time of this simple algorithm for the special case of .
Corollary II.4
Provided that , the above algorithm solves Problem II.1 in synchronous rounds w.h.p., where denotes the inverse tower function.^{10}^{10}10Formally: for and for . This function grows exceptionally slowly; for .
For ease of presentation, we do not discuss asynchronicity (which is dealt with by round counters) or the case that (requiring to adapt the growth of ) here. In [23] appropriate modifications of the given algorithm are discussed, leading to the following more general result.^{11}^{11}11In fact, [23] presents an asymptotically optimal solutions without the additive overhead. However, for any practical purposes, is a constant, and the “optimal” solution is more complex, less robust, and for reasonable parameters slower than the given algorithm.
Corollary II.5
An algorithm exists that solves Problem II.1 in an asynchronous system within time w.h.p.
It is important to note that is not uniform, i.e., (an appropriate estimate of) needs to be known to the nodes in order to execute the algorithm. However, it is not difficult to guarantee this in a practical system by monitoring the network load and updating the nodes frequently. Also, instead of the “oneshot” version of the problem described, a perpetually running solution is required that handles the network traffic generated over time. We argue, however, that in light of the results from [23], it is feasible to study the simplified version of the problem in order to assess the potential gain of our approach.
Having in place, we rely on recursion to solve the task on Level :

Calling , node , , sends each of its messages to a node in whose edges in lead to the copy of containing the destination of the message, choosing u.i.r. from the nodes fulfilling this criterion.

Each node forwards the received messages over its edges in to the copy of they are destined for, balancing the load on these edges.

is called again to forward all messages to their destinations.
We will show now that this algorithm is asymptotically optimal with respect to the number of hops (i.e., required rounds) up to a small term inherited from .
Theorem II.6
Algorithm solves Problem II.1 on . Its running time is w.h.p. bounded by .
Proof. We prove the statement by induction on , i.e., we show that for any , solves Problem II.1 on within the stated number of rounds w.h.p. For this claim immediatelz follows from Corollary II.5. Observe that for , will eventually deliver all messages to their destinations since does, i.e., it is sufficient to show the stated bound on the running time. Moreover, note that it does not matter how the constants in the term grow with since is constantly bounded.
Assume that the claim is correct for some . We show that whenever calls , w.h.p. at most many messages have to be sent or received by any node. Recalling that message destinations are w.l.o.g. distributed u.i.r., we have that the at most messages that have destinations in some given copy of are distributed u.i.r. among all copies of . Hence, between any pair , of such copies, in expectation at most messages need to be exchanged. Thus, Chernoff’s bound yields that w.h.p. no more than messages need to be sent from to .^{12}^{12}12Note that a simple application of the union bound shows that for any polynomial number of events that occur w.h.p., it holds that all of them occur concurrently w.h.p.
Afterwards, applying Chernoff’s bound again, we infer that the number of messages a single node needs to receive in Step 1 is w.h.p. at most , as a fraction of of the nodes in have edges in that lead to . By induction hypothesis, each call of in Step 1 will thus terminate within . rounds w.h.p. Moreover, as for each node its edges in lead to the same copy of , Step 2 terminates in rounds w.h.p. In addition, this implies that no node will have to send more than messages in Step 3 w.h.p. Hence, we can apply the induction hypothesis again in order to see that Step 3 terminates within rounds w.h.p. This concludes the induction step and the proof.
A closer examination of the involved constants reveals that they grow exponentially in . However, in recursive calls the algorithm uses exclusively links on lower levels. Since the number of nodes on each level grows rapidly, the physical distances of the nodes grow by more than factor 2 for each added level. Thus, overall routing path length is bounded by a geometric series with constant limit times the lengths of links on the top level. On the other hand, since we will choose not too small ( resp. ), the number of routing hops is still small.
We remark that it is possible to generalize Theorem II.6 to nonconstant values of , however, choosing too small is not desirable since the number of routing hops grows exponentially in .
Iii Performance Estimation
In this section, we study the practical merits that are to be expected from implementing the proposed communication strategy. We base our reasoning on simulation results and discuss in detail what effects on bandwidth and delays of arbitrary pointtopoint communication can be deduced in comparison to a torus network. Furthermore, we briefly address some advantages regarding the robustness of the proposed routing scheme.
Iiia Dense Traffic (Bandwidth Comparison)
Theorem II.6 states an asymptotic result, i.e., for sufficiently large and any constant , outdegrees of suffice to guarantee good load balance and a running time that is only a constant factor larger than the optimum. However, it is not clear how large the number of nodes needs to be for a certain value of in order to ensure good performance. The strong probability bounds obtained in Section IID indicate that the approach is quite robust, therefore good results for practical values of can be expected.
In order to estimate the bandwidth and delays a CLEX system will feature in comparison to a torus grid of the same size and total bandwidth, we performed simulations of the proposed pointtopoint communication algorithm on with and on with nodes.^{13}^{13}13Sequoia, featuring a Blue Gene/Q architecture, will comprise 1.6 million processors and is expected to go into service in 2012 [25]. Due to memory constraints, we confined ourselves to simulating the algorithm synchronously and solving recursive calls iteratively one after another. As pointed out earlier, both algorithm and analysis are resilient to asynchronicity; hence, neither parallel nor sequentially executed recursive calls interfere with each other, implying that the obtained results should allow for a valid performance estimation of a realworld system.
Furthermore, we adapted the algorithm from Section IID slightly. Firstly, we are primarily interested in the case of uniformly distributed traffic, i.e., there is no need for the algorithm to establish a uniform distribution of messages by itself. Thus, we do not apply Valiant’s trick here, but rather start with uniformly distributed destinations. Note that in case of “somewhat, but not entirely uniform” distributions, it is easy to apply a “lightweight” version of Valiant’s trick: just redistribute the messages uniformly within e.g. level or clusters. This drastically reduces the factor 2 overhead incurred by Valiant’s trick, both with respect to the number of hops and the distance messages travel. Secondly, for , in Step 2 of we choose the subset of neighbors receiving one message more than the others uniformly at random; this slightly improves the load balance. Thirdly, when calling on the subgraphs , nodes initially send along each link one message (if available) directly to its destination. Hence, a large fraction of the messages require only one hop to reach their (interim) destinations. Finally, to further save bandwidth, nodes may refrain from sending several copies of the remaining messages to potentially relaying nodes. Rather, they merely request a message to be forwarded to its target by a neighbor, which requires negligible^{14}^{14}14Message headers must contain the target node ID for routing purposes (20 bits for a million nodes) and probably some other information like e.g. a timestamp. Certainly the payload of a message should be considerably larger. respectively bits (the destination’s identifier in the Level clique plus a phase counter for all phases after the first in an asynchronous execution of algorithm , cf. Figure 3). These bits may also be piggybacked on another message. Then, after receiving a positive acknowledgement, the actual message is sent. Though this will delay the messages that are not delivered immediately by two more rounds, we will later see that the accordant delays do not significantly contribute to the total time until a message is delivered.
In a first simulation experiment, we consider almost saturated channels, i.e., each node initially is source for respectively messages. Unless the communication system gets overloaded (i.e., more messages arrive than can be delivered quickly), it is reasonable to expect that not all nodes generate the same amount of messages. However, due to the randomized allocation of messages to relaying and target nodes, larger average loads support a balanced load distribution (and thus throughput). Hence, we chose to initially assign to all nodes roughly 90% of the messages that can be transferred on a level in a single round, thus permitting a worstcase estimation of the bandwidth utilization under full load while getting useful results with regard to message delays. Message destinations follow a uniformly random permutation of the set containing each node resp. times, i.e., each node has to send and receive the same number of messages. Thus, results afford an easy comparison to the values for alltoall communication given in Section IIC.
For each Level , we measured four values: the maximal number of rounds (excluding recursive calls) that any instance of required, the average number of rounds (excluding recursive calls) messages spent on this level in total, the maximal average load per node any instance of had to deal with, and the (average) number of edges messages traversed on Level during the course of the complete algorithm. The outcomes of the measurements are listed in Tables I and II.
lvl.  max. rds.  avg. rds.  max. avg. load  avg. hops 

1  11  13.69  33.44  10.63 
2  2  4.11  30.33  4 
3  2  2.05  28.06  2 
4  2  1.03  28  1 
lvl.  max. rds.  avg. rds.  max. avg. load  avg. hops 

1  9  6.90  62.06  5.34 
2  2  2.03  57.30  2 
3  2  1.01  57  1 
We see that loads are wellbalanced on all levels; due to the small number of nodes on Level some instances of invoked on are slightly overloaded. Accordingly, the vast majority of the messages can be forwarded immediately on all but the first level, where a different routing scheme is employed. On the first level, a small but relevant fraction of the messages cannot be forwarded at once, leading to delays that are roughly 75% larger than the minimal possible respectively rounds.^{15}^{15}15Each round incurs one “hop” delay since processors need to decide how to deal with a message and one “propagation” delay depending on the respective length of connections on that level. These messages lead to an increase of about 30% in traffic, since they are relayed by other nodes, requiring one additional hop. The large maximal number of rounds takes to complete is in accordance with theory. The algorithm runs phases w.h.p., where in our implementation the first phase takes 1 round and each subsequent phase 2 rounds; we incur a delay of 2 more rounds due to the modification that relaying messages is preceded by an acknowledged request (except for the first phase). Consequently, the algorithm should terminate within roughly rounds, which is true for most instances; the large number of recursive calls and the fact that on some of the calls have higher average loads than nodes’ degrees explain the differences. Figure 3 depicts the number of remaining messages of all invoked instances of plotted against the number of passed phases.
Note that a single message is unlikely to experience large delays in all calls of it participates in. The total number of rounds a message spends on Level can be stochastically bounded from above by the sum of independent random variables describing the number of rounds passing until a message is forwarded in the most loaded instance of . Thus, Chernoff type bounds apply, giving exponential tail bounds on the probability that the random variable exceeds its expectation. As on higher levels almost all messages are forwarded immediately, we have a strong indication that few messages will be delayed more than 22.5 times the expected average delay, both with respect to hops and propagation time.
Due to the increase of the number of nodes on each level by a factor of 32 (64), the physical distances of the processors—and hence the length of connecting cables—grow by factor () per level.^{16}^{16}16We assume that physical distances on Level 1 are not determined by the volume required by the links connecting processors, but rather by cooling requirements, i.e., the number of processors in a cube of edge length is approximately , where is the minimal feasible distance between processors. Otherwise, we had an increase of up to () in cable length. On the top level, link lengths will be in the order of the network diameter. Shortestpath routing in a torus grid bridges on average similar distances as one hop on the top level of .^{17}^{17}17 Strictly speaking, the distortion of the network embedding and the fact that messages do not take physically shortest paths implies that on the torus topology total delays are a constant factor larger than the average delay of top level links in . However, since we do not quantify these influences, we do not incorporate them into our analysis. We conclude a worstcase bound on the competitive ratio with respect to of about (cf. Tables I and II)
in comparison to the theoretical optimum in a torus grid that does not suffer from congestion. In contrast, the average number of hop delays decreases by factors
Recalling that delays in torus networks are dominated by the time it takes to forward messages, we deduce that CLEX architectures will feature significantly smaller overall message delays even when close to maximal communication load.^{18}^{18}18We remark that our estimates do not cover a possibly increased hop delay in the CLEX system imposed by the larger node degrees. As most hops are inside level one clusters where  or bit addresses need to be resolved (in comparison to the three bits for grid links), one still can expect delays that are considerably smaller.
Next, we compare the bandwidth we provide to each node to the theoretical optimum in a threedimensional torus interconnection network. In both settings, the topology appears identical to each node. Therefore, it is reasonable to assume that each node has the same total bandwidth capacity of . Moreover, the amount of inbound and outbound communication is identical, implying that we can confine our considerations to outgoing messages.
For symmetry reasons, in a torus network of nodes ( nodes in each spatial direction) an optimal scheme assigns each link the same capacity of . We partition the nodes into two sets of equal size, such that the corresponding cut is minimum, containing edges.^{19}^{19}19If the length of cycles is different in , , and direction, we need to consider different minimum cuts given by planes orthogonal to each dimension. It is easy to see that for one of the cuts, the bandwidthtonodes ratio is at least as bad as for the symmetric case. If message destinations are distributed u.i.r., in expectation every second message needs to pass this cut. Hence, with regard to uniformly distributed traffic, the effective average bandwidth with regard to Problem II.1 provided to each node is bounded from above by .
On , we assert bandwidth according to the simulation results, i.e., each node first divides its bandwidth to the levels according to the weights given by the average hops messages travelled on each level (cf. Tables I and II) and then the bandwidth on each level evenly among the links on that level. Each message will consume one unit of bandwidth per hop. We conclude that the gain in effective pointtopoint bandwidth compared to the theoretical maximum for a torus architecture will be at least roughly
Recall that the proposed asymmetric assignment of bandwidth also improves the efficiency of the simpler mechanism for alltoall communication presented in Section IIC. Since most communication takes place on level , to which we assigned the majority of the bandwidth, we achieve a bandwidth utilization for Problem II.2 that is at least 2competitive, regardless of .
IiiB Light Traffic (Delay Comparison)
Total message delays will be smaller if traffic is less dense, since most messages can be forwarded immediately. Consequently, for a fair comparison of delays, we consider light traffic matching the maximum throughput of a torus network. From the previous results we infer that initial loads need to be and , respectively. Moreover, since saving bandwidth is not crucial any longer, we can refrain from requesting message indirection on the lowest level prior to sending complete messages. Apart from these two modifications, the test settings are identical. The results are given in Tables III and IV.
lvl.  max. rds.  avg. rds.  max. avg. load  avg. hops 

1  5  9.02  9.02  10.53 
2  1  4  7.32  4 
3  1  2  4.02  2 
4  1  1  4  1 
lvl.  max. rds.  avg. rds.  max. avg. load  avg. hops 

1  5  4.32  10.36  5.11 
2  1  2  5.09  2 
3  1  1  5  1 
We observe that dropping the mechanism to save bandwidth reduces delays on Level 1 significantly, while due to the smaller loads the average bandwidth consumption per message is roughly the same as before. Repeating the previous calculations for the new data, we see that average propagation delays slightly improve to at worst (cf. Tables III and IV)
times the average time a signal requires to follow physically shortest paths. The required number of hops reduces considerably, widening the gap to torus interconnection networks to factors
Moreover, we see that all messages can be forwarded immediately on all but the lowest level and terminates after at most 5 rounds in all instances.
Iv Conclusion
In this work, we proposed the CLEX interconnection and routing scheme for supercomputers. Our results emphasize the advantages of small diameters when aiming for small delays and high bandwidth utilization in face of growing numbers of processors. We simulated configurations of  respectively level CLEX architectures comprising half a million and a million nodes. The results indicate performance gains of roughly an order of magnitude for pointtopoint communication in comparison to threedimensional torus topologies. This comparison is based on the principal limitations of a torus topology, i.e., it does for instance not respect that a realworld routing mechanism will not be able to concurrently propagate all messages along shortest paths. Certainly, this performance gap will more than compensate for an increased local switching time due to larger node degrees, and it might justify the larger expense for the routing hardware. We believe this to be particularly true in the future, since in the past (parallel) computation power grew much faster than communication capacity, and there is no sign that this trend might stop anytime soon.
References
 [1] G. B. Adams, III, D. P. Agrawal, and H. J. Seigel, “A Survey and Comparision of FaultTolerant Multistage Interconnection Networks,” Computer, vol. 20, no. 6, pp. 14–27, 1987.
 [2] J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber, “Hyperx: topology, routing, and packaging of efficient largescale networks,” in Proc. Conference on High Performance Computing Networking, Storage and Analysis (SC). New York, NY, USA: ACM, 2009, pp. 1–11.
 [3] J. Goodman and C. Sequin, “Hypertree: A Multiprocessor Interconnection Topology,” IEEE Transactions on Computers, vol. C30, no. 12, pp. 923–933, Dec. 1981.
 [4] R. I. Greenberg and C. E. Leiserson, “Randomized Routing on FatTrees,” in Advances in Computing Research, vol. 5, 1989, pp. 345–374.
 [5] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu, “Dcell: a Scalable and FaultTolerant Network Structure for Data Centers,” in SIGCOMM ’08: Proceedings of the ACM SIGCOMM 2008 conference on Data communication. New York, NY, USA: ACM, 2008, pp. 75–86.
 [6] S. Johnsson and C.T. Ho, “Optimum Broadcasting and Personalized Communication in Hypercubes,” IEEE Transactions on Computers, vol. 38, no. 9, pp. 1249–1268, 1989.
 [7] J. Kim, W. J. Dally, and D. Abts, “Flattened Butterfly: a CostEfficient Topology for HighRadix Networks,” in ISCA, 2007, pp. 126–137.
 [8] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, and Hypercubes. San Mateo, CA, USA: Morgan Kaufmann, 1992.
 [9] C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman, M. N. Ganmukhi, J. V. Hill, W. D. Hillis, B. C. Kuszmaul, M. A. S. Pierre, D. S. Wells, M. C. WongChan, S.W. Yang, and R. Zak, “The Network Architecture of the Connection Machine CM5,” J. Parallel Distrib. Comput., vol. 33, no. 2, pp. 145–158, 1996.
 [10] J. Leonard, “The Kautz Digraph as a Cluster Interconnect,” in Proc. International Conference on High Performance Computing, Networking and Communication Systems (HPCNCS), 2007, pp. 53–58.
 [11] M. T. Raghunath and A. Ranade, “Designing Interconnection Networks for MultiLevel Packaging,” in Proc. of the 1993 ACM/IEEE conference on Supercomputing (Supercomputing ’93), 1993, pp. 772–781.
 [12] H. J. Siegel, W. G. Nation, C. P. Kruskal, and L. M. J. Napolitano, “Using the Multistage Cube Network Topology in Parallel Supercomputers,” Proc. of the IEEE, vol. 77, no. 12, pp. 1932–1953, 1989.
 [13] Y. Zhu, M. Taylor, S. Baden, and C.K. Cheng, “Advancing Supercomputer Performance through Interconnection Topology Synthesis,” in Proc. IEEE/ACM International Conference on ComputerAided Design (ICCAD 2008), Nov. 2008, pp. 555–558.
 [14] M.C. Heydemann, Graph Symmetry: Algebraic Methods and Applications. Kluwer Academic Publishers, 1997, ch. Cayley Graphs and Interconnection Networks, pp. 167–223.
 [15] E. Shamir and A. Schuster, “Communication aspects of networks based on geometric incidence relations,” Theor. Comput. Sci., vol. 64, pp. 83–96, 1989.
 [16] E. Upfal, “An O(log N) deterministic packetrouting scheme,” Journal of the ACM, vol. 39, pp. 55–70, 1992.
 [17] S. Even and A. Litman, “Layered cross product – technique to construct interconnection networks,” Networks, vol. 29, no. 4, pp. 219–223, 1997.
 [18] N. R. Adiga, M. A. Blumrich, D. Chen, P. Coteus, A. Gara, M. E. Giampapa, P. Heidelberger, S. Singh, B. D. SteinmacherBurow, T. Takken, M. Tsao, and P. Vranas, “Blue Gene/L torus interconnection network,” IBM J. Res. Dev., vol. 49, no. 2, pp. 265–276, 2005.
 [19] P. Konecny, “Introducing the Cray XMT,” US National Center for Computational Sciences (NCCS), Tech. Rep., 2007.
 [20] H. Yu, I.H. Chung, and J. Moreira, “Topology Mapping for Blue Gene/L Supercomputer,” in Proc. ACM/IEEE conference on Supercomputing (SC ’06), 2006, p. 116.
 [21] J. Kim, W. J. Dally, S. Scott, and D. Abts, “Technologydriven, highlyscalable dragonfly topology,” SIGARCH Comput. Archit. News, vol. 36, no. 3, pp. 77–88, 2008.
 [22] L. A. Barroso, “WarehouseScale Computing: Entering the Teenage Decade,” 2011, Federal Computing Research Conference (FCRC) plenary talk.
 [23] C. Lenzen and R. Wattenhofer, “Tight Bounds for Parallel Randomized Load Balancing,” in Proc. 43rd Symposium on Theory of Computing (STOC), 2011.
 [24] L. G. Valiant, “A Scheme for Fast Parallel Communication,” SIAM Journal on Computing, pp. 350–361, 1982.
 [25] M. Feldman, “Lawrence Livermore Prepares for 20 Petaflop Blue Gene/Q,” http://www.hpcwire.com/features/LawrenceLivermorePreparesfor20PetaflopBlueGeneQ38948594.html, 2009.