An Optimal SinglePath Routing Algorithm
in the Datacenter Network DPillar
Abstract
DPillar has recently been proposed as a servercentric datacenter network and is combinatorially related to (but distinct from) the wellknown wrapped butterfly network. We explain the relationship between DPillar and the wrapped butterfly network before proving that the underlying graph of DPillar is a Cayley graph; hence, the datacenter network DPillar is nodesymmetric. We use this symmetry property to establish a singlepath routing algorithm for DPillar that computes a shortest path and has time complexity , where parameterizes the dimension of DPillar (we refer to the number of ports in its switches as ). Our analysis also enables us to calculate the diameter of DPillar exactly. Moreover, our algorithm is trivial to implement, being essentially a conditional clause of numeric tests, and improves significantly upon a routing algorithm earlier employed for DPillar. Furthermore, we provide empirical data in order to demonstrate this improvement. In particular, we empirically show that our routing algorithm improves the average length of paths found, the aggregate bottleneck throughput, and the communication latency. A secondary, yet important, effect of our work is that it emphasises that datacenter networks are amenable to a closer combinatorial scrutiny that can significantly improve their computational efficiency and performance.
Keywords: datacenter networks, routing algorithms, shortest paths, symmetry.
1 Introduction
Datacenters are assuming an increasingly important role in the global computational infrastructure. They provide platforms for a wide range of dataintensive applications and activities including web search, social networking, online gaming, largescale scientific deployments and serviceoriented cloud computing. There is an increasing demand that datacenters incorporate more and more servers, and do so in a costeffective fashion, but still so that the resulting platform is computationally efficient (in various senses of the term).
A datacenter network (DCN) comprises the physical communication infrastructure underpinning a datacenter. One of the main aspects of a datacenter network is the topology by which the servers, switches and other components of the datacenter are interconnected; the choice of topology strongly influences the datacenter’s practical performance (see, e.g., [19]). For simplicity, henceforth by DCN we refer to the datacenter network topology. Originally, DCNs were hierarchical with expensive core routers that became bottlenecks in terms of both performance and cost. They evolved into treelike, switchcentric DCNs, built from commodityofftheshelf (COTS) components; that is, so that the servers are located at the ‘leaves’ of a treelike structure that is composed entirely of switches and where the routing intelligence resides within the switches. Such DCNs can offer better load balancing capabilities and so are less prone to bottlenecks but have limited scalability due to (the size of) routing tables within the switches. Typical examples of such switchcentric DCNs are ElasticTree [14], FatTree [5], VL2 [10], HyperX [3], Portland [20] and Flattened Butterfly [1].
Alternative architectures have recently emerged and servercentric DCNs have been proposed whereby the interconnection intelligence resides within the servers as opposed to the switches. Now, switches only operate as dumb crossbars (and consequently the need for highend switches is diminished as are the infrastructure costs). This paradigm shift means that more scalable topologies can be designed and the fact that routing resides within servers, which are easier to program than are switches, means that more effective routing algorithms can be adopted. However, servercentric DCNs are not a panacea as packet latency can increase, with the need to handle routing providing a computational overhead on the server. Typical examples of servercentric DCNs are DCell [12], BCube [13], FiConn [16], CamCube [2], MCube [23], DPillar [18], HCN and BCN [11] and SWCube, SWKautz, and SWdBruijn [17]. An additional positive aspect of some servercentric DCNs is that not only can commodity switches be used to build the datacenters but commodity servers can too; the DCNs FiConn, MCube, DPillar, HCN, BCN, SWCube, SWKautz, and SWdBruijn are all such that any server only needs two NIC ports (the norm in commodity servers) in order to incorporate it into the DCN.
It is with the DCN DPillar that we are concerned here. DPillar is an established and one of the most promising benchmark dualport servercentric DCNs. Moreover, DPillar is one of the even fewer dualport servercentric DCNs for which no servernode is adjacent to any other servernode, the others being SWKautz, SWCube, and SWdBruijn. DPillar has recently been compared with other dualport servercentric DCNs [17]. It was shown that when the diameter of the DCN is normalized, DPillar can incorporate more servers than FiConn and BCN, a similar number of servers to SWCube, and (usually) less servers than SWKautz and SWdBruijn. However, DPillar, SWCube, SWKautz, and SWdBruijn were shown to have similar bisection widths and all have better bisection widths than FiConn and BCN. Whilst SWCube, SWKautz, and SWdBruijn were compared with each other in [17] with regard to aspects of routing in relation to faulttolerance and handling congestion, there was no comparison of these three DCNs with DPillar. Such an evaluation is currently missing and would obviously be tied to a particular routing algorithm for DPillar, an observation that we will return to in a moment.
As we shall see, DPillar is essentially obtained by replacing complete bipartite subgraphs in a wrapped butterfly network (see, e.g., [15]) with a switch with ports. In [18], basic properties of DPillar are demonstrated and singlepath and multipath routing algorithms are developed (along with a forwarding methodology for the latter). Our focus here is on singlepath routing (also known as singlesource deterministic routing). The algorithm in [18] is appealing in its simplicity but for most sourcedestination pairs it does not produce a path of shortest length; indeed, there is often a significant discrepancy between the lengths of the path produced by the algorithm in [18] and a shortest path (as we demonstrate later). We remedy this situation and develop a singlepath routing algorithm that always outputs a shortest path. Although the proof of correctness of our algorithm is nontrivial, the actual algorithm itself is a very simple sequence of numeric tests and has the same time complexity as the original single path routing algorithm, i.e., linear in the number of columns within DPillar.
Furthermore, we undertake an empirical evaluation and show that according to our experiments, the original single path routing algorithm for DPillar from [18] fails to provide a shortest path route for more than 51% and up to 78% of the server pairs; this translates into our algorithm giving an improvement in the range of 2030% in terms of the average path length derived. Note that a reduction in path length not only means that the latency of the network traffic will be reduced (between 20 and 25%, in our experiments), but also that as less resources are required for transmitting data, the overall throughput of the network should also increase. To verify this latter contention, we empirically measure the aggregate bottleneck throughput (the most widely accepted datacenter throughput metric) for both algorithms and we find that our algorithm yields improvements in the range of 25120%, with a mean of 65% and a median of 75%. The substantial improvements in average path length and throughput, together with the algorithmic simplicity of our proposal, more than motivates its utilization in production systems. As byproducts of the development of our algorithm, we prove that the DCN DPillar is, in essence, a Cayley graph, and thus nodesymmetric (that is, there is an automorphism mapping any server to any other server), and we obtain the diameter of the DCN DPillar exactly.
Let us now return to our earlier remark as regards the current lack of a comparison in the literature of DPillar with SWCube, SWKautz, and SWdBruijn with respect to aspects of routing in relation to faulttolerance and handling congestion. Were we to embark on this comparison prior to the results of our paper then we would be doing a disservice to DPillar as we would be working with the routing algorithm from [18] which we prove (and empirically validate) here to be significantly worse in all respects than the routing algorithm we develop in this paper. We intend in future to undertake an extensive evaluation of aspects of routing for dualport servercentric DCNs including DPillar, SWCube, SWKautz, and SWdBruijn but thanks to the results of this paper, this will now be with respect to our improved routing algorithm for DPillar (of course, such an evaluation is beyond the scope of this paper).
In the next sections, we give an explicit definition of the DCN DPillar, both algebraically and as a derivation from wrapped butterfly networks, before showing how to abstract DPillar as a directed graph and proving that the resulting directed graph is a Cayley graph; an immediate consequence is that the DCN DPillar is nodesymmetric. In Section 4, and using the newfound property of nodesymmetry, we explain how solving the singlepath routing problem in our abstraction of DPillar can be further abstracted so that it is equivalent to a routing problem in what we call a marked cycle, and in Section 5 we prove that shortest paths in this marked cycle must have severe restrictions on their structure. We use these restrictions to develop our singlepath routing algorithm for DPillar in Section 6 and establish its correctness and its time complexity. To support our theoretical analysis, we provide empirical evidence that the length of the (shortest) path obtained by our singlepath routing algorithm is significantly shorter than the length of the path obtained by the singlepath routing algorithm from [18] for many sourcedestination pairs, and we calculate the diameter of DPillar explicitly. Our conclusions and directions for further research are given in Section 8^{1}^{1}1Some results from this paper appeared in preliminary form in: A. Erickson, A. Kiasari, J. Navaridas and I.A. Stewart, An efficient shortest path routing algorithm in the data centre network DPillar, Proc. of 9th Ann. Int. Conf. on Combinatorial Optimization and Applications, 2015, pp. 209–220; some proofs and results were omitted and there was no experimental evaluation..
2 The DCN DPillar
In this section, we explicitly define the DCN DPillar and explain how the DCN DPillar can be (informally) constructed from a wrapped butterfly network.
2.1 A definition of DPillar
The DCN DPillar [18] consists of a collection of switches, each of which has ports, with even, and a collection of servers, each of which has NIC ports. The names of the servers are where (we refer to as the dimension): the first parameter, , is the columnindex and denotes the column in which the server resides, whilst the second parameter is the rowindex and denotes the server’s position within a column (from the left, the bit positions are ; note that we refer to the values as ‘bits’ and their positions as ‘bit’ positions). We denote the DCN DPillar with parameters and , as above, by DPillar. Consequently, DPillar has servers.
We term the collections of servers ‘columns’ as we visualize the servers within a column as being stacked vertically within that column, with the rowindices of the servers, from top to bottom, being given in increasing lexicographic order on ; so, if and , for example, then the ordering is given by and so on. There are switches located between column and column , for , and also between column and column ; thus, there are switches in DPillar. We think of the switches between two columns of servers as appearing in a column too, with the names of the switches in a column being and again stacked from top to bottom in increasing lexicographic order. If a switch lies between servercolumn and servercolumn , where and addition is modulo , then we say that its column is column (henceforth, we assume that addition and subtraction on the names of columns are always modulo ). The columns of servers and switches for DPillar can be visualized as in Fig. 1 (note that the servers in the rightmost and leftmost columns are identical but are shown separately to facilitate visualization).
All links are serverswitch links and are from a server in (server)column to a switch in (switch)column or from a server in (server)column to a switch in (switch)column (where ). Let be a server in column . The switch to which it is connected in column is the switch named . If is a server in column then the switch to which it is connected in column is the switch named . So, for example, the server , where denotes that we may substitute in any number from , is connected to the switch in column , which in turn is connected to the server . Similarly, the server is connected to the switch in column , which in turn is connected to the server . The serverswitch links for DPillar can be visualized as in Fig. 1.
An alternative informal definition of DPillar can be given. With reference to Fig. 1, we can replace every switch with a complete bipartite graph (the bipartition is the obvious one). What results is the wellknown wrapped butterfly network (see, e.g., [15]; this network has been wellstudied within the context of multiprocessor systems). The primary difference between DPillar and the resulting wrapped butterfly network is that a switch in DPillar enables direct servertoserver communication between servers connected to the same switch and in the same column, whereas such communication is absent in the wrapped butterfly network.
2.2 Abstracting DPillar
We can abstract DPillar as a digraph as follows: the nodes of this graph are the servers of DPillar; and there is an edge from a sourcenode to a targetnode if there is a link from the corresponding sourceserver to a switch and a link from that switch to the corresponding targetserver (so, the edges correspond to serverswitchserver paths). There are types of edges in the digraph abstracting DPillar:

clockwise edges (cedges) which are edges of the form

anticlockwise edges (aedges) which are edges of the form

basic static edges (bedges) which are edges of the form

decremented static edges (dedges) which are edges of the form
So, within our abstraction of DPillar as a digraph, the nodes are the servers and are located in columns (as before) with all edges joining nodes in consecutive columns (clockwise and anticlockwise edges) or nodes in the same column (static edges). In fact, our digraph (where each node has in and outdegree ) can also be thought of as an undirected graph (that is regular of degree ) as all edges come in oppositely oriented pairs. Note that the clockwise (resp. anticlockwise, basic static, decremented static) edge above corresponds to a serverswitchserver path in the DCN DPillar from a column server through a column (resp. , , ) switch and on to a column (resp. , , ) server. Henceforth, we denote the digraph abstracting DPillar by DPillar too (this causes no confusion). The abstraction of DPillar can be visualized as in Fig. 1 where we show how the switch in column gives rise to a set of edges in the abstraction of DPillar as a graph. We annotate edges as follows: an edge annotated ‘a’ is an anticlockwise edge relative to the node (the arrow on the edge from denotes that the label is with respect to ); an edge annotated ‘b’ is a basic static edge relative to node ; an edge annotated ‘c’ is a clockwise edge relative to node ; and an edge annotated ‘d’ is a decremented static edge relative to node (so, an edge has two labels: one relative to one incident node; and another relative to the other incident node). In short, for some node, the adjacent switch ‘to the right’ gives rise to bedges and cedges, and the one ‘to the left’ gives rise to aedges and dedges.
3 DPillar is a Cayley Graph
In this section, we prove that the digraph DPillar is a Cayley graph, and consequently nodesymmetric (we exploit this nodesymmetry later on in our singlepath routing algorithm and in our experimental work). Recall that a graph is a Cayley graph if the nodes can be labelled with the elements of a (algebraic) group and there is a generating subset that is closed under inverses so that every directed edge is labelled with an element of if, and only if, (within the group ). We say that a digraph is nodesymmetric if given any distinct nodes and , there is an automorphism (that is, a onetoone mapping of the nodeset onto itself such that if is an edge then is an edge) mapping to . It is wellknown, and trivial to prove, that every Cayley graph is nodesymmetric. The first paper to establish that being a Cayley graph is a useful property for an interconnection network is [4] and since then, there has been much research into representing interconnection networks using finite groups. Not only do we immediately obtain that any Cayley graph is nodesymmetric (which is a fundamental property of interconnection networks [7]) but Cayley graphs have been shown to be relevant to various networks in a variety of ways; for example, with regard to the design of interconnection networks by pruning nodes and edges from tori [24], the design of wireless DCNs [22], and the design of highdimensional meshbased interconnection networks [6].
3.1 DPillar Symmetry
Whilst it was stated in [18] that the DCN DPillar is ‘symmetric’, it was not stated as to what ‘symmetric’ meant (hence, there was no proof of ‘symmetry’). Our main intention is to show that DPillar is nodesymmetric (defined above) but we do this by proving that DPillar is a Cayley graph.
Lemma 1.
The digraph DPillar is a Cayley graph.
Proof.
We obtain the immediate corollary.
Corollary 2.
The digraph DPillar is nodesymmetric.
4 Abstracting routing in DPillar
In this section, we abstract the problem of finding a path in the digraph DPillar from a given sourcenode to a given destinationnode so that ultimately this problem is equivalent to finding a path from a sourcenode to a destinationnode in a cycle of length but where the actual nodetonode moves are more complicated than in a digraph. We also explain the singlepath routing algorithm from [18].
4.1 Fixing bits
It is important to appreciate what might be accomplished by moving along one of the different types of edge highlighted above. Suppose that we are attempting to move from some sourcenode to some destinationnode within DPillar and that we are currently at some node in column . We can choose a clockwise (resp. anticlockwise, basic static, decremented static) edge so as to set the th (resp. th, th, th) bit in the rowindex to whatever value from that we like. Consequently, by choosing a clockwise (resp. anticlockwise, basic static, decremented static) edge along which to move, we can ‘fix’ the th (resp. th, th, th) bit of the rowindex so that it matches that of the destinationnode. We say that: a clockwise edge covers the column in which its sourcenode lies; an anticlockwise edge covers the column in which its targetnode lies; a basic static edge covers the column in which both its source and targetnodes lie; and a decremented static edge covers the column that is adjacent in an anticlockwise direction to the column in which both its source and targetnodes lie. Thus, if we wish to move along some path from to then we need to ensure that we move from column to column so as to fix all of the bits of the rowindex that need fixing, but so that we don’t subsequently ‘unfix’ them, and so that we end up in the column within which resides (with regard to not ‘unfixing’ a bit, note that we can always move from a node in one column to a node in an adjacent column so that the rowindex remains unchanged). This is equivalent to moving from column to column so that every rowindex bitposition, i.e., column, where the bit values of and differ is necessarily covered by some edge and so that we end up in the column within which resides. If we are looking for a shortest path from to then we have to do this using as few moves as possible. Of course, any path of length in our abstraction of DPillar as a digraph translates to a path consisting of serverswitchserver linkpairs in the DCN DPillar, and vice versa (for the sake of uniformity, we measure the length of servertoserver paths in the DCN DPillar in terms of the number of serverswitchserver linkpairs in the path; this is also common practice in the DCN community).
As an illustration, suppose we are at in DPillar and wish to get to the destination . If denotes any element of , there is: an anticlockwise edge taking us to ; a basic static edge taking us to ; a clockwise edge taking us to ; and a decremented static edge taking us to . Given our destination, when we move we can choose accordingly and fix the appropriate bit so that we move: via an anticlockwise edge to ; via a basic static edge to ; via a clockwise edge to ; or via a decremented static edge to .
4.2 Another abstraction
A crucial observation arising from the above discussion is that when routing in DPillar, the actual value of some bit in a rowindex of some node is unimportant: what matters is whether this value is equal to or different from the value of the corresponding bit in the rowindex of the destinationnode (that is, whether the bit needs to be ‘fixed’ or not). Consequently, in order to solve the problem of finding a path from , which lies in column , to , which lies in column , in DPillar, we can abstract the problem as a (more involved) routing problem in the following digraph :

we think of there being one node for each of the columns of nodes of DPillar with nodes in that correspond to adjacent columns being joined by an oppositely oriented pair of edges (so, we can also think of as an undirected cycle of length )

we mark every node , corresponding to some column (or, alternatively, some bitposition in the rowindex of some node of DPillar) that needs to be covered (because bit of the rowindex of is different from bit of the rowindex of ), with the set of marked nodes being denoted by

we move from node to node in , starting at the node so as to end at the node and making moves where:

a cmove means we move from node to node and such a move covers node

an amove means we move from node to node and such a move covers node

a bmove means we stay at node and such a move covers node

a dmove means we stay at node and such a move covers node

(note the correspondence between the above moves and the edge types given in Section 2.2). We call a marked cycle. Note that it might be the case that in (this would mean that the nodes and lie in the same column in DPillar).
With regard to our illustration in the previous section, the edge from : to results in an amove covering node in the marked cycle; to results in a bmove covering node in the marked cycle; to results in a cmove covering node in the marked cycle; and to results in a dmove covering node in the marked cycle.
It should be clear as to how moves in the marked cycle correspond to moves along corresponding edges in DPillar (and so to serverswitchserver linkpairs in the DCN DPillar) with the coverage of a node in and a node of DPillar being in direct correspondence. A path in is a sequence of moves leading from to and corresponds to a path in DPillar from node to node (and vice versa) with the lengths of the two paths being identical. Consequently, in order to find a shortest path from to in the DCN DPillar, it suffices to find a shortest path in the marked cycle (from the node to the node ) so that every marked node is covered by a move. Note that if then the empty sequence of moves does not constitute a legitimate path.
4.3 Basic routing in DPillar
Before we continue, let us discuss the singlepath routing algorithm for DPillar as detailed in [18]; we refer to this algorithm as DPillarSP. The routing algorithm DPillarSP operates in phases: in the first phase (the socalled ‘helix’ phase), a path in the DCN DPillar is chosen so that movement is always in a clockwise direction (that is, the columnindex is always incremented) or always in an anticlockwise direction (that is, the columnindex is always decremented) in order that the rowindex is ‘fixed’ so that it is identical to that of the destinationnode; and in the second phase (the socalled ‘ring’ phase), a path is subsequently chosen so as to reach the destinationnode without amending the rowindex and so that movement is in the same direction as in the first phase. Although not explicitly mentioned when discussing their algorithm, it is clear that the time complexity of the singlepath routing algorithm from [18] is (we have suppressed the component required to represent each bitvalue).
It is stated in [18, Section 3.1] that this singledirection movement is so that ‘loops’ might be avoided. While this statement was not explained further, it is probable that what was meant by ‘loops’ was a loop within a single route for a sourcedestination pair. Of course, our shortestpath routing algorithm means that loops in a single path will never occur. Alternatively (though unlikely), the rationale for the decision in [18] to restrict to singledirection movement might have been to avoid either networklevel deadlock or livelock due to dependency loops (see, e.g., [7, Ch. 14]). Irrespective of the intentions in [18], it is worth commenting on the potential for deadlocks in DPillar and servercentric DCNs in general. Given that the topology of DPillar is basically a sophisticated ring of columns, moving in a single direction does not completely prevent dependency loops from appearing. We give an example in Fig. 2 where there is a (bold) route from to and a (dotted) route from to so that there is a cyclic dependency graph, due to the shared switches and , even though we are using singledirection routing. Nevertheless, there are many reasons to believe that, in the context of servercentric DCNs based on COTS hardware and software (i.e., Ethernet hardware and TCP/IP stack), network level deadlocks should be a minor concern. First, commodity Ethernet hardware uses packetswitching which prevents network frames from spreading across many network components; therefore a cyclic dependency between frames is unlikely to happen. Second, servers have virtually unlimited memory (and indeed, many orders of magnitude more than switches); hence we can assume infinite FIFOs at the servers. Considering that one of the necessary conditions for deadlocks to appear is for FIFOs to become full, it is, again, very unlikely that we end up in a deadlock situation. Finally, in the very unlikely situation of a cyclic dependency appearing and all the FIFOs becoming full, the packetdropping mechanism of Ethernetbased hardware provides seamless deadlock recovery, whereas TCP ensures data delivery. The upshot is that deadlocks are not a primary concern in DCNs.
It is very easy to see (by looking at some typical sourcedestination examples) that the routing algorithm DPillarSP is by no means optimal and that more often than not much shorter paths exist (an upper bound of on the lengths of paths produced was stated in [18]). For example, if one chooses to route in a clockwise fashion in DPillar with the source and the destination then the DPillarSP yields a path of length , and if one routes in an anticlockwise fashion then the algorithm also yields a path of length ; however, a shortest path has length (a dmove followed by a cmove). Our contention is that by relaxing this insistence on singledirection movement, we can obtain a much improved routing algorithm; indeed, as we shall see, we develop an optimal singlepath routing algorithm (where the implementation overheads are negligible and where there are significant practical benefits).
5 Routing in a marked cycle
We begin by making some initial observations as regards routing along a shortest path (from to ) in a marked cycle before proving that any such shortest path has a restricted structure.
5.1 Some initial observations
Henceforth, is a shortest path from to in . Consider two consecutive moves in . We can often rule out consecutive pairs of moves. For example, suppose that we have within a cmove followed by an amove. We can replace this pair within by a bmove so as to obtain a path with identical coverage to and which is shorter. This yields a contradiction. Similarly, suppose that we have an amove followed by a cmove within . We can replace this pair within by a dmove so as to again obtain a contradiction. In Table 1, we detail all pairs of consecutive moves in that are forbidden by including the substitution that would result in a shorter path that has equivalent coverage. In this table, the first move is detailed in the rows and the second move in the columns. A blank cell means that the corresponding pair of moves cannot immediately be ruled out.
amove  bmove  cmove  dmove  

amove  amove  dmove  
bmove  bmove  cmove  
cmove  bmove  cmove  
dmove  amove  dmove 
For clarity, rather than say, for example, ‘a cmove followed by an amove’, in future we will simply write to denote this circumstance. Consequently, subsequences of moves within will be written as strings over (as will itself) and we compress subsequences of the same symbol, such as , by using powers, such as .
We can say more. If we have a subsequence of moves then this has the same effect as the subsequence , and so we may suppose that a subsequence within is forbidden. Also, note that if has length at least then we cannot have a subsequence :

a subsequence can be replaced by ; a subsequence can be replaced by ; and we cannot have a subsequence or

a subsequence can be replaced by ; a subsequence can be replaced by ; and we cannot have a subsequence or .
Consequently, if has length at least then:

if a cmove is not the final move of then it must be followed by another cmove or a bmove

if an amove is not the final move of then it must be followed by another amove or a dmove

if a bmove is not the final (resp. first) move of then it must be followed by an amove (resp. preceded by a cmove)

if a dmove is not the final (resp. first) move of then it must be followed by a cmove (resp. preceded by an amove).
Consequently, if has length at least then it must be of one of two forms:

possibly a dmove (but maybe not) followed by a sequence of cmoves followed by a bmove followed by a sequence of amoves followed by a dmove followed by a sequence of cmoves followed by followed by a sequence of cmoves (resp. amoves) possibly followed by a bmove (resp. dmove); that is,
for some , where and where

possibly a bmove followed by a sequence of amoves followed by a dmove followed by a sequence of cmoves followed by a bmove followed by a sequence of amoves followed by followed by a sequence of amoves (resp. cmoves) possibly followed by a dmove (resp. bmove); that is,
for some , where and where
(when we say ‘sequence’, above, we mean ‘nonempty sequence’).
5.2 Restricting the number of turns
If we have a subsequence in then we say that an anticlockwise turn, or simply an aturn, occurs at the bmove; similarly, if we have a subsequence then we say that a clockwise turn, or simply a cturn, occurs at the dmove. Note that if we have an aturn in then the node at which this turn occurs, i.e., the node that is covered by the dmove, must be marked in as otherwise we could delete the corresponding dmove from and still have a sequence from to covering all the marked nodes, which would yield a contradiction. Similarly, if we have a cturn then the node at which this cturn occurs, i.e., the node that is covered by the bmove, must be marked. We will use these observations later; but now we prove that any shortest path must contain at most turns.
Suppose that is a shortest path and has at least turns. What we do now is undertake a case by case analysis of the different configurations that might arise. These cases arise from the forms derived at the end of the previous subsection: the first two cases correspond to form (1) and the next two cases to form (2). The technique employed in each case is to modify the path , by replacing sequences of moves within , so as to obtain a new path that has the same coverage but is shorter; this yields a contradiction to our assumption that has at least turns.
Case (a): Suppose that is of form (1) and has a prefix of the form , where .
By this we mean that begins with cmoves followed by a bmove followed by amoves followed by a dmove followed by cmoves followed by a bmove followed by an amove.
If then we can replace the prefix in with and still obtain the same coverage; this contradicts that is a shortest path (note that we have actually only assumed so far that has turns). If then we can replace the prefix in with so as to obtain a contradiction (we have still actually only assumed that has turns). Hence, we must have that . Suppose that . We can replace the prefix in with so as to obtain a contradiction (we have still actually only assumed that has turns). Hence, and either or .
Suppose that . We can replace the prefix in with so as to obtain a contradiction (we have still actually only assumed that has turns). Hence, we must have that and . However, if we replace with then we obtain a contradiction (here we do use the fact that has at least turns). So, has at most turns and if it has turns then is of the form where and .
We can say more if has turns. Suppose that . The bmove can be deleted from and we obtain a contradiction. Hence, if has turns then is of the form where and . We can visualize as in Fig. 3(i). The marked cycle is shown as a cycle where a black node denotes a node of ; that is, a node that needs to be covered by some path in (from to , with in this illustration). The path is depicted as a dotted line partitioned into composite moves.
Case (b): Suppose that is of form (1) and has a prefix of the form , where .
If then we can replace the prefix in with so as to obtain a contradiction, and if then we can delete the first dmove from to obtain a contradiction. Hence, if starts with a dmove then it has at most turn.
Case (c): Suppose that is of form (2) and has a prefix of the form , where .
If then we can replace the prefix in with so as to obtain a contradiction. If then we can replace the prefix in with so as to obtain a contradiction. Hence, .
Suppose that . We can replace the prefix in with so as to obtain a contradiction. Suppose that . We can delete the first occurrence of a dmove in so as to obtain a contradiction. Hence, . Note that if has turns then is of the form where and . Alternatively, suppose that has at least turns. We can replace the prefix in with so as to obtain a contradiction. Hence, has at most turns.
We can say more if has turns. Suppose that . The dmove can be deleted from and we obtain a contradiction. Hence, if has turns then is of the form where and . We can visualize as in Fig. 3(ii).
Case (d): Suppose that is of form (2) and has a prefix of the form , where .
If then we can replace the prefix with so as to obtain a contradiction, and if then we can delete the first bmove from to obtain a contradiction. Hence, if starts with a bmove then it has at most turn.
So, we have proven the following lemma.
Lemma 3.
If is a shortest path (from to ) in then has at most turns, and if has turns then it must be of the form or , where and .
6 An optimal routing algorithm for DPillar
We now develop an optimal singlepath routing algorithm for DPillar, based around Lemma 3. We do this by finding a small set of paths (from to ) in so that at least one of these paths is a shortest path (and consequently we obtain a shortest path in the DCN DPillar). By Lemma 2, we may assume that and , and by Lemma 3, we may assume that any shortest path has at most turns.
Our technique is as follows. Essentially, we want to make the set as small as possible; that is, we want our resulting algorithm to have to consider as few paths as possible (when looking for the shortest). Lemma 3 precisely describes the set of paths we need to consider from the paths involving exactly turns; of course, we also need to consider paths involving or turns (if they exist). There are different situations depending upon the distribution of the marked nodes needing to be covered; in particular, upon the distribution of marked nodes along the natural clockwise and anticlockwise paths from the source to the destination on the marked cycle, assuming the source and destination to be distinct (this is the case in Section 6.1; the case when the source and destination are the same is considered in Section 6.2). Sometimes the distribution of marked nodes rules out the possibility of certain types of paths.
6.1 Building our set of paths when
We first suppose that . Let (that is, the bitpositions that need to be ‘fixed’). Suppose that so that we have (we might have that either or is , when the corresponding set is empty). If then define , for , with ; and if then define , for , with . Also: define (resp. ), if (resp. ); and (resp. ), if (resp. ). We can visualize the resulting marked cycle as in Fig. 4(i). Note that in this particular illustration and ; so, and . Of course, what we are looking for is a sequence of (a, b, c and d)moves that will take us from to in so that all nodes of have been covered.
In what follows, we examine different scenarios involving the number of marked nodes, , and also the number of marked nodes, . Each scenario for contributes certain paths to as does each scenario for . Note that perhaps the most obvious paths to consider as potential members of are the paths and which have lengths and , respectively. So, we begin by setting .
From Lemma 3, any shortest path from to having turns requires that or and that both nodes at which these turns occur are different from and and lie on the anticlockwise path from to or on the clockwise path from to , accordingly. Recall also that the node at which any turn occurs on a shortest path is necessarily a marked node (irrespective of the number of turns in ).
Case (a): Suppose that .
In this scenario, we contribute either the path to , if , or the path to , if ; either way, the length of the path contributed is .
Case (b): Suppose that .
In this scenario, we contribute either the path to , if , or the path to , if ; either way, the length of the path contributed is .
Case (c): Suppose that .
In this scenario, we contribute paths to . If then we contribute the path to , or if then we contribute the path to ; either way, the length of the resulting path is . We also contribute the path to of length . There is potentially another path when and , namely , but the length of this path is which is greater than , which in turn is evaluated with .
Case (d): Suppose that .
In this scenario, we contribute paths to . If then we contribute the path to , or if then we contribute the path to ; either way, the length of the resulting path is . We also contribute the path to of length . There is potentially another path when and , namely , but the length of this path is which is greater than , which in turn is evaluated with .
Case (e): Suppose that .
In this scenario, we contribute paths to . For each , we contribute the path to of length . If then we contribute the path to , or if then we contribute the path to ; either way, the length of the path is . We also contribute the path to of length . (These last paths mirror those constructed in Case (c).)
Case (f): Suppose that .
In this scenario, we contribute paths to . For each , we contribute the path to of length . If then we contribute the path to , or if then we contribute the path to ; either way, the length of the path is . We also contribute the path to of length . (These last paths mirror those constructed in Case (c).)
Thus, our set of potential shortest paths contains paths (from which at least one is a shortest path).
6.2 Building our set of paths when
Now we suppose that . We proceed as we did above and build a set of potential shortest paths. Let . Suppose that so that we have (we might have that is when the corresponding set is empty). If then define , for , with . We define , if , and , if . We can visualize the resulting marked cycle as in Fig. 4(ii). Again, the most obvious path to consider is (or ) which has length . We begin by setting .
Case(a): Suppose that .
In this scenario, we contribute the path of length (note that in this case the node is necessarily marked as we originally assumed that we started with distinct source and destination servers in the DCN DPillar).
Case(b): Suppose that .
If then we contribute the path , if , and the path , if ; either way, the path has length . If then we contribute the path of length . If then we contribute paths. The first of these paths is the path , if , and the path , if ; either way, this path has length . The second of these paths is the path of length .
Case(c): Suppose that .
In this scenario, we contribute paths to . For each , we contribute the path to of length . If then we contribute the path to , or if then we contribute the path to ; either way, this path has length . We also contribute the path to of length . (These last paths mirror those constructed in Case (b).)
Thus, our set of potential shortest paths contains at most paths (from which at least one is a shortest path).
6.3 Our algorithm
We now use our set of potential shortest paths so as to find a shortest path or the length of a shortest path. Our algorithm, DPillarMin, for finding the length of a shortest path in is as follows. Algorithm: DPillarMin calculate if then calculate , , , , and if then if then if then if then if then calculate % only need consider max. if then calculate % only need consider max. else calculate and if then if then if then if then if then if then output
If we wish to output a shortest path then all we do is apply the algorithm DPillarMin but remember which shortest path corresponds to the final value of and output this shortest path (note that there may be more than one shortest path; exactly which path one obtains depends upon how one implements checking the paths of ). The time complexity of both algorithms is clearly ; that is, linear in the number of columns. Henceforth, we assume that the algorithm DPillarMin outputs an actual shortest path.
It should be clear (using Lemma 3) that the different considerations for and exhaust all possibilities and that consequently the set of paths considered by DPillarMin is such as to contain a shortest path. Hence, DPillarMin clearly outputs a shortest path from some source node to some destination node in DPillar. In summary, we have the following result.
Theorem 4.
Suppose that so that is even. The algorithm DPillarMin takes as input any two servers of DPillar, a source and a destination, and outputs a shortest path from the source server to the destination server; moreover, it computes this path with time complexity .
We can confirm that we have undertaken experiments so as to empirically check, using a breadthfirst search, the correctness of DPillarMin on DPillar when and are relatively small. We undertook our experiments using our inhouse simulator INRFlow [8].
6.4 The diameter of DPillar
We also compute the diameter of the DCN DPillar, i.e., the maximum of the lengths of shortest paths joining any two distinct servers. All that was stated in [18] was that the diameter of the DCN DPillar is a ‘linear function of ’.
Theorem 5.
If then the DCN DPillar has diameter ; and if then the DCN DPillar has diameter .
Proof.
Let and be nodes of the digraph DPillar. W.l.o.g. we may assume that the columnindex of is and that of is . We work in and in the context of the algorithm DPillarMin.
We first note that for any , the worstcase scenario is when all nodes of are marked as a shortest path in this scenario yields a path in any other scenario (though not necessarily a shortest one). Hence, in what follows we assume that all nodes are marked.
Case (a): .
We consider first the case when . There are different scenarios for : ; : ; ; and .
Consider first when and . By consideration of the algorithm DPillarMin, where we have , , , , , , and , we immediately see that . We are trying to find a value of that maximizes this minimum value. If then ; so, in this situation this minimum value is maximized when and this minimum value is then . If then ; so, in this situation this minimum value is maximized when and this minimum value is then .
In each of the other cases for , where , we have that the length of the path produced by DPillarMin is trivially less than (simply look at the initial minimization ). Also, when the length of the path produced is trivially less than . Hence, when the dameter is .
Case (b) .
It is trivial to see by hand that the diameter in this case is . The result follows. ∎
7 Experimental work
Whilst we have obtained an optimal singlepath routing algorithm for DPillar (optimal in that our algorithm always outputs a shortest path), as yet we have no idea as to how often the singlepath routing algorithm DPillarSP is suboptimal and the savings to be made by employing our optimal algorithm. To undertake a precise analytical evaluation of this question would be challenging; consequently, we proceed to evaluate empirically the most important performance metrics, namely path length, aggregate bottleneck throughput and transmission latency.
We undertake our evaluation using simulation. We use our own flowbased framework INRFlow [8]. The reason we adopt a simulationbased evaluation is as follows. Future DCNs are intended to incorporate hundreds of thousands, if not millions, of processors. Consequently, building a testbed of servers (bearing in mind realistic access to resources) would only yield a DCN with a handful of servers and there would be no grounds for believing that any such evaluation would scale up. For instance, in order to build even the smallest meaningful DPillar would require that should be at least and at least which would result in a testbed with servers which is beyond our means. Not surprisingly, simulation is the standard evaluation mechanism in the literature. Of the DCNs mentioned in this paper, FiConn, MCube, HCN, BCN, SWKautz, SWCube, and SWdBruijn were all evaluated using simulation with only DCell, BCube, and CamCube evaluated using testbeds, incorporating , and servers, respectively. In addition, the aspects of symmetry present in DCNs ameliorates the likelihood of ‘random’ aspects of the network topology having an unexpected impact upon performance when compared with more unstructured networks. Finally, as regards our evaluation of communication latency in Section 7.3, we have incorporated realistic measurements of protocol stack, propagation, data transmission, and routing latencies into our analysis.
7.1 Path Length
In order to obtain some idea of the practical significance of our algorithm DPillarMin in terms of path length, we undertook the following experiment. For specific values of and , we measured the average length of the paths obtained by employing both DPillarMin and DPillarSP for every possible sourcedestination pair (nodesymmetry means that we can actually fix a unique source node) as well as the cumulative frequencies of the lengths of paths arising. We also measured the number of such occasions when the path derived by DPillarSP is longer than the path derived by DPillarMin; that is, the number of times DPillarSP produced a nonminimal path. Our results are shown in Table 2 and Table 3. In Table 2, the columns denote (in order): the parameters and of the particular DPillar that we are working with; the number of servers in that DPillar; the average path lengths obtained from inputting every possible sourcedestination pair to the algorithms DPillarMin and DPillarSP; the improvement in terms of average path length obtained by employing DPillarMin as a percentage of the average path length obtained by employing DPillarSP; and the percentage of sourcedestination pairs where the optimal path length is shorter than that obtained by employing DPillarSP. In Table 3, for each chosen and we show the cumulative frequencies of the lengths of paths obtained by employing the two algorithms DPillarSP and DPillarMin. These cumulative frequencies are shown as percentages of the total number of pairs of (not necessarily distinct) servers and are rounded to the nearest % (in order to save space we do not show data relating to all pairs of and ; this omitted data is as might be expected).
DPillar  # of  av. pth. len.  av. pth. len.  av. length  nonmin.  

servers  DPillarMin  DPillarSP  improve.  paths  
16  3  1,536  2.72  3.86  29%  66% 
16  4  16,384  3.74  5.36  30%  73% 
16  5  163,840  4.77  6.86  30%  78% 
32  3  12,288  2.86  3.93  27%  67% 
32  4  262,144  3.87  5.43  28%  74% 
48  3  41,472  2.9  3.96  26%  67% 
64  3  98,304  2.93  3.97  26%  67% 
80  3  192,000  2.94  3.97  25%  67% 
128  3  786,432  2.96  3.98  25%  67% 
DPillar 
