Discovering the Network Backbone from Traffic Activity Data
Abstract
We introduce a new computational problem, the BackboneDiscovery problem, which encapsulates both functional and structural aspects of network analysis. While the topology of a typical road network has been available for a long time (e.g., through maps), it is only recently that finegranularity functional (activity and usage) information about the network (like sourcedestination traffic information) is being collected and is readily available. The combination of functional and structural information provides an efficient way to explore and understand usage patterns of networks and aid in design and decision making. We propose efficient algorithms for the BackboneDiscovery problem including a novel use of edge centrality. We observe that for many real world networks, our algorithm produces a backbone with a small subset of the edges that support a large percentage of the network activity.
copyrightbox
Discovering the Network Backbone from Traffic Activity Data
Sanjay Chawla 
University of Sydney 
Sydney, Australia 
sanjay.chawla@sydney.edu.au 
and 
Venkata Rama Kiran Garimella 
Aalto University 
Helsinki, Finland 
kiran.garimella@aalto.fi and 
Aristides Gionis 
Aalto University and HIIT 
Helsinki, Finland 
aristides.gionis@aalto.fi 
and 
Dominic Tsang 
University of Sydney 
Sydney, Australia 
dwktsang@yahoo.com 
\@float
copyrightbox[b]
\end@floatIn this paper we study a novel problem, which combines structural and functional (activity) network data. In recent years there has been a large body of research related to exploiting structural information of networks. However, with the increasing availability of finegrained functional information, it is now possible to obtain a detailed understanding of activities on a network. Such activities include sourcedestination traffic information in road and communication networks.
More specifically we study the problem of discovering the backbone of traffic networks. In our setting, we consider the topology of a network and a traffic log , recording the amount of traffic that incurs between source and destination . We are also given a budget that accounts for a total edge cost. The goal is to discover a sparse subnetwork of , of cost at most , which summarizes as well as possible the recorded traffic .
The problem we study has applications for both exploratory data analysis and network design. An example application of our algorithm is shown in Figure Discovering the Network Backbone from Traffic Activity Data. Here, we consider a traffic log (Figure Discovering the Network Backbone from Traffic Activity Data, left), which consists of the most popular routes used on the London tube. The backbone produced by our algorithm takes into account this demand (based on the traffic log) and tries to summarize the underlying network, thus presenting us with insights about usage pattern of the London tube (Figure Discovering the Network Backbone from Traffic Activity Data, right). This representation of the ‘backbone’ of the network could be very useful to identify the important edges to upgrade or to keep better maintained in order to minimize the total traffic disruptions.
We only consider sourcedestination pairs in the traffic log, and not full trajectories, as sourcedestination information captures true mobility demand in a network. For example, data about the daily commute from home (source) to office (destination) is more resilient than trajectory information, which is often determined by local and transient constraints, like traffic conditions on the road, time of day, etc. Furthermore, in communication networks, only the sourceip and destinationip information is encoded in TCPIP packets. Similarly, in a city metro, checkin and checkout information is captured while the intervening movement is not logged.
The BackboneDiscovery problem is an amalgam of the spanner problem [?] and the Steinerforest problem [?]. However, our problem formulation will have elements which are substantially distinct from both of these problems.
In the spanner problem the goal is to find a minimumcost subnetwork of , such that for each pair of nodes and , the shortest path between and on is at most times longer than the shortest path between and on . In our problem, we are not necessarily interested in preserving the factor distance between all nodes but for only a subset of them.
In the Steinerforest problem we are given a set of pairs of terminals and the goal is to find a minimumcost forest on which each source is connected to the corresponding destination . Our problem is different from the Steinerforest problem because we do not need all to be connected, and try to optimize a stretch factor so that the structural aspect of the network are also taken into account.
A novel aspect of our work is the use of edgebetweenness to guide the selection of the backbone [?]. The intuition is as follows. An algorithm to solve the Steinerforest problem will try and minimize the sum of cost of edges selected as long as as the set of terminal pairs are connected and is agnostic to minimizing stretch factor. However, if the edge costs are inversely weighted with edgebetweenness, then edges that can contribute to reducing the stretch factor can be potentially included into the backbone.
To understand the differences of the proposed BackboneDiscovery problem with both the spanner and Steinerforest formulations, consider the example shown in Figure Discovering the Network Backbone from Traffic Activity Data. In this example, there are four groups of nodes:

group consists of nodes, ,

group consists of nodes, ,

group consists of nodes, and , and

group consists of nodes, .
Assume that is smaller then , and thus much smaller than . All edges shown in the figure have cost , except the edges between and , which has cost . Further assume that there is one unit of traffic between each and each , for , resulting in sourcedestination pairs (the majority of the traffic), and one unit of traffic between and , for , resulting in sourcedestination pairs (some additional marginal traffic). The example abstracts a common layout found in many cities: a few busy centers (commercial, residential, entertainment, etc.) with some heavilyused links connecting them (group ), and some peripheral ways around, that serve additional traffic (group ).
Careful inspection of the above example highlights advantages of the backbone discovery problem:

As opposed to the spanner problem, we do not need to guarantee short paths for all pairs of nodes, but only for those in our traffic log which makes our approach more general. In particular, based on the budget requirements a backbone could be designed for the most voluminous paths.

Due to the budget constraint, it may not be possible to guarantee connectivity for all pairs in the traffic log. We thus need a way to decide which pairs to leave disconnected. Neither the spanner nor the Steinerforest problems provision for disconnected pairs. In fact, it is possible that the optimal backbone may even contain cycles while leaving pairs disconnected. Again, allowing for a disconnected backbone, generalizes the Steinerforest problem and may help provision for a tighter budget. In order to allow for a disconnected backbone, we employ the use of stretch factor, defined as a weighted harmonic mean over the sourcedestination pairs of the traffic log, which provides a principled objective to optimize connectivity while allowing to leave disconnected pairs, when there is insufficient budget.

Certain high cost edges may be an essential part of the backbone that other problem formulations may leave out. For example, while the edge that connects the nodes in is a very important edge for the overall traffic (as it provides a short route between and ), the optimal Steinerforest solution, shown in Figure (b)b, prefers the long path along the nodes in . Our algorithm includes the component (as seen in Figure (c)c) because it is an edge that has a high edgebetweenness.
The rest of the paper is organized as follows. In Section Discovering the Network Backbone from Traffic Activity Data, we rigorously define the BackboneDiscovery problem. Section \thefigure introduces our algorithm based on the greedy approach, Section Discovering the Network Backbone from Traffic Activity Data details our experimental evaluation, results and discussion. In Section Discovering the Network Backbone from Traffic Activity Data we survey related work and distinguish our problem formulation with other relevant approaches. We conclude in Section Discovering the Network Backbone from Traffic Activity Data with a summary and potential directions for future work.
Let be a network, with and . For each edge there is a cost . Additionally, we consider a traffic log , specified as a set of triples , with , and with . A triple indicates the fact that units of traffic have been recorded between nodes and .
We aim at discovering the backbone of traffic networks. A backbone is a subset of the edges of the network , that is, that provides a good summarization for the whole traffic in . In particular, we require that if the available traffic had used only edges in the backbone , it should have been almost as efficient as using all the edges in the network. We formalize this intuition below.
Given two nodes and a subset of edges , we consider the shortest path from to that uses only edges in the set . In this shortestpath definition, edges are counted according to their cost . If there is no path from to using only edges in , we define . Consequently, is the shortest path from to using all the edges in the network, and is the shortest path from to using only edges in the backbone .
To measure the quality of a backbone , with respect to some traffic log we use the concept of stretch factor. Intuitively, we want to consider shortest paths from to , and evaluate how much longer are those paths on the backbone , than on the original network. The idea of using stretch factor for evaluating the quality of a subgraph has been used extensively in the past in the context of spanner graphs [?].
In order to aggregate shortestpath information for all source–destination pairs in our log in a meaningful way, we need to address two issues. The first issue is that not all source–destination pairs have the same volume in the traffic log. This can be easily addressed by weighting the contribution of each pair by its corresponding volume .
The second issue is that since we aim at discovering very sparse backbones, many source–destination pairs in the log could be disconnected in the backbone. To address this problem we aggregate shortestpath distances using the harmonic mean. This idea, which has been proposed by Marchiori and Latora [?] and has also been used by Boldi and Vigna [?] in measuring centrality in networks, provides a very clean way to deal with infinite distances: if a source–destination pair is not connected, their distance is infinity, so the harmonic mean accounts for this by just adding a zero term in the summation. Using the arithmetic mean is problematic, as we would need to add an infinite term with other finite numbers.
Overall, given a set of edges , we measure the connectivity of the traffic log , by
The stretch factor of a backbone is then defined as
The stretch factor is always greater or equal than 1. The closer it is to 1, the better the connectivity that it offers to the traffic log .
We are now ready to formally define the problem of backbone discovery.
Problem 1 (BackboneDiscovery)
Consider a network and a traffic log . Consider also a cost budget . The goal is to find a backbone network of total cost that minimizes the stretch factor or report that no such solution exists.
As one may suspect, BackboneDiscovery is an hard problem.
Lemma 1
The BackboneDiscovery problem, defined in Problem 1, is hard.

We obtain a reduction from the missingSetCover problem: given a ground set , a collection of subsets of , and an integer , determine whether there are sets in that cover all the elements of .
Given an instance of the SetCover problem we form an instance of the BackboneDiscovery problem as follows (See Figure Discovering the Network Backbone from Traffic Activity Data for illustration). We create one node for each and one node for each . We also create a special node . We create an edge if and only if and we assign to this edge cost 0. We also create an edge for all and we assign to this edge cost 1. As far as the traffic log is concerned, we set , that is, we pair each with the special node with volume 1. Finally we set the budget . It is not difficult to see that the instance of the BackboneDiscovery problem constructed in this way has a solution with stretch factor 1 if and only if the given instance of the SetCover problem has a feasible solution.
The algorithm we propose for the BackboneDiscovery problem is a greedy heuristic that connects onebyone the source–destination pairs of the traffic log . A distinguishing feature of our algorithm is that it utilizes a notion of edge benefit. In particular, we assume that for each edge we have available a benefit measure . The higher is the measure the more beneficial it is to include the edge in the final solution. The benefit measure is computed using the traffic log and it takes into account the global structure of the network .
The more central an edge is with respect to a traffic log, the more beneficial it is to include it in the solution, as it can be used to serve many source–destination pairs. In this paper we use edgebetweenness as a centrality measure, adapted to take into account the traffic log. We also experimented with commutetime centrality, but edgebetweenness was found to be more effective.
Our algorithm relies on the notion of effective distance , defined as , where is the cost of an edge , and is the edgebetweenness of . The intuition is that by dividing the cost of each edge by its benefit, we are biasing the algorithm towards selecting edges with high benefit.
We now present our algorithm in more detail.
As discussed above, our algorithm operates with effective distances , where is a benefit score for each edge . The objective is to obtain a cost/benefit tradeoff: edges with small cost and large benefit are favored to be included in the backbone. In the description of the greedy algorithm that follows, we assume that the effective distance of each edge is given as input.
The algorithm works in an iterative fashion, maintaining and growing the backbone, starting from the empty set. In the th iteration the algorithm picks a source–destination pair from the traffic log , and “serves” it. Serving a pair means computing a shortest path from to , and adding its edges in the current , if they are not already there. For the shortestpath computation the algorithm uses the effective distances . When an edge is newly added to the backbone its cost is subtracted from the available budget. Here, the actual cost of the edge (instead of the ) is used. Also its effective distance is reset to zero, since it can be used for free in subsequent iterations of the algorithm. The source–destination pair that is chosen to be served in each iteration is the one that reduces the stretch factor the most at that iteration; and hence the greedy nature of the algorithm.
The algorithm proceeds until it exhausts all its budget or until the stretch factor becomes equal to 1 (which means that all pairs in the log are served via a shortest path). The pseudocode for the greedy algorithm is shown as Algorithm 1.
We are experimenting with two variants of this greedy scheme, depending on the benefit score we use. These are the following:
 Greedy:

We use uniform benefit scores ().
 GreedyEB:

The benefit score of an edge is set equal to its edgebetweenness centrality.
As we show in the experimental section the greedy algorithm provides solutions of good quality, in particularly the variant with the edgebetweenness weighting scheme. However, the algorithm is computationally expensive, and thus in this section we discuss a number of optimizations. We start by analyzing the running time of the algorithm.
Running time. Assume that the benefit scores are given for all edges , and that the algorithm performs iterations until it exhausts its budget. In each iteration we need to perform shortestpath computations, where is the size of the traffic log . A shortest path computation is , and thus the overall complexity of the algorithm is . The number of iterations depends on the available budget and in the worst case it can be as large as . However, since we aim at finding sparse backbones, the number of iterations is typically smaller.
Optimizations with no approximation. We first show how to speed up the algorithm, while guaranteeing the same solution with the naïve implementation of the greedy. Since the most expensive component is the computation of shortest paths on the newlyformed network, we make sure that we compute the shortest path only when needed. Our optimizations consist of two parts.
First, during the execution of the algorithm we maintain the connected components that the backbone creates in the network. Then, we do not need to compute shortest paths for all pairs, for which and belong to different connected components; we know that their distance is . This optimization is effective at the early stages of the algorithm, when many terminals belong to different connected components.
Second, when computing the decrease in the stretch factor due to a candidate shortest path to be added in the backbone, for pairs for which we have to recompute a shortestpath distance, we first compute an optimistic lower bound, based on the shortest path on the whole network (which we compute once in a preprocessing step). If this optimistic lower bound is not better than the current best stretch factor then we can skip the computation of the shortest path on the backbone.
As shown in the empirical evaluation of our algorithms, depending on the dataset, these optimization heuristics lead to 20–35% improvement in performance.
Optimization based on landmarks. To further improve the runningtime of the algorithm we compute shortestpath distances using landmarks [?, ?], an effective technique to approximate distances on graphs. Here we use the approach of Potamias et al. [?]: A set of landmarks is selected and for each vertex the distances to all landmarks are computed and stored as an dimensional vector representing vertex . The distance between two vertices is then approximated as , i.e., the tightest of the upper bounds induced by the set of landmarks .
To select landmarks we use highdegree nonadjacent vertices in the graph, which is reported as one of the best performing methods by Potamias et al. [?]. Note that the distances are now approximations to the true distances, and the method trades accuracy by efficiency via the number of landmarks selected. In practice a few hundreds of landmarks are sufficient to provide highquality approximations even for graphs with millions of vertices [?].
For the runningtime analysis, note that in each iteration we need to compute the distance between all graph vertices and all landmarks. This can be done with singlesource shortestpath computations and the running time is . The landmarks can then be used to approximate shortestpath distances between all sourcedestination pairs in the traffic log , with running time . Thus, the overall complexity is . Since is expected to be much smaller than , the method provides a significant improvement over the naïve implementation of the greedy. As shown in the experimental evaluation, using landmarks provides an improvement of up to 4 times in terms of runtime in practice.
As we already discussed in the previous sections, our greedy algorithm uses edge centrality measures for benefit scores . In this section we discuss edge betweenness, and in particular show how we adapt the measure to take into account the traffic log .
We first recall the standard definition of edgebetweennes. Given a network , we define to be the set of all pairs of nodes of . Given a pair of nodes and an edge , we define by the total number of shortest paths from to , and by the total number of shortest paths from to that pass though edge .
The standard definition of edgebetweenness centrality of edge is the following:
In our problem setting we take into account the traffic log , and we define the edgebetweenness of an edge with respect to log , as follows.
In this modified definition only the source–destination pairs of the log contribute to the centrality of the edge , and the amount of contribution is proportional to the corresponding traffic. The adapted edgebetweenness can still be computed in time [?].
The aim of the experimental section is to evaluate the performance of the proposed algorithm, the optimizations, and the edgebetweenness measure. We also compare our algorithm with other stateoftheart methods which attempt to solve a similar problem.
Datasets. We experiment with six realworld datasets, four transportation networks, one web network and one internettraffic network. For five of the datasets we also obtain realworld traffic, while for one we use syntheticallygenerated traffic. The characteristics of our datasets are provided in Table Discovering the Network Backbone from Traffic Activity Data, and a brief description follows.
Dataset  Type  # Nodes  # Edges  Real  Real 

network  traffic  
LondonTube  transportation  316  724  ✓  ✓ 
USFlights  transportation  1 268  51 098  ✓  ✓ 
UKRoad  transportation  8 341  13 926  ✓   
NYCTaxi  transportation  50 736  158 898  ✓  ✓ 
Wikispeedia  web  4 604  213 294  ✓  ✓ 
Abeline  internet  12  15  ✓  ✓ 
LondonTube. The London Subway network consists of subway stops and links between them.^{1}^{1}1http://bit.ly/1C9PbLT We use the geographic distance between stations as a proxy for edge costs. We also obtain a traffic log extracted from the Oyster card system.^{2}^{2}2http://bit.ly/1qM2BYi The log consists of aggregate trips made by passengers between pairs of stations during a onemonth period (NovDec 2009). We filter out sourcedestination pairs with traffic less than 100 and remove bidirectional pairs by selecting one of them at random and summing up their traffic.
USFlights. We obtain a large network of US airports, and the list of all flights within the US during 2009–2013, from the Bureau of Transportation Statistics.^{3}^{3}3http://1.usa.gov/1ypXYvL Flying distances between airports, obtained using Travelmath.com, are used as edge costs. The traffic volume is the number of flights between airports.
NYCTaxi. We obtain the complete road network of NYC using OpenStreetMap data.^{4}^{4}4http://metro.teczno.com/#newyork In this network each node corresponds to a road intersection and each link corresponds to a road segment. Edge costs are computed as the geographic distances between the junctions. Data on the pickup and dropoff locations of NYC taxis for 2013 was used to construct the traffic log.^{5}^{5}5http://chriswhong.com/opendata/foil_nyc_taxi/ The most frequently used sourcedestination pairs was used to create the traffic log.
Wikispeedia. Wikispeedia^{6}^{6}6http://snap.stanford.edu/data/wikispeedia.html [?] is an online crowd sourcing game designed to measure semantic distances between 2 wikipedia pages using the paths taken by humans to reach from one page to the other. This dataset contains a base network of hyperlinks between Wikipedia pages and the paths taken by users between two pages. We construct the traffic log using the unique (start, end) pages from this data.
UKRoad. Next we consider the UK road network.^{7}^{7}7http://www.dft.gov.uk/trafficcounts/download.php The network construction is similar to what was done with the NYCTaxi data. For simplicity we use only the largest connected component. Since we were not able to obtain realworld traffic data for this network, we generate synthetic traffic logs simulating different scenarios. In particular we generate traffic logs according to four different distributions: () powerlaw traffic volume, powerlaw  pairs; () powerlaw traffic volume, uniformly random  pairs; () uniformly random traffic volume, powerlaw  pairs; and () uniformly random traffic volume, uniformly random  pairs. These different distributions capture different traffic volume possibility and hence help in understanding the behavior of our algorithm with respect to the traffic log .
Abeline. For a qualitative analysis we also consider the well known Abeline dataset consisting of a sample of the network traffic extracted from the Internet2 backbone^{8}^{8}8http://www.internet2.edu and that carries the network traffic between major universities in the continental US. The network consists of twelve nodes and 15 highcapacity links. Associated with each physical link, we also have capacity of the link which serves as a proxy for the cost of the link. We obtain traffic logs from 2003 between all pairs of nodes.
Baseline. To obtain better intuition for the performance of our methods we define a simple baseline, where a backbone is created by adding edges in increasing order of their effective distances , where is edgebetweenness; this was the bestperforming baseline among other baselines we tried, such as adding source–destination pairs one by one (i) randomly, (ii) in decreasing order of volume (), (iii) in increasing order of effective distance defined using closeness centrality, etc.
We focus our evaluation on three main criteria: (i) Comparison of the performance with and without the edgebetweenness measure; (ii) effect of the optimizations, in terms of quality and speedup; and (iii) effect of allocating more budget on the stretch factor.
Effect of edgebetweenness. We study the effect of using edgebetweenness in the Greedy algorithm. The results are presented in Figure Discovering the Network Backbone from Traffic Activity Data.
Effect of landmarks. Landmarks provide faster computation with a trade off for quality. Figure Discovering the Network Backbone from Traffic Activity Data shows the speedup achieved when using landmarks. In the figures, BasicGreedyEB indicates the greedy algorithm that doesn’t use any optimizations. GreedyEBCC makes use of the optimizations proposed in Section Discovering the Network Backbone from Traffic Activity Data which do not use approximation. GreedyEBLandmarks* makes use of the landmarks optimatization and the * indicates the number of landmarks we tried. Figure Discovering the Network Backbone from Traffic Activity Data shows the performance of GreedyEB algorithm with and without using landmarks.
Budget vs. stretch factor. We examine the tradeoff between budget and stretch factor for our algorithm and its variants. A lower stretch factor for the same budget indicates that the algorithm is able to pick better edges for the backbone. Figure Discovering the Network Backbone from Traffic Activity Data shows the tradeoff between budget and stretch factor for all our datasets. In all figures the budget used by the algorithms, shown in the axis, is expressed as a percentage of the total cost of all the edges in the network.
Key findings. From all the above results, we would like to highlight the following points.
1. The greedy algorithm and its variants performs much better than the baseline (See Figure Discovering the Network Backbone from Traffic Activity Data). Note that baseline is not included in Figure Discovering the Network Backbone from Traffic Activity Data(g,h) because the edges in the baseline are added onebyone and for a large interval of the cost, the stretch factor was very large or even infinity. This shows that the backbone produced by our greedy approach not only consists of edges with low benefit, but also tries to reuse a lot of edges, hence obtaining a lower stretch factor.
2. The backbones discovered by our algorithms are sparse and summarize well the given traffic (Figures Discovering the Network Backbone from Traffic Activity Data, Discovering the Network Backbone from Traffic Activity Data). In all cases, with about 15% of the edge cost in the network it is possible to summarize the traffic with stretch factor close to 1. In some cases, even smaller budget (than 15%) is sufficient to reach a lower stretchfactor value.
3. Incorporating edgebetweenness as an edgeweighting scheme in the algorithm improves the performance, in certain cases there is an improvement of at least 50% (See Figure Discovering the Network Backbone from Traffic Activity Data; in most cases, even though there is a significant improvement, the plot is overshadowed by a worse performing baseline). This is because, using edges of high centrality will make sure that these edges are included in many shortest paths, leading to reusing many edges.
4. The optimizations we propose in Section Discovering the Network Backbone from Traffic Activity Data help in reducing the running time of our algorithm (See Figure Discovering the Network Backbone from Traffic Activity Data). For the optimizations not using landmarks, we see around 30% improvement in running time. Using landmarks substantially decreases the time taken by the algorithms (3–4 times). While there is a compromise in the quality of the solution, we can observe from Figures Discovering the Network Backbone from Traffic Activity Data that the performance drop is small in most cases and can be controlled by the choosing the number of landmarks accordingly. Our algorithms, using the various optimizations we propose, are able to scale for large, realworld networks with tens of thousands of nodes which is the typical size of a road/traffic network.
In this section, we compare the performance of BackboneDiscovery with other related work in literature. The comparison is done based on two factors (i) Stretch factor, (ii) Percentage of edges covered by the solution. Intuitively, a good backbone should try to minimize both, i.e. produce a sparse backbone, which also preserves the shortest paths between vertices as well as possible.
Comparison with Prize Collecting Steinerforest (PCSF)  Prize Collecting Steinerforest [?] is a variant of the classic Steiner Forest problem, which allows for disconnected source–destination pairs, by paying a penalty. The goal is to minimize the total cost of the solution by ‘buying’ a set of edges (to connect the – pairs) and paying the penalty for those pairs which are not connected. We compare the performance of our algorithm with PCSF, based on two factors (i) Stretch factor (Figure Discovering the Network Backbone from Traffic Activity Dataa), (ii) Percentage of edges covered by the solution (Figure Discovering the Network Backbone from Traffic Activity Datab). We use the same (,) pairs that we use in our algorithm and set the traffic volume as the penalty score in PCSF. We first run PCSF on our datasets and compute the budget of the solution produced. Using the budget as input to our algorithm (GreedyEB), we compute our backbone.
We can see from Figure Discovering the Network Backbone from Traffic Activity Dataa that our algorithm produces a backbone with a much better stretch factor than PCSF. In most datasets, our algorithm produces a backbone which is at least 2 times better in terms of stretch factor.
Figure Discovering the Network Backbone from Traffic Activity Datab compares the fraction of edges covered by our algorithm and PCSF. We observe that the fraction of edges covered by our algorithm is lower than that of PCSF. This could be because our algorithm reuses edges belonging to multiple paths. Figures Discovering the Network Backbone from Traffic Activity Data(a,b) show that even though our solution is much better in terms of stretch factor, we produce sparse backbones (in terms of the percentage of edges covered).
Comparison with kspanner  As described in Section Discovering the Network Backbone from Traffic Activity Data, our problem is similar to spanner [?] in the sense that we try to minimize the stretch factor. A spanner of a graph is a subgraph in which any two vertices are at most times far apart than on the original graph. One of the main advantages of our algorithm compared to spanners is that spanners can not handle disconnected vertices. We also propose and optimize a modified version of stretch factor in order to handle disconnected vertices. Similar to PCSF, we first compute a 2spanner using a 2 approximation greedy algorithm and compute the budget used. We then run our algorithm for the same budget. Figures Discovering the Network Backbone from Traffic Activity Data(a,b) show the performance of our algorithm in terms of stretch factor and percentage of edges covered. Our objective here is to compare the cost our algorithm pays in terms of stretch factor for allowing disconnected vertices. We can clearly observe that even though we allow for disconnected pairs, our algorithm performs slightly better in terms of stretch factor and also produces a significantly sparser backbone.
Comparison with Toivonen et al. [?]  Next, we compare our algorithm with Toivonen, et al [?]. Toivonen et al. propose a framework for pathoriented graph simplification, in which edges are pruned while keeping the original quality of the paths between all pairs of nodes. The objective here is to check how well we perform in terms of graph sparsification. Figures Discovering the Network Backbone from Traffic Activity Data(a,b) shows the comparison in terms of stretch factor and percentage of edges covered. Similar to the above approaches, we use the same budget as that used by Toivonen’s algorithm. We observe that for most of the datasets, their algorithm works poorly in terms of sparsification, pruing less than 20% of the edges (Figure Discovering the Network Backbone from Traffic Activity Data(b)). Our algorithm performs better both in terms of the stretch of the final solution as well as sparseness of the backbone.


The above results, comparing our work with the existing approaches showcase the power of our algoritm in finding a concise representation of the graph, at the same time maintaining a low stretch factor. In all the three cases, our algorithm performs considerably better than the related work.
Fairness  Though we claim that our approach performs better, we need to keep in mind that there might be differences between these algorithms. PCSF does not optimize for stretch factor. Spanners and Toivonen et al. do not have a traffic log ((,) pairs). They also do not try to optimize stretch factor. For this section, we were just interested in contrasting the performance of our approach with existing state of the art methods and show how our approach is different and better at what we do.
The backbone of the NYC taxi traffic, as discovered by our algorithms Greedy and GreedyEB, is shown in Figure Discovering the Network Backbone from Traffic Activity Data. We see that both backbones consist of many street stretches in the midtown (around Times Square) while serving lowertown (Greenwich village and Soho) and uptown (Morningside heights). We also note that there are stretches to the major transportation centers, such as the LaGuardia airport, the World Financial Center Ferry Terminal, and the Grand Central Terminal, as well as to the Metropolitan museum. Comparing the Greedy and GreedyEB backbones, we see that GreedyEB emphasizes more on the traffic to lowertown, and ignores the northern stretch via Robert Kennedy bridge, as it is less likely to be included in many shortest paths. The case study reiterates the advantages of using edgebetweenness to guide the selection of the backbone to include edges which are likely to be used more and is consistent with the well established notion of Wardrop Equilibrium in Transportation Science that users (in a noncooperative manner) seek to minimize their cost of transportation [?].


We carry out a qualitative analysis on the Abilene dataset. The results of applying the Greedy algorithm are shown in Figure Discovering the Network Backbone from Traffic Activity Data.^{9}^{9}9The two nodes in Atlanta have been merged. The results provide preliminary evidence that the backbone produced by our problem can be tightly integrated with software defined networks (SDN), an increasingly important area in communication networks [?]. The objective of SDN is to allow a software layer to control the routers and switches in the physical layers based on the profile and shape of the traffic. This is precisely what our solution is accomplishing in Figure Discovering the Network Backbone from Traffic Activity Data. The design of datadriven logical networks will be an important operation implemented through an SDN and will help network designers manage traffic in real time.


As already noted, BackboneDiscovery is related to the spanner and the Steinerforest problem and the decision versions of both are known to be complete [?, ?]. The spanner problem is designed to bound the stretch factor for all pairs of nodes and not just those from a specific set of pairs. The Steinerforest problem on the other hand is designed to keep the pairs connected with a minimal number of edges and is agnostic about the stretch factor. Both these problems only consider structural information and completely ignore functional (activity) data that maybe available about the usage of the network. They also have strict limitations that all nodes need to be covered, which makes them restrictive.
The Prize collecting Steinerforest problem (PCSF) [?] is a version of the Steinerforest problem that allows for disconnected source–destination pairs, by imposing a penalty for disconnected pairs. Even in this variant, there is no budget or stretch requirement and hence the optimization problem that PCSF solves is completely different from what we solve. We show how our algorithm fares in comparison to PCSF in Section Discovering the Network Backbone from Traffic Activity Data.
Another enhancement in our work is to normalize edge costs with measures related to the structure of the network (like edge betweenness [?, ?, ?]) As we show in our experiments, this leads to finding solutions of better quality.
Our work is different from trajectory mining [?, ?], which consider complete trajectories between source–destination pairs. We do not make use of the trajectories and are only interested in the amount of traffic flowing between a source and destination. Also, the type of questions we try to answer in this paper are different from that of trajectory mining. While trajectory mining tries to answer questions like “Which are the most used routes between A and B?”, our paper tries to use information about traffic from A to B in order to facilitate a sparse backbone of the underlying network which allows traffic to flow from A to B, also keeping global network characteristics in mind.
The BackboneDiscovery problem is also related to finding graph sparsifiers and simplifying graphs. For example, Toivonen et al. [?] as well as Zhou et al [?], propose an approach based on pruning edges while keeping the quality of best paths between all pairs of nodes, where quality is defined on concepts such as shortest path or maximum flow. Misiolek and Chen [?] propose an algorithm which prune edges while maintaining the sourcetosink flow for each pair of nodes. Mathioudakis et al. [?] and Bonchi et al. [?] study the problem of discovering the backbone of a social network in the context of information propagation, which is a different type of activity than source–destination pairs, as considered here. In the work of Butenko et al. a heuristic algorithm for the minimum connected dominating subset of wireless networks was proposed [?]. There has been some work in social network research to extract a subgraph from larger subgraphs subject to constraints [?, ?]. Other forms of network backbonediscovery have been explored in domains including biology, communication networks and the social sciences. The main focus of most of these approaches is on the tradeoff between the level of network reduction and the amount of relevant information to be preserved either for visualization or community detection. While in this paper we try to also sparsify a graph, our objective and approach is completely different from the above because we cast the problem in a welldefined optimization framework where the structural aspects of the network are captured in the requirement to maintain a low stretch while the functional requirements are captured in maintaining connectedness between traffic terminals, which has not been done before.
In the computer network research community, the notion of software defined networks (SDN), which in principle decouples the network control layer from the physical routers and switches, has attracted a lot of attention [?, ?]. SDN (for example through OpenFlow) will essentially allow network administrators to remotely control routing tables. The BackboneDiscovery problem can essentially be considered as an abstraction of the SDN problem, and as we show in Section \thefigure, our approach can make use of traffic logs to help SDN’s make decisions on routing and switching in the physical layer.
We introduced a new problem, BackboneDiscovery, to address a modern phenomenon: these days not only is the structural information of a network available but increasingly, highly granular functional (activity) information related to network usage is accessible. For example, the aggregate traffic usage of the London Subway between all stations is available from a public website. The BackboneDiscovery problem allowed us to efficiently combine structural and functional information to obtain a highly sophisticated understanding of how the Tube is used (See Figure Discovering the Network Backbone from Traffic Activity Data). From a computational perspective, the BackboneDiscovery problem has elements of both the spanner and the Steinerforest problem and thus requires new algorithms to maintain low stretch and connectedness between important nodes subject to a budget constraint. We compare our algorithm with other similar algorithms and show how our algorithm is different and performs better for our setting. Our case studies show the application of the proposed methods for a wide range of applications, including network and traffic planning.
Though our algorithm makes use of shortest paths, in practice, any other types of paths could be incorporated into our algorithm. We leave this generalization for future analysis. The use of harmonic mean not only allows us to handle disconnected (s,t)pairs, but also makes our stretch factor measure more sensitive to outliers. For future work, we would also incorporate a deeper theoretical analysis of the algorithm and the stretch factor measure.
 [1] P. Boldi and S. Vigna. Axioms for centrality. CoRR, abs/1308.2140, 2013.
 [2] F. Bonchi, G. De Francisci Morales, A. Gionis, and A. Ukkonen. Activity preserving graph simplification. DMKD, 27(3), 2013.
 [3] U. Brandes and C. Pich. Centrality estimation in large networks. IJBC, 17(7), 2007.
 [4] S. Butenko, X. Cheng, C. A. Oliveira, and P. M. Pardalos. A new heuristic for the minimum connected dominating set problem on ad hoc wireless networks. Cooperative Systems, 3, 2004.
 [5] M. Casado, M. J. Freedman, J. Pettit, J. Luo, N. Gude, N. McKeown, and S. Shenker. Rethinking enterprise network control. IEEE/ACM Trans. Netw., 17(4):1270–1283, 2009.
 [6] A. Das Sarma, S. Gollapudi, M. Najork, and R. Panigrahy. A sketchbased distance oracle for webscale graphs. In WSDM, 2010.
 [7] N. Du, B. Wu, and B. Wang. Backbone discovery in social networks. Web Intelligence, 2007.
 [8] F. Giannotti, M. Nanni, F. Pinelli, and D. Pedreschi. Trajectory pattern mining. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 330–339. ACM, 2007.
 [9] M. Girvan and M. Newman. Community structure in social and biological network. PNAS, 2002.
 [10] M. Hajiaghayi, R. Khandekar, G. Kortsarz, and Z. Nutov. Prizecollecting steiner network problems. In Integer Programming and Combinatorial Optimization, pages 71–84. Springer, 2010.
 [11] H. Kim and N. Feamster. Improving network management with software defined networking. IEEE Communications Magazine, 51(2):114–119, 2013.
 [12] M. Marchiori and V. Latora. Harmony in the small world. Physica A, 285, 2000.
 [13] M. Mathioudakis, F. Bonchi, C. Castillo, A. Gionis, and A. Ukkonen. Sparsification of influence networks. In KDD, 2011.
 [14] E. Misiolek and D. Z. Chen. Two flow network simplification algorithms. IPL, 97, 2006.
 [15] G. Narasimhan and M. Smid. Geometric Spanner Networks. Cambridge University Press, 2007.
 [16] M. Newman. A measure of betweenness centrality based on random walks. Social Networks, 27, 2005.
 [17] M. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys. Rev., 69, 2004.
 [18] M. Potamias, F. Bonchi, C. Castillo, and A. Gionis. Fast shortest path distance estimation in large networks. In CIKM, 2009.
 [19] N. Ruan, R. Jin, G. Wang, and K. Huang. Network backbone discovery using edge clustering. arXiv:1202.1842, 2012.
 [20] H. Toivonen, S. Mahler, and F. Zhou. A framework for pathoriented network simplification. In IDA, 2010.
 [21] J. Wardrop and J. Whitehead. Correspondence. some theoretical aspects of road traffic research. In ICE:Engineering Divisions, page 767, 1952.
 [22] R. West, J. Pineau, and D. Precup. Wikispeedia: An online game for inferring semantic distances between concepts. In IJCAI, pages 1598–1603, 2009.
 [23] D. Williamson and D. Shmoys. The Design of Approximation Algorithms. CUP, 2011.
 [24] Y. Zheng, L. Zhang, X. Xie, and W.Y. Ma. Mining interesting locations and travel sequences from gps trajectories. In Proceedings of the 18th international conference on World wide web, pages 791–800. ACM, 2009.
 [25] F. Zhou, S. Mahler, and H. Toivonen. Network simplification with minimal loss of connectivity. In IDA, 2010.