Simplifying Wireless Social Caching
Social groups give the opportunity for a new form of caching. In this paper, we investigate how a social group of users can jointly optimize bandwidth usage, by each caching parts of the data demand, and then opportunistically share these parts among themselves upon meeting. We formulate this problem as a Linear Program (LP) with exponential complexity. Based on the optimal solution, we propose a simple heuristic inspired by the bipartite set-cover problem that operates in polynomial time. Furthermore, we prove a worst case gap between the heuristic and the LP solutions. Finally, we assess the performance of our algorithm using real-world mobility traces from the MIT Reality Mining project dataset and two mobility traces that were synthesized using the SWIM model. Our heuristic performs closely to the optimal in most cases, showing a better performance with respect to alternative solutions.
Today, a considerable fraction of data requirements in wireless networks comes from “social groups”. Members of a social group share common interests/goals and exhibit frequent and regular meeting patterns. Situations may arise where accommodating the data requirements of a social group through the wireless network is highly costly and infeasible. Examples of such scenarios are: (1) a group of students attending an online-course in an economically-challenged country, where it is costly to download the material that each student needs; (2) a group of tourists interested in obtaining touristic advertising and videos in a foreign country, where it is expensive to have cellular data connection; (3) in the aftermath of catastrophic emergencies, where the infrastructured networks are compromised and it is infeasible to establish stable connections with citizens. These examples highlight the critical importance of reducing the dependence on infrastructured networks. By exploiting social interactions among group members, it becomes possible to distribute the downloading efforts among the members who can then exchange data through local and cost-free connections.
We consider a social group of members who all wish to acquire (within a time period of duration ) a set of files on their smart wireless devices. These files are stored on a server to which the users have access through a wireless communication link. Examples of this type of scenarios include co-workers downloading files needed before a meeting, conference participants downloading presentations for next sessions, students downloading class materials and sport fans downloading videos during an event. We assume that the group members have regular meeting patterns, which are correlated with the group activity (e.g., work, sport, entertainment); we model these meeting patterns as random events. In particular, we assume that with some probability, members meet each other (one or multiple times) within the period of interest.
In this work we seek to minimize the usage of the bandwidth. As supported by almost all smart devices today, we assume that users can connect either directly to the server through a longhaul connection (e.g., cellular), which is expensive in bandwidth, or to each other, when in physical proximity, through a local and cost-free Device-to-Device (D2D) connection (e.g., Bluetooth). At the beginning of the period, each member downloads a certain amount of the files through the longhaul (bandwidth expensive) connection and locally caches this information. When two (or more) users meet, they exchange what they have in their caches using local (cost-free) connections. We consider two variations: in the direct case, users share only the data they themselves have downloaded (e.g., because of liability/authentication reasons), while in the indirect case, users share both the data they themselves have downloaded as well as the data they have collected through previous encounters with other members. At the end of the time period of duration , if a user has not received yet all the files, she will download the missing amount of data through the longhaul connection. The fundamental question we seek to answer is the following: at the beginning of the period, how much should each user download through the longhaul connection, so that the expected total usage of bandwidth within the period is minimized?
Related Work. Distributed and cooperative caching, as a means of improving the system performance, has received considerable attention lately as summarized next.
Work in the literature has considered the ultimate information-theoretic performance [1, 2, 3]. The common objective of these works is to find the optimal caching policy in a scenario where different users have different demands, where the demands may be uniform  or not [2, 3]. In all these works the amount of caching is known and the randomness lies in the users demands, while in our scenario the randomness lies in the member encounters.
In a situation where a group of smartphone users, with a common and simultaneous demand, are within proximity, cooperative caching is closely related to cooperative downloading [4, 5, 6]. The key-ingredient of these works, similar to ours, is that each user downloads parts of the content from the server (through a longhaul connection) and then disseminates (through a Wi-Fi connection) these parts to users in proximity. Distinct from these works, we do not a priori assume that users within the same group will meet and be able to exchange data within the prescribed period.
In a scenario where cooperative caching is allowed, a natural question arises on how to create proper incentives for the different users to cache previously downloaded content, which potentially is not any more useful. This problem has been analyzed, e.g., in [7, 8, 9]. In our framework, since users have a common demand, there is no rebate cost on communication within a group and members are always enticed to cache content, leading to distinct algorithms.
Cooperative caching has also been analyzed in the context of delay tolerant networks. In [10, 11], the authors derive the optimal caching policy that maximizes the social welfare, i.e., the average system utility. This metric is a function of other factors, e.g., users impatience and the popularity of the files. In , the authors aim to minimize the average delay and/or the average consumed energy. This is achieved by letting the server send random linear combinations of data packets to the users, and then - through heuristic algorithms - determine a set of qualified users to broadcast the transmissions to others. The differentiating feature of our work, however, lies in the objective function we optimize for: the number of downloads from the server. This implies that in our scenario, even if the members always have access to the longhaul link, they would anyway wait until the end of the time period before downloading from the server. In contrast, the incentive in [10, 11] would cause the users to download from the server whenever they have access, while the objective in  is to minimize the average consumed energy and the average delay.
Our work is similar to data offloading in cellular delay-tolerant networks: here, the goal is to reduce cellular data traffic by opportunistically sharing data among end-users through Terminal-To-Terminal (T2T) communications (we refer to  for a comprehensive study on this topic). A widely used approach is the so-called “subset selection”, where the central coordinator (i.e., the server) selects a subset of users to route the required data to other users in the network. In , the authors propose a target-set approach, where the server selects users, with the goal to maximize the number of reached users (through T2T connections). Since this problem is NP-hard, the authors propose a sub-optimal greedy heuristic. The authors in  study the regular interaction patterns among users to predict the VIP users (i.e., those who experience the highest number of meetings); these are then selected to be the local data forwarders. Distinct from these works: (i) we show that, by allowing users to cache network-coded parts of the data, the problem can be formulated as an easy-to-handle Linear Program (LP); (ii) thanks to the rigorous mathematical formulation, we prove an analytical performance guarantee of the proposed caching strategies; (iii) by means of numerical evaluations on real data, we present scenarios in which our approach achieves a better performance with respect to .
Contributions. We first formulate our problem as an LP, which allocates amounts of data to download to each member so as to minimize the expected total cost (total number of downloads). Towards this goal, we assume that the data is coded (as in network coding ). Since each user caches randomly coded data segments, it is unlikely that two different caches have the same content. Thus, a user receives novel information whenever she accesses a cache to which she has not had access before. With this, for members, we have possible meeting patterns, each occurring with a certain probability. The LP is hence of exponential size. We perform several simplification steps and prove that, in the symmetric case, i.e., when all pairs of members meet with equal probability, the complexity of the solution reduces to linear in . Moreover, through an artifact, we show how the indirect case can be studied within the framework of the direct case without the need to develop a separate one.
We then show a surprising connection between our problem and the well-known set-cover problem. In particular, we prove that the solution of the optimal LP is lower bounded by the weighted sum of the solutions of several set-cover problems. Each problem is described by an adjacency matrix, which is related to a possible meeting pattern among the users; the weight depends on the probability that this particular meeting pattern occurs.
Next, inspired by the structure of the solution of the optimal LP, we propose a simple polynomial-time approximation algorithm that we name AlgCov. AlgCov is related to the bipartite set-cover problem, reduces to a closed form expression in the symmetric case, and achieves in our simulations a performance close to the optimal. Moreover, by using approximation techniques and tools from LP duality, we analytically prove that AlgCov outputs a solution that is at most an additive worst-case gap apart from the optimal; the gap depends on the number of members and on the probability that the users meet.
Finally, we evaluate the performance of AlgCov over real-world datasets. We use data from the MIT Reality Mining project , as well as two synthesized mobility traces, generated by the SWIM model : a simulation tool used to synthesize mobility traces of users based on their social interactions. These synthesized traces were created based on real mobility experiments conducted in IEEE INFOCOM 2005  and Cambridge in 2006 . We assess the performance over the case where group members exhibit relatively symmetric meeting patterns (i.e., users have approximately the same expected number of users to meet) as well as asymmetric patterns (i.e., different users have different expected number of users to meet). For both configurations, AlgCov achieves a performance close to the optimal. AlgCov performance is also compared with alternative solutions, e.g., the target-set heuristic in  and CopCash, a strategy which incorporates the concept of caching into the cooperative downloading approach proposed in . This paper is based on the work in , with the following novel contributions: (i) proofs of the theorems in , (ii) Theorems 3.1, 3.3 and 6.1, (iii) connection to the set-cover problem, (iv) CopCash comparison, and (v) SWIM model experiments.
Paper Organization. Section 2 introduces our problem. Section 3 formulates the problem as an exponentially complex LP and shows that this complexity becomes linear in in the symmetric case. Section 4 shows the connection of the LP formulation to the set-cover problem. Section 5 proposes two polynomial time heuristics, based on which we design AlgCov in Section 6. Section 6 also derives an additive gap bound on AlgCov from the optimal solution. Section 7 evaluates the performance of AlgCov over real-world and synthesized traces; Section 7 also provides comparisons with alternative solutions. Finally Section 8 concludes the paper. Some of the proofs can be found in the Appendix.
Notation. Lower and upper case letters indicate scalars, boldface lower case letters denote vectors and boldface upper case letters indicate matrices; calligraphic letters indicate sets; is the cardinality of , is the set of elements that belong to but not to and is the power set of ; is the set of integers from to ; for ; is the expected value; (respectively, ), is a -long column vector of all ones (respectively, zeros); is the transpose of the matrix ; is the indicator function, i.e., it is equal to when statement is true and otherwise.
Goal. We consider a set of users who form a social group. All users need to obtain the same set of information units (files), that are available from a server, within the same time period of duration . Users can access the server through a direct longhaul wireless link that has a cost per downloaded information unit. They can also exchange data with each other through a cost free D2D communication link, when (and if) they happen to physically encounter each other - when their devices can directly connect to each other, e.g. through Bluetooth. Our goal is to minimize the average total downloading cost across the user group. Clearly, with no cooperation, the total cost is .
Assumptions. We make the following assumptions.
Complete encounter cache exchange. According to , the average contact duration between two mobile devices is 250 seconds, sufficient for delivering approximately 750 MBs using standard Bluetooth 3.0 technology. Thus, we assume that encounters last long enough to allow the users who meet to exchange their whole cache contents.
No memory constraints. Since the users demand the whole , we assume they have sufficient cost-free storage for it.
A-priori known Bernoulli distribution. We assume that the pairwise meetings between the users (i) are Bernoulli distributed and (ii) occur with probabilities that are known a priori. Studies in the literature have been conducted to provide mobility models for users based on their social interactions (see  and references therein). While such models are fit for simulation purposes, they appear complex to study from an analytical point of view. Thus, we make assumption (i) as a means to derive closed-form solutions and provide analytical performance guarantees; we also assess the validity of our derived solutions on synthesized mobility traces which use mobility models. Assumption (ii) can be attained by exploiting the high regularity in human mobility [24, 25] to infer future meeting probabilities based on previous meeting occurrences.
Delay-tolerant users. Even with a longhaul connection, users can endure a delay, at most of duration , in data delivery so as to receive data via D2D communications.
Network coded downloads. We assume that users download linear combinations of the information units .
Our scheme consists of three phases, namely
1) Caching phase: before the period of duration starts, each user downloads a (possibly different) amount of the file set using the longhaul connection at cost . In our LP formulations, we assume, without loss of generality, that , and thus is a fraction.
2) Sharing phase: when two or more users meet, they opportunistically exchange the data they have in their caches. We consider two separate cases: the direct sharing case, where users share data they themselves have downloaded from the server (e.g., because of liability/authentication reasons), and the indirect sharing case, where users also exchange data they have collected through previous encounters.
3) Post-sharing phase: each user downloads the amount she may be missing from the server at a cost . In the LPs, since we assume , we have that .
With this approach, what remains is to find the optimal caching strategy. For instance, it is not obvious whether a user, who we expect to meet many others, should download most of the file (so that she delivers this to them) or almost none (as she will receive it from them). Moreover, downloading too much may lead to unnecessary cost in the caching phase; downloading too little may lead to not enough cost-free sharing opportunities, and thus unnecessary cost in the post-sharing phase.
3 LP formulations
We formulate an LP that takes as input the encounter probabilities of the users, and finds that minimize the average total cost during the caching and post-sharing phases. We consider direct and indirect sharing.
Direct Sharing. During encounters users can exchange what they have personally downloaded from the server. Thus, whether users and meet each other multiple times or just once during the period of duration , they can still only exchange the same data - multiple encounters do not help.
We model the encounters between the users as a random bipartite graph , where: (i) contains a node for each of the users at the caching-phase, (ii) contains a node for each of the users at the end of the period of duration , and (iii) an edge always exists between a node and itself and it exists between , with , with probability ; this edge captures if and meet each other (one or multiple times) during the period of duration and share their cache contents. There are realizations (configurations) of such a random graph, indexed as . Each configuration has an adjacency matrix and occurs with probability . For brevity, in what follows we drop the superscript .111With this formulation, we can directly calculate the probabilities , if the pairwise encounters are independent and Bernoulli distributed with probabilities ; however, the Bernoulli assumption is not necessary, since the formulation only uses the probabilities that could be provided in different ways as well. We also remark that , not only depends on the duration , but also on the start of the sharing period.
We denote with the vector of the downloaded fractions and with the vector of the received fractions after the sharing phase for the -th configuration. With this we have . The cost of post-sharing downloading in the -th configuration is with
With the goal to minimize the total cost (i.e., caching and post-sharing phases) incurred by all users, the optimal becomes the solution of the following optimization problem
where the variable represents the fraction to be downloaded in the post-sharing phase by user in configuration after receiving data from the users encountered in the sharing phase. Without loss of generality, we assumed . The LP formulation in (1) has complexity (due to the possible realizations over which we have to optimize), which prohibits practical utilization - yet this formulation still serves to build intuitions and offers a yardstick for performance comparisons.
The following theorem provides an alternative formulation of the LP in (1), which reduces the complexity to . Let be any row vector of length with zeros and ones as entries. By excluding the all-zero vector, there are such vectors, which we refer to as selection vectors. We let be the set of all such vectors and be the set of users corresponding to the selection vector .
Let . Then the LP in (1) can be equivalently formulated as
where is the probability that user is connected to all and only the users in .
The main observation behind the proof of Theorem 3.1 (see Appendix A) is that the LP in (1) has an inherent symmetry: the Left-Hand-Sides (LHS) of the constraints of the LP in (1) are all repetitions of constraints of the form . Thus, an optimal solution will let the right-hand-side of the constraints with the same LHS be equal. By appropriately grouping these constraints and variables together, we arrive at the LP in (2) which has complexity of .
LP for the symmetric case
We now assume that users meet pairwise with the same probability, i.e., during the period of duration . Thus, only depends on the number of encounters (as opposed to which exactly) that the configuration contains. Many realistic scenarios can be modeled as symmetric, (e.g., students in the same class, doctors in the same medical department in a hospital, military soldiers in the battlefield). The next theorem (whose proof is provided in Appendix B) significantly simplifies the problem in (1).
In the symmetric scenario, the LP in (2) can be simplified to the following LP
The LP in (3) has linear complexity in , i.e., the optimal solution is obtained in polynomial time. It is worth noting that the symmetric assumption is made to get an analytical handle on the problem. When we assess the performance on real datasets we will relax this assumption by requiring users to have an approximately equal average degree (i.e., number of encountered users), as explained in Section 7.
Indirect Sharing. Enabling users to share both what they downloaded from the server as well as what they received from previous encounters, gives rise to interesting situations, since now, not only multiple encounters help, but also the order of the encounters matters. Assume for instance that, during the period of duration , meets , and later meets . Now will have ’indirectly’ received as well as . If instead, meets before she meets , then will only receive , but will receive both and . Moreover, if again meets later during the period, can receive through this second encounter with .
To model sequential encounters, we split the time period of duration into time segments, such that, during each segment, it is unlikely for more than one encounter opportunity to occur (note that one user can still meet multiple people simultaneously). We then ’expand’ over time our bipartite graph to a -partite layered graph, by adding one layer for each time segment, where the -th time segment corresponds to the duration between times and , with . In contrast to the direct case, at the end of the period of duration , node is able to receive from node , if and only if there exists a path connecting at the first layer to at the last layer; and do not need to have directly met, provided that such a path exists.
Note that in the bipartite (direct) case, the probability (respectively ) associated with the edge from user to user (respectively, from to ) indicates how often user shares her cache content with user (respectively, with ), with . Thus, using this time-expanded model, the indirect case can be readily transformed into an equivalent bipartite (direct) case, by replacing the probability of each two users meeting in the bipartite graph, with the probability of a path existing between these two users on the -partite graph. Let and be the time instants at which the -partite graph begins and ends, respectively. Let be the probability that, in the time interval between and , a path exists between user and each of the users inside the set . We let be the probability that users and are connected between time instants and . Given this, the next theorem derives the values of 222The proof of Theorem 3.3 is based on simple counting techniques. .
Assume a -partite model, where and are respectively the starting and ending times of the -th time segment, . Let be the set of all users, and let and be two sets of users of sizes and , respectively. Let . Denote with the event of having the users in meeting exactly the users in and let be the probability of this event happening between time instants and . Then, for , this probability is given by
if and otherwise, where .
Theorem 3.3 can hence be utilized to cast the indirect sharing version of our problem as a direct sharing one. In particular, an LP of the form described in (1) has to be solved, with the values of being replaced with those obtained from Theorem 3.3. Note that these probabilities might not have the same symmetric structure as those of the direct sharing model333In the direct case, when user meets user with probability , then user meets user with the same probability.. However, the problem formulation and the algorithms designed in next sections are readily suitable for the indirect sharing case where the graph model is not necessarily symmetric. Thus, in the rest of the paper, for theoretical analysis we only consider the direct case. However, in Section 7, we assess the performance of our algorithms for both the direct and indirect cases.
4 Connection to Set-Cover Problem
A Set-Cover (SC) problem is modeled as a bipartite graph , with being the set of nodes (i.e., the universe), being a collection of sets whose union equals the universe and where an edge exists between set and node if node belongs to set . An integer LP formulation of the SC problem then finds the optimal selection variables to minimize the number of selected sets in while ’covering’ all node in .
One can therefore think of the LP formulation in (1) as a relaxation of an integer LP, which models a variation of the SC problem. In this variation, there are two major differences: (i) the covering is performed on bipartite graphs, each with a different adjacency matrix , and the same sets are selected to cover ’all’ bipartite graphs; (ii) each node can be covered by either a selected set that contains it, or an ’outside’ source. With reference to the LP in (1), the variables are the selection variables of the sets, and the variables are the outside sources of user in configuration . An illustrative example is given in Figure 1. A conventional SC problem is shown in Figure 1(a), where the sets and contain nodes () and (), respectively, while set contains users . The variables therefore determine which sets are selected for all the nodes to be covered. In this example, the set covers all the nodes. In our variation of the SC problem in Figure 1(b), there are possible instances of bipartite graphs between sets and nodes, where the variables determine the selected sets that are used to simultaneously cover the users in all graphs, while the variables are used to cover the remaining users that were not covered by the selected sets.
The optimal solution of the LP in (1) is lower bounded by the weighted sum of the outputs of different LPs as follows. For , the -th LP is a relaxed SC problem over a bipartite graph with adjacency matrix . The output of the -th LP is weighted by .
5 Polynomial Time Approximations
In this section, we propose heuristics that find an approximate solution for the LP in (1) in polynomial time.
Inverse Average Degree (IAD). Consider the symmetric direct case, where users meet pairwise with the same probability . For this scenario, we expect that the bipartite graph has (in expectation) a constant degree of , since each user, in average, meets the same number of people. The degree, in fact, captures the number of users met in that random realization; hence, each user meets (apart from herself) the remaining users with equal probability .
In this case, a natural heuristic is to let each user download , where is a random variable corresponding to the number of people (including herself) a user meets. Figure 2 shows the optimal performance (solid lines), i.e., the solution of the LP in (1), and the performance of the caching strategy when each user downloads (dashed lines) for the symmetric case versus different values of . It is evident from Figure 2, that such a choice of a caching strategy closely follows the performance of the optimal solution in symmetric scenarios. However, this approximation does not perform as well in the general (asymmetric) case. Consider, for example, a ‘star’-like configuration, i.e., is highly connected to the other users, while the other users are only connected to . In this scenario the minimum (i.e., optimal) total average cost is approximately , achieved by letting download the whole file and then share it with the other users. In contrast, if we force to download we would get that downloads (as she meets the others members plus herself) and downloads (as she only meets plus herself). This would imply a total cost of for the caching phase, which grows linearly with and thus can be -times worse than the optimal. This suggests that the optimal search might look like a ‘cover’: a set of nodes that enables to ‘reach’ and ‘convey’ information to all others. This is in line with the observations we previously made in Section 4.
Probabilistic Set-Cover (PSC). Building on this intuition, we propose another heuristic that seeks to find a form of a “fractional covering”, where the fraction that each user downloads is a ’cover’ for the users she may meet. In the PSC problem , the covering constraint is replaced with a probabilistic one (i.e., the probability of covering all nodes is greater than a threshold). Here, we propose a variation of the PSC problem with an ’average’ constraint.
We model the problem through a fully-connected bipartite graph , where each edge has an associated weight , that represents how much on average can cover . We set , and . The heuristic then seeks to associate fractional values to the nodes in on the transmitting side, so that the sum of all ’s is minimized, while each node in on the receiving side is covered, i.e., assured to receive (on average) the total amount. This is expressed through the following LP
where is a matrix whose -th entry (with ) is and with ones on the main diagonal. This is very similar to a fractional covering problem formulation, with the only difference that is not forced to be binary, but can have real components to express expectations.
For the symmetric scenario, the optimal solution for the LP in (4), denoted as , coincides with the IAD solution, denoted as , i.e., where .
6 AlgCov Algorithm
In this section we present AlgCov, a simple heuristic algorithm that combines both approaches discussed in Section 5. AlgCov enables to calculate the fractions in polynomial time, and achieves a performance close to that of the (exponentially complex) general LP in (1).
To design an algorithm that combines the merits of both heuristics presented in Section 5, one might proceed as follows: (i) compute the solution of the PSC heuristic, (ii) compute the performance of this heuristic by plugging into the LP in (1) and by optimizing over to find the optimal cost for this solution. Then, repeat the same procedure for the IAD solution and finally choose the solution with the smallest cost.
Such a solution is, in theory, possible. However, the process of computing the cost of each heuristic involves solving an exponentially complex LP, prohibiting the applicability of the heuristic. The following theorem helps circumvent this complexity issue (see Appendix E for the proof).
Theorem 6.1 provides a lower bound on the optimal value of the LP in (1), and consequently on the performance of the solution , i.e., , with being obtained by evaluating the LP in (1) while setting . A fairly simple lower bound on the performance of is obtained by simply summing over the elements of the vector , i.e., , with being obtained by evaluating the LP in (1) while setting . As it is much simpler to compute these lower bounds, one can envisage to design an algorithm which, based on the lower bounds, selects one among the two heuristics described in Section 5.
6.2 Algorithm Description
AlgCov takes as input the probability matrix that contains the pairwise probabilities of meeting among users, and outputs the solution as a caching strategy. It first computes the two heuristic solutions, namely and , and then, as shown in Algorithm 1, selects one of them as output based on and , which in Theorem 6.1 we proved to be lower bounds on the actual performance of the heuristics.
6.3 Analytical performance
Symmetric case. In this setting, all pairs of members meet with equal probability. According to Theorem 5.1, both the IAD and the PSC heuristics provide the same solution, i.e., . By optimizing the objective function of the LP in (3) over with , we obtain
which is an upper bound on the optimal performance, i.e., . In order to provide a performance guarantee we need to understand how well approximates the optimal solution of the LP in (3). To this end, we use the lower bound in Theorem 6.1. This lower bound, denoted as , implies that . Using the structure of for the symmetric case in Theorem 5.1, the lower bound becomes
The above gap result ensures us that, in the symmetric case, the output of AlgCov is always no more than above the optimal solution of the LP in (3). It is worth noting that is only function of the number of members and of the probability that users meet.
Through extensive numerical simulations, we observed that is maximum for , i.e., the probability maximizing is . By evaluating in , we get a worst-case (greatest) gap of .
Asymmetric case. In this setting, different pairs of members meet with different probabilities. In this scenario, differently from the symmetric case analysed above, the LP in (4) does not seem to admit an easily computable closed-form solution. For this reason, we next show how the analysis drawn for the symmetric case can be extended to find a performance guarantee for the asymmetric case as well.
In the asymmetric case, an upper bound on the solution of AlgCov can be found by evaluating in (6.3) in , with . In other words, instead of considering different probabilities for different pairs, we set all of them to be equal to the minimum probability; this gives a solution which is always worse, i.e., greater than or equal to the optimal solution of AlgCov evaluated with the original (asymmetric) probability matrix.
Similarly, a lower bound on the optimal solution of the LP in (2) can be found by evaluating in (6) in , with . Again, instead of considering different probabilities for different pairs, we set all of them to be equal to the maximum probability; this gives a solution which is always better, i.e., smaller than or equal to the optimal solution of the LP in (2) evaluated with the original (asymmetric) probability matrix. Thus
This proves that in the asymmetric case, the output of AlgCov is always no more than above the optimal solution of the LP in (2). Similar to the symmetric case, also in this setting is only a function of the number of members and of the probabilities and .
7 Data-set evaluation
In this section, we evaluate and compare the performance of our proposed solutions and algorithms using mobility traces that are obtained either from real-world experiments or via a human mobility trace synthesizer.
Performance Metrics and Comparisons: We are mainly interested in the performance of our proposed caching techniques in comparison to the conventional non-sharing solution.
Specifically, we are interested in assessing the average total cost (total amount downloaded across the caching and post-sharing phases), averaged over the experiments.
If each user simply downloads all data, this cost is .
Versus this, we compare the performance of:
Original Formulation and AlgCov: We calculate the average probabilities from our dataset, feed these into the LP in (1) and into Algorithm 1 that assume Bernoulli distributions, and obtain the optimal and the AlgCov heuristic solutions, respectively. For each experiment, we then use these caching amounts, and follow the real meeting patterns recorded in the mobility traces to exchange data and download as needed in post-sharing phase. Finally, we calculate the actual total cost, averaged over the experiments.
1/: We evaluate the performance when each user caches of the data, independently of the meeting probabilities; this is a naive heuristic that does not fully exploit the opportunistic sharing possibilities.
CopCash: We propose a modified version of the cooperative sharing algorithm originally proposed in , where we incorporate the concept of caching. Cooperative sharing takes advantage of the fact that nearby users, with a common demand, can collectively download the requested set of files. In addition, the proposed CopCash allows users to cache the received files, with the goal of exploiting next encounter opportunities to further share the data with other users. The scheme can be described as follows:
Whenever users meet, each of them first downloads a fraction of the requested set of files, and then they share these parts among themselves through cost-free transmissions (e.g., Bluetooth).
If there exists a user (or a set of users) in the group who has already participated to a cooperative sharing instance, she directly shares what she has in her cache, i.e., what she obtained from previous meetings. In particular, she can share only what she has downloaded (direct sharing) or the whole set of files (indirect sharing).
The sharing procedure continues until the end of the period of duration . At this point, if a user has participated in a previous sharing instance, she will have already obtained the set of files during that sharing instance. Otherwise, she will solely download the file set.
Consider the example in Figure 3. Suppose that, at time instant , and met; hence each of them downloaded of the file. Then, and exchanged the downloaded fractions, thus their demand was satisfied. At time instant , meets and , while meets and . In the case of direct sharing - see Figure 3(a) - (respectively, ) shares with and (respectively, and ) what she has personally downloaded from the server, i.e., of the file; at the end of the sharing period, downloads of the file from the server. With this, each user has to download of the file. In the case of indirect sharing - see Figure 3(b) - and share the whole set of files with the users they are connected to; in this case, does not need to download anything from the server.
Target-Set: We assess the performance of the Target-Set heuristic proposed in  with , i.e., the server assigns one user the task to route the data to other users. We only show the performance of since it is the case which incurs the smallest cost over the datasets that we consider.
Experiment Setup: We consider groups of size . In each experiment, we obtain the average performance of our algorithms by averaging over 50 group trials. For each group trial we pick a group of size according to a specific selection criterion, and we compute the performance of the different heuristics for this group. In particular, we evaluate the performance in two different types of network, namely:
Symmetric Configurations: Users in the group have approximately the same expected number of users to meet among the group. Note that this is a relaxed requirement of symmetry with respect to the one used in Section 3 where all the users were assumed to meet with the exact same probability.
Asymmetric Configurations: Users in the group have different expected number of users to meet.
For each group, we define the Expectation Deviation (ED): the difference between the maximum and the minimum expected number of encountered users, among all users, i.e., let be the set of users belonging to group , then . A group with high ED is more likely to have an asymmetric structure, while a group with small ED would have a symmetric structure. Our selection criterion is therefore the following: (a) for asymmetric configurations, we choose groups that have , and (b) for symmetric configurations, we select groups that have , while having ; , and are decision parameters. For each experiment, these thresholds are set to values, which ensure the existence of the required number of groups.
We consider different deadlines : the time period after which all users must individually have the whole set of files at their disposal. Intuitively, we expect that the longer the deadline is, the higher the number of sharing opportunities can be among the users within the same group and thus the smaller the average cost becomes. For each deadline, the duration of the whole experiment is divided into a number of deadline trials: for example, if the experiment is performed for a duration of days, and we consider a duration of hours in each day, then for a deadline of hours, we have deadline trials.
7.1 MIT Reality Mining Dataset
We evaluate the performance of our proposed solutions and algorithms using the dataset from the MIT Reality Mining project . Table 7.1 lists the values of all the parameters that we use in our experiment, described in the following.
This dataset includes the traces from 104 subjects affiliated to MIT - 75 of whom were in the Media Laboratory - in the period from September 2004 to June 2005. All subjects were provided with Nokia 6300 smart phones used to collect information such as the Bluetooth devices in the proximity logs. In our experiment, we utilize this information to capture the sharing opportunities among users. Each device was programmed to perform a Bluetooth device discovery approximately every minutes, and store both the time instant at which the scan was performed, as well as the list of the MAC addresses of the devices found during the scan.
Assumptions: We say that two users are connected at a time instant, if there exists a scan (at that time instant) that was performed by any of the two users, in which the other user was found. We assume instantaneous sharing, i.e., if two users are connected at a time instant, then they can share their full cache contents. We justify this assumption in the following discussion. As specified in , Bluetooth discovery scans were performed and logged approximately every minutes. However, this granularity in time was not always attainable since (i) the devices were highly asynchronous, and (ii) some devices were powered off for a considerable amount of time. Because a non-negligible fraction of users experienced these irregularities, discarding their traces is not a suitable solution. Other solutions in the literature (for example, see ) utilize the IDs of the cell towers to which mobile devices are connected to infer proximity information. However, such approaches are too optimistic in assuming sharing opportunities, and hence are not suitable for our application. Our approach to deal with this highly irregular data was to consider the minimum sharing interval to be minutes, i.e., two users are connected for an entire sharing interval if they are so at any time instant in that specific interval. Using the standard Bluetooth wireless transmission speed, this time period is sufficient to share approximately GBs of data. Hence, for all practical purposes, it is reasonable to assume that any two connected users can share their full cache contents during that sharing interval.
For indirect sharing, we do not allow intra-interval relaying: users cannot indirectly share with other users within the same interval. We do, however, allow inter-interval relaying: indirect sharing can be performed across successive intervals. Our premise is that, while a -minute sharing interval is sufficient for one full cache content sharing, it might not be long enough to ensure more than one successful data exchange. This approach might severely limit the performance, i.e., a lower cost could be achieved by allowing intra-interval relaying.
Setup: We consider a period of three months from the academic year 2004/2005 in MIT, namely from October to December. We consider traces of only 75 users - labeled as affiliated to the Media Laboratory - during Monday, Tuesday and Wednesday. The reason for choosing these particular days is that we observed that, across the time period of interest, meetings occur most frequently in these days; thus, this represents a suitable period to assess the performance of all the solutions under consideration. We perform each experiment from 2 pm to 6 pm, and we consider deadlines of hours. The thresholds for choosing groups are , and . The reason behind this particular choice was to ensure the existence of groups of users in the duration of the experiment.
Experimental Results: Figure 4 shows the performance of different network structures (i.e., asymmetric and symmetric) for the direct and indirect sharing cases, respectively. From Figure 4, as expected, we observe that: (i) the average total cost decreases as the deadline increases; (ii) the average total cost in the indirect sharing case is less than the one in the direct case, thanks to a higher number of sharing opportunities; (iii) using as a caching strategy performs the worst among all other schemes. This is because the scheme, differently from the other strategies, is not based on the meeting probabilities of the users.
Asymmetric Networks: Figure 4(a) and Figure 4(c) show the performance over asymmetric networks for the direct and indirect sharing cases, respectively. We note the following:
Target-Set performs very close to the optimal scheme in both the direct and the indirect sharing cases. This is due to the asymmetric structure of the selected groups: one node is more likely to be connected to the other members of the group, and therefore the optimal solution would rely on that node to deliver the data to the whole group.
AlgCov outperforms IAD in Figure 4(a), which indicates that AlgCov utilizes the solution that is generated from PSC. In contrast, AlgCov and the IAD strategy perform almost the same in Figure 4(c) which indicates that IAD outperforms PSC in this case. This justifies the merge between these two heuristics in the design of AlgCov.
Symmetric Networks: Figure 4(b) and Figure 4(d) show the performance of the different schemes over symmetric networks for the direct and the indirect sharing cases, respectively. Observations are similar to those drawn for the asymmetric case. However, one major observation is that Target-Set, differently from asymmetric groups, poorly performs. This is a direct consequence of the symmetric structure of the selected group: in a symmetric group, an optimal sharing strategy would equally distribute the caching and sharing efforts among all members within the group; in contrast, Target-Set selects only one member who has the task of caching and sharing the data for the group.
One might argue that CopCash has an inherent advantage over the other caching strategies since it does not need the genie-aided information of the pairwise meeting probabilities. However, this information is not hard to obtain in a realistic scenario. For example, although being out of the scope of this work, one can think of modifying AlgCov, by including a learning module. With this and by exploiting the regular mobility behavior of the users, the probabilities can be estimated as reportedly done in the literature (see [15, 27]).
CopCash performs closely to our proposed solution. One can thus draw a premature conclusion that pre-caching does not bring significant benefits with respect to opportunistically exploiting sharing opportunities, as CopCash does. This is true when the meeting probabilities are small, as in the MIT Reality Mining dataset. However, as shown next, pre-caching solutions outperform opportunistic sharing approaches when the users are moderately/highly connected.
7.2 SWIM-Based Results
We here evaluate the performance of our algorithms over mobility traces synthesized using the SWIM model. SWIM  is a human mobility model that is used to synthesize mobility traces based on the users social behavior. Traces are generated in the form of events: the exact time at which two users meet/leave. Thus, the trace files consist of a chronological series of meeting/leaving events among the users involved in the generation of the trace. We use a synthesized version of two existing traces, namely Infocom-2005 and Cambridge-2006. These traces were obtained through experiments conducted in the IEEE INFOCOM 2005 conference and in Cambridge in 2006, respectively (see [19, 20] for more details). The synthesized versions of these traces include a greater number of nodes (with the same spatial density) than the original ones, which is the main reason behind our choice of the synthesized traces.
Assumptions: We consider the sharing interval to be minutes. We say that two users successfully exchange their cache contents if they are in contact for at least of the interval. Similarly to Section 7.1, in the indirect sharing we only allow inter-interval relaying.
Setup: We perform each experiment over the traces from virtual users during the entire duration of the trace ( days for Infocom-2005 and days for Cambridge-2006). The deadlines that we consider are of hour, hour and hours. The thresholds for choosing groups are , and . The reason behind this particular choice was to ensure the existence of groups of users for all the days of the experiment. Table 7.1 lists the values of all the parameters of the experiments.
Experimental Results: We assess the performance of our algorithms on the Infocom-2005 (Figure 5) and Cambridge-2006 (Figure 6) mobility traces. Similar conclusions to those in Section 7.1 can be drawn. In particular: (i) the average total cost decreases as the deadline increases; (ii) the average total cost incurred in the indirect sharing case is less than the one in the direct counterpart; (iii) the caching strategy shows the worst performance among the different schemes; (iv) Target-Set performs close to the optimal in asymmetric configurations. However, differently from Section 7.1, in most of the cases CopCash poorly performs with respect to other solutions. The reason is that the mobility traces of both Infocom-2005 and Cambridge-2006 show a relatively high frequency of meetings among users, which is a distinct feature with respect to the MIT Reality Mining dataset.
We here motivated, proposed, analysed, and experimentally evaluated AlgCov, a simple low-complexity algorithm for social caching, that uses pre-caching in anticipation of encounter opportunities to minimize the required download bandwidth. We derived formal LP formulations and presented a worst-case analytical performance gap. We numerically evaluated the performance of the proposed solutions on (i) the mobility traces obtained from the MIT Reality Mining data set, and (ii) two mobility traces that were synthesized using the SWIM mobility model. AlgCov achieves a performance which is close to the optimal and, in some configurations, it outperforms existing solutions, such as the Target-Set. AlgCov makes the case that, even in the presence of random encounters, using simple algorithms for pre-caching can significantly reduce bandwidth usage.
Appendix A Proof of Theorem 3.1
The key observation is to notice that the constraints in the LP in (1) can be written in the form , where . Since all the constraints of the type in the LP in (1) can be replaced with , the optimal solution would make all equal, as proved in Lemma A.1.
Let (,) be an optimal solution for the LP in (1). Then . Thus,
We next prove the result in Lemma A.1. Without loss of generality, assume that and where . Then, since this is a feasible point, can be driven down to zero without violating the feasibility conditions, and consequently reducing the optimal value of the objective function; thus, we have a contradiction. The same argument can be extended to the case where more than one is different from .
Notice that, by our definition in Theorem 3.1, we have .
We now use the result in Lemma A.1 to prove Theorem 3.1, i.e., the equivalence of the LPs in (1) and in (2).
Part 1. Let be an optimal solution for the LP in (1), which follows the structure described in Lemma A.1. For , let be the value where, for each , . Then, one can construct a feasible solution for the LP in (2) as follows: (i) set ; (ii) let be an element of that corresponds to a constraint of the form in the LP in (2), then set . By doing so, the constraints of the LP in (2) are satisfied. Moreover, with Lemma A.1, the objective functions of both problems are equal, when evaluated at the described points.
Part 2. Let be an optimal solution for the LP in (2). Then one can construct a feasible solution for the LP in (1) as follows: (i) set ; (ii) , set . By doing so, the constraints of the LP in (1) are guaranteed to be satisfied. Moreover, with Lemma A.1, the objective functions of both problems will be equal, when evaluated at the described points. This concludes the proof.
Appendix B Proof of Theorem 3.2
We now prove Lemma B.1 in three steps.
Step 1. We prove that and is a feasible solution for the LP in (1). Assume a feasible solution consists of . Then, with so that , we get a feasible solution of the required form.
Step 2. Assume that an optimal solution has . We prove, by contradiction, that this implies . We use similar steps as in the proof of Lemma A.1. Without loss of generality, assume that and , where . Since this point is feasible, then can be driven down to zero without having violated the feasibility conditions; this operation (i.e., setting ) also implies a reduction in the optimal value of the objective function; thus we have a contradiction.
Step 3. We prove that, for the symmetric model, an optimal solution of the form and exists. Without loss of generality, assume that is an optimal solution of the form , where 444This assumption is not necessary and is made only to simplify the analysis.. We show that gives a smaller value for the objective function of the LP in (1), i.e., . We start by noticing that, by using the symmetric model in in (1), we get
where the last equality follows by noticing that each of the is equal to a term of the type for some and by swapping the order of the summations.
We next evaluate and separately.
Evaluation of : We define