A LearningBased Approach to Caching in Heterogenous Small Cell Networks
Abstract
A heterogenous network with base stations (BSs), small base stations (SBSs) and users distributed according to independent Poisson point processes is considered. SBS nodes are assumed to possess high storage capacity and to form a distributed caching network. Popular files are stored in local caches of SBSs, so that a user can download the desired files from one of the SBSs in its vicinity. The offloadingloss is captured via a cost function that depends on the random caching strategy proposed here. The popularity profile of cached content is unknown and estimated using instantaneous demands from users within a specified time interval. An estimate of the cost function is obtained from which an optimal random caching strategy is devised. The training time to achieve an difference between the achieved and optimal costs is finite provided the user density is greater than a predefined threshold, and scales as , where is the support of the popularity profile. A transfer learningbased approach to improve this estimate is proposed. The training time is reduced when the popularity profile is modeled using a parametric family of distributions; the delay is independent of and scales linearly with the dimension of the distribution parameter.
I Introduction
The advent of multimediacapable devices at economical costs has triggered the growth of wireless data traffic at an unprecedented rate. This trend is likely to continue, requiring wireless service providers to reevaluate design strategies for the next generation wireless infrastructure [Furuskar2015]. A promising approach to address this problem is to deploy small cells that can offload a significant amount of data from a macro base station (BS) [Chou2014]. Doing so, it is expected to lead to costeffective integration of the existing WiFi and cellular technologies with improved performance of peak data traffic steering policies [Bennis2013]. However, a potential shortcoming of the small cell infrastructure is that, during peak traffic hours, the backhaul linkcapacity requirement to support data traffic is enormously high [Kim2014]. Also, the cost incurred in deploying a high capacity backbone network for small cells can be quite high. Therefore, small cellbased solutions alone will not suffice to efficiently solve the quality of service requirements associated with peak traffic demands.
A noteworthy development in this direction is to improve the accessibility of data content to users by storing the most popular data files in the local caches (intermediate servers such as gateways, routers, etc.) of small cell BSs, with the objective of reducing the peak traffic rates. This is commonly referred to as “caching” and has attracted significant attention [Lin2003]  [Niesen2012]. In the next subsection we mention a few references, which although by no means exhaustive, fairly indicate the scope and trend of research on caching.
Ia Literature review on caching
Caching has received considerable attention in the wireless communications literature. In [Hu2003], a twolayer hierarchical strategy termed New Snoop was proposed to cache the unacknowledged packets from mobiles and BSs to significantly enhance TCP performance. In [Wang2014], a technique based on the concept of contentcentric networking was devised for caching in 5G networks, while in [Golrezaei2014] caching of video files was proposed by exploiting the redundancy of user requests and storage capacity of mobile devices with a priori knowledge of the locations of devices. In [Lee2014], the effects of cache size and cacheddata popularity on a data access scheme were studied to mitigate the traffic load over the wireless channel. In [Poularakis2014], inner and outer bounds were proposed for the joint routing and caching problem in small cell networks, while in [Fang2014] innetwork caching was proposed for an informationcentric networking architecture for faster content distribution in an energyefficient manner. Innetwork caching was employed in [Asaeda2015] for contentcentric networks using a tool called “contrace” for monitoring and operating the network. The tradeoff between the performance gain of coded caching and delivery delay in video streaming was characterized in [Pedarsani2014]. A polynomialtime heuristic solution was proposed in [Guan2014] to address the NPhard optimization problem of maximizing the caching utility of mobile users.
Caching has also made advances in devicetodevice (D2D) communications. In [Pyattaev2015], a practical method was devised for data caching and content distribution in D2D networks to enhance assisted communications between proximate nodes. In [Ji2013], the outagethroughput tradeoff was characterized for D2D nodes, which obtained the desired file from nodes which had that file in its cache. In [Golrezaei2012], the conflict between collaborationdistance and interference was identified among D2D nodes to maximize frequency reuse by exploiting distributed storage of cached content. In [Ji2013a], coded caching was shown to achieve multicast gain in a D2D network, where users had access to linear combinations of packets from cached files. In [Ji2014], the throughput scaling laws of random caching, where users with precached information made arbitrary requests for cached files, were studied. New caching mechanisms developed by modeling the network as independent Poisson point processes (PPPs) with full knowledge of the popularity profile can be found in [Bacstuug2015a]  [TamoorulHassan2015], while the most recent results on caching in D2D networks and video content delivery are reported in [Zhang2015] and [Li2015].
Caching has been addressed from an informationtheoretic viewpoint as well. In [MaddahAli2014], it was shown that when cachedcontent demand is uniformly distributed, joint optimization of caching and coded multicast delivery significantly improves the gains; this setup was extended to the case of nonuniform distributions on demand and to a decentralized setting in [Niesen2014a] and [MaddahAli2014a], respectively. In [Karamchandani2014], coded caching was achieved for content delivery networks with two layers of caches.
IB Main contributions of this paper
In the aforementioned references, the popularity profile of data files was assumed to be known perfectly. In practice, such an assumption cannot be reasonably justified; this was clearly highlighted in [Blasco2014]  [Bacstuug2015], where various learningbased approaches were employed to estimate the popularity profile. On the other hand, estimation procedures result in computational overhead especially in dataintensive realtime multimedia applications. Therefore, given the increasing demand for improving the quality of service for the end users, establishing the theoretical underpinnings of learningbased caching strategies is a topical research problem, and is the main subject of this paper.
In this work, we relax the assumption of a priori knowledge of the popularity profile to devise a caching strategy. We consider a heterogenous network where the users, BS and small base stations (SBSs) are assumed to be distributed according to PPPs. Each SBS is assumed to employ a random caching strategy with no caching at the user terminal (see [Ji2013]). A protocol model for communications is proposed using which a cost that captures backhaul link overhead that depends on the popularity profile is derived. Assuming a Poisson request model, a centralized approach is presented in which the BS computes an estimate of the popularity profile based on the requests observed during the time interval ; this estimate is then used in the cost function to optimize the caching probability. Thus, the actual cost incurred differs from the optimal cost, and this difference depends on the number of samples used to estimate the popularity profile. Further, the number of samples collected at the BS depends on the density of the Poisson arrival process and the training time during which the samples are collected. A lower bound on this training time is derived that guarantees a cost that is within of the optimal cost. The results are improved using a transfer learning (TL)based approach wherein samples from other domains, such as those obtained from a social network, are used to improve the estimation accuracy; the minimum number of source domain samples required to achieve better performance is derived. Finally, we model the popularity profile using a parametric family of distributions (specifically, the Zipf distribution [Llorca2013]) to analyze the benefits offered.
The following are the main findings of our study:

The training time is finite, provided the user density is greater than a predefined threshold.

scales as , where is the total number of cached data files in the system.

Employing the TLbased approach, a finite training time can be achieved for all user densities. In this case, the training time is a function of the “distance” between the probability distribution of the files requested and that of the source domain samples (the notion of distance will be made precise in the proof of Theorem LABEL:thm:time_complexity_centralized_TL).

When the popularity profile is modeled using a parametric family of distributions, the bound on the training time is independent of , and scales only linearly with the dimension of the distribution parameter leading to a significant improvement in the performance compared to its nonparametric counterpart.
The problem of periodic caching without the knowledge of the popularity profile, but with access to the demand history, was addressed in [Blasco2014] and [Blasco2014a]; however, the model and objective function considered in our work are different from those presented therein. Learningbased approaches to estimate the popularity profile for devising caching mechanisms have also been reported in [Sengupta2014]  [Bacstuug2015]; while caching in femtocell networks without prior knowledge of the popularity distribution was considered in [Golrezaei2013], where it was shown that distributed caching was NPhard and approximation algorithms were proposed for video content delivery. We would like to emphasize that the central focus of this paper is not on deriving new caching mechanisms. Our main contribution is the theoretical analysis of the implications of learning the popularity profile on the training time to achieve an offloading loss which is close to the optimal policy. To the best of our knowledge, this is the first instance where an analytical treatment of training time and its relation to the probability distribution function of source domain samples has been reported in the literature on caching. Some preliminary aspects of this work can be found in [Bharath2015].
In Section II, we present the system model followed by the main problem addressed in the paper. The two methods for estimating the popularity profile and its corresponding training time analysis are developed in Section III. The training time analysis when the popularity profile is modeled as a parametric family of distributions is presented in Section LABEL:sec:param_pac_bound. Numerical results are reported in Section LABEL:sec:sims. Concluding remarks are provided in Section LABEL:sec:conclude. The proofs of the theorems are relegated to appendices.
Ii System Model and Problem Statement
In this section, we present the system model followed by the main problem addressed in the paper. The notation used in the rest of the paper is as follows: , and (, and ) denote the points (densities) corresponding to the user, SBSs and BS, respectively; denotes the number of requests in by the user at ; denotes the request of the user ; is the average number of requests per unit time. A heterogenous cellular network is considered where the set of users, the set of BSs, and the set of SBSs are distributed according to independent PPPs with density , and , respectively, in the twodimensional space [Baccelli1997]. Each user independently requests a datafile of size bits from the set ; the popularity of data files is specified by the distribution , where and is assumed to be stationary across time. In a typical heterogenous cellular network, the BS fetches a file using its backhaul link to serve a user. During peak data traffic hours, this results in an informationbottleneck both at the BS as well as in its backhaul link. To alleviate this problem, caching the most popular files (either at the user nodes or at SBSs) is proposed. The requested file will be served directly by one of the neighboring SBSs depending on the availability of the file in its local cache. The performance of caching depends on the density of SBS nodes, cache size, users’ request rate, and the caching strategy. It is assumed that the SBS can cache up to files, each of length bits. Each SBS in caches its content in an independent and identically distributed (i.i.d.) fashion by generating indices distributed according to , (see [Ji2013]). One way of generating this is to roll an sided die times in an i.i.d. fashion, where the outcomes correspond to the index of the file to be cached. Although this approach is suboptimal, it is mathematically tractable and the corresponding time complexity serves as a lower bound, albeit pessimistic, for optimal strategies.
We now present a simple communications protocol to determine the set of neighboring SBS nodes for any user in . Essentially, we let each SBS at location communicate with a user at location if , (); this condition determines the communication radius. In this protocol, we have ignored the interference constraint. The set of neighbors of the user at location is denoted
(1) 
Iia The main problem addressed in this paper
The user located at requests a datafile from the set , with the popularity profile chosen from the probability distribution function . The requested file will be served directly by a neighboring SBS at location depending on the availability of the file in its local cache, and following the protocol described in the previous paragraph. The problem of caching involves minimizing the time overhead incurred due to the unavailability of the requested file. Without loss of generality and for ease of analysis, we focus on the performance of a typical user located at the origin, denoted by . The unavailability of the requested file from a user located at is given by
(2) 
where is as defined in (1), is the rate supported by the BS to the user, and is the time overhead incurred in transmitting the file from the BS to the user. Further, we use to denote the event that the file is not stored in any of the SBSs in . The expectation is with respect to , and . The indicator function is equal to one if the event occurs, and zero otherwise. We refer to as the “offloading loss”, which we seek to minimize:
(3)  
where , for . To solve the optimization problem (3), we need an analytical expression for which is provided in the following theorem.
Theorem 1
For the caching strategy proposed in this paper, the average offloading loss is given by
(4) 

See Appendix LABEL:app:througput_derivation.
We note that, solving the optimization problem posed in (3) is not the main focus of this paper. We assume that there exists a method to solve the problem posed in (3), and instead focus on analyzing the training time required to obtain a good estimate of the popularity profile that results in an offloading loss that is within of the optimal offloading loss. Interestingly, although the problem in (3) is nonconvex, since it is separable a bound on the duality gap can be obtained with respect to the solution derived using the KarushKuhnTucker conditions.
In practice, the popularity profile is generally unknown and has to be estimated. Denoting the estimated popularity profile by , and the corresponding offloading loss by , (3) becomes
(5) with , for . Naturally, the solution to (5) differs from that of the original problem (3). Let and denote the optimal solutions to the problems in (3) and (5), respectively, and let the throughput achieved using be denoted . The central theme of this paper is the analysis of the offloading loss difference, i.e., , where is the minimum offloading loss incurred with perfect knowledge of the popularity profile . Theorems LABEL:thm:time_complexity_centralized  LABEL:thm:pac_param_pop_distr are devoted to this analysis.
Iii Estimating the popularity profile
In this section, we present two methods for estimating the popularity profile and provide the corresponding training time analyses. The efficiency of the estimate of the popularity profile depends on the number of available data samples, which in turn is related to the number of requests made by the users. We first obtain an expression for the estimate of the popularity profile. We then study, in Section LABEL:subsec:waitingtime_lowerbound, the minimum training time in obtaining the samples to achieve a desired estimation accuracy . Finally, in Section LABEL:subsec:transfer_learning, we employ the TLbased approach to improve the bound on the training time. We begin with the definition of the request model.