Energy Cooperation for Sustainable Base Station Deployments: Principles and Algorithms
Energy Sustainable Mobile Networks via Energy Routing, Learning and Foresighted Optimization
Abstract
The design of selfsustainable base station (BS) deployments is addressed in this paper: BSs have energy harvesting and storage capabilities, they can use ambient energy to serve the local traffic or store it for later use. A dedicated power packet grid allows energy transfer across BSs, compensating for imbalance in the harvested energy or in the traffic load. Some BSs are offgrid, i.e., they can only use the locally harvested energy and that transferred from other BSs, whereas others are ongrid, i.e., they can also purchase energy from the power grid. Within this setup, an optimization problem is formulated where: energy harvested and traffic processes are estimated at the BSs through Gaussian Processes (GPs), and a Model Predictive Control (MPC) framework is devised for the computation of energy allocation and transfer schedules. Numerical results, obtained using real energy harvesting and traffic profiles, show substantial improvements in terms of energy selfsustainability of the system, outage probability (zero in most cases), and in the amount of energy purchased from the power grid, which is of more than halved with respect to the case where the optimization does not consider GP forecasting and MPC.
I Introduction
The massive use of Information and Communications Technologies (ICT) is increasing the amount of energy drained by the telecommunication infrastructure and its footprint on the environment. Forecast values for 2030 are that of the global electricity consumption and of the carbon footprint by human activity will be due to ICT [andrae2015global]. As such, energy efficiency and selfsufficiency are becoming key considerations for any development in the ICT sector.
In this paper, we advocate future networks where small Base Stations (BSs) are densely deployed to offer coverage and high data rates, and energy harvesting hardware (e.g., solar panels and energy storage units) is also installed to power them [zordan2015telecommunications]. These BSs collect energy from the environment, use it to serve their local traffic and transfer it to other BSs to compensate for imbalance in the harvested energy or in the traffic load. Some of the Base Stations are connected to the power grid (referred to as ongrid), whereas the others are offgrid and, as such, rely on either the locally harvested energy or on the energy transferred from other BSs. Since the BSs have a local energy storage, they can accumulate energy when the harvested inflow is abundant. Moreover, some of the surplus energy can be transferred to other BSs to ensure the selfsustainability of the cellular system. Energy transfer is a prime feature of these networks and can be accomplished in two ways: i) through Wireless Power Transfer (WPT) or ii) using a Power Packet Grid (PPG) [sugiyama2012packet]. For i), previous studies [Bonati2017] have shown that its transfer efficiency is too low for it to be a viable solution when distances exceed a few meters, but ii) looks promising. In analogy with communications networks, in a PPG a number of power sources and power consumers exchange power (Direct Current, DC) in the form of “packets”, which flow from sources to consumers thanks to power lines and electronic switches. The energy routing process is controlled by a special entity called the energy router [krylov2010toward]. Following this architecture, a local area packetized power network consisting of a group of energy subscribers and a core energy router is presented in [ma2017optimal], where a strategy to match energy suppliers and consumers is devised.
Within this setting, in this paper the allocation and distribution of energy is performed through the PPG infrastructure, where a centralized energy router is responsible for deciding the power allocation/transfer among BSs over time, referred to here as energy routing. This energy allocation and transfer problem is solved combining Gaussian Processes (GPs), Model Predictive Control (MPC) and Convex Optimization (CO). GPs are utilized to learn the Energy Harvesting (EH) and consumption patterns, which are then fed into MPC and CO techniques to obtain energy distribution schedules across subsequent time slots, solving a finite horizon optimization problem. This framework is designed for online use and combines learning and foresighted optimization. Numerical results, obtained with realworld harvested energy traces and traffic load patterns, show that the proposed approach effectively keeps the outage probability^{1}^{1}1Computed as the ratio between the number of BSs that are unable to serve the users within range due to energy scarcity, and the total number of BSs. to nearly zero for a wide range of traffic loads and system configurations. Also, the amount of energy purchased from the power grid to operate the mobile network is reduced of more than % with respect to the case where energy schedules are computed solely based on the current network status, i.e., disregarding future energy arrivals and load conditions. The proposed approach extends our previous work in [Gambin2017], adding online learning features and foresighted optimization (via MPC), whose combination is here proven to lead to substantial improvements.
The paper is organized as follows. In Section II, the literature on energy cooperation and the mathematical tools used in this work are presented. Section III describes the network scenario. The overall optimization framework for online energy management is explained in Section IV, where the proposed solutions are also detailed. Routing and scheduling policies are addressed in Section IVE. The numerical results are presented in section V, whereas final remarks are given in section VI.
Ii Related Work
In this section, we first survey the main literature dealing with energy transfer in mobile networks, and then delve into the description of the mathematical tools that we consider in this paper, discussing their successful use within diverse application domains.
Energy transfer in mobile cellular networks: the concept of energy transfer, also referred to as energy cooperation [chia2014energy, Gurakan2013, Xu2015] or energy exchange [leithon2014energy], is motivated by the fact that the distributed renewable energy generated at the base stations can be leveraged upon through a microgrid connecting the BSs [farooq2017hybrid], with the aim of improving the network selfsustainability, while reducing the cost entailed in purchasing the energy from the main power grid. Since this is a rather new paradigm, only few works dealing with this problem have been published so far. Energy sharing among BSs is investigated in [Gurakan2013] through the analysis of several multiuser network structures. A twodimensional and directional waterfillingbased and offline algorithm is proposed to control the harvested energy flows in time and space, with the objective of maximizing the system throughput. In [Xu2015], the authors introduce a new entity called the aggregator, which mediates between the grid operator and a group of BSs to redistribute the energy flows, reusing the existing power grid infrastructure: one BS injects power into the aggregator and, simultaneously, another one draws power from it. This scheme does not consider the presence of energy storage devices, and for this reason some of the harvested energy can be lost if none of the base stations needs it at a certain time instant. The proposed algorithm tries to jointly optimize the transmit power allocations and the transferred energy, maximizing the sumrate throughput for all the users. The authors of [huang2017smart] consider BSs with energy harvesting capabilities connected to the power grid as a means to carry out the energy trading. A joint optimization tackling BS operation and power distribution is performed to minimize the ongrid power consumption of the BSs. Wired energy transfer to/from the power distribution network, and a userBS association scheme based on cell zooming are investigated. The problem is split into two subproblems, which are solved using heuristics. A similar approach is considered in [han2013optimizing], where two problems are solved: the first one consists of optimizing the energy allocation at individual BSs to accommodate for the temporal dynamics of harvested energy and mobile traffic. Considering the spatial diversity of mobile traffic patterns, the second problem is to balance the energy consumption among BSs, by adapting the cell size (radio coverage) to reduce the ongrid energy consumption of the cellular network. Again, the solutions are obtained through heuristic algorithms. Also, base stations cooperate toward the reduction of energy costs, but do not perform any actual energy transfer among them.
A twocell renewableenergypowered system is studied in [guo2013base], by maximizing the sumrate over all users while determining the direction and amount of energy to be transferred between the two BSs. Energy can be transferred across the network either through power lines or wireless transfer and the energy transfer efficiency is taken into account. This resource allocation problem is formulated under a Frequency Division Multiple Access (FDMA) setup and is solved numerically. A lowcomplexity heuristic approach is also proposed as a practical nearoptimal strategy when the transfer efficiency is sufficiently high and the channel gains are similar for all users.
Along the same lines, a twoBS scenario is considered in [chia2014energy], where BSs have hybrid conventional (power grid) and renewable energy sources, limited energy storage capability, and are connected through power lines. The authors study the case where renewable energy and energy demand profiles are deterministic or known ahead of time, and find the optimal energy cooperation policy by solving a linear program. They then consider a more realistic case where the profiles are stochastic and propose an online greedy algorithm. Finally, an intermediate scenario is addressed, where the energy profiles are obtained from a deterministic pattern adding a small amount of random noise at each time step (to model prediction errors). Simulation results are shown for several (online) algorithms, assessing the impact of knowing the energy pattern profiles in advance.
The authors of [farooq2017hybrid] and [farooq2016energy] consider an energy sharing framework for cellular networks that are powered by power grids and possess renewable energy generation capabilities. Energy sharing takes place via physical power lines, as well as through the power grid for virtual energy transportation. Interestingly, the authors investigate the impact of the power line infrastructure topology: agglomerative and divisive hierarchical clustering algorithms are utilized to determine it. Upon establishing the physical connections among BSs, an optimization framework for daytoday cost optimization is developed for the cases of 1) zero knowledge, 2) perfect knowledge, and 3) partial future knowledge of the renewable energy generation.
An optimal energy management strategy to minimize the energy cost incurred by a set of cellular base stations is presented in [leithon2014energy]. There, base stations can exchange energy with the power grid and are equipped with batteries (energy storage) and renewable energy harvesting devices. Simulation results show that a cost reduction can be achieved by increasing the battery capacity of each BS and/or the number of base stations.
On combining pattern learning with multistep optimization techniques: next, we briefly review the mathematical tools that we use in the present paper, namely, MPC and GPs, touching upon the various application domains where they have been used. MPC has its roots in optimal control theory. The main idea is to use a dynamic model to forecast the system behavior, and exploit the forecast state sequence to obtain the control at the current time. The system usually evolves in slotted time, the control action is obtained by solving, at each time step, a finite horizon optimal control problem where the initial state is the current state of the system. The optimization yields a finite control sequence, and the first control action in this sequence is applied [rawlings2009model]. MPC has the ability to anticipate future events and can take control actions accordingly. It has been widely used in industrial processes, including chemical plants [fang2015nonlinear, eaton1991model, arefi2006nonlinear] and oil refineries [nejadkazemi2016pressure, pavlov2014modelling] and, recently, to balance energy consumption in smart energy grids [halvgaard2016distributed, stadler2016distributed, meng2015cooperation]. Moreover, it has been applied to supply chain management problems, with promising results [tzafestas1992parallel, perea2003model, braun2003model, lin2005predictive].
It is known that using timeseries forecasting within an MPC framework can improve the quality of the control actions by providing insight into the future [doganis2008combined]. Over the last decades, numerous forecasting approaches have been developed, including Autoregressive Integrated Moving Average (ARIMA) processes and Artificial Neural Networks (ANNs). ARIMA models (introduced by Box and Jenkins in [box1970]) are known for their prediction accuracy, but their main limitation lies in the assumption that the data follows a linear model. Conversely, ANNs capture nonlinear models and, in turn, can be a good alternative to ARIMA [zhang2005neural]. Nonetheless, ANNs give rise to mixed results for purely linear correlation structures. In [zhang2003time, khandelwal2015time], hybrid schemes that combine them are put forward to take advantage of their unique strengths. Experimental results with realworld data indicate that their combined use can improve the prediction accuracy achieved by either of the techniques when used in isolation.
Several authors have proposed the use of nonlinear models to build nonlinear adaptive controllers. In most applications, however, these nonlinearities are unknown, and nonlinear parameterization must be used instead. In timeseries analysis, where the underlying structure is largely unknown, one of the main challenges is to define an appropriate form of nonlinear parameterization for the forecasting model. Some implementations claim to be nonparametric, such as GPss, which can be considered (in some sense) as equivalent to models based on an infinite set of nonlinear basis functions [mackay1997gaussian]. The basic idea of GPs is to place a prior distribution directly on the space of functions, without finding an appropriate form of nonlinear parameterization for the forecasting model. This can be thought of as a generalization of a Gaussian distribution over functions. Moreover, a Gaussian Process (GP) is completely specified by the mean function and by the covariance function or kernel, which has a particular (but simple) parametric structure, defined through a small number of hyperparameters. The term nonparametric does not mean that there are no parameters, but that the parameters can be conveniently adapted from data. While GPs have been used in timeseries forecasting [williams1998prediction], to the best of the authors’ knowledge, [leith2004gaussian] is the first application of GPs to electrical load forecasting [blum2013electricity, taylor2003short, taylor2006comparison, ketter2013power]. The electricity demand is mainly influenced by meteorological conditions and daily seasonality. Nevertheless, forecasting for shortterm horizons of about a day is often performed using univariate prediction models, which are considered to be sufficient because the weather tends to change in a smooth fashion, which is reflected in the electricity demand itself. Also, in a realworldonline forecasting scenario, multivariate modeling is usually considered impractical [taylor2003]. Due to daily seasonality, we can say that the electrical load data bears some similarities with the time series that we consider in this paper, i.e., the harvested energy profile of Section IIIB and the traffic load of Section IIIC.
The idea of combining MPC and GP was first proposed in [kocijan2003predictive]. Other practical implementations can be found in application domains such as greenhouse temperature control systems [pawlowski2011predictive], gasliquid separation plant control systems [likar2007predictive], combustion power plants control systems [grancharova2008explicit] and in a number of other cases [palm2007multiple, ko2007gaussian, maciejowski2013fault, wang2014gaussian]. To the best of our knowledge, the present work is the first where MPC and GP are combined to control an energy harvesting mobile network.
Iii System Model
We consider a mobile network comprising a set of BSs, each with energy harvesting capabilities, i.e., a solar panel, an energy conversion module and an energy storage device. Some of the BSs are ongrid (termed ongrid BSs, being part of set ) and, in turn, can obtain energy from the power grid. The remaining BSs are offgrid (set ). The proposed optimization process evolves in slotted time , where the slot duration corresponds to the time granularity of the control and can be changed without requiring any modifications to the following algorithms.
Iiia Power Packet Grids
A PPG is utilized to distribute energy among the BSs. The grid architecture is similar to that of a multihop network, see Fig. 1, where circles are BSs and the square is the energy router, which is in charge of energy routing decisions and power allocation. As assumed in [ma2017optimal], BSs are connected through Direct Current (DC) power links (electric wires) and the transmission of energy over them is operated in a Time Division Multiplexing (TDM) fashion. Energy transfer occurs by first establishing an energy route, which corresponds to a sequence of power links between the energy source and the destination. Each power link can only be used for a single transfer operation at a time. Power distribution losses along the power links follow a linear function of the distance between the source and the destination [ma2017optimal]. They depend on the resistance of the considered transmission medium and are defined by [von2006electric]: , where is the resistivity of the wire in , is the length of the power link in meters, and is the crosssectional area of the cable in . In this paper, we consider a PPG with a single energy router in the center of the topology. A number of trees originates from the router and, without loss of generality, each hop is assumed to have the same length , i.e., the same power loss.
IiiB Harvested Energy Profiles
Solar energy generation traces have been obtained using the SolarStat tool [miozzo2014solarstat]. For the solar modules, the commercially available Panasonic N235B photovoltaic technology is considered. Each solar panel has solar cells, leading to a panel area of , which is deemed practical for installation in a urban environment, e.g., on lightpoles. As discussed in [miozzo2014solarstat, zordan2015telecommunications], the EH inflow is generally bellshaped with a peak around midday, whereas the energy harvested during the night is negligible. Here, the framework in [miozzo2014solarstat] is utilized to obtain the amount of energy harvested for each BS in time slot , which is denoted by .
IiiC Traffic Load and Power Consumption
Traffic load traces have been obtained using real mobile data from the Big Data Challenge organized by Telecom Italia Mobile (TIM) [bigdata2015tim]. The dataset is the result of a computation over Call Detail Records (CDRs), logging the user activity within the TIM cellular network for the city of Milan during the months of November and December 2013. For the traffic load traces we use the CDRs related to SMS, calls and Internet activities, performing spatial and temporal aggregation. In this way, we obtain a daily traffic load profile for each BS.
Clustering techniques have been applied to the dataset to understand the behavior of the mobile data. To this end, we use DBSCAN unsupervised clustering [ester1996density] to classify the load profiles into several categories. In Fig. 2, we show the typical traffic behavior of two clusters, corresponding to the heaviest (cluster 1) and lightest (cluster 2) daily load. As noted in previous works, the traffic is timecorrelated (and daily periodic) [zordan2015telecommunications, auer2010d2]. In our numerical results, each BS has an associated load profile, which is picked at random as one of the two clusters in Fig. 2. Depending on the cluster association probabilities, there is some imbalance in the load distribution across BSs that, as we discuss shortly, plays a key role in the performance of energy transfer algorithms. Given the traffic load profile , intended as the percentage of the total bandwidth that a BS allocates to serve the users in its radio cell, the BS energy consumption (energy outflow), referred to in the following as , is computed through the linear model in [zordan2015telecommunications] (see Eq. (1) in that paper).
IiiD Energy Storage Units
Energy storage units are interchangeably referred to as Energy Buffers. The EB level for BS is denoted by and three thresholds are defined: , and , respectively termed the upper, reference and lower energy threshold, with ( is the EB capacity). corresponds to the desired EB level and is the lowest energy level that any BS should ever reach. Both variables are used in the optimization of Section IVC. For an offgrid BS, i.e., , if is the current time slot, the buffer level process is updated at the beginning of the next time slot as:
(1) 
where is the amount of energy transferred to/from BS in time slot , which is positive if BS is a consumer or negative if BS acts as an energy source. is the EB level at the beginning of time slot , whereas , are the amount of energy harvested and the energy that is locally drained (to support the local data traffic), respectively. The energy level of an ongrid BS is updated as:
(2) 
where represents the energy purchased by BS from the power grid during time slot . The behavior of a BS (i.e., and ) depends on its EB level. If the BS behaves as an energy source, it is eligible for transferring a certain amount of energy to other BSs. In this work, we assume that if the total energy in the buffer at the beginning of the current time slot is and the BS is ongrid, then the difference is purchased from the power grid in slot , as an ongrid BS should always be a source, i.e., in the position of transferring energy to other BSs. If instead the BS behaves as an energy consumer, it demands energy to the sources. For example, the energy demand in time slot may be set to , so that the EB level would ideally become no smaller than the reference threshold by the end of the current time slot . Note that, this can only be strictly guaranteed if . However, is updated at the beginning of time slot , whereas and are only known at the end of it. The theory of Sections IVB and IVC allows computing , accounting for the expected behavior to get more accurate results, where is the expectation operator.
Iv Optimization for online energy management
In this section, we devise an online optimal power allocation strategy, whose objective is to make the offgrid BSs as energy selfsustainable as possible. This is achieved by transferring some amount of energy from rich energy BSs (energy sources) to those base stations that require energy (energy consumers).
Iva Overview of the optimization framework
A diagram of the optimization process is shown in Fig. 3, involving 1) pattern learning (forecasting), 2) model predictive control, 3) convex optimization and 4) energy routing. These algorithms are all executed at runtime. First of all, the harvested energy and traffic load processes are statistically modeled through Bayesian nonparametric tools (“pattern learning” in Fig. 3), as we detail in Section IVB. This first step allows each BS to independently track its own energy and load processes, capturing their statistical behavior and obtaining multistep ahead forecasts for the corresponding time series. It is worth noting that our forecasting method is agnostic to the type of signal, and for this reason can be promptly extended to other processes, if need be. These forecasts are then fed into the foresighted optimization approach of Section IVC. Their use allows for informed decisions, which take the future system evolution into account. This results in effective energy allocation strategies, which lead to a reduction of the amount of energy that has to be purchased from the power grid.
The second step in the optimization framework is the adaptive control based on MPC theory. Its main goal is to determine the BS role (energy source or consumer), and obtain the amount of energy that each BS has to either transfer or require from the sources. The MPC block takes online actions, considering not only the current system state, i.e., traffic load, harvested energy and EB levels, but also future ones (based on the forecasts from the previous block), anticipating events and acting accordingly. This is a main difference with respect to the work in [Gambin2017], where BS energy roles are solely determined on the current EB level. The MPC block is described in Section IVC.
The actual energy allocation is evaluated in the third optimization step. This block computes how energy (obtained through MPC) has to be redistributed among BSs, matching energy sources and energy consumers. Two approaches are proposed to this end (see Section IVD): one based on convex optimization and another one formulated as an assignment problem and solved through the Hungarian method [kuhn1955hungarian]. Their objective is to reduce as much as possible the outage probability, i.e., the ratio between the number of BSs that are unable to serve their load due to energy scarcity, and the total number of BSs in the network, while maximizing the energy transfer efficiency.
Finally, the last step is to perform the energy exchange (“energy routing”) among the BSs. Since the PPG is operated in a TDM fashion, each power link can only be used for a single trading operation at a time. Hence, the proposed routing strategy seeks to find disjoint routes between energy sources and consumers, while minimizing the time needed to perform the energy transfer. Details are provided in Section IVE.
IvB Pattern learning through Bayesian nonparametric models
Definition  Variable name 

Base station set  
Ongrid base station set  
Offgrid base station set  
Number of base stations  
Harvested energy profile in slot  
Traffic load profile in slot  
BS energy consumption in slot  
Energy buffer level in slot  
Maximum energy buffer capacity  
Upper, lower and reference buffer thresholds  , , 
Transferred energy in slot  
Purchased grid energy in slot  
Used by the pattern analysis block  
Number of observations (training dataset)  
Number of observations (test set)  
The transpose of vector  
The weights of the Bayesian linear model  
The function value  
The observed real value  
dimensional column vector of targets  
Map in the feature space  
Training dataset  
dimensional input column vector  
dimensional input column vector (test set)  
dimensional feature vector  
matrix of inputs  
matrix of inputs in the test set  
matrix in the feature space  
Gaussian dist. with zero mean and variance  
Covariance matrix of the model parameters  
Gaussian process  
Gaussian process: mean function  
Gaussian process: covariance function (kernel)  
Gaussian process: predictive mean vector  
Gaussian process: predictive covariance matrix  
covariance matrix (training dataset)  
identity matrix  
Function values (training dataset)  
Function values (test set)  
Used by the optimization block  
Optimization horizon (time steps)  
System state matrix for MPC  
Control matrix for MPC  
System disturbances matrix for MPC  
Weight parameter for MPC  
Set of energy sources  
Set of energy consumers  
Energy allocation matrix  
Energy availability matrix  
Maximum transmission energy capacity  
Energy demand vector  
Number of hops matrix  
Weight parameter for energy allocation  
Cost matrix for Hungarian method 
In this section, we are concerned with statistical models to automatically capture the hidden structure of the observations in a training dataset. Bayesian nonparametric models, such as GPs, can represent our beliefs about the model parameters via a probability distribution (called the prior). Then, Bayesian inference can reshape the prior distribution, transforming it into a posterior one.
GPs have become popular for regression and classification, often showing impressive empirical performance [Rasmussen_2006]. However, while the outputs for a classification task are discrete class labels, in a regression task the outputs (or targets) are real values. Here, we use GPs for our regression task. According to [Rasmussen_2006], there are two equivalent views to treat GPs within a regression problem: 1) the weightspace view and 2) the functionspace view.
1) The weightspace view. The Bayesian linear model for regression is defined as:
(3) 
where is a vector of weights, also known as model parameters, is the function value, which is linear in the weights , is the observed real value, and maps the dimensional input column vector into an dimensional feature vector . Assume we are given with a training dataset with observations, , where each pair consists of the dimensional input column vector and the scalar target . We can aggregate inputs and targets in a matrix and an dimensional column vector , so that , and becomes an matrix in the feature space. We are interested in the conditional distribution of the targets, given the inputs in the feature space and the model parameters. We further assume that differs from by additive noise, and this noise follows an independent identically distributed (i.i.d.) Gaussian distribution with zero mean and variance , i.e., . From the i.i.d. assumption, it follows that the likelihood (i.e., the conditional distribution of the targets given the inputs in the feature space and the model parameters) is factorized over cases for the observations, i.e., .
According to the standard formalism of Bayesian inference, the prior distribution encodes our beliefs about the model parameters, before we access the observations. Conversely, the likelihood gives us insights about the model parameters, thanks to the evidence of the observations in a training dataset. For the case simple Bayesian linear model of Eq. (3), the prior distribution is set to be Gaussian with zero mean and covariance matrix , i.e., . Then, the posterior distribution combines the likelihood and the prior distribution, and expresses our full knowledge about the model parameters, after we access the observations. The posterior distribution is derived via the Bayes’ rule as:
(4) 
To make prediction for the test case given , we average over all possible model parameters’ choices, weighted by the posterior distribution, i.e.,
(5) 
where and . Here, note that it is more convenient to invert the matrix than the matrix , whenever . Furthermore, the feature space enters in the form , where vectors and are either in the test or in the training dataset. At this point, let us define as the covariance function or kernel. Specifically, represents an inner product with respect to , equivalent to when and is such that (since is positive definite). We can conclude that: if an algorithm is defined in terms of inner products in the input space, then it can be lifted into the feature space by replacing occurrences of inner products by the kernel, whenever it is more convenient to compute the kernel than the feature vectors themselves; this is also known as the kernel trick [Rasmussen_2006].
2) The functionspace view. We can have exact correspondence with the weightspace view by using a GP modeling a distribution over functions. Formally: a GP is a collection of random variables, any finite number of which have a joint Gaussian distribution. Moreover, it is completely specified by the mean function and the covariance function (or kernel). We define the mean function and the covariance function of process as
(6) 
Next, we consider the zero mean function, i.e., , which is a very typical choice in the GP literature [Rasmussen_2006]. In the Bayesian linear model of Eq. (3), the prior distribution is set to be Gaussian with zero mean and covariance matrix , i.e., . Thus, we can derive an example GP as:
(7) 
Assume the training dataset has observations, then vector has a joint Gaussian distribution, i.e., , where the covariance matrix can be computed evaluating the covariance function or kernel for the observations, i.e., for . Given the noise , it follows from the i.i.d. assumption that a diagonal matrix must be added to , as compared to the noisefree model in the GP literature [Rasmussen_2006]. To make prediction for the test case given , we consider the joint Gaussian prior distribution over functions
(8) 
where we define the dimensional column vector such that the th element is equal to . To derive the posterior distribution over functions we need to condition the joint Gaussian prior distribution over functions on the data, so that we get the key predictive equations of GPs for regression:
(9) 
In practice, the predictive mean is used as a point estimate for the function output, while the variance can be translated into uncertainty bounds (predictive errorbars) on this point estimate, thus making GPs for regression very appealing for MPC applications (see [murray1999transient, leith2000nonlinear, solak2003derivative, kocijan2003predictive]).
For any set of basis functions in the feature space, we can compute the corresponding covariance function or kernel; conversely, for every (positive definite) covariance function or kernel, there exists a (possibly infinite) expansion in terms of basis functions in the feature space. As we show shortly, the choice of the kernel deeply affects the performance of a GP for a given task, as much as the choice of the parameters (architecture, activation functions, learning rate, etc.) does for a neural network. Specifically, the hyperparameters of the kernel must be set in order to optimize the marginal likelihood, which is defined as follows:
(10) 
Under the Gaussian assumption, the prior distribution is Gaussian, , and the likelihood is a factorized Gaussian, , thus . Extensive derivation for the formulation of and generalization to more that one test case can be found in [Rasmussen_2006].
Suppose we have observations in the test set, i.e., , to make prediction for the test cases given , we consider the joint Gaussian prior distribution over functions
(11) 
where we define the matrix similarly to , such that for , , and is a column vector in . Finally, we define the matrix similarly to , such that for , thus we get the key predictive equations of GPs for regression: {mdframed}
(12) 
The choice of the kernel: this choice deeply affects the performance of a GP for a given task, as it encodes the similarity between pairs of outputs in the function domain. There has been significant work on constructing base and composite kernels [duvenaud2013structure]. Common base kernels include the Squared Exponential (SE) kernel, the Rational Quadratic (RQ) kernel, and the Standard Periodic (SP) kernel, defined as:
(13) 
The properties of the functions under a GP with a SE kernel can display long range trends, where the lengthscale determines how quickly a process varies with the inputs. The RQ kernel is derived as a scale mixture of SE kernels with different lengthscales. The SP kernel is derived by mapping the two dimensional variable through the SE kernel. Derivations for the RQ and SP kernels are in [Rasmussen_2006].
Note that valid kernels (i.e., those having a positivedefinite covariance function) are closed under the operators and . This allows one to create more representative (and composite) kernels from wellunderstood basic components, according to the following key rules [duvenaud2013structure]:

Any subexpression^{2}^{2}2Subexpression refers to any valid kernel family, either basic or composite. can be replaced with , where is any base kernel family.

Any subexpression can be replaced with , where is any base kernel family.

Any base kernel can be replaced with any other base kernel family .
In time series, summing kernels can express superpositions of different processes, operating at different scales, whereas multiplying kernels may be a way of converting global data properties onto local data properties. From here on, we will use onedimensional kernels in the form with period , which correspond to a local quasiperiodic structure in the data, with noise operating at different scales. Note that kernels over multidimensional inputs can be constructed via the operators and over individual dimensions. Next, we consider models based on zeromean GPs for the runtime multistep ahead forecasting of time series, with application to a) Harvested Energy Profile (defined in Section IIIB) and b) Traffic Load (Section IIIC).
The basic routine for prediction: we use models based on zeromean GPs for the runtime forecasting of time series, with application to and , . The strong daily seasonality of the data is evident for both time series, as well as the presence of noise at different scales. Therefore, we define composite kernels for and in the form with period , i.e., {mdframed}
(14) 
where and is the Euclidean distance between inputs. At this point, the hyperparameters of the kernel must be set in order to optimize the marginal likelihood, which is defined in Eq. (10), and here implemented using the toolbox of [toolbox]. For compactness, we aggregate the hyperparameters of the kernel in the initialization set . Here, we opt for , , and select the free parameters (, , ) via a grid search, scanning combinations in the range . To model the strong daily seasonality in the data, we also opt for a prior distribution on the period , which is a delta function, i.e., if and only if , so that we treat the period as a constant, excluding it from the optimization (see [toolbox]).


Algorithm 1 describes the basic routine for the pretraining, training and forecasting phases for both zeromean GPs, i.e., the same basic reasoning holds for and , where refers to time and refers to either or , at time . Also, we assume that we can access the values in the training dataset, and we wish to predict the values in the test set, where refers to the training dataset and refers to the test set, at time , respectively.
According to the pretraining phase, we first have to find the optimal hyperparameters for the kernel , starting from and minimizing the marginal likelihood on the training dataset , where we set .
Note that will serve as initialization for the optimal hyperparameters at each step of the online forecasting routine, as the optimal hyperparameters are found over the training dataset , which changes at each step of the online forecasting routine. Assuming Gaussian noise with variance , thus Gaussian likelihood, it follows that we can perform exact inference. To do it, we use the Conjugate Gradients (CG) optimization tool implemented in toolbox [toolbox].
We get via Eq. (12) given the test set with test cases, at time . Finally, we derive the Root Mean Square Error (RMSE) over the test cases, starting from residuals , at time , and iterating the procedure (except for the pretraining phase) up to time . For the numerical results, the training phase is performed once every steps.
Numerical results: now, we assess the proposed scheme for the runtime multistep ahead forecasting of time series and , where we track over the test cases, given . Here, we set the time step to one hour, hours (i.e., two weeks of data), hours (i.e., two months of data), , and we recall that . This choice of parameters is valid for both time series, as well as the use of the kernel in Eq. (14), whereas the hyperparameters differ, depending on the nature of data.
In Table Ia and Table Ib we show the average RMSE for and , computed evaluating the mean of the RMSE measures up to time , where we track over the test cases, given . Also, as we perform the training phase once every steps, we compare the numerical results when and , i.e., when we reoptimize the freeparameters at each step of the online forecasting routine, or just once every steps, at time . In general, the average decreases as we increase the test cases up to , which corresponds to one day into the future. However, the worst performance is , which is still rather small if we consider that both time series are normalized in prior to processing. Also, predictions for (Fig. 3(a)) are more precise than predictions for (Fig. 3(b)), and this is due to the nature of the data, given that we use the same kernel for both time series. In fact, values in (Fig. 3(a)) follow a more regular behavior than those in (Fig. 3(b)), with quasiperiodic streams of zero values corresponding to zero solar energy income during the night. These quasiperiodic streams of zero values help reinforcing prediction, while allowing for a higher confidence at nighttime (see Fig. 4(a)). Finally, tuning parameter explains the impact of reoptimizing the hyperparameters according to the most recent history (i.e., two weeks of data), but with a longer execution time. Numerical results suggest that tuning parameter could be reasonable when data exhibit multiple strong local behaviors rather than just a strong daily seasonality, and the kernel has to adapt to these. However, could not be the obvious, optimal choice (see Table Ia).
In Fig. 3(a) and Fig. 3(b) we show real values and predictions for two weeks of data, where we track the onestep predictive mean value at each step of the online forecasting routine. The strong daily seasonality is evident, as well as the quasiperiodic structure in data with noise operating at different scales. Note that predictions for (Fig. 3(a)) are more accurate than those for (Fig. 3(b)), and this result can be confirmed by comparing the average in Tables Ia and Ib for . However, predictions are still quite far from real values when some unusual events occur, see, for example, the low solar energy income within hours and (sixth peak from the left), in Fig. 3(a), or the sudden peaks in the traffic load profile of Fig. 3(b), which are very dayspecific.
In Figs. 4(a) and 4(b) we show real and predicted values for three days of data, i.e., the last two days of the training dataset, and hours for the test set, plotting the multistep predictive mean value with . Here, we compare the use of the kernel in Eq. (14) with common base kernels from the literature, such as the popular Squared Exponential (SE) kernel, the Rational Quadratic (RQ) kernel, and the Standard Periodic (SP) kernel, see Eq. (14). Also, we compare the use of the kernel in Eq. (14) in terms of generalization capabilities over the training dataset and the test set, i.e., we perform forecasting over the training dataset and the test set, after the optimization of the hyperparameters given the observations. Note that the proposed kernel (solid line) shows the best performance in terms of forecasting, since composite kernels are more representative than base ones. Specifically, the RMSE is close to zero over the training dataset (due to the fact that we set , i.e., ), and this result also holds for both the SE and RQ cases. However, the generalization capabilities over the test set are quite limited for SE and RQ. In fact, these base kernels have limited expressive power, and simply act like smoothers. Finally, the SP kernel succeeds in recovering the strong daily seasonality in the data, but it fails to model noise at different scales. Again, its expressive power is quite limited, with respect to our proposed kernel in Eq. (14).
IvC Predictive and timeadaptive energy allocation
In general, an MPC framework can be divided into three blocks: i) inputs, ii) MPC controller and iii) real system [maciejowski2013fault]. The first block is the prediction model (see Section IVB). The MPC solves a control problem at runtime (see below). Finally, the real system block receives the optimal actions from the MPC controller and behaves accordingly.
Notation: using a standard notation, the system to be controlled is described by means of a discretetime model:
(15) 
where is the current time slot. The matrix with elements denotes the system state, representing for each BS the energy buffer level for time slots , were is the optimization horizon (we use hours), which corresponds to in the previous subsection. The matrix with elements denotes the control matrix, representing the amount of energy that each BS shall either transfer or receive (depending on the sign of ) in time slot . The matrix models the system disturbances, i.e., the stochastic behavior of the forecast profiles (harvested and consumed energy), with:
(16) 
where and contain the mean and variance of the forecast estimates, respectively. Eqs. (15) and (16) relate to the problem setup of Section IIID as follows: symbol contains the buffer state for all BSs, i.e., , is the control, which corresponds to the amount of energy to transfer, i.e., , and contains the exogenous processes, i.e., . Note that processes and are statistically characterized through the prediction framework of Section IVB, and their difference is still a Gaussian r.v. (in fact, is derived from through a linear model, and as such is still Gaussian distributed). Following [wang2016stochastic], due to the stochastic nature of Eq. (16), the system state should also be written in a probabilistic way:
(17) 
where and are the mean and the variance of , respectively.
Objective functions: the goal of the MPC controller is to determine the amount that each BS should either transfer or receive in time slots . If , BS acts as a source in slot , whereas if it acts as an energy consumer. A first cost function tracks the total amount of energy that is exchanged among BSs:
(18) 
For the second objective, we use the reference threshold , see Section IIID. The MPC controller tries to increase the BS energy buffer levels, while avoiding that only a few of the consumers receive energy. To achieve this, a second cost function weighs how close to the buffers get:
(19) 
where is the energy buffer level of BS in time slot .
Control problem: once the inputs are stated, a finitehorizon multiobjective optimization problem is formulated:
[innertopmargin=0pt]
(20a)  
subject to:  (20b)  
(20c)  
(20d)  
(20e)  
where is a weight to balance the relative importance of the two cost functions. and are the energy buffer limitations defined in Section IIID. Finally, the fourth constraint defines the amount of energy that each BS can exchange and depends on the system state, i.e., the energy buffer level, expected harvested energy and expected traffic load. The system state dictates the limits of the control action.
For any fixed value of , and since the optimization problem must be solved at runtime, it is strongly preferable to choose a convex optimization formulation such as Eq. (20), which can be solved through standard techniques. Here, we have used the CVX tool [grant2008cvx] to obtain the optimal solution , which dictates the optimal amount of energy that each BS shall either provide or receive in time slot .
Optimization algorithm: the MPC controller performs as follows [zhang2017optimal]:

Step 1: at the beginning of time slot , the system state is obtained, that is energy buffer levels for all BSs, the harvested energy and traffic load forecasts for the next hours (the optimization horizon).

Step 2: the control problem in Eq. (20) is solved yielding a sequence of control actions over the horizon .

Step 3: only the first control action is performed and the system state is updated upon implementing the required energy transfers.

Step 4: Forecasts are updated and the optimization cycle is repeated from Step 1.
IvD Energy scheduling for the current time slot
With the MPC of the previous subsection, we compute the amount of energy that each BS should either provide (in case the BS acts as a source) or receive (if the BS is a consumer). Note that the BS role (source or consumer) is also a result of the optimization process. In this section, we solve the energy allocation problem between energy sources and energy consumers, i.e., which source will transfer energy to which consumer and in which amount. Note that this also depends on the distribution losses between them and, in turn, on the electrical PPG topology.
Notation: we use indices and to respectively denote an arbitrary BS source and an arbitrary BS consumer. and are the set of sources and consumers, respectively. With we mean the energy available from the energy source to the energy consumer , in matrix notation we have . Note that is the energy that would be available at the consumer BS and, in turn, it depends on , and on the distribution losses between them, i.e., on the total distance that the energy has to travel, see Section IIIA. Vector , with elements , represents the energy demand from each consumer . represents the number of hops in the energy routing topology between source and consumer , in matrix notation we have . Also, we assume that all hops have the same physical length. Finally, with we mean the fraction of that is allocated from source to consumer , in matrix notation .
Objective functions: as a first objective, we seek to minimize the difference between the amount of energy offered by the BS sources and that transferred to the BS consumers . This amounts to fulfill, as much as possible, the consumers’ energy demand. At time , the energy that can be drained from a source is . Now, if we consider the generic consumer , the maximum amount of energy that can provide to is , where is the attenuation coefficient between and , due to the power loss. We thus write a first cost function as:
(21) 
where and . Due to the existence of a single path between any source and consumer pair and due to the fact that each power link can only be used for a single transfer operation at a time, a desirable solution shall: i) pick source and consumer pairs in such a way that the physical distance () between them is minimized and ii) achieve the best possible match between sources and consumers, i.e., use source , whose available energy is the closest to that required by consumer . In other words, we would like to be as close as possible to . If this is infeasible, multiple sources will supply the consumer. Minimizing the following cost function, the number of hops between sources and consumers is kept small and we favor solutions with :
(22) 
that is, with this cost function we are looking for a sparse solution (i.e., a small number of sources with close to ). Note that when and is minimized, the argument is maximized and the negative exponential is minimized. Also, the exponential function was picked as it is convex, but any increasing and convex function would do.
Solution through convex optimization: at each time slot , each BS updates its buffer level , using either Eq. (1) or Eq. (2) (note that , , , and are all known in slot , see Section III). The MPC problem of Section IVC is solved, and in the current time slot , BS acts as a source if and as a consumer if . Each source evaluates for all through and each consumer sets its energy demand as . At time , using Eq. (21) and Eq. (22), the following optimization problem is formulated: {mdframed}
(23a)  
subject to:  (23b)  
(23c) 
where is a weight used to balance the relative importance of the two cost functions. The first constraint represents the fact that is a fraction of the available energy from source , and the second constraint means that the total amount of energy that a certain source transfers to consumers cannot exceed the total amount of available energy at this source.
For any fixed value of , Eq. (23) is a convex minimization problem which can be solved through standard techniques. In this paper, we have used the CVX tool [grant2008cvx] to obtain the optimal solution , which dictates the optimal energy fraction to be allocated from any source to any consumer .
Solution through the Hungarian method: the energy distribution problem from sources to consumers can also be modeled as an assignment problem, where each source has to be matched with a consumer . This approach can be solved through the Hungarian method [kuhn1955hungarian], an algorithm capable of finding an optimal assignment for a given square cost matrix, where . An assignment is a set of entry positions in the cost matrix, no two of which lie in the same row or column. The sum of the entries of an assignment is its cost. An assignment with the smallest possible cost is referred to as optimal. Let be the cost matrix, where rows and columns respectively correspond to sources and consumers . Hence, is the cost of assigning the th source to the th consumer and is obtained as follows:
(24) 
where , the first term weighs the quality of the match ( should be as close as possible to ) and the second the quality of the route. To ensure the cost matrix is square, additional rows or columns are to be added when the number of sources and consumers differs. As typically assumed, each element in the added row or column is set equal to the largest number in the matrix.
The main difference between the optimal solution found by solving the convex optimization problem (Eq. (23)) and that found by the Hungarian method is that the latter returns a onetoone match between sources and consumers, i.e., each consumer can only be served by a single source. On the other hand, for any given consumer the convex solution also allows the energy transfer from multiple sources.
IvE Energy Routing Strategy
Now, we describe how the energy allocation is implemented over time. The algorithm that follows is executed at the beginning of each time slot, when a new allocation matrix is returned by the solver of Section IVD. Each hour is further split into a number of mini slots. Given a certain maximum transmission energy capacity for a power link in a mini slot, the required number of mini slots to deliver a certain amount of power between source and consumer is obtained as .
Since each power link can only be used for a single energy transfer operation at a time, we propose an algorithm that seeks to minimize the number of mini slots that are used. First of all, an energy route for the sourceconsumer pair is defined as the collection of intermediate nodes to visit when transferring energy from to . The algorithm proceeds as follows: 1) a route is identified for each source and consumer (note that for the given network topology this route is unique), 2) the disjoint routes, with no power links in common, are found and are allocated to as many pairs as possible, 3) for each of these pairs , the energy transfer is accomplished using route for a number of mini slots , 4) when the transfer for a pair