Energy Cooperation for Sustainable Base Station Deployments: Principles and Algorithms

Energy Cooperation for Sustainable Base Station Deployments: Principles and Algorithms

Ángel Fernández Gambín, Maria Scalabrin, Michele Rossi

Energy Sustainable Mobile Networks via Energy Routing, Learning and Foresighted Optimization

Ángel Fernández Gambín, Maria Scalabrin, Michele Rossi
Abstract

The design of self-sustainable base station (BS) deployments is addressed in this paper: BSs have energy harvesting and storage capabilities, they can use ambient energy to serve the local traffic or store it for later use. A dedicated power packet grid allows energy transfer across BSs, compensating for imbalance in the harvested energy or in the traffic load. Some BSs are offgrid, i.e., they can only use the locally harvested energy and that transferred from other BSs, whereas others are ongrid, i.e., they can also purchase energy from the power grid. Within this setup, an optimization problem is formulated where: energy harvested and traffic processes are estimated at the BSs through Gaussian Processes (GPs), and a Model Predictive Control (MPC) framework is devised for the computation of energy allocation and transfer schedules. Numerical results, obtained using real energy harvesting and traffic profiles, show substantial improvements in terms of energy self-sustainability of the system, outage probability (zero in most cases), and in the amount of energy purchased from the power grid, which is of more than halved with respect to the case where the optimization does not consider GP forecasting and MPC.

Online learning, foresighted optimization, energy harvesting, energy routing, energy self-sustainability, power packet grids, mobile networks.

I Introduction

The massive use of Information and Communications Technologies (ICT) is increasing the amount of energy drained by the telecommunication infrastructure and its footprint on the environment. Forecast values for 2030 are that of the global electricity consumption and of the carbon footprint by human activity will be due to ICT [andrae2015global]. As such, energy efficiency and self-sufficiency are becoming key considerations for any development in the ICT sector.

In this paper, we advocate future networks where small Base Stations (BSs) are densely deployed to offer coverage and high data rates, and energy harvesting hardware (e.g., solar panels and energy storage units) is also installed to power them [zordan2015telecommunications]. These BSs collect energy from the environment, use it to serve their local traffic and transfer it to other BSs to compensate for imbalance in the harvested energy or in the traffic load. Some of the Base Stations are connected to the power grid (referred to as ongrid), whereas the others are offgrid and, as such, rely on either the locally harvested energy or on the energy transferred from other BSs. Since the BSs have a local energy storage, they can accumulate energy when the harvested inflow is abundant. Moreover, some of the surplus energy can be transferred to other BSs to ensure the self-sustainability of the cellular system. Energy transfer is a prime feature of these networks and can be accomplished in two ways: i) through Wireless Power Transfer (WPT) or ii) using a Power Packet Grid (PPG[sugiyama2012packet]. For i), previous studies [Bonati-2017] have shown that its transfer efficiency is too low for it to be a viable solution when distances exceed a few meters, but ii) looks promising. In analogy with communications networks, in a PPG a number of power sources and power consumers exchange power (Direct Current, DC) in the form of “packets”, which flow from sources to consumers thanks to power lines and electronic switches. The energy routing process is controlled by a special entity called the energy router [krylov2010toward]. Following this architecture, a local area packetized power network consisting of a group of energy subscribers and a core energy router is presented in [ma2017optimal], where a strategy to match energy suppliers and consumers is devised.

Within this setting, in this paper the allocation and distribution of energy is performed through the PPG infrastructure, where a centralized energy router is responsible for deciding the power allocation/transfer among BSs over time, referred to here as energy routing. This energy allocation and transfer problem is solved combining Gaussian Processes (GPs), Model Predictive Control (MPC) and Convex Optimization (CO). GPs are utilized to learn the Energy Harvesting (EH) and consumption patterns, which are then fed into MPC and CO techniques to obtain energy distribution schedules across subsequent time slots, solving a finite horizon optimization problem. This framework is designed for online use and combines learning and foresighted optimization. Numerical results, obtained with real-world harvested energy traces and traffic load patterns, show that the proposed approach effectively keeps the outage probability111Computed as the ratio between the number of BSs that are unable to serve the users within range due to energy scarcity, and the total number of BSs. to nearly zero for a wide range of traffic loads and system configurations. Also, the amount of energy purchased from the power grid to operate the mobile network is reduced of more than % with respect to the case where energy schedules are computed solely based on the current network status, i.e., disregarding future energy arrivals and load conditions. The proposed approach extends our previous work in [Gambin2017], adding online learning features and foresighted optimization (via MPC), whose combination is here proven to lead to substantial improvements.

The paper is organized as follows. In Section II, the literature on energy cooperation and the mathematical tools used in this work are presented. Section III describes the network scenario. The overall optimization framework for online energy management is explained in Section IV, where the proposed solutions are also detailed. Routing and scheduling policies are addressed in Section IV-E. The numerical results are presented in section V, whereas final remarks are given in section VI.

Ii Related Work

In this section, we first survey the main literature dealing with energy transfer in mobile networks, and then delve into the description of the mathematical tools that we consider in this paper, discussing their successful use within diverse application domains.

Energy transfer in mobile cellular networks: the concept of energy transfer, also referred to as energy cooperation [chia2014energy, Gurakan2013, Xu2015] or energy exchange [leithon2014energy], is motivated by the fact that the distributed renewable energy generated at the base stations can be leveraged upon through a microgrid connecting the BSs [farooq2017hybrid], with the aim of improving the network self-sustainability, while reducing the cost entailed in purchasing the energy from the main power grid. Since this is a rather new paradigm, only few works dealing with this problem have been published so far. Energy sharing among BSs is investigated in [Gurakan2013] through the analysis of several multiuser network structures. A two-dimensional and directional water-filling-based and offline algorithm is proposed to control the harvested energy flows in time and space, with the objective of maximizing the system throughput. In [Xu2015], the authors introduce a new entity called the aggregator, which mediates between the grid operator and a group of BSs to redistribute the energy flows, reusing the existing power grid infrastructure: one BS injects power into the aggregator and, simultaneously, another one draws power from it. This scheme does not consider the presence of energy storage devices, and for this reason some of the harvested energy can be lost if none of the base stations needs it at a certain time instant. The proposed algorithm tries to jointly optimize the transmit power allocations and the transferred energy, maximizing the sum-rate throughput for all the users. The authors of [huang2017smart] consider BSs with energy harvesting capabilities connected to the power grid as a means to carry out the energy trading. A joint optimization tackling BS operation and power distribution is performed to minimize the on-grid power consumption of the BSs. Wired energy transfer to/from the power distribution network, and a user-BS association scheme based on cell zooming are investigated. The problem is split into two subproblems, which are solved using heuristics. A similar approach is considered in [han2013optimizing], where two problems are solved: the first one consists of optimizing the energy allocation at individual BSs to accommodate for the temporal dynamics of harvested energy and mobile traffic. Considering the spatial diversity of mobile traffic patterns, the second problem is to balance the energy consumption among BSs, by adapting the cell size (radio coverage) to reduce the on-grid energy consumption of the cellular network. Again, the solutions are obtained through heuristic algorithms. Also, base stations cooperate toward the reduction of energy costs, but do not perform any actual energy transfer among them.

A two-cell renewable-energy-powered system is studied in [guo2013base], by maximizing the sum-rate over all users while determining the direction and amount of energy to be transferred between the two BSs. Energy can be transferred across the network either through power lines or wireless transfer and the energy transfer efficiency is taken into account. This resource allocation problem is formulated under a Frequency Division Multiple Access (FDMA) setup and is solved numerically. A low-complexity heuristic approach is also proposed as a practical near-optimal strategy when the transfer efficiency is sufficiently high and the channel gains are similar for all users.

Along the same lines, a two-BS scenario is considered in [chia2014energy], where BSs have hybrid conventional (power grid) and renewable energy sources, limited energy storage capability, and are connected through power lines. The authors study the case where renewable energy and energy demand profiles are deterministic or known ahead of time, and find the optimal energy cooperation policy by solving a linear program. They then consider a more realistic case where the profiles are stochastic and propose an online greedy algorithm. Finally, an intermediate scenario is addressed, where the energy profiles are obtained from a deterministic pattern adding a small amount of random noise at each time step (to model prediction errors). Simulation results are shown for several (online) algorithms, assessing the impact of knowing the energy pattern profiles in advance.

The authors of [farooq2017hybrid] and [farooq2016energy] consider an energy sharing framework for cellular networks that are powered by power grids and possess renewable energy generation capabilities. Energy sharing takes place via physical power lines, as well as through the power grid for virtual energy transportation. Interestingly, the authors investigate the impact of the power line infrastructure topology: agglomerative and divisive hierarchical clustering algorithms are utilized to determine it. Upon establishing the physical connections among BSs, an optimization framework for day-to-day cost optimization is developed for the cases of 1) zero knowledge, 2) perfect knowledge, and 3) partial future knowledge of the renewable energy generation. An optimal energy management strategy to minimize the energy cost incurred by a set of cellular base stations is presented in [leithon2014energy]. There, base stations can exchange energy with the power grid and are equipped with batteries (energy storage) and renewable energy harvesting devices. Simulation results show that a cost reduction can be achieved by increasing the battery capacity of each BS and/or the number of base stations.

On combining pattern learning with multi-step optimization techniques: next, we briefly review the mathematical tools that we use in the present paper, namely, MPC and GPs, touching upon the various application domains where they have been used. MPC has its roots in optimal control theory. The main idea is to use a dynamic model to forecast the system behavior, and exploit the forecast state sequence to obtain the control at the current time. The system usually evolves in slotted time, the control action is obtained by solving, at each time step, a finite horizon optimal control problem where the initial state is the current state of the system. The optimization yields a finite control sequence, and the first control action in this sequence is applied [rawlings2009model]. MPC has the ability to anticipate future events and can take control actions accordingly. It has been widely used in industrial processes, including chemical plants [fang2015nonlinear, eaton1991model, arefi2006nonlinear] and oil refineries [nejadkazemi2016pressure, pavlov2014modelling] and, recently, to balance energy consumption in smart energy grids [halvgaard2016distributed, stadler2016distributed, meng2015cooperation]. Moreover, it has been applied to supply chain management problems, with promising results [tzafestas1992parallel, perea2003model, braun2003model, lin2005predictive].

It is known that using time-series forecasting within an MPC framework can improve the quality of the control actions by providing insight into the future [doganis2008combined]. Over the last decades, numerous forecasting approaches have been developed, including Autoregressive Integrated Moving Average (ARIMA) processes and Artificial Neural Networks (ANNs). ARIMA models (introduced by Box and Jenkins in [box1970]) are known for their prediction accuracy, but their main limitation lies in the assumption that the data follows a linear model. Conversely, ANNs capture non-linear models and, in turn, can be a good alternative to ARIMA [zhang2005neural]. Nonetheless, ANNs give rise to mixed results for purely linear correlation structures. In [zhang2003time, khandelwal2015time], hybrid schemes that combine them are put forward to take advantage of their unique strengths. Experimental results with real-world data indicate that their combined use can improve the prediction accuracy achieved by either of the techniques when used in isolation.

Several authors have proposed the use of non-linear models to build non-linear adaptive controllers. In most applications, however, these non-linearities are unknown, and non-linear parameterization must be used instead. In time-series analysis, where the underlying structure is largely unknown, one of the main challenges is to define an appropriate form of non-linear parameterization for the forecasting model. Some implementations claim to be non-parametric, such as GPss, which can be considered (in some sense) as equivalent to models based on an infinite set of non-linear basis functions [mackay1997gaussian]. The basic idea of GPs is to place a prior distribution directly on the space of functions, without finding an appropriate form of non-linear parameterization for the forecasting model. This can be thought of as a generalization of a Gaussian distribution over functions. Moreover, a Gaussian Process (GP) is completely specified by the mean function and by the covariance function or kernel, which has a particular (but simple) parametric structure, defined through a small number of hyperparameters. The term non-parametric does not mean that there are no parameters, but that the parameters can be conveniently adapted from data. While GPs have been used in time-series forecasting [williams1998prediction], to the best of the authors’ knowledge, [leith2004gaussian] is the first application of GPs to electrical load forecasting [blum2013electricity, taylor2003short, taylor2006comparison, ketter2013power]. The electricity demand is mainly influenced by meteorological conditions and daily seasonality. Nevertheless, forecasting for short-term horizons of about a day is often performed using univariate prediction models, which are considered to be sufficient because the weather tends to change in a smooth fashion, which is reflected in the electricity demand itself. Also, in a real-worldonline forecasting scenario, multivariate modeling is usually considered impractical [taylor2003]. Due to daily seasonality, we can say that the electrical load data bears some similarities with the time series that we consider in this paper, i.e., the harvested energy profile of Section III-B and the traffic load of Section III-C.

The idea of combining MPC and GP was first proposed in [kocijan2003predictive]. Other practical implementations can be found in application domains such as greenhouse temperature control systems [pawlowski2011predictive], gas-liquid separation plant control systems [likar2007predictive], combustion power plants control systems [grancharova2008explicit] and in a number of other cases [palm2007multiple, ko2007gaussian, maciejowski2013fault, wang2014gaussian]. To the best of our knowledge, the present work is the first where MPC and GP are combined to control an energy harvesting mobile network.

Iii System Model

We consider a mobile network comprising a set of BSs, each with energy harvesting capabilities, i.e., a solar panel, an energy conversion module and an energy storage device. Some of the BSs are ongrid (termed ongrid BSs, being part of set ) and, in turn, can obtain energy from the power grid. The remaining BSs are offgrid (set ). The proposed optimization process evolves in slotted time , where the slot duration corresponds to the time granularity of the control and can be changed without requiring any modifications to the following algorithms.

Iii-a Power Packet Grids

A PPG is utilized to distribute energy among the BSs. The grid architecture is similar to that of a multi-hop network, see Fig. 1, where circles are BSs and the square is the energy router, which is in charge of energy routing decisions and power allocation. As assumed in [ma2017optimal], BSs are connected through Direct Current (DC) power links (electric wires) and the transmission of energy over them is operated in a Time Division Multiplexing (TDM) fashion. Energy transfer occurs by first establishing an energy route, which corresponds to a sequence of power links between the energy source and the destination. Each power link can only be used for a single transfer operation at a time. Power distribution losses along the power links follow a linear function of the distance between the source and the destination [ma2017optimal]. They depend on the resistance of the considered transmission medium and are defined by [von2006electric]: , where is the resistivity of the wire in , is the length of the power link in meters, and is the cross-sectional area of the cable in . In this paper, we consider a PPG with a single energy router in the center of the topology. A number of trees originates from the router and, without loss of generality, each hop is assumed to have the same length , i.e., the same power loss.

Fig. 1: Power packet grid topology example.

Iii-B Harvested Energy Profiles

Solar energy generation traces have been obtained using the SolarStat tool [miozzo2014solarstat]. For the solar modules, the commercially available Panasonic N235B photovoltaic technology is considered. Each solar panel has solar cells, leading to a panel area of , which is deemed practical for installation in a urban environment, e.g., on light-poles. As discussed in [miozzo2014solarstat, zordan2015telecommunications], the EH inflow is generally bell-shaped with a peak around mid-day, whereas the energy harvested during the night is negligible. Here, the framework in [miozzo2014solarstat] is utilized to obtain the amount of energy harvested for each BS in time slot , which is denoted by .

Iii-C Traffic Load and Power Consumption

Traffic load traces have been obtained using real mobile data from the Big Data Challenge organized by Telecom Italia Mobile (TIM[bigdata2015tim]. The dataset is the result of a computation over Call Detail Records (CDRs), logging the user activity within the TIM cellular network for the city of Milan during the months of November and December 2013. For the traffic load traces we use the CDRs related to SMS, calls and Internet activities, performing spatial and temporal aggregation. In this way, we obtain a daily traffic load profile for each BS.

Fig. 2: Load pattern profiles (two classes).

Clustering techniques have been applied to the dataset to understand the behavior of the mobile data. To this end, we use DBSCAN unsupervised clustering [ester1996density] to classify the load profiles into several categories. In Fig. 2, we show the typical traffic behavior of two clusters, corresponding to the heaviest (cluster 1) and lightest (cluster 2) daily load. As noted in previous works, the traffic is time-correlated (and daily periodic) [zordan2015telecommunications, auer2010d2]. In our numerical results, each BS has an associated load profile, which is picked at random as one of the two clusters in Fig. 2. Depending on the cluster association probabilities, there is some imbalance in the load distribution across BSs that, as we discuss shortly, plays a key role in the performance of energy transfer algorithms. Given the traffic load profile , intended as the percentage of the total bandwidth that a BS allocates to serve the users in its radio cell, the BS energy consumption (energy outflow), referred to in the following as , is computed through the linear model in [zordan2015telecommunications] (see Eq. (1) in that paper).

Iii-D Energy Storage Units

Energy storage units are interchangeably referred to as Energy Buffers. The EB level for BS is denoted by and three thresholds are defined: , and , respectively termed the upper, reference and lower energy threshold, with ( is the EB capacity). corresponds to the desired EB level and is the lowest energy level that any BS should ever reach. Both variables are used in the optimization of Section IV-C. For an offgrid BS, i.e., , if is the current time slot, the buffer level process is updated at the beginning of the next time slot as:

(1)

where is the amount of energy transferred to/from BS in time slot , which is positive if BS is a consumer or negative if BS acts as an energy source. is the EB level at the beginning of time slot , whereas , are the amount of energy harvested and the energy that is locally drained (to support the local data traffic), respectively. The energy level of an ongrid BS is updated as:

(2)

where represents the energy purchased by BS from the power grid during time slot . The behavior of a BS (i.e., and ) depends on its EB level. If the BS behaves as an energy source, it is eligible for transferring a certain amount of energy to other BSs. In this work, we assume that if the total energy in the buffer at the beginning of the current time slot is and the BS is ongrid, then the difference is purchased from the power grid in slot , as an ongrid BS should always be a source, i.e., in the position of transferring energy to other BSs. If instead the BS behaves as an energy consumer, it demands energy to the sources. For example, the energy demand in time slot may be set to , so that the EB level would ideally become no smaller than the reference threshold by the end of the current time slot . Note that, this can only be strictly guaranteed if . However, is updated at the beginning of time slot , whereas and are only known at the end of it. The theory of Sections IV-B and IV-C allows computing , accounting for the expected behavior to get more accurate results, where is the expectation operator.

Fig. 3: Overview of the optimization framework.

Iv Optimization for online energy management

In this section, we devise an online optimal power allocation strategy, whose objective is to make the offgrid BSs as energy self-sustainable as possible. This is achieved by transferring some amount of energy from rich energy BSs (energy sources) to those base stations that require energy (energy consumers).

Iv-a Overview of the optimization framework

A diagram of the optimization process is shown in Fig. 3, involving 1) pattern learning (forecasting), 2) model predictive control, 3) convex optimization and 4) energy routing. These algorithms are all executed at runtime. First of all, the harvested energy and traffic load processes are statistically modeled through Bayesian non-parametric tools (“pattern learning” in Fig. 3), as we detail in Section IV-B. This first step allows each BS to independently track its own energy and load processes, capturing their statistical behavior and obtaining multi-step ahead forecasts for the corresponding time series. It is worth noting that our forecasting method is agnostic to the type of signal, and for this reason can be promptly extended to other processes, if need be. These forecasts are then fed into the foresighted optimization approach of Section IV-C. Their use allows for informed decisions, which take the future system evolution into account. This results in effective energy allocation strategies, which lead to a reduction of the amount of energy that has to be purchased from the power grid.

The second step in the optimization framework is the adaptive control based on MPC theory. Its main goal is to determine the BS role (energy source or consumer), and obtain the amount of energy that each BS has to either transfer or require from the sources. The MPC block takes online actions, considering not only the current system state, i.e., traffic load, harvested energy and EB levels, but also future ones (based on the forecasts from the previous block), anticipating events and acting accordingly. This is a main difference with respect to the work in [Gambin2017], where BS energy roles are solely determined on the current EB level. The MPC block is described in Section IV-C.

The actual energy allocation is evaluated in the third optimization step. This block computes how energy (obtained through MPC) has to be redistributed among BSs, matching energy sources and energy consumers. Two approaches are proposed to this end (see Section IV-D): one based on convex optimization and another one formulated as an assignment problem and solved through the Hungarian method [kuhn1955hungarian]. Their objective is to reduce as much as possible the outage probability, i.e., the ratio between the number of BSs that are unable to serve their load due to energy scarcity, and the total number of BSs in the network, while maximizing the energy transfer efficiency.

Finally, the last step is to perform the energy exchange (“energy routing”) among the BSs. Since the PPG is operated in a TDM fashion, each power link can only be used for a single trading operation at a time. Hence, the proposed routing strategy seeks to find disjoint routes between energy sources and consumers, while minimizing the time needed to perform the energy transfer. Details are provided in Section IV-E.

Iv-B Pattern learning through Bayesian non-parametric models

Definition Variable name
Base station set
Ongrid base station set
Offgrid base station set
Number of base stations
Harvested energy profile in slot
Traffic load profile in slot
BS energy consumption in slot
Energy buffer level in slot
Maximum energy buffer capacity
Upper, lower and reference buffer thresholds , ,
Transferred energy in slot
Purchased grid energy in slot
Used by the pattern analysis block
Number of observations (training dataset)
Number of observations (test set)
The transpose of vector
The weights of the Bayesian linear model
The function value
The observed real value
-dimensional column vector of targets
Map in the feature space
Training dataset
-dimensional input column vector
-dimensional input column vector (test set)
-dimensional feature vector
matrix of inputs
matrix of inputs in the test set
matrix in the feature space
Gaussian dist. with zero mean and variance
Covariance matrix of the model parameters
Gaussian process
Gaussian process: mean function
Gaussian process: covariance function (kernel)
Gaussian process: predictive mean vector
Gaussian process: predictive covariance matrix
covariance matrix (training dataset)
identity matrix
Function values (training dataset)
Function values (test set)
Used by the optimization block
Optimization horizon (time steps)
System state matrix for MPC
Control matrix for MPC
System disturbances matrix for MPC
Weight parameter for MPC
Set of energy sources
Set of energy consumers
Energy allocation matrix
Energy availability matrix
Maximum transmission energy capacity
Energy demand vector
Number of hops matrix
Weight parameter for energy allocation
Cost matrix for Hungarian method
TABLE 1: List of symbols used in the paper.

In this section, we are concerned with statistical models to automatically capture the hidden structure of the observations in a training dataset. Bayesian non-parametric models, such as GPs, can represent our beliefs about the model parameters via a probability distribution (called the prior). Then, Bayesian inference can reshape the prior distribution, transforming it into a posterior one. GPs have become popular for regression and classification, often showing impressive empirical performance [Rasmussen_2006]. However, while the outputs for a classification task are discrete class labels, in a regression task the outputs (or targets) are real values. Here, we use GPs for our regression task. According to [Rasmussen_2006], there are two equivalent views to treat GPs within a regression problem: 1) the weight-space view and 2) the function-space view.

1) The weight-space view. The Bayesian linear model for regression is defined as:

(3)

where is a vector of weights, also known as model parameters, is the function value, which is linear in the weights , is the observed real value, and maps the -dimensional input column vector into an -dimensional feature vector . Assume we are given with a training dataset with observations, , where each pair consists of the -dimensional input column vector and the scalar target . We can aggregate inputs and targets in a matrix and an -dimensional column vector , so that , and becomes an matrix in the feature space. We are interested in the conditional distribution of the targets, given the inputs in the feature space and the model parameters. We further assume that differs from by additive noise, and this noise follows an independent identically distributed (i.i.d.) Gaussian distribution with zero mean and variance , i.e., . From the i.i.d. assumption, it follows that the likelihood (i.e., the conditional distribution of the targets given the inputs in the feature space and the model parameters) is factorized over cases for the observations, i.e., .

According to the standard formalism of Bayesian inference, the prior distribution encodes our beliefs about the model parameters, before we access the observations. Conversely, the likelihood gives us insights about the model parameters, thanks to the evidence of the observations in a training dataset. For the case simple Bayesian linear model of Eq. (3), the prior distribution is set to be Gaussian with zero mean and covariance matrix , i.e., . Then, the posterior distribution combines the likelihood and the prior distribution, and expresses our full knowledge about the model parameters, after we access the observations. The posterior distribution is derived via the Bayes’ rule as:

(4)

To make prediction for the test case given , we average over all possible model parameters’ choices, weighted by the posterior distribution, i.e.,

(5)

where and . Here, note that it is more convenient to invert the matrix than the matrix , whenever . Furthermore, the feature space enters in the form , where vectors and are either in the test or in the training dataset. At this point, let us define as the covariance function or kernel. Specifically, represents an inner product with respect to , equivalent to when and is such that (since is positive definite). We can conclude that: if an algorithm is defined in terms of inner products in the input space, then it can be lifted into the feature space by replacing occurrences of inner products by the kernel, whenever it is more convenient to compute the kernel than the feature vectors themselves; this is also known as the kernel trick [Rasmussen_2006].

2) The function-space view. We can have exact correspondence with the weight-space view by using a GP modeling a distribution over functions. Formally: a GP is a collection of random variables, any finite number of which have a joint Gaussian distribution. Moreover, it is completely specified by the mean function and the covariance function (or kernel). We define the mean function and the covariance function of process as

(6)

Next, we consider the zero mean function, i.e., , which is a very typical choice in the GP literature [Rasmussen_2006]. In the Bayesian linear model of Eq. (3), the prior distribution is set to be Gaussian with zero mean and covariance matrix , i.e., . Thus, we can derive an example GP as:

(7)

Assume the training dataset has observations, then vector has a joint Gaussian distribution, i.e., , where the covariance matrix can be computed evaluating the covariance function or kernel for the observations, i.e., for . Given the noise , it follows from the i.i.d. assumption that a diagonal matrix must be added to , as compared to the noise-free model in the GP literature [Rasmussen_2006]. To make prediction for the test case given , we consider the joint Gaussian prior distribution over functions

(8)

where we define the -dimensional column vector such that the -th element is equal to . To derive the posterior distribution over functions we need to condition the joint Gaussian prior distribution over functions on the data, so that we get the key predictive equations of GPs for regression:

(9)

In practice, the predictive mean is used as a point estimate for the function output, while the variance can be translated into uncertainty bounds (predictive error-bars) on this point estimate, thus making GPs for regression very appealing for MPC applications (see [murray1999transient, leith2000nonlinear, solak2003derivative, kocijan2003predictive]).

For any set of basis functions in the feature space, we can compute the corresponding covariance function or kernel; conversely, for every (positive definite) covariance function or kernel, there exists a (possibly infinite) expansion in terms of basis functions in the feature space. As we show shortly, the choice of the kernel deeply affects the performance of a GP for a given task, as much as the choice of the parameters (architecture, activation functions, learning rate, etc.) does for a neural network. Specifically, the hyperparameters of the kernel must be set in order to optimize the marginal likelihood, which is defined as follows:

(10)

Under the Gaussian assumption, the prior distribution is Gaussian, , and the likelihood is a factorized Gaussian, , thus . Extensive derivation for the formulation of and generalization to more that one test case can be found in [Rasmussen_2006].

Suppose we have observations in the test set, i.e., , to make prediction for the test cases given , we consider the joint Gaussian prior distribution over functions

(11)

where we define the matrix similarly to , such that for , , and is a column vector in . Finally, we define the matrix similarly to , such that for , thus we get the key predictive equations of GPs for regression: {mdframed}

(12)

The choice of the kernel: this choice deeply affects the performance of a GP for a given task, as it encodes the similarity between pairs of outputs in the function domain. There has been significant work on constructing base and composite kernels [duvenaud2013structure]. Common base kernels include the Squared Exponential (SE) kernel, the Rational Quadratic (RQ) kernel, and the Standard Periodic (SP) kernel, defined as:

(13)

The properties of the functions under a GP with a SE kernel can display long range trends, where the length-scale determines how quickly a process varies with the inputs. The RQ kernel is derived as a scale mixture of SE kernels with different length-scales. The SP kernel is derived by mapping the two dimensional variable through the SE kernel. Derivations for the RQ and SP kernels are in [Rasmussen_2006].

1:Pre-training phase: find the optimal hyperparameters for the kernel , starting from and minimizing the marginal likelihood on the training dataset
2:Set
3:while  do
4:     Set
5:     Set
6:     Training phase: find the optimal hyperparameters for the kernel , starting from and minimizing the marginal likelihood on the training dataset
7:     Get via Eq. (12) given the test set
8:     Compute
9:     Set
Algorithm 1 Pseudo-code for the basic routine

Note that valid kernels (i.e., those having a positive-definite covariance function) are closed under the operators and . This allows one to create more representative (and composite) kernels from well-understood basic components, according to the following key rules [duvenaud2013structure]:

  • Any subexpression222Subexpression refers to any valid kernel family, either basic or composite. can be replaced with , where is any base kernel family.

  • Any subexpression can be replaced with , where is any base kernel family.

  • Any base kernel can be replaced with any other base kernel family .

In time series, summing kernels can express superpositions of different processes, operating at different scales, whereas multiplying kernels may be a way of converting global data properties onto local data properties. From here on, we will use one-dimensional kernels in the form with period , which correspond to a local quasi-periodic structure in the data, with noise operating at different scales. Note that kernels over multidimensional inputs can be constructed via the operators and over individual dimensions. Next, we consider models based on zero-mean GPs for the runtime multi-step ahead forecasting of time series, with application to a) Harvested Energy Profile  (defined in Section III-B) and b) Traffic Load  (Section III-C).

The basic routine for prediction: we use models based on zero-mean GPs for the runtime forecasting of time series, with application to and , . The strong daily seasonality of the data is evident for both time series, as well as the presence of noise at different scales. Therefore, we define composite kernels for and in the form with period , i.e., {mdframed}

(14)

where and is the Euclidean distance between inputs. At this point, the hyperparameters of the kernel must be set in order to optimize the marginal likelihood, which is defined in Eq. (10), and here implemented using the toolbox of [toolbox]. For compactness, we aggregate the hyperparameters of the kernel in the initialization set . Here, we opt for , , and select the free parameters (, , ) via a grid search, scanning combinations in the range . To model the strong daily seasonality in the data, we also opt for a prior distribution on the period , which is a delta function, i.e., if and only if , so that we treat the period as a constant, excluding it from the optimization (see [toolbox]).

0.0119 0.0170 0.0385 0.0512
0.0116 0.0166 0.0383 0.0511
a Average for .
0.0389 0.0464 0.0670 0.0740
0.0415 0.0483 0.0671 0.0743
b Average for .
TABLE 2: Average .

Algorithm 1 describes the basic routine for the pre-training, training and forecasting phases for both zero-mean GPs, i.e., the same basic reasoning holds for and , where refers to time and refers to either or , at time . Also, we assume that we can access the values in the training dataset, and we wish to predict the values in the test set, where refers to the training dataset and refers to the test set, at time , respectively. According to the pre-training phase, we first have to find the optimal hyperparameters for the kernel , starting from and minimizing the marginal likelihood on the training dataset , where we set . Note that will serve as initialization for the optimal hyperparameters at each step of the online forecasting routine, as the optimal hyperparameters are found over the training dataset , which changes at each step of the online forecasting routine. Assuming Gaussian noise with variance , thus Gaussian likelihood, it follows that we can perform exact inference. To do it, we use the Conjugate Gradients (CG) optimization tool implemented in toolbox [toolbox]. We get via Eq. (12) given the test set with test cases, at time . Finally, we derive the Root Mean Square Error (RMSE) over the test cases, starting from residuals , at time , and iterating the procedure (except for the pre-training phase) up to time . For the numerical results, the training phase is performed once every steps.

Numerical results: now, we assess the proposed scheme for the runtime multi-step ahead forecasting of time series and , where we track over the test cases, given . Here, we set the time step to one hour,  hours (i.e., two weeks of data),  hours (i.e., two months of data), , and we recall that . This choice of parameters is valid for both time series, as well as the use of the kernel in Eq. (14), whereas the hyperparameters differ, depending on the nature of data.

In Table Ia and Table Ib we show the average RMSE for and , computed evaluating the mean of the RMSE measures up to time , where we track over the test cases, given . Also, as we perform the training phase once every steps, we compare the numerical results when and , i.e., when we re-optimize the free-parameters at each step of the online forecasting routine, or just once every steps, at time . In general, the average decreases as we increase the test cases up to , which corresponds to one day into the future. However, the worst performance is , which is still rather small if we consider that both time series are normalized in prior to processing. Also, predictions for (Fig. 3(a)) are more precise than predictions for (Fig. 3(b)), and this is due to the nature of the data, given that we use the same kernel for both time series. In fact, values in (Fig. 3(a)) follow a more regular behavior than those in (Fig. 3(b)), with quasi-periodic streams of zero values corresponding to zero solar energy income during the night. These quasi-periodic streams of zero values help reinforcing prediction, while allowing for a higher confidence at nighttime (see Fig. 4(a)). Finally, tuning parameter explains the impact of re-optimizing the hyperparameters according to the most recent history (i.e., two weeks of data), but with a longer execution time. Numerical results suggest that tuning parameter could be reasonable when data exhibit multiple strong local behaviors rather than just a strong daily seasonality, and the kernel has to adapt to these. However, could not be the obvious, optimal choice (see Table Ia).

(a) One-step predictive mean value for .
(b) One-step predictive mean value for .
Fig. 4: One-step online forecasting for one-month of data.

In Fig. 3(a) and Fig. 3(b) we show real values and predictions for two weeks of data, where we track the one-step predictive mean value at each step of the online forecasting routine. The strong daily seasonality is evident, as well as the quasi-periodic structure in data with noise operating at different scales. Note that predictions for (Fig. 3(a)) are more accurate than those for (Fig. 3(b)), and this result can be confirmed by comparing the average in Tables Ia and Ib for . However, predictions are still quite far from real values when some unusual events occur, see, for example, the low solar energy income within hours and (sixth peak from the left), in Fig. 3(a), or the sudden peaks in the traffic load profile of Fig. 3(b), which are very day-specific.

(a) Multi-step predictive mean value for .
(b) Multi-step predictive mean value for .
Fig. 5: Multi-step prediction with different kernels.

In Figs. 4(a) and 4(b) we show real and predicted values for three days of data, i.e., the last two days of the training dataset, and hours for the test set, plotting the multi-step predictive mean value with . Here, we compare the use of the kernel in Eq. (14) with common base kernels from the literature, such as the popular Squared Exponential (SE) kernel, the Rational Quadratic (RQ) kernel, and the Standard Periodic (SP) kernel, see Eq. (14). Also, we compare the use of the kernel in Eq. (14) in terms of generalization capabilities over the training dataset and the test set, i.e., we perform forecasting over the training dataset and the test set, after the optimization of the hyperparameters given the observations. Note that the proposed kernel (solid line) shows the best performance in terms of forecasting, since composite kernels are more representative than base ones. Specifically, the RMSE is close to zero over the training dataset (due to the fact that we set , i.e., ), and this result also holds for both the SE and RQ cases. However, the generalization capabilities over the test set are quite limited for SE and RQ. In fact, these base kernels have limited expressive power, and simply act like smoothers. Finally, the SP kernel succeeds in recovering the strong daily seasonality in the data, but it fails to model noise at different scales. Again, its expressive power is quite limited, with respect to our proposed kernel in Eq. (14).

Iv-C Predictive and time-adaptive energy allocation

In general, an MPC framework can be divided into three blocks: i) inputs, ii) MPC controller and iii) real system [maciejowski2013fault]. The first block is the prediction model (see Section IV-B). The MPC solves a control problem at runtime (see below). Finally, the real system block receives the optimal actions from the MPC controller and behaves accordingly.

Notation: using a standard notation, the system to be controlled is described by means of a discrete-time model:

(15)

where is the current time slot. The matrix with elements denotes the system state, representing for each BS the energy buffer level for time slots , were is the optimization horizon (we use  hours), which corresponds to in the previous subsection. The matrix with elements denotes the control matrix, representing the amount of energy that each BS shall either transfer or receive (depending on the sign of ) in time slot . The matrix models the system disturbances, i.e., the stochastic behavior of the forecast profiles (harvested and consumed energy), with:

(16)

where and contain the mean and variance of the forecast estimates, respectively. Eqs. (15) and (16) relate to the problem setup of Section III-D as follows: symbol contains the buffer state for all BSs, i.e., , is the control, which corresponds to the amount of energy to transfer, i.e., , and contains the exogenous processes, i.e., . Note that processes and are statistically characterized through the prediction framework of Section IV-B, and their difference is still a Gaussian r.v. (in fact, is derived from through a linear model, and as such is still Gaussian distributed). Following [wang2016stochastic], due to the stochastic nature of Eq. (16), the system state should also be written in a probabilistic way:

(17)

where and are the mean and the variance of , respectively.

Objective functions: the goal of the MPC controller is to determine the amount that each BS should either transfer or receive in time slots . If , BS acts as a source in slot , whereas if it acts as an energy consumer. A first cost function tracks the total amount of energy that is exchanged among BSs:

(18)

For the second objective, we use the reference threshold , see Section III-D. The MPC controller tries to increase the BS energy buffer levels, while avoiding that only a few of the consumers receive energy. To achieve this, a second cost function weighs how close to the buffers get:

(19)

where is the energy buffer level of BS in time slot .

Control problem: once the inputs are stated, a finite-horizon multi-objective optimization problem is formulated:

{mdframed}

[innertopmargin=0pt]

(20a)
subject to: (20b)
(20c)
(20d)
(20e)

where is a weight to balance the relative importance of the two cost functions. and are the energy buffer limitations defined in Section III-D. Finally, the fourth constraint defines the amount of energy that each BS can exchange and depends on the system state, i.e., the energy buffer level, expected harvested energy and expected traffic load. The system state dictates the limits of the control action.

For any fixed value of , and since the optimization problem must be solved at runtime, it is strongly preferable to choose a convex optimization formulation such as Eq. (20), which can be solved through standard techniques. Here, we have used the CVX tool [grant2008cvx] to obtain the optimal solution , which dictates the optimal amount of energy that each BS shall either provide or receive in time slot .

Optimization algorithm: the MPC controller performs as follows [zhang2017optimal]:

  1. Step 1: at the beginning of time slot , the system state is obtained, that is energy buffer levels for all BSs, the harvested energy and traffic load forecasts for the next hours (the optimization horizon).

  2. Step 2: the control problem in Eq. (20) is solved yielding a sequence of control actions over the horizon .

  3. Step 3: only the first control action is performed and the system state is updated upon implementing the required energy transfers.

  4. Step 4: Forecasts are updated and the optimization cycle is repeated from Step 1.

Iv-D Energy scheduling for the current time slot

With the MPC of the previous subsection, we compute the amount of energy that each BS should either provide (in case the BS acts as a source) or receive (if the BS is a consumer). Note that the BS role (source or consumer) is also a result of the optimization process. In this section, we solve the energy allocation problem between energy sources and energy consumers, i.e., which source will transfer energy to which consumer and in which amount. Note that this also depends on the distribution losses between them and, in turn, on the electrical PPG topology.

Notation: we use indices and to respectively denote an arbitrary BS source and an arbitrary BS consumer. and are the set of sources and consumers, respectively. With we mean the energy available from the energy source to the energy consumer , in matrix notation we have . Note that is the energy that would be available at the consumer BS and, in turn, it depends on , and on the distribution losses between them, i.e., on the total distance that the energy has to travel, see Section III-A. Vector , with elements , represents the energy demand from each consumer . represents the number of hops in the energy routing topology between source and consumer , in matrix notation we have . Also, we assume that all hops have the same physical length. Finally, with we mean the fraction of that is allocated from source to consumer , in matrix notation .

Objective functions: as a first objective, we seek to minimize the difference between the amount of energy offered by the BS sources and that transferred to the BS consumers . This amounts to fulfill, as much as possible, the consumers’ energy demand. At time , the energy that can be drained from a source is . Now, if we consider the generic consumer , the maximum amount of energy that can provide to is , where is the attenuation coefficient between and , due to the power loss. We thus write a first cost function as:

(21)

where and . Due to the existence of a single path between any source and consumer pair and due to the fact that each power link can only be used for a single transfer operation at a time, a desirable solution shall: i) pick source and consumer pairs in such a way that the physical distance () between them is minimized and ii) achieve the best possible match between sources and consumers, i.e., use source , whose available energy is the closest to that required by consumer . In other words, we would like to be as close as possible to . If this is infeasible, multiple sources will supply the consumer. Minimizing the following cost function, the number of hops between sources and consumers is kept small and we favor solutions with :

(22)

that is, with this cost function we are looking for a sparse solution (i.e., a small number of sources with close to ). Note that when and is minimized, the argument is maximized and the negative exponential is minimized. Also, the exponential function was picked as it is convex, but any increasing and convex function would do.

Solution through convex optimization: at each time slot , each BS updates its buffer level , using either Eq. (1) or Eq. (2) (note that , , , and are all known in slot , see Section III). The MPC problem of Section IV-C is solved, and in the current time slot , BS acts as a source if and as a consumer if . Each source evaluates for all through and each consumer sets its energy demand as . At time , using Eq. (21) and Eq. (22), the following optimization problem is formulated: {mdframed}

(23a)
subject to: (23b)
(23c)

where is a weight used to balance the relative importance of the two cost functions. The first constraint represents the fact that is a fraction of the available energy from source , and the second constraint means that the total amount of energy that a certain source transfers to consumers cannot exceed the total amount of available energy at this source.

For any fixed value of , Eq. (23) is a convex minimization problem which can be solved through standard techniques. In this paper, we have used the CVX tool [grant2008cvx] to obtain the optimal solution , which dictates the optimal energy fraction to be allocated from any source to any consumer .

Solution through the Hungarian method: the energy distribution problem from sources to consumers can also be modeled as an assignment problem, where each source has to be matched with a consumer . This approach can be solved through the Hungarian method [kuhn1955hungarian], an algorithm capable of finding an optimal assignment for a given square cost matrix, where . An assignment is a set of entry positions in the cost matrix, no two of which lie in the same row or column. The sum of the entries of an assignment is its cost. An assignment with the smallest possible cost is referred to as optimal. Let be the cost matrix, where rows and columns respectively correspond to sources and consumers . Hence, is the cost of assigning the -th source to the -th consumer and is obtained as follows:

(24)

where , the first term weighs the quality of the match ( should be as close as possible to ) and the second the quality of the route. To ensure the cost matrix is square, additional rows or columns are to be added when the number of sources and consumers differs. As typically assumed, each element in the added row or column is set equal to the largest number in the matrix.

The main difference between the optimal solution found by solving the convex optimization problem (Eq. (23)) and that found by the Hungarian method is that the latter returns a one-to-one match between sources and consumers, i.e., each consumer can only be served by a single source. On the other hand, for any given consumer the convex solution also allows the energy transfer from multiple sources.

Iv-E Energy Routing Strategy

Now, we describe how the energy allocation is implemented over time. The algorithm that follows is executed at the beginning of each time slot, when a new allocation matrix is returned by the solver of Section IV-D. Each hour is further split into a number of mini slots. Given a certain maximum transmission energy capacity for a power link in a mini slot, the required number of mini slots to deliver a certain amount of power between source and consumer is obtained as .

Since each power link can only be used for a single energy transfer operation at a time, we propose an algorithm that seeks to minimize the number of mini slots that are used. First of all, an energy route for the source-consumer pair is defined as the collection of intermediate nodes to visit when transferring energy from to . The algorithm proceeds as follows: 1) a route is identified for each source and consumer (note that for the given network topology this route is unique), 2) the disjoint routes, with no power links in common, are found and are allocated to as many pairs as possible, 3) for each of these pairs , the energy transfer is accomplished using route for a number of mini slots , 4) when the transfer for a pair