Model-Free Control of Thermostatically Controlled Loads Connected to a District Heating Network

Model-Free Control of Thermostatically Controlled Loads Connected to a District Heating Network

Bert J. Claessens111Bert J. Claessens is currently working at REstore and can be contacted at D. Vanhoudt J. Desmedt F. Ruelens Energy departement of the Research Institute VITO, 2400 Mol, Belgium Department of Electrical Engineering of KU Leuven, Kasteelpark Arenberg 10, bus 2445, 3001 Leuven, Belgium EnergyVille, Thor Park 8310, 3600 Genk, Belgium

Optimal control of thermostatically controlled loads connected to a district heating network is considered a sequential decision-making problem under uncertainty. The practicality of a direct model-based approach is compromised by two challenges, namely scalability due to the large dimensionality of the problem and the system identification required to identify an accurate model. To help in mitigating these problems, this paper leverages on recent developments in reinforcement learning in combination with a market-based multi-agent system to obtain a scalable solution that obtains a significant performance improvement in a practical learning time. The control approach is applied on a scenario comprising 100 thermostatically controlled loads connected to a radial district heating network supplied by a central combined heat and power plant. Both for an energy arbitrage and a peak shaving objective, the control approach requires 60 days to obtain a performance within 65% of a theoretical lower bound on the cost.

District heating, combined heat and power, reinforcement learning, thermostatically controlled loads.
journal: Energy and Buildings

1 Introduction

A District Heating Network (DHN) offers the opportunity to provide the collective heat demand of a cluster of geographically concentrated buildings through a set of central heat sources. This allows the use of centralized production techniques with an efficiency exceeding that of distributed production. Combined Heat and Power plants (CHPs) are a prominent example, as 80-90% of the primary energy is converted to heat and electricity ChristianThesis (); Kitapbayev2015823 (); Sartor2014474 (). But also heat from a geothermal source GDHN () or excess heat resulting from an industrial process can be used as primary heat source. The heat from the sources is transported through a network of pipes using water as a medium. At each building, heat is extracted in a local substation, resulting in water at a lower temperature, being transported back to the different heat sources. A typical operational model at the production side is to modulate the power of the heat sources to keep the supply temperature close to a design setting. This basically results in the thermal supply following the thermal demand. Heat storage however, can provide demand flexibility enabling flexibility at the production side through demand response approaches. This flexibility allows operational opportunities for cost reduction, examples being peak shaving/valley filling ClaessensSelfLearning () and energy arbitrage by selling the electricity production of the CHP on the wholesale market ChristianThesis (); Kitapbayev2015823 (); de2014trading (). Well referenced embodiments of local heat storage are Thermostatically Controlled Loads (TCLs) KochThesis () such as a hot water storage tank Vanthournout () where the heat is stored directly in the water, but also the building envelope ChristianThesis (); KochThesis (); Verbeeck (); Kensby2015773 () can be used to store heat.
From an operational point of view, controlling a cluster of TCLs connected to a DHN can be considered as a sequential decision-making problem under uncertainty. One well studied control paradigm for operational management of a DHN is that of Model Predictive Control (MPC) SandouMPCDHN (); Široký20113079 (). When projected on the setting of TCLs connected to a DHN, this requires defining control actions for the central sources as well as for all individual TCLs. Developing a practical implementation requires one to tackle the problem of scalability, as the state dimensionality and number of control variables quickly result in an intractable optimization problem. This is complicated further by non-linear system dynamics. A second important challenge is that of system identification mathieu2013energy () as identifying an accurate model of both the DHN and all TCLs requires significant amounts of not readily available data and expert knowledge.
This work contributes in mitigating operational control challenges for TCLs connected to a DHN by working on these two problems.
Scalability: To obtain scalability, a heuristic dispatch approach as described in Stijn (), is applied to the setting of TLCs connected to a DHN. Instead of calculating an individual control action for each TCL, this approach calculates a collective control action for the entire cluster of TCLs. A market-based dispatch algorithm is used to translate the collective control action into individual control actions.
System identification: Driven by recent developments in Batch Reinforcement Learning (BRL) ReinforcementLearning (); atariRL (), a blind model-free approach is considered. As a BRL technique needs no prior information on the system dynamics, this strongly relaxes the system identification requirements, at the cost of a learning time and sub-optimal performance.
In Section 2, an overview is given on related research regarding the control of large clusters of TLCs and model based controllers for DHN. In Section 3, the decision-making problem is formalized as a Markov Decision Problem (MDP). In Section 4, the control approach as used in this work, is described in detail. In Section 5, an evaluation of the controller performance is provided based upon a simulation scenario comprising 100 TCLs connected to a DHN. The simulation scenario is sufficiently complex to evaluate the contributions of the control approach, but also comprehensibly enough to allow for an analysis which is not obfuscated by complexity of the scenario. Finally in Section 6, the conclusions are provided, as is a discussion on the results.

2 Related Work

In this section a non-exhaustive overview is given of related work regarding both the control of large clusters of TCLs and model-based control applied to a DHN.

2.1 Controlling a cluster of TCLs

The curse of dimensionality Bertsekas () lurks around the corner when managing the flexibility present in a large cluster of TCLs. This is attributed to the dimensionality of the state space and the large amount of control variables. To this end, significant recent work mathieu2015arbitraging (); tracers (); georges2016direct (); BiegelHP (); Hu2015229 () has focused on providing computationally tractable solutions for large clusters of TCLs. In Mathieu (), a cluster of flexibility carriers represented by generic tank models is considered with the objective of providing day-ahead modulation services to a transmission system operator. Even though all models are linear, a formal branch and bound based optimization approach quickly becomes intractable. The main contribution of the work is a heuristic method including a state dependent dispatch algorithm in combination with an iterated local search technique. High quality solutions to a test problem are presented, obtained within a practical calculation time. The results of a related approach applied in an actual field test comprising 54 heat pumps has been presented in BiegelHP (). Here an aggregated model of reduced dimensionality was used to determine power set points for the entire cluster in a MPC approach. A heuristic dispatch algorithm was used to convert the aggregated set points to local control actions in the portfolio. In Hu2015229 () a data-driven decision framework has been developed using a meta-heuristic optimization technique.
An approach from the same solution class is presented in mathieu2013energy (). Here a problem of energy arbitrage with a large cluster of TCLs is presented. An aggregated system model is used in the form of a state bin transition model Koch (). All TCLs are clustered, based upon their position within their dead-band, resulting in a state vector containing the fraction of TCLs in each state bin. A linear state bin transition model describes the dynamics of this state vector, the dimensionality of which is independent of the number of TCLs in the cluster. This model is used in an MPC resulting in a control action for each state bin. A simple heuristic is used to dispatch the control signals to individual control actions at device level. Although a simplified first order TCL model has been used, the results presented in mathieu2013energy () show that careful system identification is required. Moreover, in Zhang () it was argued that a first order TCL model is found lacking, further complicating system identification.
A different approach is that of distributed optimization Gatsis (); Bosman (), where the centralized optimization problem is decomposed over distributed agents who interact through virtual prices. For example in BiegelDD (), distributed MPC through dual decomposition was presented as a means for energy arbitrage of a large cluster of TCLs subjected to a coupling constraint related to an infrastructure limitation. Although mathematical performance guarantees can be provided under sufficient assumptions, the method heavily relies on the accuracy of local models and has stringent communication and computation requirements due to its iterative character. An example of how an MPC controller can be used at building level is detailed in Široký20113079 ().

2.2 Model-Based DHN Control

When implementing a model-based control approach for a DHN, be it centralized or distributed, one is confronted with (1) the non-linearities in the dynamics of a DHN, and (2) the slow time scales compared to e.g. an electric network Dirk (). Taking these effects into account is essential for a good performance of the controller. Several model-based optimization approaches have been identified in literature, explicitly incorporating the dynamics of the DHN. For example in SandouMPCDHN () a simplified model has been derived that is used together with sequential quadratic programming. In Ikonen () approximate dynamic programming ADP () has been used taking advantage of permutational symmetries of the DHN dynamics. A model-based approach using fuzzy direct matrix control to mitigate non-linearities of the DHN dynamics can be found in GrosswindFDMC (). Although model-based solutions can have excellent performance, accurate models of the DHN and the consumers coupled to the DHN are required, tuning and shaping these models is considered an expert task making a generic roll-out of this technology challenging Pinson2009163 ().
A scalable model-free solution solution is presented in Booij (), here a market-based multi-agent system is used to match thermal and electric demand and supply. Although this approach is scalable, it does not take into account the DHN dynamics and follows a myopic control strategy. An approach combining an auction-based multi-agent system with a central optimization, taking into account a forecast of the total heat demand can be found in ChristianThesis ().

3 Problem description

Inspired by Gemine et al. gemine (), this section presents a problem formulation of the sequential decision-making problem related to optimal control of a cluster of TCLs connected to a DHN. In a second step, the control problem is cast onto a Markov Decision Process (MDP).

Figure 1: Illustration of the district heating network configuration as used in this work. A set of 100 Thermostatically Controlled Loads (TCLs) is connected to a district heating network, the thermal energy is produced by one central Combined Heat and Power (CHP) plant at the top.

3.1 Test scenario network

To make the problem formalism more tangible, references are made to the test scenario as illustrated in Figure 1 and detailed in Section 5. The test scenario comprises 100 TCLs connected to a DHN. One central CHP is assumed to provide the heat to the buildings through a DHN.

3.2 Problem components

3.2.1 Dhn

A DHN contains a set of nodes and pipes connecting these nodes. Each pipe connects two nodes and is characterized by its length , diameter and heat loss coefficient . The flow speed of the medium at time in pipe is denoted by and its average temperature by . The temperature in each node at time is characterized by .

3.2.2 TCLs and heat production

The set contains the TCLs connected to the DHN, each TCL is assumed connected to a node. For modeling purposes a set of relevant temperatures is associated to each TCL at time , i.e. the air temperature as measured by a local thermostat, a temperature corresponding to a building envelope Verbeeck () and the heating system return temperature. The control available at the level of a TCL is to decide whether or not to extract heat from the DHN, corresponding to a binary value . The thermal inertia of the DHN and TCLs is used for storage, no separate hot water storage is considered.
Besides TCLs also production units are connected to the DHN in specific nodes. As illustrated in Figure 1, here a single CHP is used as heat source. At every time step , an input power , a thermal output power and an electric output power are are associated to in a node . The relationship between these powers is defined as:


with the total fuel utilization ratio of the CHP and defines the heat to power ratio of the CHP. In this work is considered the control variable.

3.2.3 Operational limits

The main interest of the DHN system is to supply a heat service towards the TCLs meeting comfort constraints:


Here and indicate the lower and upper bound respectively. Common for buildings is to have a constraint on the operative temperature. As a simplification the comfort constraints are here directly related to the air temperature . Besides constraints for the TCLs also constraints for the DHN are relevant, i.e.


3.3 Sequential decision making

Driven by the possibility of using techniques from Reinforcement Learning (RL) ReinforcementLearning (), the sequential decision-making process is formulated as a Markov Decision Process (MDP) Bertsekas (); FonteneauAT (). The sequential nature results from inter-temporal constraints related to the dynamics of the DHN and the TCLs. Decisions made at time step impact possible actions allowed at future states. An MDP is defined by its state space , its action space , and a transition function :


describing the dynamics from state to , following the control actions subject to a random process, , where is drawn from a probability distribution . Each transition is accompanied by a cost signal :


with the cost function.

3.3.1 State description

Following the notation provided in RuelensBRLDevice (), the state of the system is assumed to be spanned by time dependent state information , controllable state information and uncontrollable exogenous state information Bertsekas (). The time dependent state information describes the time information relevant for the dynamics, e.g. the quarter of an hour in the day or the day in the week. In this work . The controllable state information represents the state of the DHN and the TCLs:


The uncontrollable exogeneous state information comprises the physical parameters relevant for the dynamics of the system that can not be influenced by . Examples being the outside temperature , solar irradiation , wind speed and direction 222The outside temperature, solar irradiation and wind information are assumed constant over the DHN and the buildings. and local electric consumption .


It is that represents . If there is no correlation between and (), can be omitted from the state information Bertsekas () in the MDP, however by having in the state vector a first order correlation is assumed333This can readily be extended to include information several time steps back. ().

3.3.2 Control actions

The control vector includes the control actions of the TCLs and the thermal output power of the central CHP:


3.3.3 DHN Dynamics

To model the dynamics of the DHN (as used in the evaluation), a quasi-dynamic approach is followed Dirk () as pressure and flow change orders of magnitude faster than the temperature of the water. In a first step, a hydraulic simulation is performed, in a second step the thermal dynamics are calculated. For the hydraulic calculations the approach as proposed in Valdimarsson1993 () is followed. This results in applying Kirchhoff’s laws, with the consideration that there exists a non-linear relationship between pressure and flow rate.
To calculate the thermal dynamics, the node model as presented by Benonysson Benonysson () has been used, essentially solving the following equation for every pipe section:


With the mass of the water, the thermal capacity of water, the water temperature, the heat demand and the heat transfer coefficient between the water and the ground. The surface of the pipe considered is and the local ground temperature.

3.3.4 TCL Dynamics

For the building models, a lumped capacitance model is used i.e. an electric analogue following Verbeek Verbeeck (). The model includes the temperature dynamics of the inside air, a building envelope and the heating system return temperature Dirk (). Besides heat losses to the ambient air, also wind speed dependent air infiltration losses are included, as are uncontrolled heating due to solar irradiation and local electric consumption Dirk ().

3.3.5 Cost signal

Finally also a cost function needs to be defined: , in this work, two objectives are regarded, i.e. energy arbitrage , responding to an external price and peak shaving/valley filling . The cost functions are defined as:


with the effective price for producing thermal energy at time step . For the peak shaving objective, the cost is expressed on a daily basis:


The objective of this work is to find a control policy that will minimize the -stage return starting from state defined as:




from the understanding that an optimal policy satisfies the Bellman equation:


When an accurate model is available, typical techniques to find near-optimal policies in an MDP framework are value iteration, policy iteration, direct policy search Busoniu () and tree search algorithms such as optimistic planning BusoniuOptimistic (). In Ikonen () for example, an approximated value iteration approach is followed to determine a control policy for the control of a DHN.
In this work, a model-free approach is explored. Driven by promising results Ernst (); atariRL (); FonteneauAT (), Batch Mode Reinforcement Learning (BRL) techniques are investigated, detailed in Section 4.

4 DHN controller approach

This section describes a pragmatic control approach building upon recent results in BRL and market-based multi-agent systems.

Figure 2: Overview of the three-step approach. In a first (left) aggregation step the flexibility and state information of the Thermostatically Controlled Loads (TCLs) are aggregated. In a second optimization step (middle), the optimal control action for the entire cluster is determined, finally in step three (right) is dispatched using a market-based multi-agent system.

The control approach illustrated in Figure 2 and followed in this work is based upon a Three Step Approach (TSA) as presented in Stijn (); Arnout () following a similar strategy as followed in BiegelHP (); Koch (). In a first step, (1) all relevant (and practically available) state information is collected, e.g. the temperature information from the TCLs. From this information a limited set of features Bertsekas () is extracted, resulting in a low-dimensional representation of the system state. In a second step (2) a control action for the entire cluster of TCLs is extracted from a policy determined offline on given time intervals. In a third and last step (3), this control action is dispatched over the different TCLs using a market-based multi-agent system. This process is repeated following a receding horizon approach. In the following, a more detailed description of the three steps is presented.

4.1 Step 1: Aggregation

In the first step, state information as described in Section 3.3.1 is retrieved from the system. From a practical perspective however, not all state information is readily available. At building level, only the air temperatures as measured by a local thermostat are assumed available, i.e. the air temperatures measured in the buildings, . Furthermore measurements of the outside air temperature are assumed available, as is the water temperature from a subset of nodes in the supply side of the DHN, i.e. and the return side444Only temperatures at a limited set of nodes are assumed available. . This information is further aggregated, formally this can be seen as a feature extraction Bertsekas (), which reduces the dimensionality of the decision-making process. In this work, the feature extraction is handcrafted, resulting in the following effective state vector .


Although, more generic dimension reduction techniques such as autoencoders can be used RiedmillerAuto (), the aim of this work is to understand what performance can be obtained starting from this limited state description. Alternatively, a convolutional neural network as presented by the the authors in DDRBert () could be used to automatically extract relevant state-time features, allowing to add historic observations to the state atariRL (); Bertsekas ().
To facilitate the dispatch step as explained in Section 4.3, a bid-function is defined for every TCL ClaessensSelfLearning (); Stijn (); KlaasEventBased (). In ClaessensSelfLearning (), the bid function of a device is expressed as the electric power consumed versus a heuristic (). Above a corner value the bid function is zero:


Determining this heuristic is considered relatively straightforward as it requires only the air temperature as measured by the thermostat and the upper and lower temperature bound. Defining the thermal power extracted from the DHN by a TCL when switched on, is less so Dirk (). To relax this requirement we assume an estimate of the flowrate () when switched on is available. The flowrate is defined as follows. First the set temperature of the indoor heat supply system, which is a function of the outdoor temperature, is calculated. Then, a model for the substation heat exchanger is used to determine the flow rate extracted from the DHN, by which the outlet temperature of the heat exchanger meets the set temperature. Using this value instead of the actual power results in the following bid function for building :


here corresponds to the heaviside function.

0:  , regression algorithm Geurts ().
1:  let be zero everywhere on
2:  repeat
3:     for  do
5:     end for
6:     use regression to obtain from
7:  until  is satisfactory
8:  return  
Algorithm 1 Overview fitted Q-iteration

4.2 Step 2: Batch Reinforcement Learning

In the second step, a control action is selected once every 15 minutes, following the policy . The control action is selected for the entire cluster which is projected onto individual control actions as described in Section 4.3. One of the main goals of this work is to explore to what extent (model-free) reinforcement learning can be used to determine . Reinforcement Learning (RL) is a model-free control approach that learns a policy by interaction with the system Busoniu (). In recent literature, RL (mainly in the form of Q-learning NeilRL (); karaRL (); Mocanu2016646 ()) has been presented as an effective model-free learning approach for DR applications. The practicality however, suffers from slow convergence ReinforcementLearning (); Busoniu () and the curse of dimensionality Bertsekas (). These challenges can be partially mitigated by using past interactions and appropriate function approximators in a BRL strategy Busoniu (); Xu20141 (). A popular BRL approach is that of Fitted Q-Iteration (FQI) introduced by Ernst et al. in FQI (), especially in combination with extremely randomized trees as regression technique Geurts (). In Ernst () the authors conclude that especially for non-linear control problems, FQI can be a valuable alternative to MPC approaches with the extra advantage that FQI is a blind technique. Moreover, FQI and MPC can strengthen each other Ernst (); MABRL (). Although several BRL techniques have been proposed in the literature FonteneauAT (); FQI (); RiedmillerAuto (), this work focuses on FQI using extremely randomized trees as regression algorithm Geurts (). Comparison to the performance of other BRL approaches is considered outside the scope of this work.
Following FQI (), an approximation of the state-action value function , is built on a daily basis555This is done for practicality, as the simulations cover a time span of several months, in a real-life application, should be constructed more frequently, following Algorithm 1. from a batch of four-tuples :


With the state vector as defined in equation (17). Algorithm 1 is used to obtain . During the day, the control action is selected with a probability defined by Powell ():


The temperature is decreased on a daily basis according to a harmonic sequence ADP (), a high temperature results in more exploration whilst results in a greedy approach:


In the peak shaving scenario, a control action is determined on a daily basis, defining the average power to be followed for the next day as detailed in ClaessensSelfLearning ().

4.3 Step 3: Real-time control

In the third step, the energy corresponding to is dispatched over the cluster of TCLs, using a market-based multi-agent system Stijn (); Hommelberg (). Compared to the work of ClaessensSelfLearning (), there is a significant difference as only the expected flow rate for each TCL is assumed available. To this end, a Proportional Integrator (PI) controller (at a central level) managing the flow rates at the different buildings is used. Since hydraulic effects occur nearly instantaneous in a DHN, this will have a direct effect on the flow rate at the source, influencing the power at the source side as the supply setpoint is assumed to be constant in the simulations Dirk (). An overview of the real-time control can be seen in Figure 3. As described in Section 4.1, every TCL is represented by a bid function . After a clearing process (26), a clearing priority is sent back to the different devices:


The devices open or close their local valve according to .

Figure 3: Overview of the controller approach as developed in this work.

5 Evaluation

To evaluate the performance of the controller described in Section 4, a set of simulations have been performed. In this section, first a condensed description of the simulation scenario will be presented after which the tracking performance of the controller is evaluated. The performance of the controller is evaluated for two distinct objectives, i.e. that of energy arbitrage on a day-ahead energy market and peak shaving/valley filling.

5.1 Simulation scenario

As mentioned in Section 1, the scenario is designed to be sufficiently demanding, but also simple enough to allow for an analysis which is not obfuscated by the complexity of the scenario. To this end, the (arbitrary) topology as depicted in Figure 1 has been used, i.e. a central CHP (kW) provides heat to 100 TCLs connected to a radial DHN. Each building is located in one of four streets. The total length of the grid is 2.1 km, with pipe diameters ranging from DN25 to DN100. A detailed description of the simulation scenario can be found in Dirk (). The TCL building models are lumped capacitance models using an electric analogon comprising capacitances and resistors. The capacitance values are related to the temperature of the inside air, a building envelope and the heating system return temperature Verbeeck (); Dirk (). All 100 buildings models included in the simulation are derived from the same model. Different model parameters are used for each building, by sampling capacitance and resistance values from a normal distribution with a standard deviation of 20% of the standard value. The standard values correspond to a detached house with a living area of 103 m and a protected volume of 452 m. The maximum standard power demand of the building is 9.8 kW at an internal temperature of 20 and an ambient temperature of -8. Besides heat losses to the ambient air, also wind speed dependent air infiltration losses are included as is uncontrolled heating due to solar irradiation and local electric consumption Dirk (). For simplicity the temperature constraints are set the same for all buildings at 19.5 and 20.5.

5.2 Tracking performance

Figure 4: Top graph, the temperature dynamics of the 100 simulated buildings as is the average temperature indicated by the black line. Bottom graph, the requested thermal power at every control step indicated by the dashed line, as is the actual thermal power delivered by the CHP indicated by the black line.

A first supporting result is depicted in Figure 4, here a numerical experiment was performed where every 15 minutes a random control action was selected (within the technical specification of the CHP). The corresponding set points are depicted in the lower part of Figure 4, as is the average thermal power as produced by the CHP. The upper graph of Figure 4 depicts the internal temperatures of the 100 buildings and the average temperature. The graph shows that when the buildings are on average in their dead band, the requested power can be tracked accurately. Although the buildings have different physical parameters, they tend to synchronize with regard to their temperature relative to the comfort constraints. This is a direct effect of the dispatch dynamics, as those buildings with a higher priority are served first.

5.3 Energy arbitrage

In the scenario of energy arbitrage, the CHP can sell its electric energy directly at the wholesale market Kitapbayev2015823 (). The energy prices are taken from the Belgian day-ahead market BelPex (), the gas price is set at 38.6 /MWh eurostat (). As day-ahead prices can be predicted with a reasonable accuracy ConejoPricePrediction (), these are considered deterministic in this evaluation. Furthermore, the off line policy calculation (Algorithm 1) is performed only on a daily basis666 This is for practical reasons as the calculation of a policy typically takes 20 minutes on a Intel, 2.5 GHz, 8GB RAM, and the simulation period covers up to 80 days. The results of this numerical experiment are depicted in Figure 5. In the upper row, the daily cost is depicted, both for the approach presented in this paper and a default controller. The default controller applies a hysteresis controller at every building and a fixed DHN inlet temperature Dirk (). The cumulative cost is depicted in the second row. It can be observed that initially the daily cost obtained with the BRL controller is similar to the cost of the default controller. However, as the BRL controller starts gathering more interactions, its daily cost for energy starts decreasing. To evaluate the performance during colder days, the first 20 days with a lower average outside temperature where repeated, with a policy constructed with all data. These results are depicted in the right column. The performance is significantly improved, and the daily cost is decreased with about 20%. Although these results are positive, they do not provide an objective metric of the quality of the control approach, since this percentage is biased by the actual price profiles. To this end, a lower bound on the daily cost is provided by taking the total energy consumed during each day and distributing this over those hours with the highest price assuming the CHP running at full capacity. This is considered an over-optimistic benchmark. In the third row of Figure 5 the following metric is depicted:


with the daily cost of the default controller, the lower bound on the daily cost and the daily cost obtained with the solution presented in this work. A metric of 0 corresponding to the same performance as the default controller, whilst a metric of 1 corresponding to the lower bound solution. From Figure 5 it is observed that the performance metric gradually increases to a value of around 60-70%, also for the colder days as depicted in the right column.

Figure 5: Overview of results obtained for an energy arbitrage scenario. The upper graphs depict the daily cost for heating for both the BRL approach and the default control case. The graphs at the second row present the cumulative cost. The graphs in the third row, depict a performance metric () as explained in the text. The lower graphs show the daily average outside temperature and the sensitivity of the performance metric relative to the average outside temperature.

A snapshot of the daily power profiles of a mature controller compared to the default controller is depicted in Figure 6. It can be seen that the controller produces heat when energy prices are high. It is meaningful to understand that a default controller already has a reasonable performance as the heat demand is typically largest when also the wholesale price is highest. This correlation is however expected to decrease as more renewable energy comes in the production mix, making wholesale prices more volatile.

Figure 6: Top graph, the wholesale price profile. Lower graph, the average output power, both for the default control case and the BRL approach presented in this work.

5.4 peak shaving

In the second experiment, peak shaving is considered. The summarizing results are depicted in Figure 7. In the top graph one can see the daily peak in thermal power obtained with the default controller and the BRL approach. The second graph gives the average daily ambient temperature. The daily maximum power peak indeed reduces with time. To make the performance more visible, also the load duration curves are plotted for the first and the last 50 days, both for the BRL approach and the default controller. The first 50 days there is limited improvement over the default controller, however for the subsequent period one can clearly observe the effect of the controller from the load duration curves. A lower bound is depicted by plotting the average power corresponding to the daily energy consumed by the default control approach. Note, on day seven (top graph in Figure 7), the profile obtained with BRL results a high peak power, which is attributed to an exploration step (24). The final performance is visualized more clearly in Figure 8 where the power profile is plotted for a mature controller. Indeed, the thermal power follows a constant profile compared to the default control case. A visualization of the policy obtained by the BRL controller is presented in Figure 9, here the average power set point is plotted versus the initial indoor temperature state and the expected average daily outdoor temperature. As is conceived logical, the setpoint decreases with increasing outside temperature and average air temperature.

Figure 7: Top graph, daily power peak, both for the approach presented in this work and the default controller. The middle graph shows the daily average temperature. The lower row presents load duration curves for the first 50 days and for the last 50 days.

Figure 8: Power profile, both for the controller as presented in this work and a default controller in the peak shaving/valley filling scenario.

Figure 9: Policy map as obtained with the BRL approach for the peak shaving/valley filling scenario.

6 Conclusions and future work

In this work the control problem of a cluster of thermostatically controlled loads connected to a district heating network is addressed by assessing the performance of a control approach comprising a model-free reinforcement learning technique in combination with a market-based multi-agent system. The performance of the controller has been evaluated for two distinct scenarios, i.e. energy arbitrage and peak shaving. In the evaluation a detailed district heating network model has been used including hydraulic and thermal dynamics. For the energy arbitrage scenario, solutions are obtained that reach over 65% of the available optimization potential after a learning period of 40 to 60 days. Knowing that the policy is updated on a daily base, this is considered a promising result. Also for the peak shaving/valley filling scenario, promising results have been obtained, since a clear performance improvement is observed. These results support a practical implementation and coming of age of reinforcement learning techniques.
To understand the potential of completely model-free control a direct implementation of fitted Q-iteration as presented in Busoniu () has been used. However as discussed in Ernst (), combining general domain knowledge with a model-free approach is expected to result in an improved performance at a reduced learning time. This domain knowledge can be incorporated through e.g. information regarding the shape of the policy Busoniu () or through using a model similar as in MABRL (). A second point of future research is directed at more automated feature extraction techniques such as autoencoders atariRL (), which is also expected to result in a reduced learning time.



  • (1) C. Johansson, On Intelligent District Heating, PhD Thesis, Blekinge Institute of Technology, Blekinge, Sweden, 2014.
  • (2) Y. Kitapbayev, J. Moriarty, P. Mancarella, Stochastic control and real options valuation of thermal storage-enabled demand response from flexible district energy systems, Applied Energy 137 (2015) 823 – 831.
  • (3) K. Sartor, S. Quoilin, P. Dewallef, Simulation and optimization of a CHP biomass plant and district heating network, Applied Energy 130 (0) (2014) 474 – 483.
  • (4) L. Ozgener, A. Hepbasli, I. Dincer, Performance investigation of two geothermal district heating systems for building applications: Energy analysis, Energy and Buildings 38 (4) (2006) 286 – 292. doi:
  • (5) B. J. Claessens, S. Vandael, F. Ruelens, M. Hommelberg, Self-learning demand side management for a heterogeneous cluster of devices with binary control actions, in: Proc. 3th IEEE Innov. Smart Grid Technol. Conf. (ISGT Europe), Berlin, Germany, 2012, pp. 1–8.
  • (6) F. De Ridder, B. Claessens, A trading strategy for industrial chps on multiple power markets, International Trans. on Electrical Energy Systems 24 (5) (2014) 677–697.
  • (7) S. Koch, Demand Response Methods for Ancillary Services and Renewable Energy Integration in Electric Power Systems, PhD thesis, University of Stuttgart, 2012.
  • (8) K. Vanthournout, R. D’hulst, D. Geysen, G. Jacobs, A smart domestic hot water buffer, IEEE Trans. on Smart Grid 3 (4) (2012) 2121–2127.
  • (9) G. Verbeeck, Optimisation of Extremely Low Energy Residential Buildings, PhD thesis, KULeuven, 2007.
  • (10) J. Kensby, A. Trüschel, J.-O. Dalenbäck, Potential of residential buildings as thermal energy storage in district heating systems – results from a pilot test, Applied Energy 137 (2015) 773 – 781.
  • (11) G. Sandou, S. Font, S. Tebbani, A. Hiret, C. Mondon, S. Tebbani, A. Hiret, C. Mondon, Predictive control of a complex district heating network, in: Proc. 44th IEEE Conference on Decision and Control, European Control Conference (CDC-ECC), 2005, pp. 7372–7377. doi:10.1109/CDC.2005.1583.
  • (12) J. Å iroký, F. Oldewurtel, J. Cigler, S. Prívara, Experimental analysis of model predictive control for an energy efficient building heating system, Applied Energy 88 (9) (2011) 3079 – 3087.
  • (13) J. L. Mathieu, M. Kamgarpour, J. Lygeros, D. S. Callaway, Energy arbitrage with thermostatically controlled loads, in: Proc. European Control Conference (ECC), IEEE, 2013, pp. 2519–2526.
  • (14) S. Vandael, B. J. Claessens, M. Hommelberg, T. Holvoet, G. Deconinck, A scalable three-step approach for demand side management of plug-in hybrid vehicles, IEEE Trans. on Smart Grid 4 (2) (2013) 720–728.
  • (15) S. Lange, T. Gabel, M. Riedmiller, Batch reinforcement learning, in: M. Wiering, M. van Otterlo (Eds.), Reinforcement Learning: State-of-the-Art, Springer, New York, NYC, 2012, pp. 45–73.
  • (16) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature 518 (7540) (2015) 529–533.
  • (17) D. Bertsekas, J. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Nashua, NH, 1996.
  • (18) J. L. Mathieu, M. Kamgarpour, J. Lygeros, G. Andersson, D. S. Callaway, Arbitraging intraday wholesale energy market prices with aggregations of thermostatic loads, IEEE Trans. on Power Systems 30 (2) (2015) 763–772.
  • (19) S. Iacovella, F. Ruelens, P. Vingerhoets, B. J. Claessens, G. Deconinck, Cluster control of heterogeneous thermostatically controlled loads using tracer devices, IEEE Trans. on Smart Grid PP (99) (2015) 1–9. doi:10.1109/TSG.2015.2483506.
  • (20) E. Georges, B. Cornélusse, D. Ernst, Q. Louveaux, V. Lemort, S. Mathieu, Direct control service from residential heat pump aggregation with specified payback, in: Proceedings of the 19th Power Systems Computation Conference (PSCC), 2016.
  • (21) B. Biegel, P. Andersen, J. Stoustrup, M. B. Madsen, L. H. Hansen, L. H. Rasmussen, Aggregation and control of flexible consumers - a real life demonstration, in: Proc. of the 19th IFAC World Congress, IFAC, Cape Town, South Africa, 2014. doi:10.3182/20140824-6-ZA-1003.00718.
  • (22) M. Hu, A data-driven feed-forward decision framework for building clusters operation under uncertainty, Applied Energy 141 (0) (2015) 229 – 237.
  • (23) S. Mathieu, D. Ernst, Q. Louveaux, An efficient algorithm for the provision of a day-ahead modulation service by a load aggregator, in: in Proc. 4th IEEE Innovative Smart Grid Technologies Europe (ISGT EUROPE), 2013, pp. 1–5. doi:10.1109/ISGTEurope.2013.6695247.
  • (24) J. Mathieu, S. Koch, D. Callaway, State estimation and control of electric loads to manage real-time energy imbalance, IEEE Trans. on Power Systems 28 (1) (2013) 430–440. doi:10.1109/TPWRS.2012.2204074.
  • (25) W. Zhang, K. Kalsi, J. Fuller, M. Elizondo, D. Chassin, Aggregate model for heterogeneous thermostatically controlled loads with demand response, in: Proc. IEEE Power and Energy Society General Meeting, 2012, pp. 1–8.
  • (26) N. Gatsis, G. Giannakis, Residential load control: Distributed scheduling and convergence with lost AMI messages, IEEE Trans. on Smart Grid 3 (2) (2012) 770–786. doi:10.1109/TSG.2011.2176518.
  • (27) M. G. C. Bosman, Planning in Smart Grids, PhD Thesis, University of Twente, 2012.
  • (28) B. Biegel, J. Stoustrup, P. Andersen, Distributed model predictive control via dual decomposition, Intelligent Systems, Control and Automation: Science and Engineering 69.
  • (29) D. Vanhoudt, B. Claessens, R. Salenbien, J. Desmedt, The use of distributed thermal storage in district heating grids for demand side management,, under review at Elsevier: Energy and buildings (feb 2017). arXiv:1702.06005.
  • (30) E. Ikonen, I. Selek, J. Kovacs, M. Neuvonen, Z. Szabo, J. Bene, J. Peurasaari, Short term optimization of district heating network supply temperatures, in: Proc. IEEE International Energy Conference (ENERGYCON), 2014, pp. 996–1003. doi:10.1109/ENERGYCON.2014.6850547.
  • (31) W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd Edition, Wiley, 2011.
  • (32) S. Grosswindhager, A. Voigt, M. Kozek, Predictive control of district heating network using fuzzy DMC, in: Proc. of International Conference on Modelling, Identification Control (ICMIC), 2012, pp. 241–246.
  • (33) P. Pinson, T. Nielsen, H. Nielsen, N. Poulsen, H. Madsen, Temperature prediction at critical points in district heating systems, European Journal of Operational Research 194 (1) (2009) 163 – 176.
  • (34) P. Booij, V. Kamphuis, O. Pruisen, C. Warmer, Multi-agent control for integrated heat and electricity management in residential districts, in: 4th International Workshop on Agent Technologies for Energy Systems, ATES 2013, 2013. doi:10.1109/ENERGYCON.2014.6850547.
  • (35) Q. Gemine, D. Ernst, B. Cornélusse, Active network management for electrical distribution systems: problem formulation and benchmark, CoRR abs/1405.2806.
  • (36) R. Fonteneau, S. Murphy, L. Wehenkel, D. Ernst, Batch mode reinforcement learning based on the synthesis of artificial trajectories, Annals of Operations Research 208 (1) (2013) 383–416. doi:10.1007/s10479-012-1248-5.
  • (37) F. Ruelens, B. J. Claessens, S. Vandael, B. De Schutter, R. Babuška, R. Belmans, Residential demand response of thermostatically controlled loads using batch reinforcement learning, IEEE Trans. on Smart Grid PP (99) (2016) 1–11. doi:10.1109/TSG.2016.2517211.
  • (38) P. Valdimarsson, Modelling of Geothermal District Heating Systems, Phd thesis, University of Iceland (1993).
  • (39) A. Benonysson, Dynamic modellign and operational Optimization of District Heating Systems, PhD thesis, Technical University of Denmark (DTU), 1991.
  • (40) L. Busoniu, R. Babuška, B. De Schutter, D. Ernst, Reinforcement learning and Dynamic Programming Using Function Approximators, 1st Edition, CRC Press, 2010.
  • (41) L. Busoniu, R. Munos, R. Babuška, A Survey of Optimistic Planning in Markov Decision Processes, John Wiley & Sons, Inc., 2013, pp. 494–516. doi:10.1002/9781118453988.ch22.
  • (42) D. Ernst, M. Glavic, F. Capitanescu, L. Wehenkel, Reinforcement learning versus model predictive control: a comparison on a power system problem, IEEE Trans. Syst., Man, Cybern., Syst. 39 (2) (2009) 517–529.
  • (43) A. Aertgeerts, Demand side management of the thermal flexibility in a residential neighborhood using a hierarchical market-based multi-agent system, master thesis, KULeuven, 2014.
  • (44) S. Lange, M. Riedmiller, Deep auto-encoder neural networks in reinforcement learning, in: Proc. IEEE 2010 Int. Joint Conf. on Neural Networks (IJCNN), Barcelona, Spain, 2010, pp. 1–8.
  • (45) B. J. Claessens, P. Vrancx, F. Ruelens, Convolutional neural networks for automatic state-time feature extraction in reinforcement learning applied to residential load control, IEEE Transactions on Smart Grid PP (99) (2016) 1–1. doi:10.1109/TSG.2016.2629450.
  • (46) K. De Craemer, G. Deconinck, Balancing trade-offs in coordinated phev charging with continuous market-based control, in: Proc. 3rd IEEE PES International Conference and Exhibition on Innovative Smart Grid Technologies (ISGT Europe),, 2012, pp. 1–8. doi:10.1109/ISGTEurope.2012.6465685.
  • (47) P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees, Machine Learning 63 (1) (2006) 3–42.
  • (48) Z. Wen, D. O’Neill, H. Maei, Optimal demand response using device-based reinforcement learning, IEEE Trans. on Smart Grid 6 (5) (2015) 2312–2324. doi:10.1109/TSG.2015.2396993.
  • (49) E. C. Kara, M. Berges, B. Krogh, S. Kar, Using smart devices for system-level management and control in the smart grid: A reinforcement learning framework, in: Proc. 3rd IEEE Int. Conf. on Smart Grid Commun. (SmartGridComm), Tainan, Taiwan, 2012, pp. 85–90.
  • (50) E. Mocanu, P. H. Nguyen, W. L. Kling, M. Gibescu, Unsupervised energy prediction in a smart grid context using reinforcement cross-building transfer learning, Energy and Buildings 116 (2016) 646 – 655. doi:
  • (51) X. Xu, L. Zuo, Z. Huang, Reinforcement learning algorithms with function approximation: Recent advances and applications, Information Sciences 261 (0) (2014) 1 – 31.
  • (52) D. Ernst, P. Geurts, L. Wehenkel, Tree-based batch mode reinforcement learning., Journal of Machine Learning Research, 6(1):503-–556doi:10.1109/TPWRS.2009.2016607.
  • (53) T. Lampe, M. Riedmiller, Approximate model-assisted neural fitted Q-iteration, in: IEEE International Joint Conference on Neural Networks (IJCNN 2014), Beijing, China, 2014.
  • (54) R. Anderson, A. Boulanger, W. Powell, W. Scott, Adaptive stochastic control for the smart grid, Vol. 99, 2011, pp. 1098–1115. doi:10.1109/JPROC.2011.2109671.
  • (55) M. Hommelberg, B. van der Velde, C. Warmer, I. Kamphuis, J. Kok, A novel architecture for real-time operation of multi-agent based coordination of demand and supply, in: Power and Energy Society General Meeting - Conversion and Delivery of Electrical Energy in the 21st Century, 2008 IEEE, 2008, pp. 1 –5. doi:10.1109/PES.2008.4596531.
  • (56) Belpex - Belgian power exchange,, [Online: accessed March 21, 2016].
  • (57) Eurostat, Gas prices for industrial consumers 2013,
  • (58) A. Conejo, M. Plazas, R. Espinola, A. Molina, Day-ahead electricity price forecasting using the wavelet transform and ARIMA models, IEEE Trans. on Power Systems 20 (2) (2005) 1035 – 1042. doi:10.1109/TPWRS.2005.846054.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description