ModelFree Control of Thermostatically Controlled Loads Connected to a District Heating Network
Abstract
Optimal control of thermostatically controlled loads connected to a district heating network is considered a sequential decisionmaking problem under uncertainty. The practicality of a direct modelbased approach is compromised by two challenges, namely scalability due to the large dimensionality of the problem and the system identification required to identify an accurate model. To help in mitigating these problems, this paper leverages on recent developments in reinforcement learning in combination with a marketbased multiagent system to obtain a scalable solution that obtains a significant performance improvement in a practical learning time. The control approach is applied on a scenario comprising 100 thermostatically controlled loads connected to a radial district heating network supplied by a central combined heat and power plant. Both for an energy arbitrage and a peak shaving objective, the control approach requires 60 days to obtain a performance within 65% of a theoretical lower bound on the cost.
keywords:
District heating, combined heat and power, reinforcement learning, thermostatically controlled loads.[restorekak]Bert J. Claessens is currently working at REstore and can be contacted at bert.claessens@restore.eu
1 Introduction
A District Heating Network (DHN) offers the opportunity to provide the collective heat demand of a cluster of geographically concentrated buildings through a set of central heat sources. This allows the use of centralized production techniques with an efficiency exceeding that of distributed production. Combined Heat and Power plants (CHPs) are a prominent example, as 8090% of the primary energy is converted to heat and electricity (1); (2); (3). But also heat from a geothermal source (4) or excess heat resulting from an industrial process can be used as primary heat source.
The heat from the sources is transported through a network of pipes using water as a medium. At each building, heat is extracted in a local substation, resulting in water at a lower temperature, being transported back to the different heat sources.
A typical operational model at the production side is to modulate the power of the heat sources to keep the supply temperature close to a design setting. This basically results in the thermal supply following the thermal demand.
Heat storage however, can provide demand flexibility enabling flexibility at the production side through demand response approaches. This flexibility allows operational opportunities for cost reduction, examples being peak shaving/valley filling (5) and energy arbitrage by selling the electricity production of the CHP on the wholesale market (1); (2); (6).
Well referenced embodiments of local heat storage are Thermostatically Controlled Loads (TCLs) (7) such as a hot water storage tank (8) where the heat is stored directly in the water, but also the building envelope (1); (7); (9); (10) can be used to store heat.
From an operational point of view, controlling a cluster of TCLs connected to a DHN can be considered as a sequential decisionmaking problem under uncertainty.
One well studied control paradigm for operational management of a DHN is that of Model Predictive Control (MPC) (11); (12). When projected on the setting of TCLs connected to a DHN, this requires defining control actions for the central sources as well as for all individual TCLs.
Developing a practical implementation requires one to tackle the problem of scalability, as the state dimensionality and number of control variables quickly result in an intractable optimization problem. This is complicated further by nonlinear system dynamics.
A second important challenge is that of system identification (13) as identifying an accurate model of both the DHN and all TCLs requires significant amounts of not readily available data and expert knowledge.
This work contributes in mitigating operational control challenges for TCLs connected to a DHN by working on these two problems.
Scalability: To obtain scalability, a heuristic dispatch approach as described in (14), is applied to the setting of TLCs connected to a DHN. Instead of calculating an individual control action for each TCL, this approach calculates a collective control action for the entire cluster of TCLs. A marketbased dispatch algorithm is used to translate the collective control action into individual control actions.
System identification: Driven by recent developments in Batch Reinforcement Learning (BRL) (15); (16), a blind modelfree approach is considered. As a BRL technique needs no prior information on the system dynamics, this strongly relaxes the system identification requirements, at the cost of a learning time and suboptimal performance.
In Section 2, an overview is given on related research regarding the control of large clusters of TLCs and model based controllers for DHN. In Section 3, the decisionmaking problem is formalized as a Markov Decision Problem (MDP).
In Section 4, the control approach as used in this work, is described in detail.
In Section 5, an evaluation of the controller performance is provided based upon a simulation scenario comprising 100 TCLs connected to a DHN. The simulation scenario is sufficiently complex to evaluate the contributions of the control approach, but also comprehensibly enough to allow for an analysis which is not obfuscated by complexity of the scenario.
Finally in Section 6, the conclusions are provided, as is a discussion on the results.
2 Related Work
In this section a nonexhaustive overview is given of related work regarding both the control of large clusters of TCLs and modelbased control applied to a DHN.
2.1 Controlling a cluster of TCLs
The curse of dimensionality (17) lurks around the corner when managing the flexibility present in a large cluster of TCLs. This is attributed to the dimensionality of the state space and the large amount of control variables.
To this end, significant recent work (18); (19); (20); (21); (22) has focused on providing computationally tractable solutions for large clusters of TCLs.
In (23), a cluster of flexibility carriers represented by generic tank models is considered with the objective of providing dayahead modulation services to a transmission system operator. Even though all models are linear, a formal branch and bound based optimization approach quickly becomes intractable. The main contribution of the work is a heuristic method including a state dependent dispatch algorithm in combination with an iterated local search technique. High quality solutions to a test problem are presented, obtained within a practical calculation time.
The results of a related approach applied in an actual field test comprising 54 heat pumps has been presented in (21).
Here an aggregated model of reduced dimensionality was used to determine power set points for the entire cluster in a MPC approach. A heuristic dispatch algorithm was used to convert the aggregated set points to local control actions in the portfolio. In (22) a datadriven decision framework has been developed using a metaheuristic optimization technique.
An approach from the same solution class is presented in (13).
Here a problem of energy arbitrage with a large cluster of TCLs is presented.
An aggregated system model is used in the form of a state bin transition model (24). All TCLs are clustered, based upon their position within their deadband, resulting in a state vector containing the fraction of TCLs in each state bin. A linear state bin transition model describes the dynamics of this state vector, the dimensionality of which is independent of the number of TCLs in the cluster. This model is used in an MPC resulting in a control action for each state bin. A simple heuristic is used to dispatch the control signals to individual control actions at device level. Although a simplified first order TCL model has been used, the results presented in (13) show that careful system identification is required. Moreover, in (25) it was argued that a first order TCL model is found lacking, further complicating system identification.
A different approach is that of distributed optimization (26); (27), where the centralized optimization problem is decomposed over distributed agents who interact through virtual prices. For example in (28), distributed MPC through dual decomposition was presented as a means for energy arbitrage of a large cluster of TCLs subjected to a coupling constraint related to an infrastructure limitation. Although mathematical performance guarantees can be provided under sufficient assumptions, the method heavily relies on the accuracy of local models and has stringent communication and computation requirements due to its iterative character.
An example of how an MPC controller can be used at building level is detailed in (12).
2.2 ModelBased DHN Control
When implementing a modelbased control approach for a DHN, be it centralized or distributed, one is confronted with (1) the nonlinearities in the dynamics of a DHN, and (2) the slow time scales compared to e.g. an electric network (29). Taking these effects into account is essential for a good performance of the controller. Several modelbased optimization approaches have been identified in literature, explicitly incorporating the dynamics of the DHN. For example in (11) a simplified model has been derived that is used together with sequential quadratic programming. In (30) approximate dynamic programming (31) has been used taking advantage of permutational symmetries of the DHN dynamics. A modelbased approach using fuzzy direct matrix control to mitigate nonlinearities of the DHN dynamics can be found in (32).
Although modelbased solutions can have excellent performance, accurate models of the DHN and the consumers coupled to the DHN are required, tuning and shaping these models is considered an expert task making a generic rollout of this technology challenging (33).
A scalable modelfree solution solution is presented in (34), here a marketbased multiagent system is used to match thermal and electric demand and supply. Although this approach is scalable, it does not take into account the DHN dynamics and follows a myopic control strategy.
An approach combining an auctionbased multiagent system with a central optimization, taking into account a forecast of the total heat demand can be found in (1).
3 Problem description
Inspired by Gemine et al. (35), this section presents a problem formulation of the sequential decisionmaking problem related to optimal control of a cluster of TCLs connected to a DHN. In a second step, the control problem is cast onto a Markov Decision Process (MDP).
3.1 Test scenario network
3.2 Problem components
Dhn
A DHN contains a set of nodes and pipes connecting these nodes. Each pipe connects two nodes and is characterized by its length , diameter and heat loss coefficient . The flow speed of the medium at time in pipe is denoted by and its average temperature by . The temperature in each node at time is characterized by .
TCLs and heat production
The set contains the TCLs connected to the DHN, each TCL is assumed connected to a node. For modeling purposes a set of relevant temperatures is associated to each TCL at time , i.e. the air temperature as measured by a local thermostat, a temperature corresponding to a building envelope (9) and the heating system return temperature.
The control available at the level of a TCL is to decide whether or not to extract heat from the DHN, corresponding to a binary value . The thermal inertia of the DHN and TCLs is used for storage, no separate hot water storage is considered.
Besides TCLs also production units are connected to the DHN in specific nodes. As illustrated in Figure 1, here a single CHP is used as heat source. At every time step , an input power , a thermal output power and an electric output power are are associated to in a node .
The relationship between these powers is defined as:
(1) 
with the total fuel utilization ratio of the CHP and defines the heat to power ratio of the CHP. In this work is considered the control variable.
Operational limits
The main interest of the DHN system is to supply a heat service towards the TCLs meeting comfort constraints:
(2) 
Here and indicate the lower and upper bound respectively. Common for buildings is to have a constraint on the operative temperature. As a simplification the comfort constraints are here directly related to the air temperature . Besides constraints for the TCLs also constraints for the DHN are relevant, i.e.
(3)  
(4)  
3.3 Sequential decision making
Driven by the possibility of using techniques from Reinforcement Learning (RL) (15), the sequential decisionmaking process is formulated as a Markov Decision Process (MDP) (17); (36). The sequential nature results from intertemporal constraints related to the dynamics of the DHN and the TCLs. Decisions made at time step impact possible actions allowed at future states. An MDP is defined by its state space , its action space , and a transition function :
(5) 
describing the dynamics from state to , following the control actions subject to a random process, , where is drawn from a probability distribution . Each transition is accompanied by a cost signal :
(6) 
with the cost function.
State description
Following the notation provided in (37), the state of the system is assumed to be spanned by time dependent state information , controllable state information and uncontrollable exogenous state information (17). The time dependent state information describes the time information relevant for the dynamics, e.g. the quarter of an hour in the day or the day in the week. In this work . The controllable state information represents the state of the DHN and the TCLs:
(7)  
(8) 
The uncontrollable exogeneous state information comprises the physical parameters relevant for the dynamics of the system that can not be influenced by . Examples being the outside temperature , solar irradiation , wind speed and direction
(9) 
It is that represents . If there is no correlation between and (), can be omitted from the state information (17) in the MDP, however by having in the state vector a first order correlation is assumed
Control actions
The control vector includes the control actions of the TCLs and the thermal output power of the central CHP:
(10) 
DHN Dynamics
To model the dynamics of the DHN (as used in the evaluation), a quasidynamic approach is followed (29) as pressure and flow change orders of magnitude faster than the temperature of the water. In a first step, a hydraulic simulation is performed, in a second step the thermal dynamics are calculated. For the hydraulic calculations the approach as proposed in (38) is followed. This results in applying Kirchhoff’s laws, with the consideration that there exists a nonlinear relationship between pressure and flow rate.
To calculate the thermal dynamics, the node model as presented by Benonysson (39) has been used, essentially solving the following equation for every pipe section:
(11) 
With the mass of the water, the thermal capacity of water, the water temperature, the heat demand and the heat transfer coefficient between the water and the ground. The surface of the pipe considered is and the local ground temperature.
TCL Dynamics
For the building models, a lumped capacitance model is used i.e. an electric analogue following Verbeek (9). The model includes the temperature dynamics of the inside air, a building envelope and the heating system return temperature (29). Besides heat losses to the ambient air, also wind speed dependent air infiltration losses are included, as are uncontrolled heating due to solar irradiation and local electric consumption (29).
Cost signal
Finally also a cost function needs to be defined: , in this work, two objectives are regarded, i.e. energy arbitrage , responding to an external price and peak shaving/valley filling . The cost functions are defined as:
(12) 
with the effective price for producing thermal energy at time step . For the peak shaving objective, the cost is expressed on a daily basis:
(13) 
The objective of this work is to find a control policy that will minimize the stage return starting from state defined as:
(14) 
with
(15) 
from the understanding that an optimal policy satisfies the Bellman equation:
(16) 
When an accurate model is available, typical techniques to find nearoptimal policies in an MDP framework are value iteration, policy iteration, direct policy search (40) and tree search algorithms such as optimistic planning (41). In (30) for example, an approximated value iteration approach is followed to determine a control policy for the control of a DHN.
In this work, a modelfree approach is explored. Driven by promising results (42); (16); (36), Batch Mode Reinforcement Learning (BRL) techniques are investigated, detailed in Section 4.
4 DHN controller approach
This section describes a pragmatic control approach building upon recent results in BRL and marketbased multiagent systems.
The control approach illustrated in Figure 2 and followed in this work is based upon a Three Step Approach (TSA) as presented in (14); (43) following a similar strategy as followed in (21); (24). In a first step, (1) all relevant (and practically available) state information is collected, e.g. the temperature information from the TCLs. From this information a limited set of features (17) is extracted, resulting in a lowdimensional representation of the system state. In a second step (2) a control action for the entire cluster of TCLs is extracted from a policy determined offline on given time intervals. In a third and last step (3), this control action is dispatched over the different TCLs using a marketbased multiagent system. This process is repeated following a receding horizon approach. In the following, a more detailed description of the three steps is presented.
4.1 Step 1: Aggregation
In the first step, state information as described in Section 3.3.1 is retrieved from the system. From a practical perspective however, not all state information is readily available. At building level, only the air temperatures as measured by a local thermostat are assumed available, i.e. the air temperatures measured in the buildings, . Furthermore measurements of the outside air temperature are assumed available, as is the water temperature from a subset of nodes in the supply side of the DHN, i.e. and the return side
(17)  
(18)  
(19)  
(20) 
Although, more generic dimension reduction techniques such as autoencoders can be used (44), the aim of this work is to understand what performance can be obtained starting from this limited state description.
Alternatively, a convolutional neural network as presented by the the authors in (45) could be used to automatically extract relevant statetime features, allowing to add historic observations to the state (16); (17).
To facilitate the dispatch step as explained in Section 4.3, a bidfunction is defined for every TCL (5); (14); (46). In (5),
the bid function of a device is expressed as the electric power consumed versus a heuristic (). Above a corner value the bid function is zero:
(21) 
Determining this heuristic is considered relatively straightforward as it requires only the air temperature as measured by the thermostat and the upper and lower temperature bound. Defining the thermal power extracted from the DHN by a TCL when switched on, is less so (29). To relax this requirement we assume an estimate of the flowrate () when switched on is available. The flowrate is defined as follows. First the set temperature of the indoor heat supply system, which is a function of the outdoor temperature, is calculated. Then, a model for the substation heat exchanger is used to determine the flow rate extracted from the DHN, by which the outlet temperature of the heat exchanger meets the set temperature. Using this value instead of the actual power results in the following bid function for building :
(22) 
here corresponds to the heaviside function.
4.2 Step 2: Batch Reinforcement Learning
In the second step, a control action is selected once every 15 minutes, following the policy . The control action is selected for the entire cluster which is projected onto individual control actions as described in Section 4.3.
One of the main goals of this work is to explore to what extent (modelfree) reinforcement learning can be used to determine . Reinforcement Learning (RL) is a modelfree control approach that learns a policy by interaction with the system (40). In recent literature, RL (mainly in the form of Qlearning (48); (49); (50)) has been presented as an effective modelfree learning approach for DR applications. The practicality however, suffers from slow convergence (15); (40) and the curse of dimensionality (17). These challenges can be partially mitigated by using past interactions and appropriate function approximators in a BRL strategy (40); (51). A popular BRL approach is that of Fitted QIteration (FQI) introduced by Ernst et al. in (52), especially in combination with extremely randomized trees as regression technique (47). In (42) the authors conclude that especially for nonlinear control problems, FQI can be a valuable alternative to MPC approaches with the extra advantage that FQI is a blind technique. Moreover, FQI and MPC can strengthen each other (42); (53). Although several BRL techniques have been proposed in the literature (36); (52); (44), this work focuses on FQI using extremely randomized trees as regression algorithm (47). Comparison to the performance of other BRL approaches is considered outside the scope of this work.
Following (52), an approximation of the stateaction value function , is built on a daily basis
(23) 
With the state vector as defined in equation (17). Algorithm 1 is used to obtain . During the day, the control action is selected with a probability defined by (54):
(24) 
The temperature is decreased on a daily basis according to a harmonic sequence (31), a high temperature results in more exploration whilst results in a greedy approach:
(25) 
In the peak shaving scenario, a control action is determined on a daily basis, defining the average power to be followed for the next day as detailed in (5).
4.3 Step 3: Realtime control
In the third step, the energy corresponding to is dispatched over the cluster of TCLs, using a marketbased multiagent system (14); (55). Compared to the work of (5), there is a significant difference as only the expected flow rate for each TCL is assumed available. To this end, a Proportional Integrator (PI) controller (at a central level) managing the flow rates at the different buildings is used. Since hydraulic effects occur nearly instantaneous in a DHN, this will have a direct effect on the flow rate at the source, influencing the power at the source side as the supply setpoint is assumed to be constant in the simulations (29). An overview of the realtime control can be seen in Figure 3. As described in Section 4.1, every TCL is represented by a bid function . After a clearing process (26), a clearing priority is sent back to the different devices:
(26) 
The devices open or close their local valve according to .
5 Evaluation
To evaluate the performance of the controller described in Section 4, a set of simulations have been performed. In this section, first a condensed description of the simulation scenario will be presented after which the tracking performance of the controller is evaluated. The performance of the controller is evaluated for two distinct objectives, i.e. that of energy arbitrage on a dayahead energy market and peak shaving/valley filling.
5.1 Simulation scenario
As mentioned in Section 1, the scenario is designed to be sufficiently demanding, but also simple enough to allow for an analysis which is not obfuscated by the complexity of the scenario. To this end, the (arbitrary) topology as depicted in Figure 1 has been used, i.e. a central CHP (kW) provides heat to 100 TCLs connected to a radial DHN. Each building is located in one of four streets. The total length of the grid is 2.1 km, with pipe diameters ranging from DN25 to DN100. A detailed description of the simulation scenario can be found in (29). The TCL building models are lumped capacitance models using an electric analogon comprising capacitances and resistors. The capacitance values are related to the temperature of the inside air, a building envelope and the heating system return temperature (9); (29). All 100 buildings models included in the simulation are derived from the same model. Different model parameters are used for each building, by sampling capacitance and resistance values from a normal distribution with a standard deviation of 20% of the standard value. The standard values correspond to a detached house with a living area of 103 m and a protected volume of 452 m. The maximum standard power demand of the building is 9.8 kW at an internal temperature of 20 and an ambient temperature of 8. Besides heat losses to the ambient air, also wind speed dependent air infiltration losses are included as is uncontrolled heating due to solar irradiation and local electric consumption (29). For simplicity the temperature constraints are set the same for all buildings at 19.5 and 20.5.
5.2 Tracking performance
A first supporting result is depicted in Figure 4, here a numerical experiment was performed where every 15 minutes a random control action was selected (within the technical specification of the CHP). The corresponding set points are depicted in the lower part of Figure 4, as is the average thermal power as produced by the CHP. The upper graph of Figure 4 depicts the internal temperatures of the 100 buildings and the average temperature. The graph shows that when the buildings are on average in their dead band, the requested power can be tracked accurately. Although the buildings have different physical parameters, they tend to synchronize with regard to their temperature relative to the comfort constraints. This is a direct effect of the dispatch dynamics, as those buildings with a higher priority are served first.
5.3 Energy arbitrage
In the scenario of energy arbitrage, the CHP can sell its electric energy directly at the wholesale market (2). The energy prices are taken from the Belgian dayahead market (56), the gas price is set at 38.6 €/MWh (57). As dayahead prices can be predicted with a reasonable accuracy (58), these are considered deterministic in this evaluation. Furthermore, the off line policy calculation (Algorithm 1) is performed only on a daily basis
(27) 
with the daily cost of the default controller, the lower bound on the daily cost and the daily cost obtained with the solution presented in this work. A metric of 0 corresponding to the same performance as the default controller, whilst a metric of 1 corresponding to the lower bound solution. From Figure 5 it is observed that the performance metric gradually increases to a value of around 6070%, also for the colder days as depicted in the right column.
A snapshot of the daily power profiles of a mature controller compared to the default controller is depicted in Figure 6. It can be seen that the controller produces heat when energy prices are high. It is meaningful to understand that a default controller already has a reasonable performance as the heat demand is typically largest when also the wholesale price is highest. This correlation is however expected to decrease as more renewable energy comes in the production mix, making wholesale prices more volatile.
5.4 peak shaving
In the second experiment, peak shaving is considered. The summarizing results are depicted in Figure 7. In the top graph one can see the daily peak in thermal power obtained with the default controller and the BRL approach. The second graph gives the average daily ambient temperature. The daily maximum power peak indeed reduces with time. To make the performance more visible, also the load duration curves are plotted for the first and the last 50 days, both for the BRL approach and the default controller. The first 50 days there is limited improvement over the default controller, however for the subsequent period one can clearly observe the effect of the controller from the load duration curves. A lower bound is depicted by plotting the average power corresponding to the daily energy consumed by the default control approach. Note, on day seven (top graph in Figure 7), the profile obtained with BRL results a high peak power, which is attributed to an exploration step (24). The final performance is visualized more clearly in Figure 8 where the power profile is plotted for a mature controller. Indeed, the thermal power follows a constant profile compared to the default control case. A visualization of the policy obtained by the BRL controller is presented in Figure 9, here the average power set point is plotted versus the initial indoor temperature state and the expected average daily outdoor temperature. As is conceived logical, the setpoint decreases with increasing outside temperature and average air temperature.
6 Conclusions and future work
In this work the control problem of a cluster of thermostatically controlled loads connected to a district heating network is addressed by assessing the performance of a control approach comprising a modelfree reinforcement learning technique in combination with a marketbased multiagent system. The performance of the controller has been evaluated for two distinct scenarios, i.e. energy arbitrage and peak shaving. In the evaluation a detailed district heating network model has been used including hydraulic and thermal dynamics.
For the energy arbitrage scenario, solutions are obtained that reach over 65% of the available optimization potential after a learning period of 40 to 60 days. Knowing that the policy is updated on a daily base, this is considered a promising result. Also for the peak shaving/valley filling scenario, promising results have been obtained, since a clear performance improvement is observed.
These results support a practical implementation and coming of age of reinforcement learning techniques.
To understand the potential of completely modelfree control a direct implementation of fitted Qiteration as presented in (40) has been used.
However as discussed in (42), combining general domain knowledge with a modelfree approach
is expected to result in an improved performance at a reduced learning time. This domain knowledge can be incorporated through e.g. information regarding the shape of the policy (40) or through using a model similar as in (53).
A second point of future research is directed at more automated feature extraction techniques such as autoencoders (16), which is also expected to result in a reduced learning time.
References
Footnotes
 journal: Energy and Buildings
 The outside temperature, solar irradiation and wind information are assumed constant over the DHN and the buildings.
 This can readily be extended to include information several time steps back.
 Only temperatures at a limited set of nodes are assumed available.
 This is done for practicality, as the simulations cover a time span of several months, in a reallife application, should be constructed more frequently, following Algorithm 1.
 This is for practical reasons as the calculation of a policy typically takes 20 minutes on a Intel, 2.5 GHz, 8GB RAM
References
 C. Johansson, On Intelligent District Heating, PhD Thesis, Blekinge Institute of Technology, Blekinge, Sweden, 2014.
 Y. Kitapbayev, J. Moriarty, P. Mancarella, Stochastic control and real options valuation of thermal storageenabled demand response from flexible district energy systems, Applied Energy 137 (2015) 823 – 831.
 K. Sartor, S. Quoilin, P. Dewallef, Simulation and optimization of a CHP biomass plant and district heating network, Applied Energy 130 (0) (2014) 474 – 483.
 L. Ozgener, A. Hepbasli, I. Dincer, Performance investigation of two geothermal district heating systems for building applications: Energy analysis, Energy and Buildings 38 (4) (2006) 286 – 292. doi:http://dx.doi.org/10.1016/j.enbuild.2005.06.021.
 B. J. Claessens, S. Vandael, F. Ruelens, M. Hommelberg, Selflearning demand side management for a heterogeneous cluster of devices with binary control actions, in: Proc. 3th IEEE Innov. Smart Grid Technol. Conf. (ISGT Europe), Berlin, Germany, 2012, pp. 1–8.
 F. De Ridder, B. Claessens, A trading strategy for industrial chps on multiple power markets, International Trans. on Electrical Energy Systems 24 (5) (2014) 677–697.
 S. Koch, Demand Response Methods for Ancillary Services and Renewable Energy Integration in Electric Power Systems, PhD thesis, University of Stuttgart, 2012.
 K. Vanthournout, R. D’hulst, D. Geysen, G. Jacobs, A smart domestic hot water buffer, IEEE Trans. on Smart Grid 3 (4) (2012) 2121–2127.
 G. Verbeeck, Optimisation of Extremely Low Energy Residential Buildings, PhD thesis, KULeuven, 2007.
 J. Kensby, A. TrÃ¼schel, J.O. DalenbÃ¤ck, Potential of residential buildings as thermal energy storage in district heating systems â results from a pilot test, Applied Energy 137 (2015) 773 – 781.
 G. Sandou, S. Font, S. Tebbani, A. Hiret, C. Mondon, S. Tebbani, A. Hiret, C. Mondon, Predictive control of a complex district heating network, in: Proc. 44th IEEE Conference on Decision and Control, European Control Conference (CDCECC), 2005, pp. 7372–7377. doi:10.1109/CDC.2005.1583.
 J. Å irokÃ½, F. Oldewurtel, J. Cigler, S. PrÃvara, Experimental analysis of model predictive control for an energy efficient building heating system, Applied Energy 88 (9) (2011) 3079 – 3087.
 J. L. Mathieu, M. Kamgarpour, J. Lygeros, D. S. Callaway, Energy arbitrage with thermostatically controlled loads, in: Proc. European Control Conference (ECC), IEEE, 2013, pp. 2519–2526.
 S. Vandael, B. J. Claessens, M. Hommelberg, T. Holvoet, G. Deconinck, A scalable threestep approach for demand side management of plugin hybrid vehicles, IEEE Trans. on Smart Grid 4 (2) (2013) 720–728.
 S. Lange, T. Gabel, M. Riedmiller, Batch reinforcement learning, in: M. Wiering, M. van Otterlo (Eds.), Reinforcement Learning: StateoftheArt, Springer, New York, NYC, 2012, pp. 45–73.
 V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Humanlevel control through deep reinforcement learning, Nature 518 (7540) (2015) 529–533.
 D. Bertsekas, J. Tsitsiklis, NeuroDynamic Programming, Athena Scientific, Nashua, NH, 1996.
 J. L. Mathieu, M. Kamgarpour, J. Lygeros, G. Andersson, D. S. Callaway, Arbitraging intraday wholesale energy market prices with aggregations of thermostatic loads, IEEE Trans. on Power Systems 30 (2) (2015) 763–772.
 S. Iacovella, F. Ruelens, P. Vingerhoets, B. J. Claessens, G. Deconinck, Cluster control of heterogeneous thermostatically controlled loads using tracer devices, IEEE Trans. on Smart Grid PP (99) (2015) 1–9. doi:10.1109/TSG.2015.2483506.
 E. Georges, B. Cornélusse, D. Ernst, Q. Louveaux, V. Lemort, S. Mathieu, Direct control service from residential heat pump aggregation with specified payback, in: Proceedings of the 19th Power Systems Computation Conference (PSCC), 2016.
 B. Biegel, P. Andersen, J. Stoustrup, M. B. Madsen, L. H. Hansen, L. H. Rasmussen, Aggregation and control of flexible consumers  a real life demonstration, in: Proc. of the 19th IFAC World Congress, IFAC, Cape Town, South Africa, 2014. doi:10.3182/201408246ZA1003.00718.
 M. Hu, A datadriven feedforward decision framework for building clusters operation under uncertainty, Applied Energy 141 (0) (2015) 229 – 237.
 S. Mathieu, D. Ernst, Q. Louveaux, An efficient algorithm for the provision of a dayahead modulation service by a load aggregator, in: in Proc. 4th IEEE Innovative Smart Grid Technologies Europe (ISGT EUROPE), 2013, pp. 1–5. doi:10.1109/ISGTEurope.2013.6695247.
 J. Mathieu, S. Koch, D. Callaway, State estimation and control of electric loads to manage realtime energy imbalance, IEEE Trans. on Power Systems 28 (1) (2013) 430–440. doi:10.1109/TPWRS.2012.2204074.
 W. Zhang, K. Kalsi, J. Fuller, M. Elizondo, D. Chassin, Aggregate model for heterogeneous thermostatically controlled loads with demand response, in: Proc. IEEE Power and Energy Society General Meeting, 2012, pp. 1–8.
 N. Gatsis, G. Giannakis, Residential load control: Distributed scheduling and convergence with lost AMI messages, IEEE Trans. on Smart Grid 3 (2) (2012) 770–786. doi:10.1109/TSG.2011.2176518.
 M. G. C. Bosman, Planning in Smart Grids, PhD Thesis, University of Twente, 2012.
 B. Biegel, J. Stoustrup, P. Andersen, Distributed model predictive control via dual decomposition, Intelligent Systems, Control and Automation: Science and Engineering 69.
 D. Vanhoudt, B. Claessens, R. Salenbien, J. Desmedt, The use of distributed thermal storage in district heating grids for demand side management, https://arxiv.org/abs/1702.06005␣, under review at Elsevier: Energy and buildings (feb 2017). arXiv:1702.06005.
 E. Ikonen, I. Selek, J. Kovacs, M. Neuvonen, Z. Szabo, J. Bene, J. Peurasaari, Short term optimization of district heating network supply temperatures, in: Proc. IEEE International Energy Conference (ENERGYCON), 2014, pp. 996–1003. doi:10.1109/ENERGYCON.2014.6850547.
 W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, 2nd Edition, Wiley, 2011.
 S. Grosswindhager, A. Voigt, M. Kozek, Predictive control of district heating network using fuzzy DMC, in: Proc. of International Conference on Modelling, Identification Control (ICMIC), 2012, pp. 241–246.
 P. Pinson, T. Nielsen, H. Nielsen, N. Poulsen, H. Madsen, Temperature prediction at critical points in district heating systems, European Journal of Operational Research 194 (1) (2009) 163 – 176.
 P. Booij, V. Kamphuis, O. Pruisen, C. Warmer, Multiagent control for integrated heat and electricity management in residential districts, in: 4th International Workshop on Agent Technologies for Energy Systems, ATES 2013, 2013. doi:10.1109/ENERGYCON.2014.6850547.

Q. Gemine, D. Ernst, B. Cornélusse,
Active network management for
electrical distribution systems: problem formulation and benchmark, CoRR
abs/1405.2806.
URL http://arxiv.org/abs/1405.2806 
R. Fonteneau, S. Murphy, L. Wehenkel, D. Ernst,
Batch mode reinforcement
learning based on the synthesis of artificial trajectories, Annals of
Operations Research 208 (1) (2013) 383–416.
doi:10.1007/s1047901212485.
URL http://dx.doi.org/10.1007/s1047901212485  F. Ruelens, B. J. Claessens, S. Vandael, B. De Schutter, R. Babuška, R. Belmans, Residential demand response of thermostatically controlled loads using batch reinforcement learning, IEEE Trans. on Smart Grid PP (99) (2016) 1–11. doi:10.1109/TSG.2016.2517211.
 P. Valdimarsson, Modelling of Geothermal District Heating Systems, Phd thesis, University of Iceland (1993).
 A. Benonysson, Dynamic modellign and operational Optimization of District Heating Systems, PhD thesis, Technical University of Denmark (DTU), 1991.
 L. Busoniu, R. Babuška, B. De Schutter, D. Ernst, Reinforcement learning and Dynamic Programming Using Function Approximators, 1st Edition, CRC Press, 2010.

L. Busoniu, R. Munos, R. Babuška,
A Survey of Optimistic
Planning in Markov Decision Processes, John Wiley & Sons, Inc., 2013, pp.
494–516.
doi:10.1002/9781118453988.ch22.
URL http://dx.doi.org/10.1002/9781118453988.ch22  D. Ernst, M. Glavic, F. Capitanescu, L. Wehenkel, Reinforcement learning versus model predictive control: a comparison on a power system problem, IEEE Trans. Syst., Man, Cybern., Syst. 39 (2) (2009) 517–529.
 A. Aertgeerts, Demand side management of the thermal flexibility in a residential neighborhood using a hierarchical marketbased multiagent system, master thesis, KULeuven, 2014.
 S. Lange, M. Riedmiller, Deep autoencoder neural networks in reinforcement learning, in: Proc. IEEE 2010 Int. Joint Conf. on Neural Networks (IJCNN), Barcelona, Spain, 2010, pp. 1–8.
 B. J. Claessens, P. Vrancx, F. Ruelens, Convolutional neural networks for automatic statetime feature extraction in reinforcement learning applied to residential load control, IEEE Transactions on Smart Grid PP (99) (2016) 1–1. doi:10.1109/TSG.2016.2629450.
 K. De Craemer, G. Deconinck, Balancing tradeoffs in coordinated phev charging with continuous marketbased control, in: Proc. 3rd IEEE PES International Conference and Exhibition on Innovative Smart Grid Technologies (ISGT Europe),, 2012, pp. 1–8. doi:10.1109/ISGTEurope.2012.6465685.
 P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees, Machine Learning 63 (1) (2006) 3–42.
 Z. Wen, D. OâNeill, H. Maei, Optimal demand response using devicebased reinforcement learning, IEEE Trans. on Smart Grid 6 (5) (2015) 2312–2324. doi:10.1109/TSG.2015.2396993.
 E. C. Kara, M. Berges, B. Krogh, S. Kar, Using smart devices for systemlevel management and control in the smart grid: A reinforcement learning framework, in: Proc. 3rd IEEE Int. Conf. on Smart Grid Commun. (SmartGridComm), Tainan, Taiwan, 2012, pp. 85–90.

E. Mocanu, P. H. Nguyen, W. L. Kling, M. Gibescu,
Unsupervised
energy prediction in a smart grid context using reinforcement crossbuilding
transfer learning, Energy and Buildings 116 (2016) 646 – 655.
doi:http://dx.doi.org/10.1016/j.enbuild.2016.01.030.
URL http://www.sciencedirect.com/science/article/pii/S0378778816300305  X. Xu, L. Zuo, Z. Huang, Reinforcement learning algorithms with function approximation: Recent advances and applications, Information Sciences 261 (0) (2014) 1 – 31.
 D. Ernst, P. Geurts, L. Wehenkel, Treebased batch mode reinforcement learning., Journal of Machine Learning Research, 6(1):503â556doi:10.1109/TPWRS.2009.2016607.
 T. Lampe, M. Riedmiller, Approximate modelassisted neural fitted Qiteration, in: IEEE International Joint Conference on Neural Networks (IJCNN 2014), Beijing, China, 2014.
 R. Anderson, A. Boulanger, W. Powell, W. Scott, Adaptive stochastic control for the smart grid, Vol. 99, 2011, pp. 1098–1115. doi:10.1109/JPROC.2011.2109671.
 M. Hommelberg, B. van der Velde, C. Warmer, I. Kamphuis, J. Kok, A novel architecture for realtime operation of multiagent based coordination of demand and supply, in: Power and Energy Society General Meeting  Conversion and Delivery of Electrical Energy in the 21st Century, 2008 IEEE, 2008, pp. 1 –5. doi:10.1109/PES.2008.4596531.
 Belpex  Belgian power exchange, http://www.belpex.be/, [Online: accessed March 21, 2016].
 Eurostat, Gas prices for industrial consumers 2013, http://epp.eurostat.ec.europa.eu.
 A. Conejo, M. Plazas, R. Espinola, A. Molina, Dayahead electricity price forecasting using the wavelet transform and ARIMA models, IEEE Trans. on Power Systems 20 (2) (2005) 1035 – 1042. doi:10.1109/TPWRS.2005.846054.