Machine Learning on EPEX Order Books:
Insights and Forecasts
Abstract
This paper employs machine learning algorithms to forecast German electricity spot market prices. The forecasts utilize in particular bid and ask order book data from the spot market but also fundamental market data like renewable infeed and expected demand. Appropriate feature extraction for the order book data is developed. Using crossvalidation to optimise hyperparameters, neural networks and random forests are proposed and compared to statistical reference models. The machine learning models outperform traditional approaches.
Keywords: Machine Learning, Neural Networks, Random Forests, Electricity Market, Renewables, Spot Price, Forecasting.
1 Introduction
Forecasting electricity prices is an important task in an energy utility and needed not only for proprietary trading but also for the optimisation of power plant production schedules and other technical issues. A promising approach in power price forecasting is based on a recalculation of the order book using forecasts on market fundamentals like demand or renewable infeed. However, this approach requires extensive statistical analysis of market data. In this paper, we examine if and how this statistical work can be reduced using machine learning. Our paper focuses on two research questions:

How can order books from electricity markets be included in machine learning algorithms?

How can orderbookbased spot price forecasts be improved using machine learning?
We consider the German/Austrian EPEX spot market for electricity. There is a daily auction for electricity with delivery the next day. All 24 hours of the day are traded as separate products. Figure 1 shows auction results on different time scales. The pronounced seasonality of prices is visible as well as their high volatility.^{1}^{1}1Another interesting property is that in contrast to price series of other commodities or stocks, electricity prices may become negative.
In the following, we shortly explain the idea of orderbookbased price forecasts. Each price is the result of an auction, which can be represented as a bid and an ask curve. For a particular hour, those curves are shown in Figure 2. The intersection of the bid (purchase, demand) and ask (sell, supply) curve is the market clearing price (MCP). In the magnified figure, it is clearly visible that the bid and ask curves are step functions. Each step width is the cumulated volume which market participants have put in the auction at a certain price. Price levels correspond in fact to the marginal production costs of different power plants. Due to the regulatory environment, in particular renewables bid at negative prices in the auction. Moreover, in contrast to a classical power plant, the produced amount of renewable energy is stochastic and total expected production is sold on the exchange. Relying on those economical circumstances, the orderbookbased forecasting modifies the volumes at different price levels in the bid and ask curves. The modifications correspond to the forecasted wind and solar power infeed. An important issue is which price levels are influenced by the renewable infeed. Usually, energy utilities use exhaustive statistical analysis on historical data to identify the price levels and the impact of the renewable forecasts. In fact, there are also other fundamental factors which influence the market price, first of all the expected electricity demand. This paper focuses on machine learning methods to reduce the effort for building a forecast model.
In the following section we give an overview on existing literature on the economics of electricity markets, orderbookbased models and the use of machine learning in price forecasting. In Section 3 we detail our methodology. Section 4 is devoted to numerical results and a comparison to other models from the literature. Section 5 concludes.
2 Existing literature on price forecasting and machine learning in electricity markets
Solar and wind energy is playing a more and more prominent role in today’s electricity markets. Empirical studies show that renewable electricity generation is both highly volatile and has a substantial impact on the dayahead electricity price (Wagner (2014)). Using multivariate regression methods, various authors have quantified the influence renewable infeed has on the price (Cludius et al. (2014); Würzburg et al. (2013)). This influence can easily be seen graphically, cf. Figure 3. Therefore, we also use expected solar and wind infeed as features for the price forecasts.
There is a vast body of literature on electricity price forecasting, over which Aggarwal et al. (2009) give an early overview. Their survey covers 47 papers published between 1997 and 2006 with topics ranging from game theoretic to time series and machine learning models. A more recent extensive literature overview is given by Weron (2014), in which the author distinguishes and describes five model classes for electricity price forecasting, namely gametheoretic, fundamental, reducedform, statistical and machine learning models. In an empirical study he finds the latter two to yield the best results. The article closes with a discussion of future challenges in the field, including the issues of feature selection, probabilistic forecasts, combined estimators, model comparability and multivariate factor models. Regarding this last aspect, Ziel & Weron (2018) conduct an empirical comparison of different univariate and multivariate model structures for price forecasting. Comparing a total of 58 models on several datasets, they find that there is no single modelling framework that consistently achieves the best results.
Statistical methods which have been applied to price forecasting include, for example, dynamic regression and transfer functions (Nogales et al. (2002)), wavelet transformation followed by an ARIMA model (Conejo et al. (2005)) and weighted nearest neighbor techniques (Troncoso et al. (2007)). There are many applications of machine learning methods in electricity price forecasting. Amjady (2006) compare the performance of a fuzzy neural network with one hidden layer to ARIMA, waveletARIMA, multilayer perceptron and radial basis function network models for the Spanish market. Chen et al. (2012) also use a neural network with one hidden layer and a special training technique called extreme learning machine on Australian data. On the same market, Mosbah & ElHawary (2016) train a multilayer neural network on temperature, total demand, gas price and electricity price data of the year 2005 to predict hourly electricity prices for January 2006. In order to show the superior performance of neural networks compared to time series approaches, Keles et al. (2016) conduct an extensive study focussing on the important topics of variable selection and hyperparameter optimisation. They select the most predictive features via a knearest neighbor backward elimination approach and employ 6fold crossvalidation to optimise forecasting performance over several hyperparameters of the neural network. The resulting network is found to outperform the benchmark models substantially. Recently, more sophisticated types of neural networks have been used: In a benchmark study, Lago et al. (2018) compare feedforward neural networks with up to 2 hidden layers, radial basis function networks, deep belief networks, convolutional neural networks, simple recurrent neural networks, LSTM and GRU networks to several statistical and also to other machine learning methods like random forests and gradient boosting. Using the DieboldMariano test, they show the deep feedforward, GRU and LSTM network approaches to perform significantly better than most of the other methods on Belgium market data. Marcjasz et al. (2018) consider a nonlinear autoregressive (NARX) neural networktype model which especially accounts for the longterm price seasonality. Also using the DieboldMariano test, they show that this approach can improve the accuracy of dayahead forecasts relative to the corresponding ARX benchmark.
Among the features considered in the aforementioned studies historical electricity prices, total demand series, total demand prognoses, renewable infeed forecasts, weather data and calendar information appear on a regular basis. On the other hand, to the best of our knowledge, the first to use supply and demand curves for price prediction are Ziel & Steinert (2016). Their goal is to fill the gap between time series analysis and structural analysis by setting up a time series model for these curves and then forecasting the future market clearing price as the intersection of the corresponding forecasted curves. They compare multiple time series prediction methods based on this approach. However, they do not investigate whether the performance of their model can be enhanced by machine learning techniques.
3 Methodology
Data preparation and feature extraction from order book
Our dataset ranges from 1/2/2015 to 18/9/2018 (31823 single auctions) and includes order book data from the EPEX German/Austrian electricity spot market, transparency data from EEX on expected wind and solar power infeed, and expected total demand data from ENTSOE. To avoid data dredging, (about 9 months) of the available data at the end of the time period are held back for an outofsample model evaluation (see Section 4).
For feature extraction, i.e., translating the order book into a vector of numbers, we use ideas from Coulon et al. (2014) and Ziel & Steinert (2016). Let be the set of possible prices and the set of time points for which there are data available. Each is a tuple consisting of a date and an hour . We represent the supply and demand data at time as vectors and , where and denote the supply and demand volume, respectively, bid at price level . The market clearing price at time is determined by EPEX via the EUPHEMIA algorithm, which also considers complex orders. There is no information about such orders in our dataset, so it would be unreasonable to expect any learning algorithm to incorporate them into its price prediction. Therefore, we calculate the market clearing price that would result from considering only the available supply and demand data and use this as the target value for price prediction. To this end, we define the socalled supply and demand curves
(1)  
(2) 
The MCP lies at the intersection of the supply and demand curves. As and are step functions, explicit formulae for are quite technical and therefore omitted. We refer to Figure 2 for a graphical illustration. To reduce the dimensionality, we partition into price classes and use the volumes per price class as features. To determine the price classes we use a heuristic which aims to achieve that all price intervals contain the same amount of volume on average. This algorithm ensures that there are more price classes at the interesting parts of the curve, i.e., in the price regions with many bids. We begin by averaging the supply and demand curves over all time points. Then, we fix a volume that each price class is supposed to contain on average and choose price class boundaries and accordingly.
Again, the mathematical details are rather technical (see also Ziel & Steinert (2016)). However, the graphical illustration in Figure 4 should make the idea intuitively clear. Analogously as with the original supply and demand curves, one can calculate the price that results from the price classes and of course, in general, does not exactly coincide with the actual market clearing price .
Finally, in order to simplify both implementation and interpretation without losing any essential information, we transform the supply and demand features into a socalled price curve. For this, let be the ascendingly ordered union of the supply and demand price class boundaries. Now, we define new price classes
(3) 
and volume features
(4) 
We use these price curve features and additionally the total demand as inputs for the price prediction. Figure 5 shows an example of such a price curve calculated from given supply and demand curves.
There is also an economic interpretation for this transformation: In fact, electricity demand is highly priceinelastic, so the constant inelastic demand is the expected total demand for electricity at that hour. The price curve is the socalled merit order, which represents the electricity production units sorted by their variable production costs. For more details, we refer to standard literature on electricity markets like Burger et al. (2014). Note that the price curve still contains the information that is necessary to calculate the resulting price: The MCP lies at the intersection of the cumulative price curve and the constant inelastic demand. In addition to the price curve and inelastic demand, we use renewable infeed and total demand forecasts as features as well as some calendar information, namely

year as a numerical variable,

a binary variable on daylight saving time,

type of day as a onehotencoded categorical variable with three different values (workday, Saturday/bridge day, Sunday/holiday),

month, and

hour.
To account for the periodicity of months and hours, we project these values on a circle and use the twodimensional projections as features. For example, if date lies in month , this is encoded as
(5) 
For prediction, we use the price curve features of a preceding day, the socalled reference date . For notational convenience, we write . As a reference date for we use the nearest day before which is of the same type of day as . This is a simple but efficient technique in energy economics. More sophisticated methods to define a reference date may incorporate similarities in renewable infeed and demand profile.
Training of learning algorithms
We employ ordinary linear regression, random forests and feedforward neural networks to predict hourly electricity prices. Note that we use the prices which are implied by the volume features as target values, which means that the prices we aim to forecast attain the values . On the whole dataset, the absolute difference between these price approximations and the real prices is EUR/MWh on average (corresponding to a median absolute percentage deviation of ). While we assume ordinary linear regression to be wellknown, we give a brief description of the machine learning algorithms we consider. In each case, our goal is to approximate the function which maps the features described above to the corresponding electricity price. To this end, we assume to be given a set of training data where
(6) 
and is a vector of realizations of independent random variables with zero expectation and equal variance.
Random forests
Random forests are based on a simpler machine learning method called decision trees ((Hastie et al., 2001, chapter 9.2)).
While decision trees are easy to understand, they often perform rather poorly because of their high dependence on the training data. Random forests aim to overcome this drawback by averaging the predictions of several decision trees that are trained in a randomized way proceeding from the same data (Breiman (2001)). As part of their training process, random forests offer a convenient way to assess the influence of each feature on the output. Therefore, they can deliver a ranking of the features according to their relevance for electricity price prediction. While it is quite interesting in its own right, we also use this ranking for feature selection, i.e., for training a feedforward neural network only on the most important features (e.g. ).
Feedforward neural networks
Feedforward neural networks can be viewed as a farreaching nonlinear extension to ordinary linear regression. They consist of several layers, through which the input is fed via the composition of nonlinear activation functions and weighted sums in order to generate the output. The smallest unit (one vector component) of such a layer is called a neuron. A central result in the theory of neural networks states that, using a nonconstant, bounded and continuous activation function, a neural network with just one hidden layer can in principle approximate any continuous function arbitrarily well when there are sufficiently many neurons and appropriate weights are chosen (Hornik (1991)). In practice, a higher number of layers has been found to improve performance for many applications (deep learning). Besides the number of hidden layers and the number of neurons per layer, there are other socalled hyperparameters on which forecasting performance can critically depend. For instance, the optimisation algorithm that is used to train the network has to be chosen. Typically, some variant of stochastic gradient descent (SGD) like rmsprop (Thieleman & Hinton (2012)) or Adam (Kingma and Ba (2015)) is used. Furthermore, SGDtype algorithms work with batches of training data. The batch size can be varied in order to improve performance. Other hyperparameters which we consider include the number of epochs, i.e., the number of times the training data are fed into the optimisation algorithm, the activation function (tangens hyperbolicus, rectified linear unit, identity) and whether or not to employ dropout to avoid overfitting (Srivastava et al. (2014)) and batch normalization to avoid internal covariate shift (Ioffe and Szegedy (2015)).
Hyperparameter optimisation via crossvalidation
We choose the hyperparameter values for the neural networks and random forests using fivefold crossvalidation. First, we define a grid of hyperparameter combinations to be evaluated. Then, for every combination of hyperparameters in the grid, we split our training dataset into five parts or folds of equal size, train a model with these values on four of the folds and evaluate its performance on the remaining one. After repeating this five times, each time with a different validation fold, we average performances. Finally, once the whole grid has been evaluated, we choose the hyperparameter combination that performs best on average.
Summarising, the features we use to forecast the spot price of a time point with reference time point are

the total demand and the price curve features of the same hour on the reference day, i.e., , ,

the solar and wind infeed forecasts as well as the total demand forecast for the time points and ,

the calendar features year, daylight saving time, type of day, month and hour for the time points and .
We considered about 100 different parameter combinations for the random forests with the number of trees equal to , , , , or . For the neural networks, we tested over 1000 parameter combinations with about 20 different network sizes ranging from one hidden layer with 5 neurons to hidden layers with 25 neurons each.
4 Results
To evaluate model performance, we primarily use the rootmeansquare error
where are the predictions, are the true target values and is the number of observations for which a prediction is made. Furthermore, we consider the mean absolute error
as a more interpretable measure of how far off the prediction is on average. The RMSE is the error measure which the machine learning algorithms aim to minimize during training. Accordingly, we select the model architecture that performs best in the 5fold crossvalidation with respect to the RMSE. In the electricity forecasting literature, sometimes the mean absolute percentage error (MAPE) is used. This is unsuitable for the German market, as often the MCP is at or close to zero. Therefore, we report the median absolute percentage error
for comparison.
Aside from the methods which were described in Section 3, we consider two benchmarks. The first one is called the naive benchmark (Nogales et al. (2002)). Its forecast for hour of date is the price at hour of the previous day if is a workday other than Monday and the price at hour of the same type of day in the previous week otherwise.
The second benchmark is based on a different market, the Energy Exchange Austria (EXAA), where the electricity price is fixed two hours before the EPEX auction takes place. Therefore, the EXAA price at a time point can directly be used as a predictor for the EPEX price at the same time point. In fact, Ziel et al. (2015) show this benchmark to be highly competitive. However, note that it is not really appropriate to compare the remaining forecasting methods to the EXAA benchmark because they are based on different information (see also Ziel & Steinert (2016)). Nonetheless, the EXAA benchmark can provide some orientation on how well other models perform and how much improvement could be expected.
The bestperforming random forest consists of 1000 decision trees where at each step in the training of the underlying decision trees a randomly chosen subset of size 23 (corresponding to ) of all available features is used and where a tree node is only split further if it contains at least of all training data. We also use the random forest to support feature selection for the following neural network approach.
For the neural networks under consideration we use different feature vector realizations:

all available features,

all but the price curve features of the reference date,

the 10 most influential features according to the bestperforming random forest,

the 20 most influential features according to the bestperforming random forest.
For each case we use different network architectures, which we determine by hyperparameter optimisation as described above. These are reported in Table 1 where each column corresponds to a different set of features and each row corresponds to a hyperparameter. The notation for the network architecture means that a 3layer network consisting of 5 nodes per layer is used. For the networks that are trained on the selected features, we find a deeper architecture to perform best: * denotes a layer network with nodes per layer. Analogously, in the dropout row, means that dropout is employed with a probability of after the second layer and * means that dropout is employed after each of the layers with a probability of . It is noteworthy that the bestperforming network when using all features is rather small. Thus, as an additional plausibility check, we also consider the network architectures proposed by Keles et al. (2016) (network size , sigmoid activation function, no dropout) and Lago et al. (2018) (network size , relu activation function, no dropout) as a reference. Note that their models do not consider price curve features, i.e., order book data.
Hyperparameter  All features 






[5, 5, 5]  [5, 5]  [25] * 25  [25] * 25  
optimiser  rmsprop  Adam  Adam  Adam  

100  100  100  100  
Batch size  128  64  128  128  

tanh  relu  relu  relu  
Dropout  [0, 0.25, 0]  [0, 0.25, 0]  [0.1] * 25  [0.1] * 25  

no  yes  yes  yes 
The results of the chosen model configurations are shown in Table 2. The errors we report are measured both on the training set (insample error) to evaluate how well the model describes the given data and on the test set (outofsample error) to assess model performance on previously unseen data (20% of our whole dataset).
Forecasting technique  insample error  outofsample error  
RMSE  MAE  MdAPE  RMSE  MAE  MdAPE  
Naive model  13.55  7.87  15.31%  12.68  7.71  11.61%  
Ordinary linear regression  6.85  4.25  10.93%  9.60  7.52  16.95%  
Random forest  6.77  4.17  9.73%  11.92  9.32  19.9%  

6.72  4.51  11.49%  14.87  12.81  30.63%  

2.27  1.65  4.45%  21.05  8.94  15.22%  
Feedforward neural network  5.45  3.57  8.89%  9.59  7.08  14.18%  

6.63  4.41  11.22%  10.11  7.85  16.12%  

7.69  5.06  11.68%  9.41  7.34  15.57%  

7.71  4.95  11.27%  13.65  10.18  21.48%  
EXAA  6.47  3.53  7.56%  5.58  3.92  7.22% 
Alternative: More sophisticated neural network architectures
Apart from feedforward neural networks we also analysed recurrent neural networks. As electricity spot prices can be expected to exhibit a strong dependence on previous days’ features and prices, it seems reasonable to model them as a multivariate time series. While classical approaches like ARIMA or GARCH models are possible, this also is a typical application for recurrent neural networks because they explicitly incorporate the sequential structure of the inputs. In this case, the goal was to predict the dimensional vector of spot prices at some date based on information available up to date . For each date this information consists of the curve features for date as well as the calendar features and expected renewable infeed and total demand for date . We implemented this approach using the long shortterm memory (LSTM) architecture that allows for efficient training of recurrent neural networks (Hochreiter and Schmidhuber (1997)), but the results were not as convincing as with the other methods. This might be due to the high dimensionality of the multivariate time series under consideration. Therefore, we focused on the random forest and feedforward neural network approaches where the temporal dependence structure is more explicitly incorporated as a feature by means of the reference day.
5 Conclusion
Our results show that neural networks can indeed provide orderbookbased price forecasts with competitive results. However, they do not perform significantly better than simpler methods like ordinary linear regression. Whereas the classical orderbookbased forecasting technique requires a lot of statistical analysis, the network architecture optimisation also demands significant resources. We also found that reducing the number of features generally improves results. In regard to the RMSE, we find that the feedforward neural network with only 10 features as selected by the random forest performs best. Considering the MAE (a measure directly linked to revenues from financial trading), the feedforward neural network without feature selection is in the lead. However, the naive model shows good results as well, supporting this traditional and often applied heuristic in energy economics. The neural network architectures from literature show competitive insample results, but their performance drops significantly in an outofsample analysis. This indicates overfitting.
The posed research questions have been answered. We have shown how to incorporate order book features using volumebased partitioning, a transformation to price curves and feature selection based on random forests. We have also shown that machine learning cannot significantly reduce the work effort needed in the model setup, but gives competitive results.
The models do have a lot of potential for improvement. For instance, there are much more accurate wind and solar infeed forecasts available in the market compared to the data from EEX transparency (unfortunately they are not free of charge). We see the largest potential in a daily recalibration of the models including an updated feature selection which allows the model to react to fundamental changes in the market (coal and gas prices, power plant outages, …).
In addition, we also analysed different applications of machine learning on EPEX order books, which are not outlined in detail: We employed neural networks to reconstruct renewable infeed from the order book and used the networks to generate price forward curves.