Learning to Predict with Big Data ^{†}^{†}thanks: This work is funded by the UK ESRC Consumer Data Research Centre (CDRC) grant reference ES/L011840/1
Abstract
Big spatiotemporal datasets, available through both open and administrative data sources, offer significant potential for social science research. The magnitude of the data allows for increased resolution and analysis at individual level. One of the issues researchers face with such data is the stationarity assumption. This poses several challenges in how to quantify uncertainty and bias. While there are recent advances in forecasting techniques for highly granular temporal data, little attention is given to segmenting the time series and finding homogeneous patterns. In this paper, it is proposed to estimate behavioral profiles of individuals’ activities over time using Gaussian Process based models. In particular, the aim is to investigate how individuals or groups may be clustered according to the model parameters. Such a Bayesian nonparametric method is then tested by looking at the predictability of the segments using a combination of models to fit different parts of the temporal profiles. Models validity is then tested on a set of hold out data. The dataset consists of half hourly energy consumption records from smart meters from more than 100,000 households in the UK and covers the period from 2015 to 2016. The methodological approach that is developed in the paper may be easily applied to datasets of similar structure and granularity, for example social media data, and may lead to improved accuracy in the prediction of social dynamics and behavior.
Keywords: big data, time series models, consumer behavior, smart meters
Introduction
Social and political science research on time series data often focuses on prediction models that strive to understand the dynamics behind the outcome variables (BoxSteffensmeier et al., 2014). Time series classification models have been used for predictions of conflict and civil war (Muchlinski et al., 2015; Hegre et al., 2013), political instability (Goldstone et al., 2010) or design of aggregated indicators, such as political accountability (Chappell and Keech, 1985).
Fine grained, complex nonstationary time series that characterize social dynamics have entered political science research via social media analysis (digrazia2013more). Prediction of individual behaviour with such data is challenging. A mixture of methods to predict behaviour with such data is prosed as an alternative. The aim for this paper is to consider challenges and opportunities that may be associated with big highly granular temporal datasets. In particular, the focus is on challenges relating to nonstationarity and aggregation of timeseries in the context of big data.
As an illustrative example, smart meter data that records household energy consumption at half hourly intervals is analysed in the paper. Such data may be used as a proxy for individual household behaviour and activities (Anderson et al., 2017). In order to make the approach more generalizable the two stage procedure is suggested : first, segment the population with unsupervised clustering methods to categorize behaviour; second, predict cluster allocation using only time series features.
The paper proceeds as follows. First, we briefly review key statistical concepts relevant for our data. We then introduce our case study based on a sample of smart meter data available for the UK throughout 20142015. We assesses several methods for clustering timeseries with an aim of segregating consumer behaviour. This is followed by prediction of behavioural classes from the individual timeseries records. The paper concludes with a discussion on how nonstationarity and aggregation affect the analysis in our case study.
Overview of Statistical Considerations
Stationarity and the Affect of Aggregation
A stationary process is one in which the joint distribution of the variable remains the same, i.e. for all lags . Frequently, weakly or covariance stationary process are assumed whereby the mean and variance remain constant as we move through time (see e.g. Hamilton, 1994). Failure to correctly diagnose stationary or nonstationary process may lead to wrong inferences about the phenomena under study.
Sampling frequency (or equivalently aggregation level) of data affects the output of analysis and the assessment of stationarity. Intuitively, it is expected that the aggregation results in some information loss. The question is how important is this kind of information for the inferences about the process. Aggregation is useful for data reduction, which, in turn, speeds up the computation. In general, the information lost in aggregation depends on properties of the process itself. For example, if one considers a signal which has very gradual variation over time, then increasing the sampling interval, or equivalently increasing aggregation over very granular samples, may not have much impact on inference about the process. Conversely, if activity varies rapidly, then one may expect aggregation to have a large impact and significantly limit any insights obtained from analysis. In the eventdata application, Shellman (2004) demonstrate the varying effect of aggregation on both segmentation and prediction. We expect the effect to be even more pronounced for highly granular timeseries big data.
Smart Meter TimeSeries Data
Time series data constitutes an ordered sequence indexed by time alongside values of the variables of interest at each point in time. For smart meter data there are various ways to represent such time series sequence. For instance, one sequence could represent the total consumption per day, while another sequence could track hourly energy consumption. We can then model different data generation processes depending on our level of aggregation.
Independence Assumption
In the case of smart meter data , data may be analysed in either a univariate or multivariate setting. Traditional analysis is mainly univariate in nature and would impose an independence assumption across energy consumption levels if we are interested in fitting a parametric model. If we are concerned with the prediction of average (or aggregated) energy demand that is composed of consumption by individual users, then correlation or independence between streams may affect the aggregated processes. In the univariate case, each customer’s time series is taken separately and it is attempted to predict their consumption using only their historical behaviour.
Energy consumption may be viewed as a first order chain of preceding readings. Figure 1 presents a schematic illustration. We assume there are no interdependencies among the nodes in the chain other than on the previous timestep. Each time period is conditioned on the previous one. However, as an extension a secondorder chain may be considered where t3 may be dependent on t1 and t4 dependent on t2. Such models may be generalized to higherorders at the expense of increasing model complexity and drop in interpretability.
Description of Data
A summary of the dataset is given in Table 1. From the total set of 8.5 billion observations, we study two subsamples of smart meter data streams. Aggregated patterns are calculated by taking the average half hourly consumption across the whole year for each unique consumer at each postcode sector. This significantly reduces the volume of data. It is associated with the certain levels of variability across the units of analysis but the variability within the individual customer records is collapsed. To assess how much we can learn about the true dynamics from this aggregated level we compare the aggregated results to those obtained on a disaggregated raw sample.
The overall dataset is sufficiently large and this may present computational limitations for some analyses. One approach, implemented in the disaggregated sample, is a random draw of 1,100 individuals. The overall sample is not taken for the analysis, yet it is presented here to give a flavour of general volume one may be dealing with when studying energy data. Computationally, such magnitude can create obstacles , especially for methods that use matrix transformations extensively.
Data 
Overall  Aggregated Sample  Disagregated Sample 

Unique identifiers  489,000  8,171  1,100 
Days  365  365  365 
Daily readings)  48  48  48 
Total observations  8,567,280,000  143,155,920  19,272,000 

Figure 2 presents an example of the average daily consumption pattern and variation around this average for a sample of consumers randomly taken from the overall dataset. As may be expected, the shape of consumption behaviour aligns with morning and evening spikes. At the same time, if we are to differentiate among the patterns, the variation around the mean and median consumption may generate additional insights about consumer behaviour.
Methodology
This section presents the work flow of the analysis. It is worth noting that a number of other solutions may be considered as pre stage of the method (e.g. feature transformations). However, we avoid variable transformations in order to preserve the interpret ability of the analysis as well as in this case it is important to ensure that the analysis may be replicated using the raw smart meter data without any modifications applied. Significance of this is driven by applicability of the research method in the industry for instance.
Our approach is based on a combination of both unsupervised and supervised machine learning techniques. First, the process that may have generated the patterns in the data is studied to find a way to group these based on the similarity of that process – a process often referred to as clustering. Since the data is unlabelled a priori, this step is also useful for segmenting large data sets into groups that then can be studied separately. It is often the case that these clusters may be associated with realworld segmentations in the data.
In the final step we predict assignment to clusters based only on timeseries features. This models a setting where one may be interested in individual behaviour (clusters) without access to any additional information apart from past behaviour. We perform this analysis both on the aggregated and disaggregated data streams.
Segmentation and Labelling of TimeSeries
As discussed, the dataset represents solely the readings from smart meters and contains no information on individual characteristics of the users and properties. We use unsupervised machine learning techniques to segment the data and create artificial labelling. In the next section,it is assessed how well such labels can be predicted from the timeseries data. The main goal of this section is to develop a method to read new unseen data and allocate it to a group of already known segments. Clustering is being accessed as a feasible strategy for segmenting large, granular data. Related work has been done in energy classification using smaller and more aggregated samples (Albert and Rajagopal, 2013; McLoughlin, Duffy and Conlon, 2015; Haben, Singleton and Grindrod, 2016).
Clustering
Clustering is an unsupervised machine learning method that is used primarily to associate a simplified underlying structure with unlabelled data. For example, in smart meter case, having solely energy consumption recordings one knows little on whether the consumption patterns may be aggregated into similarity groups. For instance, people who work full time may be grouped together while those who are at home throughout the day may also be clustered together. The objective is thus, to find an algorithm which ensures that similarity between individuals within each cluster is maximized, while also maximizing dissimilarity between clusters.
To date, a number of methods have been developed for clustering data. While many of these give reliable performance on static data, they often disregard the dynamic structures of clusters. This poses further challenges if we are to consider spatial and temporal dimensions in the analysis. One of the immediate solutions could be to transform dynamic data into the static format. For example, we may calculate the mean for each of the individuals and create numerical indicator that represents an estimate of average consumption for the individuals in our sample. This can also be done for geographical references reducing the dimensionality of the data and allowing for greater generalization. According to Liao (2005) the decision on which clustering method is appropriate for time series also depends on the type of the data. The characteristics can include: discrete vs real valued, uniformity of the sample, univariate vs multivariate series, and lengths of time series considered for the analysis.
Most clustering algorithms are designed to maximize dissimilarity among the groups using various distance measures (e.g., kmeans, hierarchical), while others may consider the underlying data generation process (e.g., Gaussian Mixture Models, Bayesian clustering by dynamics). An important issue for these algorithms is how one should treat outliers. For instance, whether outliers get weakly assigned to clusters (with some probability) or they are associated strictly with a specific cluster (absolute/hard clustering).
Kmeans clustering is the most popular approach due to its simplicity and fast minimization of the similarities among the objects within each class centre. It is well suited for data sets with static features. For highly variable temporal variables, the assignment of the cluster may be highly unstable as individuals are likely to be assigned to a different cluster subject to the day and time. As an alternative, we consider a Gaussian Mixture Model (GMM) based on a probabilistic model (10.2307/2532201). Such a setting brings about the ability to handle diverse types of data, including dealing with missing or unobservable data that may have contributed to variation differences among segmented groups. This is achieved by assigning a probability to a segmented group membership. Under greater uncertainty about the assignment, additional variables may be introduced or the individual may be treated as an outlier or belonging to an uncertain group. Unlike kmeans, it produces stable results and selects the number of clusters using the probability density fit. Clustering results are also replicable and remain the same regardless of how many times we run the algorithm.
Gaussian Mixture Models
Gaussian Mixture Models constitute a probabilistic method for clustering that handles diverse types of data, including dealing with missing data and hierarchical structures. The probabilities for each data point to be in a particular cluster are first assigned and then a cluster is allocated to each point using those probabilistic measures The mixture is formed using the probabilities obtained from the standard Gaussian representation:
(1) 
with representing the mean vector and being a covariance matrix. A mixture of Gaussians is then represented as the following:
(2) 
As an example, Figure 3 demonstrates how consumption variability can be represented as a mixture of densities. As can be seen, we may represent this data with a mixture of Gaussians, yet they may differ in size or shape.^{1}^{1}1The GMM algorithm is implemented in R in ’mclust’ package (Scrucca et al., 2016). For the mixture models we utilize a likelihood based estimation procedure. .
Clustering Results
As may be observed from Figure 4 and Table 2, while we are dealing with different samples we obtained the same number of clustered groups. However, the key differentiator between the two cluster models is the shape of the Gaussian models used to fit the patterns. While the aggregated sample presents smoother shapes, we see more variation in the disaggregated case (for resulting temporal profiles please see Appendix B).
In terms of samples allocation to each of the clusters, we are presented with an unbalanced allocation. This is caused by the fact that on average, as we saw in Figure LABEL:fig:avg, energy customers may be alike in their temporal behaviour, particularly characterized by morning and evening peaks. In the case of clustering, the less represented groups of patterns are indeed those with less expected energy consumption, profiles that vary from very low to very high and persistent usage during the day.
Segment  % of total sample (Aggregated patterns)  % of total sample (Disaggregated patterns) 

1  24.0%  15.7% 
2  10.6%  14.2% 
3  5.3%  1.4% 
4  0.9%  5.9% 
5  1.9%  20.0% 
6  21.9%  3.4% 
7  15.5%  13.6% 
8  14.4%  22.5% 
9  5.5%  3.4% 
Behaviour Prediction
There are a number of approaches that can be used for time series prediction and classification. Initially, it was attempted to forecast the next unit of energy consumption in our data using standard parametric family of models such as ARIMA, AR, and MA models. However, performance was extremely poor and for readability it was decided to omit the details of this analysis here. Instead, given the greater variability in big data it is proposed that ability to predict the next half hour or day of activity may appear troublesome, but as an initial stage the consumer may rather be associated with a class of known or similar users. In the presented case the labels obtained from segmentation of the data will be used in the previous section. Once again the performance for aggregated and disaggregated samples is then compared. The choice of models was based on the popularity in past research, specifically in the multiclass setting.
KNearest Neighbour
KNearest Neighbour (KNN) is considered one of the simplest classification methods for both binary and multiclass problems. It is particularly useful for problems where the conditional distribution of the outcome variable on the independent variables is unknown (James et al., 2013). KNN works by taking an input point , and points that are in some sense close to it. The points nearby in the feature space can then be used to select an appropriate label. The estimator can be written mathematically as
(3) 
where represents the labels of the points in the neighbourhood of input point .
Tree based methods
The other methods we assess, Random Forest (RF) and Gradient Boosting Trees (GBM) are based solely on decision tree mechanisms. They are differentiated by the approach which they use to select the best combination of trees and the way the samples of data are incorporated in the learning process. These methods are especially valuable due to their simplicity in interpretation compared to other machine learning algorithms. They can easily be used for regression and classification type problems and can be used to model nonlinear relationships.
The Random Forest algorithm is based on building decision tress on bootstrapped (randomly sub sampled) data with a smaller subset of randomly sampled predictors at each decision node. A large number of trees is grown until a stopping rule is achieved (e.g. minimum 5 observations in the terminal nodes) and then aggregated for final prediction. An example of the successful use of Random Forest to civil war onset prediction can be found in Muchlinski et al. (2015) and Strobl et al. (2008).
Our implementation of the model is as follows. The input variables are represented by the sequence which is a combination of half hourly readings. The model draws bootstrap samples Z from the training set and random forest trees are build using a combinations of predictors that are responsible for the split of these trees. Once number of tree classifiers have been generated, we take the average among all and form a single classifier. Output is represented by The class is then predicted for the unseen data (test set) through the majority vote that selects the best performing trees :
(4) 
An alternative tree algorithm known as Gradient Boosting was first used to tackle classification problems, however, is now widely used for regression as well Friedman, Hastie and Tibshirani (2001). Like Random Forest, the gradient boosting algorithm takes advantage of both weak and strong classifiers. By weak here we mean classifiers that bring a prediction which is slightly better, or just the same as a random guess. Unlike Random Forest where at each iteration we are training a different solution, in Gradient Boosting model we are updating the solution of already trained model as more samples are taken. The trees are, therefore, updated at each iteration to obtain more powerful classifiers.
In boosting models we first assign the weights to each of our training observations that include both input and output variables with being the total number of observations. We then iterate the process times during which we are fitting the classifier using the observation weights. The observations which were misclassified at the previous stage are assigned greater weights so at each iteration we give more importance to those observations which were harder to classify initially. We calculate the error associated with which model fit as
(5) 
Those with the highest error are assigned an increase to their weights using the factor of . The final output is based on continuous iterations of model fit using reweighted observations until the error rate is minimized.
Results
The tables below report overall accuracy and kappa values for each of the models that were used to predict the segment of the data. Accuracy reports the overall prediction power of the model including both true positives and true negatives over total of true and false positives and negatives. Kappa statistic is used for evaluation of classifiers by comparing the observed accuracy of prediction with that of a random chance. The optimal parameters were obtained using tenfold crossvalidation. The results are followed by confusion tables that represent the ratio of observed versus predicted class.
Aggregated Results
Model  Accuracy  Kappa 

KNearest Neighbour  23%  0.14 
Gradient Boosting Trees  37%  0.29 
Random Forest  40%  0.29 
Results on disaggregated sample
Model  Accuracy  Kappa 

KNearest Neighbour  65%  0.58 
Gradient Boosting Trees  80%  0.73 
Random Forest  79%  0.75 
As observed from the confusion tables, prediction methods show differential performance across clusters. One of the immediate observations is the difference in performance when considering aggregated versus disaggregated analysis. Aggregated models are associated with higher misclassification rates suggesting that by aggregating we have lost essential dynamics that contribute to identifiable patterns.
While RF and GBM tend to perform better on average, KNN showed higher accuracy on some classes. This is possibly related to different ’biasvariance’ trade off for each of the tree models. While boosting aims to reduce the bias by taking the average of predictive performance among the estimated models, random forest fundamentally searches for a solution that reduces variance by imposing a strict structure of reducing the number of predictors at each split of the tree.
Often the classes which are better represented in the data may be associated with better performance as there is more data available for the training. In our case, this had no implication on performance. Classes with smaller number of observations were more easily differentiated, while the bigger classes showed higher levels of misclassification.
Discussion and Conclusions
In this paper the analysis that can be performed on time series associated with substantial levels of variability across a large number of datastreams was presented. It was demonstrated that such data can be meaningfully clustered using Gaussian Mixture Models. The paper suggests a possible strategy for prediction and characterization of temporal profiles. One of the arising challenges is the effect of aggregation on prediction performance.
It was shown that both segmentation and predictive algorithm tend to work differently depending on whether we looked at aggregate or disaggregate samples. For prediction in particular, we show that using aggregated data records leads to much higher rates of misclassification while the most granular data can be classified and predicted with more certainty.
Compared to Random Forest in practice some classifications may be better performed using Gradient Boosting trees (Friedman, Hastie and Tibshirani, 2001). Although this performance may be at the cost of overfitting the data. Nevertheless, what is observed is rather a mixture of performances with each method winning or losing for different prediction class. This may be related to the essential ’biasvariance’ trade off that is worked differently by each model. While boosting aims to reduce the bias by taking the average of predictive performance among the estimated models, random forest fundamentally searches for the solution that reduces the variance by imposing a strict structure of reducing the number of predictors at each split of the tree.
References
 (1)
 Albert and Rajagopal (2013) Albert, Adrian and Ram Rajagopal. 2013. “Smart meter driven segmentation: What your consumption says about you.” IEEE Transactions on Power Systems 28(4):4019–4030.
 Anderson et al. (2017) Anderson, Ben, Sharon Lin, Andy Newing, AbuBakr Bahaj and Patrick James. 2017. “Electricity consumption and household characteristics: Implications for censustaking in a smart metered future.” Computers, Environment and Urban Systems 63:58 – 67. Spatial analysis with census data: emerging issues and innovative approaches.
 BoxSteffensmeier et al. (2014) BoxSteffensmeier, Janet M., John R. Freeman, Matthew P. Hitt and Jon C. W. Pevehouse. 2014. Time Series Analysis for the Social Sciences. Analytical Methods for Social Research Cambridge University Press.
 Chappell and Keech (1985) Chappell, Henry W and William R Keech. 1985. “A new view of political accountability for economic performance.” American Political Science Review 79(1):10–27.
 Dietterich et al. (2000) Dietterich, Thomas G et al. 2000. “Ensemble methods in machine learning.” Multiple classifier systems 1857:1–15.
 Friedman, Hastie and Tibshirani (2001) Friedman, Jerome, Trevor Hastie and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1 Springer series in statistics New York.
 Goldstone et al. (2010) Goldstone, Jack A, Robert H Bates, David L Epstein, Ted Robert Gurr, Michael B Lustik, Monty G Marshall, Jay Ulfelder and Mark Woodward. 2010. “A global model for forecasting political instability.” American Journal of Political Science 54(1):190–208.
 Haben, Singleton and Grindrod (2016) Haben, Stephen, Colin Singleton and Peter Grindrod. 2016. “Analysis and clustering of residential customers energy behavioral demand using smart meter data.” IEEE transactions on smart grid 7(1):136–144.
 Hamilton (1994) Hamilton, James Douglas. 1994. Time series analysis. Vol. 2 Princeton university press Princeton.
 Hegre et al. (2013) Hegre, Håvard, Joakim Karlsen, Håvard Mokleiv Nygård, Håvard Strand and Henrik Urdal. 2013. “Predicting armed conflict, 2010–2050.” International Studies Quarterly 57(2):250–270.
 James et al. (2013) James, Gareth, Daniela Witten, Trevor Hastie and Robert Tibshirani. 2013. An introduction to statistical learning. Vol. 112 Springer.
 Liao (2005) Liao, T Warren. 2005. “Clustering of time series dataa survey.” Pattern recognition 38(11):1857–1874.
 McLoughlin, Duffy and Conlon (2015) McLoughlin, Fintan, Aidan Duffy and Michael Conlon. 2015. “A clustering approach to domestic electricity load profile characterisation using smart metering data.” Applied energy 141:190–199.
 Muchlinski et al. (2015) Muchlinski, David, David Siroky, Jingrui He and Matthew Kocher. 2015. “Comparing random forest with logistic regression for predicting classimbalanced civil war onset data.” Political Analysis 24(1):87–103.
 Rokach (2010) Rokach, Lior. 2010. “Ensemblebased classifiers.” Artificial Intelligence Review 33(1):1–39.
 Scrucca et al. (2016) Scrucca, Luca, Michael Fop, T Brendan Murphy and Adrian E Raftery. 2016. “mclust 5: Clustering, classification and density estimation using gaussian finite mixture models.” The R Journal 8(1):289.
 Shellman (2004) Shellman, Stephen M. 2004. “Time series intervals and statistical inference: The effects of temporal aggregation on event data analysis.” Political Analysis 12(1):97–104.
 Strobl et al. (2008) Strobl, Carolin, AnneLaure Boulesteix, Thomas Kneib, Thomas Augustin and Achim Zeileis. 2008. “Conditional variable importance for random forests.” BMC bioinformatics 9(1):307.
 Wilkerson and Casas (2017) Wilkerson, John and Andreu Casas. 2017. “LargeScale Computerized Text Analysis in Political Science: Opportunities and Challenges.” Annual Review of Political Science 20:529–544.
Appendix
Appendix (A) The Figure 7 presents the description of the data we used for the analysis: aggregated and disaggregated sample.
Appendix (B)
The Figures 8 and 9 present the shapes and variation in the resulting clusters using GMM model. AS can be seen the number of clusters was defined as identical however the shape of aggregated clusters is far more smoother then those of disaggregated sample. This fundamental differences may have had a direct implication for predictability of aggregated clusters as the differentiation on aggregated level may be more challenging as essential dynamics that distinguish the patterns were collapsed during the averaging of energy consumption.