Convolutional Sequence to Sequence
Nonintrusive Load Monitoring
Abstract
A convolutional sequence to sequence nonintrusive load monitoring model is proposed in this paper. Gated linear unit convolutional layers are used to extract information from the sequences of aggregate electricity consumption. Residual blocks are also introduced to refine the output of the neural network. The partially overlapped output sequences of the network are averaged to produce the final output of the model. We apply the proposed model to the REDD dataset and compare it with the convolutional sequence to point model in the literature. Results show that the proposed model is able to give satisfactory disaggregation performance for appliances with varied characteristics.
I Introduction
Nonintrusive load monitoring (NILM) refers to the technique of estimating the power demand of a single appliance from the combined demand of multiple appliances in a household measured by a single meter [1]. It is suggested in [2] that electricity consumption feedback that includes appliancespecific breakdown is more likely to promote electricity conservation for residential consumers. Electricity providers can have more detailed and indepth understanding of their customers and provide better services. Thus, both electricity consumers and electricity providers can benefit from the information provided by accurate disaggregation of wholehome power demands.
Comprehensive reviews of various NILM methods can be found in [3, 4]. In recent years, the success of deep neural networks (DNN) in the fields including computer vision, speech recognition, and natural language processing has gained much attention in the research community and the industry [5]. As NILM is a welldefined machine learning task, researchers have been actively applying various DNN models to this task. In [1], the authors proposed several DNN architectures that are mainly composed of onedimensional convolutional layers, long shortterm memory (LSTM) recurrent layers, denoising autoencoders, and fully connected layers. It is believed that convolutional layers are able to extract local features of electricity consumption patterns, while LSTM layers (or other types recurrent layers) are good at modelling the temporal dependence within the sequences of electricity consumption. Some other studies focused on deep neural networks with homogenous building blocks in order to gain indepth insights into these building blocks. For instance, in [6], the authors propsed a network with one LSTM layer (which is in fact not a DNN). Much emphasis was placed on the fact that the longterm behavior of multistate appliances can be properly modelled. Another network structure that mainly consists of convolutional layers was proposed in [7]. The authors visualized the feature maps of the network with different input sequences in order to demonstrate the effectiveness of the model they proposed. A number of DNN models with different types of building blocks are also introduced in [8].
While the main stream practice of modelling temporal data is to leverage recurrent neural networks (RNN) [9, 10], some recent studies have shown that convolutional networks can also work well on onedimensional data [11, 12]. We learn from these newlydeveloped convolutional network structures that are specially designed for onedimensional data and adapt them to the task of NILM. As the networks consist of mainly convolutional and fully connected layers, they can be trained within a reasonably short amount of time on GPUs. We apply the proposed convolutional sequence to sequence model to a realworld dataset, and the results show that the proposed model outperforms existing models based on convolutional networks.
Ii Convolutional Sequence to Sequence Model for Nonintrusive Load Monitoring
Given a wholehome power consumption sequence , the goal of power consumption disaggregation is to obtain the power consumption sequence of the th appliance, , where is the length of the sequences. In this paper, one convolutional sequence to sequence model is able to produce disaggregated power consumption sequences for a single appliance.
The illustration of the proposed model is presented in Fig. 1. Specifically, the convolutional sequence to sequence network takes the input of length and maps it to the output of length . We set the value of several times larger than that of , which helps the model perceive electricity consumptions before and after the target time range. This is different from the sequence to sequence strategies in [1], [7], and [8], where is the same as . Further, in order to make use of information from both directions of time, the window of each is aligned to the center of each . Instead of cutting the complete output sequence into nonoverlapping sections, we move the windows for the input and output sequences with a small step size of , which is much smaller than . Thus, for the majority of the output sequence, we produce values for each time step. We then average the values for all of the time steps and obtain the final output.
In Fig. 2, we demonstrate the overall structure of the convolutional sequence to sequence network. In order to process the input of the network, several gated linear unit (GLU) convolutional blocks [11] and max pooling layers are used after the input layer. We set the pool size of the max pooling layers to 2, so that the size of the layers is divided by a factor of 2 after each pooling layer. Thus, the size of the layer after three max pooling layers becomes 8. A flatten layer then reshapes the feature maps from the previous layer and produces a vector as its output, and a fully connected layer with rectified linear unit (ReLU) as its activation function is used to reduce the size of the vector to match the size of the network’s output (i.e., ). Further, several residual blocks are added to the network to refine the output sequence. Another fully connected layer is added as the output layer of the network. For the convenience of implementation, we make sure that and is an even number.
The detailed structure of the GLU convolutional block, which uses GLU as its nonlinearity, is shown in Fig. 3. Two streams of convolutional operations are involved, and the additional convolutional pathway (Fig. 3, left) is used to fulfill the gating mechanism of GLU. More specifically, if we denote the feature maps along the main pathway and the additional pathway as , and ( is the number of kernels), respectively, then the output of the convolution operation, , is calculated as
(1) 
where is the sigmoid function, and is the operation of elementwise multiplication [12]. We then concatenate the outputs of each convolution operation along the input sequence and yield the output of the block.
As for the residual block (shown in Fig. 4), we put two fully connected layers within each block, and the ReLU nonlinearity is added between the two layers. The output of the residual block is obtained by
(2) 
where is the input to the block and is the set of weights and biases associated with the residual block. The adoption of residual blocks allows us to increase the learning ability of the network (i.e., increase the depth of the network) without suffering from the vanishing gradient problem [13].
Iii Experiments
In this section, we apply the proposed model to a realworld dataset used for NILM studies and report the disaggregation results for appliances of different characteristics.
Iiia The Dataset Used for the Experiments
The Reference Energy Disaggregation Data Set (REDD) [14] is used to demonstrate the effectiveness of the proposed model in this paper. The appliancelevel electricity consumption values for various appliances are sampled every 3 seconds for a number of real houses. We train the models on the data of house 2 and test the models on house 1. Six types of appliances, namely, kitchen outlets, lighting devices, microwave, washer dryer, fridge, and dish washer are used to synthesize the aggregate electricity consumption time series in this paper. The data for house 2 covers a limited time range of roughly 14 days, while the time range for house 1 is more than 30 days. This is a relatively strict setting, as we require the models trained with only the data from one house to generalize to an unseen house. Examples of the synthesized household electricity consumption time series for house 1 and house 2 are plotted in Fig. 5. While the two houses share some similar electricity consumption patterns, it is still challenging to train the model on house 2 and obtain satisfactory disaggregation results on house 1, as the appliances in the houses have different power demands and operating characteristics.
Three types of appliances, namely, fridge, lighting devices, and dish washers, are selected to be the target disaggregation appliances for the following reasons:

The electricity consumption of a fridge appears to be periodic and relatively stable, thus it is easier to learn the consumption patterns for this appliance. However, as the power demands for the fridges are less than 500 Watts, the consumption of fridges can easily be masked by other appliances with high power demands.

Lighting devices generally have low power consumption, thus they are hard to be separated from other appliances. In addition, the usage of lighting devices is generally more flexible and unpredictable.

The usage of dish washers is very sparse compared with other appliances. We can thus find out whether the proposed model is suitable for unbalanced data.
When used as the output of the model, the power demands of fridge, lighting devices, and dish washer are divided by 500, 200, and 1400 Watts, respectively. The synthesized power demands of both houses are divided by 1000 Watts. For dish washer, we resample from the data of house 2 based on the operation state of the appliance so that the models are able to learn the consumption patterns effectively. More specifically, we include all the samples that correspond to the on state of dish washer, but randomly reject samples that correspond to the off state, so that the proportion of on state samples is large enough for the models to be trained properly.
IiiB The Details of the Proposed Network and the Benchmark
The implementation details of the proposed model is as follows:

The size of the input sequence is set to 800. With 3 max pooling layers with a pool size of 2, the output size of the network is 100. The step size of moving the windows along the input and output sequences is set to 5.

The GLU convolutional block. Both pathways have 100 kernels with a size of 4. Zero padding is added so that the output size of the blocks is the same as the input size.

The residual block. Each residual block has two fully connected layers with 50 hidden neurons. The activation function of the first layer is ReLU.

The fully connected layer. Both of the two fully connected layers have 100 neurons. The first fully connected layer uses ReLU nonlinearity, whereas the second fully connected layer is linear as it is the output layer of the network.
In addition, we compare the proposed model with the sequence to point convolutional network model (a sevenlayer convolutional network) proposed in [7]. For a fair comparison, the numbers of kernels or hidden neurons for the layers within the proposed model in this paper are not tuned but set to reasonable values, such that the size of the proposed model is comparable with the model in [7]. Both models are trained with the Adam optimizer, and the loss is the mean absolute error (MAE) between the disaggregated and actual electricity consumptions of an appliance. The minibatch size is set to 32. The models are implemented with Keras 2.0.8 [15], and Tensorflow 1.3.0 is used as the backend [16]. A Titan Xp GPU is used in order to speed up the training of the models.
IiiC Disaggregation Results
The MAEs of disaggregation results for different appliances are listed in Table I. It is seen in the table that the convolutional sequence to sequence model proposed in this paper outperforms the convolutional sequence to point model in [7] by a large margin. Note that the sequence to point model fails to learn the consumption patterns of lighting devices, resulting in a very large MAE with respect to the power demands of the lighting devices. For dish washer, the proposed model is able to learn the consumption patterns when the proportion of on state samples is 10%, while the sequence to point model requires 50% of the samples to be of on state (the actual proportion of on state samples is only 1.4%).
Model  Fridge  Lighting  Dish washer 
Conv seq2point [7]  
Conv seq2seq (this paper)  
Examples of disaggregation results for the fridge are shown in Fig. 6. Compared with the sequence to point model, the proposed model is more robust to highly fluctuating sections in the aggregate consumption time series (the disaggregation in the latter half of the figure is influenced by the power demands of kitchen outlets and washer dryers).
In Fig. 7, we demonstrate the effectiveness of the proposed model on small power appliances (lighting devices). The power demands of different levels can be reflected by the disaggregation results. The performance of the sequence to point model is not presented, as it is unable to learn the consumption patterns and produces outputs that are all close to a small constant.
Illustrative disaggregation results for the dish washer are presented in Fig. 8. It is seen in the figure that the proposed model yields sharper shapes of demands and is less prone to produce false alarms. As a matter of fact, only one false alarm is observed for the proposed model, but the results of the sequence to point model has many spikes, partly because the model is misled by the resampled dataset with very high proportion of on state samples.
Iv Conclusion and Future Work
In this paper, we propose a sequence to sequence NILM framework based on convolutional networks. The mapping from a long sequence of aggregate power consumption to a short sequence of the power consumption of an individual appliance is facilitated by the GLU convolution blocks and max pooling layers. Further refinement of the output sequence is fulfilled by residual blocks of fully connected layers. The partialoverlapping output sequences are filtered to construct final disaggregation results. The experiments in this paper show that the proposed NILM framework outperforms existing NIML models based on convolutional neural networks. Several future paths are worth taking:

The architecture of the network can be further optimized. Firstly, we may add attention mechanism to the network so that the network can learn to focus on certain parts of the aggregate electricity consumption sequence (or inputs of intermediate layers) [8]. In addition, we may replace the fully connected layers within the network with convolutional networks for consistency, as the structure of a flatten layer followed by a fully connected layer partly loses the temporal relations of the points within the input sequence.

The way we filter the partialoverlapping output sequences can be further improved so that the model would be more robust to noisy input and complicated electricity consumption patterns. It is also of interest to integrate this process into the neural network model (e.g., the final output can be a synthesis of multiple proposals that indicate the states of the appliance [17]).
Acknowledgement
We are very grateful for the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
References
 [1] J. Kelly and W. Knottenbelt, “Neural nilm: Deep neural networks applied to energy disaggregation,” in Proceedings of the 2nd ACM International Conference on Embedded Systems for EnergyEfficient Built Environments. ACM, 2015, pp. 55–64.
 [2] C. Fischer, “Feedback on household electricity consumption: a tool for saving energy?” Energy efficiency, vol. 1, no. 1, pp. 79–104, 2008.
 [3] A. Zoha, A. Gluhak, M. A. Imran, and S. Rajasegarar, “Nonintrusive load monitoring approaches for disaggregated energy sensing: A survey,” Sensors, vol. 12, no. 12, pp. 16 838–16 866, 2012.
 [4] A. Faustine, N. H. Mvungi, S. Kaijage, and K. Michael, “A survey on nonintrusive load monitoring methodies and techniques for energy disaggregation problem,” arXiv preprint arXiv:1703.00785, 2017.
 [5] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
 [6] J. Kim, T.T.H. Le, and H. Kim, “Nonintrusive load monitoring based on advanced deep learning and novel signature,” Computational Intelligence and Neuroscience, vol. 2017, 2017.
 [7] C. Zhang, M. Zhong, Z. Wang, N. Goddard, and C. Sutton, “Sequencetopoint learning with neural networks for nonintrusive load monitoring,” in National Conference on Artificial Intelligence (AAAI), 2018.
 [8] P. P. M. do Nascimento, “Applications of deep learning techniques on nilm,” Ph.D. dissertation, Universidade Federal do Rio de Janeiro, 2016.
 [9] A. Graves, A.r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–6649.
 [10] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proceedings of the Third International Conference on Learning Representations, 2015.
 [11] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” arXiv preprint arXiv:1612.08083, 2016.
 [12] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” arXiv preprint arXiv:1705.03122, 2017.
 [13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [14] J. Z. Kolter and M. J. Johnson, “Redd: A public data set for energy disaggregation research,” in Workshop on Data Mining Applications in Sustainability (SIGKDD), San Diego, CA, vol. 25, 2011, pp. 59–62.
 [15] F. Chollet et al., “Keras,” https://keras.io, 2015.
 [16] M. Abadi et al., “Tensorflow: Largescale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
 [17] S. Ren, K. He, R. Girshick, and J. Sun, “Faster rcnn: Towards realtime object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.