Few-shot Learning for Time-series Forecasting
Time-series forecasting is important for many applications. Forecasting models are usually trained using time-series data in a specific target task. However, sufficient data in the target task might be unavailable, which leads to performance degradation. In this paper, we propose a few-shot learning method that forecasts a future value of a time-series in a target task given a few time-series in the target task. Our model is trained using time-series data in multiple training tasks that are different from target tasks. Our model uses a few time-series to build a forecasting function based on a recurrent neural network with an attention mechanism. With the attention mechanism, we can retrieve useful patterns in a small number of time-series for the current situation. Our model is trained by minimizing an expected test error of forecasting next timestep values. We demonstrate the effectiveness of the proposed method using 90 time-series datasets.
Time-series forecasting is important for many applications, which include finantial markets [4, 23, 8], enegy management [11, 46], traffic system [29, 20, 31, 55], and environmental engineering . Recently, deep learning methods, such as Long Short Term Memory (LSTM) , have been widely used for time-series forecasting models due to its high performance [3, 35, 29].
Forecasting models are usually trained using time-series data in a specific target task, where we want to forecast future values. For example, to train traffic congesting forecasting models, we use traffic congestion time-series data at many locations. However, sufficient data in the target task might be unavailable, which leads to performance degradation.
In this paper, we propose a few-shot learning method that forecasts time-series in a target task given a few time-series, where time-series in the target task are not given in a training phase. The proposed method trains our model using time-series data in multiple training tasks that are different from the target task. Figure 1 illustrates our problem formulation. Time-series in other tasks might have similar dynamics to those in the target task. For example, many time-series include trend that shows the long-term tendency of the time-series to increase or decrease. Also, time-series that are related to human activity, such as traffic volume and electric power consumption, exhibit daily and/or weekly cyclic dynamics. By using knowledge learned from various time-series data, we can improve the forecasting performance of the target task.
Given a few time-series, which are called a support set, our model outputs a value at the next timestep of a time-series, which is called a query. In particular, first, we obtain representations of the support set with a bidirectional LSTM. Then, we forecast future values of the query considering the support representations based on an attention mechanism as well as the query’s own pattern based on an LSTM. With the attention mechanism, we can retrieve useful patterns in the support sets to forecast at the current situation. In addition, we can handle the support set with the different number of time-series with different length using the attention mechanism. Given a target task, our model forecasts future values that are tailored to the target task without retraining. Our model is trained by minimizing an expected test error of forecasting next timestep values given a support set, which are calculated using data in multiple training tasks.
The main contributions of this paper are:
Our method is the first method of few-shot learning for time-series forecasting that does not require retraining given target tasks.
Our model can handle different support size and different time-series length with an attention mechanism and LSTMs.
We demonstrate the effectiveness of the proposed method using 90 time-series datasets.
The remainder of this paper is organized as follows. In Section 2, we briefly review related work. In Section 3, we propose our model and its training procedure for few-shot time-series forecasting. In Section 4, we show that the proposed method outperforms existing methods. Finally, we give a concluding remark and future work in Section 5.
2 Related work
For transfer knowledge in source tasks to target tasks, many transfer learning, domain adaptation, and multi-task learning methods have been proposed [49, 33, 21, 19, 26]. However, these methods require relatively a large number of time-series of target tasks. To reduce the required number of target examples, few-shot learning, or meta-learning, has been attract considerable attention recently [45, 6, 40, 2, 51, 47, 5, 13, 32, 24, 14, 44, 54, 12, 15, 22, 16, 7, 41, 42, 50, 34, 53, 28]. There are some applications of few-shot learning to time-series forecasting [18, 43, 30, 38, 48, 1]. Existing few-shot time-series forecasting methods can be categorized into two: finetune-based and meta feature-based. Finetune-based methods [18, 43] train models using training tasks, and finetune the models given target tasks. On the other hand, the proposed method does not need to retrain the model given target tasks. Meta feature-based methods [30, 38, 48, 1] use meta features of time-series, such as standard deviation and length, to select forecasting models. In contrast, the proposed method does not require to determine meta features; it extracts latent representations of time-series with LSTMs. Neural network-based time-series forecasting models , which can be considered as a meta-learning method, are used for zero-shot time-series forecasting . Since they consider zero-shot learning, where no target examples are given, they cannot use given target time-series data. Recurrent attentive neural processes  uses recurrent neural networks with an attention mechanism for meta-learning, where attentions are connected to past sequences for extending neural processes [22, 52, 15]. Therefore, they cannot use different time-series data given as a support set. On the other hand, the proposed method connects attentions to time-series data in a support set to use them for improving the performance.
3 Proposed method
We describe our model that uses a support set to build a forecasting function in Section 3.1. In Section 3.2, we present the training procedure for our model given sets of time-series in multiple tasks. Then, we describe a test phase in Section 3.3.
Let be a support set, where is the th time-series, is a scalar continuous value at timestep , is its length, and is the number of time-series in the support set. Our model uses support set to build a forecasting function that outputs predictive value at the next timestep given query time-series in the same task with the support set. Figure 2 illustrates our model.
First, we obtain representations of each timestep for each time-series in suppot set using a bidirectional LSTM in the form of hidden states:
where and are forward and backword LSTMs, and and are foward and backword hidden states of the th support time-series at timestep . The forward (backword) hidden state () contains information about the time-series before (after) timestep . We use concatenated vector of the forward and backward hidden states, , as the representation of the th time-series at timestep where represents the concatenation of vectors, and . With the bidirectional LSTM, we can encode both past and future information in representation , which is important for forecasting. In addition, LSTMs enable us to handle time-series in different lengths.
Second, we obtain a representation of query time-series with LSTM :
where is the hidden state at timestep . We use the hidden state at the last timestep as query’s representation .
Third, we extract knowledge from the support set that is useful for forecasting using an attention mechanism:
where , and are linear projection matrices. When there are support time-series that have locally similar patterns with the query, the attention mechanism retrieves information of the point, . The similarity is calculated by the inner product between linearly transformed support representations, and linearly transformed query representations, . By training our model so as to minimize the expected forecasting error as described in Section 3.2, the attention mechanism retrieves information that is effective to improve the forecasting performance. Since parameters of the attention mechanism, , and , do not depend on the number of time-series in the support set, we can deal with support sets with different sizes.
Then, we forecast a value at next timestep using both attention output and query representation :
where is a feed-forward neural network, and is parameters in our model, which are parameters of bidirectional LSTMs, , , LSTM , and feed-forward neural network , linear projection matrices in the attention mechanism, , and . By including query representation in the input of the neural network, we can forecast using its own past values even if there is no useful information in the support set.
In a training phase, we are given a set of one-dimensional time-series in tasks , where is the set of time-seres in task , is the th time-series in task , is a scalar continuous value at timestep , is its length, and is the number of time-series in task .
We estimate model parameters by minimizing the expected loss on a query set given a support set using an episodic training framework, where support and query sets are randomly generated from training datasets to simulate target tasks:
where represents an expectation,
is the mean squared error on the predictions at the next timestep values in query set given support set , is the number of instances in the query set, is length of the th time-series in the query set, is the value of the th sequence at timestep , and is the time-series until timestep .
The training procedure of our model is shown in Algorithm 1. For each iteration, we randomly generate support and query sets (Lines 3 – 5) from a randomly selected task. Given the support and query sets, we calculate the loss (Line 6) by (6). We update model parameters by using stochastic gradient descent methods (Line 7).
In a test phase, we are given a few time-series in a new task as a support set. Then, we obtain a model that forecasts value at the next timestep given query time-series in task .
We evaluated the proposed method using time-series datasets obtained from UCR Time Series Classification Archive [10, 9]. There were originally time-series data in 128 tasks. We omit tasks that contain missing values, time-series with the length shorter than 100, and less than 50 time-series. Then, we obtained time-series data in 90 tasks. We used values at first 100 timesteps for each time-series. We randomly split into 55 training, 10 validation and 25 target tasks, where each task contains 50 time-series. We normalized the values for each task with mean zero and variance one.
4.2 Our model setting
We used bidirectional LSTM with hidden units for encoding support sets, and LSTM with hidden units for encoding query sets. In the attention mechanism, we used and . For the neural network to output a focasting value , we used three-layered feed-forward neural network with 32 hidden units. The activation function in the neural networks were rectified linear unit, . Optimization was performed using Adam  with learning rate and dropout rate . The maximum number of training epochs was 500, and the validation datasets were used for early stopping. We set support set size at , and query set size at .
4.3 Comparing methods
We compared the proposed method with three types of training frameworks: model-agnostic meta-learning (MAML), domain-independent learning (DI), and domain-specific learning (DS). With MAML, initial model parameters are optimized so that they perform well when finetuned with a support set. For the finetuning, Adam with learning rate and five epochs were used. With DI, a model was trained by minimizing the error on all training tasks. With DS, a model was trained by minimizing the error on the support set of the target task. For MAML, DI, and DS, we used three types of models: LSTM, neural network (NN), and linear models (Linear). With LSTM, we used LSTM with 32 hidden units. For forecasting values at the next timestep, we used a three-layered feed-forward neural network with 32 hidden units that takes the output of the LSTM. With NN, we used three-layered feed-forward neural networks with 32 hidden units that take values at one timestep before. With Linear, we used linear regression models that takes values at one timestep before. We also compared with a method that output values that are the same with the previous timestep (Pre).
Tables 1 and 2 show the rooted mean squared error of next timestep forecasting for each target task averaged over 30 experiments with different training, validation and target splits. The proposed method achieved the performance that was not different from the best method 62 among 90 target tasks, which was the most among comparing methods. Generally, LSTM was better than NN, and NN was better than Linear. This result indicates that LSTM-based recurrent neural networks are appropriate for forecasting time-series. LSTM-MAML was worse than the proposed method. The reason is that time-series dynamics are very different across tasks, and it is difficult to finetune well from a single initial model parameter setting for diverse tasks. On the other hand, the proposed method flexibly adapt to target tasks with attention mechanisms given the support set. LSTM-DI performed similar performance to LSTM-MAML. Although LSTM-DI does not use support sets of target tasks, it can give task-specific forecasting by taking query time-series as input with LSTM. Figure 3 shows some examples of true and forecasted values by the proposed method, LSTM-MAML, and LSTM-DI. The proposed method forecasted appropriately with different dynamics of target tasks.
Figure 4(a) shows the average mean squared error with different numbers of training tasks by the proposed method, LSTM-MAML, and LSTM-DI. All the methods decreased the errors as the number of training tasks increased. Figure 4(b) shows the average mean squared error with different test support size by the proposed method and LSTM-MAML, where the training support size was three. Even when the test support size was different from training, the proposed method and LSTM-MAML decreased the error as the test support size increased. Table 3 shows the average computational time in seconds of training with all training tasks, and computational time of test for each target task. on computers with 2.30GHz CPUs with five cores. The proposed method had slightly shorter training and test time than LSTM-MAML.
|(a) Beef||(b) Chlorine||(c) EOG|
|(d) FaceFour||(e) InlineSkate||(f) PigCVP|
|(g) Rock||(h) WordSynonyms||(i) Worms|
|(a) #training tasks||(b) Test support size|
In this paper, we proposed a meta-learning method for time-series forecasting, where our model is trained with many time-series datasets. Our model can forecast future values that are specific to a target task using a few time-series data in the target task by recurrent neural networks with an attention mechanism. For future work, we plan to apply the proposed method to multivariate time-series datasets.
- A. R. Ali, B. Gabrys, and M. Budka. Cross-domain meta-learning for time-series forecasting. Procedia Computer Science, 126:9–18, 2018.
- M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. De Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
- M. Assaad, R. Boné, and H. Cardot. A new boosting algorithm for improved time-series forecasting with recurrent neural networks. Information Fusion, 9(1):41–55, 2008.
- E. M. Azoff. Neural Network Time Series Forecasting of Financial Markets. John Wiley & Sons, Inc., 1994.
- S. Bartunov and D. Vetrov. Few-shot generative modelling with generative matching networks. In International Conference on Artificial Intelligence and Statistics, pages 670–678, 2018.
- Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule. In International Joint Conference on Neural Networks, 1991.
- J. Bornschein, A. Mnih, D. Zoran, and D. J. Rezende. Variational memory addressing in generative models. In Advances in Neural Information Processing Systems, pages 3920–3929, 2017.
- L.-J. Cao and F. E. H. Tay. Support vector machine with adaptive parameters in financial time series forecasting. IEEE Transactions on Neural Networks, 14(6):1506–1518, 2003.
- H. A. Dau, A. Bagnall, K. Kamgar, C.-C. M. Yeh, Y. Zhu, S. Gharghabi, C. A. Ratanamahatana, and E. Keogh. The UCR time series archive. IEEE/CAA Journal of Automatica Sinica, 6(6):1293–1305, 2019.
- H. A. Dau, E. Keogh, K. Kamgar, C.-C. M. Yeh, Y. Zhu, S. Gharghabi, C. A. Ratanamahatana, Yanping, B. Hu, N. Begum, A. Bagnall, A. Mueen, G. Batista, and Hexagon-ML. The UCR time series classification archive, October 2018. https://www.cs.ucr.edu/~eamonn/time_series_data_2018/.
- C. Deb, F. Zhang, J. Yang, S. E. Lee, and K. W. Shah. A review on time series forecasting techniques for building energy consumption. Renewable and Sustainable Energy Reviews, 74:902–924, 2017.
- H. Edwards and A. Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
- C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1126–1135, 2017.
- C. Finn, K. Xu, and S. Levine. Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pages 9516–9527, 2018.
- M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. Rezende, and S. A. Eslami. Conditional neural processes. In International Conference on Machine Learning, pages 1690–1699, 2018.
- L. B. Hewitt, M. I. Nye, A. Gane, T. Jaakkola, and J. B. Tenenbaum. The variational homoencoder: Learning to learn high capacity generative models from few examples. arXiv preprint arXiv:1807.08919, 2018.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- A. Hooshmand and R. Sharma. Energy predictive models with limited data using transfer learning. In Proceedings of the Tenth ACM International Conference on Future Energy Systems, pages 12–16, 2019.
- Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. L. Moreno, Y. Wu, et al. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in Neural Information Processing Systems, pages 4480–4490, 2018.
- T. A. Jilani, S. A. Burney, and C. Ardil. Multivariate high order fuzzy time series forecasting for car road accidents. International Journal of Computational Intelligence, 4(1):15–20, 2007.
- T. W. Killian, S. Daulton, G. Konidaris, and F. Doshi-Velez. Robust and efficient transfer learning with hidden parameter markov decision processes. In Advances in Neural Information Processing Systems, pages 6250–6261, 2017.
- H. Kim, A. Mnih, J. Schwarz, M. Garnelo, A. Eslami, D. Rosenbaum, O. Vinyals, and Y. W. Teh. Attentive neural processes. arXiv preprint arXiv:1901.05761, 2019.
- K.-j. Kim. Financial time series forecasting using support vector machines. Neurocomputing, 55(1-2):307–319, 2003.
- T. Kim, J. Yoon, O. Dia, S. Kim, Y. Bengio, and S. Ahn. Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems, 2018.
- D. P. Kingma and J. Ba. ADAM: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- A. Kumagai, T. Iwata, and Y. Fujiwara. Transfer anomaly detection by inferring latent domain representations. In Advances in Neural Information Processing Systems, pages 2467–2477, 2019.
- G. Lachtermacher and J. D. Fuller. Backpropagation in hydrological time series forecasting. In Stochastic and Statistical Methods in Hydrology and Environmental Engineering, pages 229–242. Springer, 1994.
- B. M. Lake. Compositional generalization through meta sequence-to-sequence learning. In Advances in Neural Information Processing Systems, pages 9788–9798, 2019.
- N. Laptev, J. Yosinski, L. E. Li, and S. Smyl. Time-series extreme event forecasting with neural networks at uber. In International Conference on Machine Learning, volume 34, pages 1–5, 2017.
- C. Lemke and B. Gabrys. Meta-learning for time series forecasting and forecast combination. Neurocomputing, 73(10-12):2006–2016, 2010.
- Y. Li, R. Yu, C. Shahabi, and Y. Liu. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926, 2017.
- Z. Li, F. Zhou, F. Chen, and H. Li. Meta-SGD: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835, 2017.
- M. Long, H. Zhu, J. Wang, and M. I. Jordan. Deep transfer learning with joint adaptation networks. In Proceedings of the 34th International Conference on Machine Learning, pages 2208–2217, 2017.
- J. Narwariya, P. Malhotra, L. Vig, G. Shroff, and T. Vishnu. Meta-learning for few-shot time series classification. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, pages 28–36. 2020.
- O. Ogunmolu, X. Gu, S. Jiang, and N. Gans. Nonlinear systems identification using deep dynamic neural networks. arXiv preprint arXiv:1610.01439, 2016.
- B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv preprint arXiv:1905.10437, 2019.
- B. N. Oreshkin, D. Carpov, N. Chapados, and Y. Bengio. Meta-learning framework with applications to zero-shot time-series forecasting. arXiv preprint arXiv:2002.02887, 2020.
- R. B. Prudêncio and T. B. Ludermir. Meta-learning approaches to selecting time series models. Neurocomputing, 61:121–137, 2004.
- S. Qin, J. Zhu, J. Qin, W. Wang, and D. Zhao. Recurrent attentive neural process for sequential data. arXiv preprint arXiv:1910.09323, 2019.
- S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017.
- S. Reed, Y. Chen, T. Paine, A. v. d. Oord, S. Eslami, D. Rezende, O. Vinyals, and N. de Freitas. Few-shot autoregressive density estimation: Towards learning to learn distributions. arXiv preprint arXiv:1710.10304, 2017.
- D. J. Rezende, S. Mohamed, I. Danihelka, K. Gregor, and D. Wierstra. One-shot generalization in deep generative models. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, pages 1521–1529, 2016.
- M. Ribeiro, K. Grolinger, H. F. ElYamany, W. A. Higashino, and M. A. Capretz. Transfer learning with seasonal and trend adjustment for cross-building energy forecasting. Energy and Buildings, 165:352–363, 2018.
- A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, 2019.
- J. Schmidhuber. Evolutionary principles in self-referential learning. on learning now to learn: The meta-meta-meta…-hook. Master’s thesis, Technische Universitat Munchen, Germany, 1987.
- A. Sfetsos. A comparison of various forecasting techniques applied to mean hourly wind speed time series. Renewable Energy, 21(1):23–35, 2000.
- J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
- T. S. Talagala, R. J. Hyndman, G. Athanasopoulos, et al. Meta-learning how to forecast time series. Monash Econometrics and Business Statistics Working Papers, 6:18, 2018.
- C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu. A survey on deep transfer learning. In International Conference on Artificial Neural Networks, pages 270–279. Springer, 2018.
- W. Tang, L. Liu, and G. Long. Few-shot time-series classification with dual interpretability. In ICML Time Series Workshop. 2019.
- O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
- T. Willi, J. Masci, J. Schmidhuber, and C. Osendorfer. Recurrent neural processes. arXiv preprint arXiv:1906.05915, 2019.
- Y. Xie, H. Jiang, F. Liu, T. Zhao, and H. Zha. Meta learning with relational information for short sequences. In Advances in Neural Information Processing Systems, pages 9901–9912, 2019.
- H. Yao, Y. Wei, J. Huang, and Z. Li. Hierarchically structured meta-learning. In International Conference on Machine Learning, pages 7045–7054, 2019.
- B. Yu, H. Yin, and Z. Zhu. Spatio-temporal graph convolutional neural network: A deep learning framework for traffic forecasting.