Evaluating Data Augmentation for Financial Time Series Classification


Data augmentation methods in combination with deep neural networks have been used extensively in computer vision on classification tasks, achieving great success; however, their use in time series classification is still at an early stage. This is even more so in the field of financial prediction, where data tends to be small, noisy and non-stationary. In this paper we evaluate several augmentation methods applied to stocks datasets using two state-of-the-art deep learning models. The results show that several augmentation methods significantly improve financial performance when used in combination with a trading strategy. For a relatively small dataset ( samples), augmentation methods achieve up to improvement in risk adjusted return performance; for a larger stock dataset ( samples), results show up to improvement.


Elizabeth Fons1   Paula Dawson   Xiao-jun Zeng   John Keane   Alexandros Iosifidis \address School of Computer Science, University of Manchester, UK.
AllianceBernstein, London, UK.
Department of Electrical and Computer Engineering, Aarhus University, Denmark \ninept {keywords} Data augmentation, financial signal processing, stock classification, deep learning

1 Introduction

Time series classification is an important and challenging problem, that has garnered much attention as time series data is found across a wide range of fields, such as weather prediction, financial markets, medical records, etc. Recently, given the success of deep learning methods in areas such as computer vision and natural language processing, deep neural networks have been increasingly used for time series classification tasks. However, unlike in the case of image or text datasets, (annotated) time series datasets tend to be smaller in comparison, which often leads to poor performance on the classification task [10]. This is especially true of financial data, where a year-long of stock price data may consist of only daily prices. [17]. Therefore, in order to be able to leverage the full potential of deep learning methods for time series classification, more labeled data is needed.

A common strategy to address this problem is use of data augmentation techniques to generate new sequences that cover unexplored regions of input space while maintaining correct labels, thus preventing over-fitting and improving model generalization [15]. This practice has been shown to be very effective in other areas, but it is not an established procedure for time series classification [20] [9]. Moreover, most of the methods used are just adaptations of image-based augmentation methods that rely on simple transformations, such as scaling, rotation, adding noise, etc. While a few data augmentation methods have been specifically developed for time series [10, 8], their effectiveness in the classification of financial time series has not been systematically studied.

Stock classification is a challenging task due to the high volatility and noise from the influence of external factors, such as global economy and investor’s behaviour [6]. An additional challenge is that financial datasets tend to be small; ten years of daily stock prices would include around samples, which would be insufficient to train even a small neural network (e.g. a single-layer LSTM network with neurons has approximately parameters). In this work we perform a systematic analysis of multiple individual data augmentation methods on stock classification. To compare the different data augmentation methods, we evaluate them using two state-of-the-art neural network models that have been used for financial tasks. As the usual purpose of stock classification tasks is to build portfolios, we compare the results of each method and each architecture by building simple rule-based portfolios and calculating the main financial metrics to assess performance of each portfolio. Finally, we analyse the combination of multiple data augmentation methods, by focusing on the best performing ones.

The contributions of the paper are as follows:

  • We provide the first, to the best of our knowledge, thorough evaluation of popular data augmentation methods for time series on the stock classification problem; we perform an in-depth analysis of a number of methods on two state-of-the-art neural network architectures using daily stock returns datasets.

  • We evaluate performance using traditional classification metrics. In addition, we build portfolios using a simple rule-based strategy and evaluate performance based on financial metrics.

The remainder of the paper is organized as follows: Section 2 overviews previous work on data augmentation; Section 3 describes the data augmentation methods used in our evaluations; Section 4 describes the experimental setup; Section 5 provides the experimental results; conclusions and future work are presented in Section 6.

2 Related work

Data augmentation has proven to be an effective approach to reduce over-fitting and improve generalization in neural networks [5]. While there are several methods to reduce over-fitting in neural networks, such as regularization, dropout and transfer learning, data augmentation tackles the issue from the root, i.e., by enriching the information related to the class distributions in the training dataset. Therefore, by assuming that more information can be extracted from the dataset through augmentations, it further has the advantage that it is a model-independent solution [15].

In tasks such as image recognition, data augmentation is a common practice, and may be seen as a way of pre-processing the training set only [7]. For instance Krizhevsky et al[13] used random cropping, flipping and changing image intensity in AlexNet, Simonyan et al. used scale jittering and flipping [16] on the VGG network. However, such augmentation strategies are not easily extended to time-series data in general, due to the non i.i.d. property of the measurements forming each time-series. Data augmentation has been applied to domain-specific time series data encoding information of natural phenomena with great success. Cui et al[5] use stochastic feature mapping as a label preserving transformation for automatic speech recognition. Um et al[19] test a series of transformation-based methods (many inspired directly by computer vision) on sensor data for Parkinson’s disease and show that rotations, permutations and time warping of the data, as well as combinations of those methods, improve test accuracy.

To date, little work has been done on studying the effect of data augmentation methods for financial data or developing methods specialized on financial time-series. For regression tasks, Teng et al[17] use a time-sensitive data augmentation method for stock trend prediction, where data is augmented by corrupting high-frequency patterns of original stock price data as well as preserving low-frequency ones in the frame of wavelet transformation. For stock market index forecasting, Yujin et al[3] propose ModAugNet, a framework consisting of two modules: an over-fitting prevention LSTM module and a prediction LSTM module.

3 Time Series Augmentation

Figure 1: Examples of time-series data augmentation methods on a sine wave. The blue line corresponds to the original time-series and the dotted orange lines correspond to the generated time-series patterns.

Most cases of time series data augmentation correspond to random transformations in the magnitude and time domain, such as jittering (adding noise), slicing, permutation (rearranging slices) and magnitude warping (smooth element-wise magnitude change). In our analysis, the following methods were used for evaluation, and examples of these transformations are shown in Figure 1:
Magnify: a variation of window slicing proposed by Le Guennec et al [8]. In window slicing, a window of of the original time series is selected at random. Instead, we randomly slice windows between and of the original time series, but always from the fixed end of the time series (i.e. we slice the beginning of the time series by a random factor). Randomly selecting the starting point of the slicing would make sense in an anomaly detection framework, but not on a trend prediction as is our case. We interpolate the resulting time series to the original size in order to make it comparable to the other augmentation methods.
Reverse: the time series is reversed; hence a time-series of the form is transformed to . This method is inspired by the flipping data augmentation process followed in computer vision.
Jittering: Gaussian noise with a mean and standard deviation is added to the time series [19].
Pool: Reduces the temporal resolution without changing the length of the time series by averaging a pooling window. We use a window of size . This method is inspired by the resizing data augmentation process followed in computer vision.
Quantise: the time series is quantised to a level set , therefore the difference between the maximum and minimum values of the time series is divided into levels, and the values in the time series are rounded to the nearest level [18]. We used .
Convolve: the time series is convolved with a kernel window. The size of the kernel is and the type of window is Hann.
Time Warping: the time intervals between samples are distorted based on a random smooth warping curve by cubic spline with four knots at random magnitudes [19].
Sub-optimal warped time series generator (SPAWNER): SPAWNER [11] creates a time series by averaging two random sub-optimally aligned patterns that belong to the same class. Following Iwana et al[10], noise is added to the average with in order to avoid cases where there is little change.

For the methods Pool, Quantise, Convolve and Time warping we used the code from Arundo [1]2.

4 Methodology

4.1 Datasets

Full SP500 dataset: The data used in this study consists of the daily returns of all constituent stocks of the SP500 index, from to . It comprises trading days, and approximately stocks per day. We use the data pre-processing scheme from Krauss et al[12], where the data is divided into splits of days, with a sliding window of days. Each split overlaps with the previous one by points, and a model is trained in each one, resulting in splits in total. Inside each of the 25 splits, the data is segmented into sequences consisting on time steps for each stock , with a sliding window of one day, as shown in Figure 2. The first days make up the training set, with the test set consisting of the last days. The training set has approximately 255K samples ((750-240)*500) and the test set has approximately 125K samples.

Figure 2: Construction of input sequences, segmented in time steps, with a moving window of one day.

The data is standardised by subtracting the mean of the training set () and dividing by the standard deviation (), i.e., , with the return of stock at time . We define the problem as a binary classification task, where the target variable for stock and date can take to values, 1 if the returns are above the daily median (trend up) and 0 if returns are below the daily median. This leads to a balanced dataset.

50 stocks dataset: In order to have a smaller dataset, we use the same pre-processing scheme but only for the largest stocks on the SP500 measured by market capitalization, on each data split. This leads to samples for training and for testing.

4.2 Augmentation

The training data (750 days) is divided into training and validation with a proportion . Before splitting the data, all samples are shuffled in order to make sure that all stocks and time steps are randomly assigned to train or validation. Each train set is augmented with 1X the original size.

4.3 Network architectures and training

We used two neural network architectures proposed in previous financial studies, optimizing the cross entropy loss:
LSTM: Following Krauss et al[12], we train a single layer LSTM network with neurons, and a fully connected two-neuron output. We use a learning rate of , batch size and early stopping with patience with RMSProp as optimizer.

Temporal Logistic Neural Bag-of-Features (TLo-NBoF): we adapt the network architecture proposed by Passalis et al[14] to forecast limit order book data. The original network was used on data samples of 15 time steps and 144 features so we adapt it for our univariate data of 240 time steps. It comprises an 1D-convolution ( filters, kernel size ), a TLo-NBoF layer (, ), a fully-connected layer ( neurons) and a fully-connected output layer of neurons. The initial learning rate is set to , the learning rate is decreased on plateau of the validation loss, batch size is and the optimizer is Adam.

4.4 Rule-based portfolio strategy and evaluation

In order to evaluate if data augmentation provides an improvement in asset allocation, we propose a simple trading strategy, following the conclusions of Krauss et al [12]. The trading rule on the full SP500 dataset is as follows: stocks in both classes are ranked daily by their predicted probability of belonging to that class, we then take the top and bottom stocks and build a long-short portfolio by equally weighting the stocks. Portfolios are analysed after transaction costs of 5 bps per trade.

On the 50-stocks dataset, building a long-short portfolio would not be profitable as it consists of the largest US market cap stocks. So we only build a portfolio by going long on the top stocks [4]. In order to compare our methods with the performance of their stocks universe, we build a benchmark that consists of all 50 stocks weighted by their market cap. All portfolios are built including transaction costs.

We evaluate portfolio performance by calculating the Information ratio (IR), the ratio between excess return (portfolio returns minus benchmark returns) and tracking error (standard deviation of excess returns) [2]. We also calculate the downside information ratio, the ratio between excess return and the downside risk (variability of underperformance below the benchmark), that differentiates harmful volatility from total overall volatility.

5 Results

Tables 5 and 5 present the results obtained for each individual augmentation method and the combination of the most successful individual methods for the small 50 stock dataset using the LSTM and the TLo-NBoF networks. For comparison, we also show the results without augmentation.

Table 1: Performance of the long-only portfolios after transaction costs for the TLo-NBoF model and small dataset. Ann ret Ann vol IR D. Risk DIR Acc F1 None 10.28 22.62 0.07 15.53 0.10 50.490.46 40.066.44 Convolve 12.29 22.35 0.24 15.04 0.35 50.620.6 42.986.5 Jitter 9.2 22.32 -0.02 15.32 -0.02 50.430.59 42.56.71 Magnify 13.33 21.98 0.31 14.71 0.47 50.550.5 40.356.35 Pool 12.76 21.9 0.28 14.8 0.41 50.510.6 41.336.52 Quantize 12.69 20.23 0.27 13.83 0.38 50.450.63 40.676.51 Reverse 7.28 22.08 -0.18 15.03 -0.27 50.520.59 40.286.17 Time warp 12.81 22.41 0.27 14.89 0.42 50.440.61 41.645.64 Spawner 11.93 21.99 0.20 14.89 0.29 Mag-Pool 9.24 22.58 -0.01 15.18 -0.02 50.520.44 40.26.68 Mag-Quant 11.52 21.43 0.16 14.48 0.24 50.430.55 39.636.13 Mag-TW 10.4 21.5 0.08 14.75 0.11 50.460.56 40.156.41 Quant-Pool 11.52 20.15 0.15 13.69 0.21 50.540.53 41.516.62 Quant-TW 12.06 20.7 0.20 14.09 0.29 50.540.46 41.096.7
Table 2: Performance of the long-only portfolios after transaction costs for the LSTM model and small dataset. Ann ret Ann vol IR D. Risk DIR Acc F1 None 12.24 24.05 0.22 15.89 0.33 50.80.75 47.694.81 Convolve 12.33 25.91 0.21 16.95 0.33 50.740.81 48.433.39 Jitter 11.75 24.35 0.18 16.49 0.27 50.890.73 48.862.87 Magnify 14.16 25.44 0.32 16.58 0.51 50.940.68 48.573.2 Pool 11.81 26.15 0.18 17.15 0.27 50.860.77 48.53.49 Quantize 12.80 24.41 0.26 16.46 0.38 50.930.79 48.62.82 Reverse 6.12 24.12 -0.22 16.27 -0.33 50.760.78 45.964.74 Time Warp 15.60 24.38 0.43 16.12 0.67 50.850.74 48.243.48 Spawner 14.58 24.49 0.38 16.02 0.60 Mag-Quant 13.70 25.82 0.29 16.74 0.47 50.920.67 48.433.29 Mag-TW 14.00 25.66 0.31 16.63 0.49 50.880.67 48.453.06

We also show classification metrics (accuracy and F1) over the 25 data splits expressed by the mean and standard deviation. In both models, the classification accuracy improvement is very small with respect to no augmentation, and for F1 as well. But we see that both the IR and DIR improve using several augmentation methods. Magnify and time warp methods are consistently good performers, as well as spawner. For the TLo-NBoF, IR increases four times with respect of no method, and time warp on the LSTM model doubles the IR. We anticipated that the Reverse method would not be effective - and in both cases it decreases overall performance. Further, we note that he combination of two augmentation methods does not always improve performance.

Table 3: Performance of the long-short portfolios after transaction costs for the LSTM model and large dataset. Ann ret Ann vol IR D. Risk IDR Acc F1 None 28.43 1.22 18.78 1.84 Convolve 25.99 1.25 17.49 1.86 Jitter 25.3 1.36 16.69 2.06 Magnify 29.41 1.58 19.56 2.38 Pool 26.16 1.38 17.15 2.11 Quantize 25.48 1.15 16.62 1.77 Reverse 26.34 1.25 16.9 1.95 Time warp 29.26 1.61 19.17 2.45 Spawner 27.85 1.37 18.05 2.11 Mag-Jit 27.59 1.09 18.62 1.61 Mag-TW 27.41 1.61 17.66 2.49 TW-Pool 44.98 26.21 1.72 16.82 2.67 TW-Jitter 22.47 25.94 0.87 17.71 1.27

Figures 3 and 4 show the cumulative profit over time (out of sample) of the models trained with different augmentation methods and the baseline (no augmentation). We focus on the most competitive techniques and for comparison, we add the benchmark calculated by the market weighted returns of the 50 constituent stocks. The top plots show the full history while the bottom plots show the last 10 years. Both models perform well over time, and cumulative profits of the models trained with augmentation are higher when compared to not using augmentation; however, only TLo-NBoF is competitive on the most recent testing period (2007-2017), along with several of the augmentation methods. The LSTM model fluctuates around zero and does not improve with regards to the benchmark. Krauss et al[12] observes that the edge of the LSTM method seems to have been arbitraged away in the latter years.

Figure 3: Performance of the TLo-NBoF models trained with and without augmentation and the benchmark (in black) measured as cumulative profits on 1USD average investment per day. Top corresponds to full testing history and bottom corresponds to the last 10 years.
Figure 4: Performance of the LSTM models trained with and without augmentation and the benchmark (in black) measured as cumulative profits on 1USD average investment per day. Top corresponds to full testing history and bottom corresponds to the last 10 years.

Table 5 presents the results obtained for each individual augmentation method and the combination of the most successful methods for the large SP500 dataset trained on the LSTM network. As the portfolios are long-short, they are market-neutral (therefore, the performance of the portfolio in independent of the performance of the market and no benchmark has to be subtracted). As with the small dataset, Magnify and Time warp show a strong performance in IR and DIR, as well as their combination. Jitter performs well in this dataset, but in the models trained on the small dataset, performance decreased so maybe in a larger dataset, the added noise helps with generalization, while in smaller data, diminishes the signal. The changes to the classification metrics are not significant.

6 Conclusions

Data augmentation is a ubiquitous technique to improve generalization in supervised learning. In this work, we have studied the impact of various data augmentation methods for time series on the stock classification problem. We have shown that even with very noisy datasets such as stocks returns, it is beneficial to use data augmentation to improve generalization. Magnify, Time warp and Spawner consistently improve both the Information ratio and downside information ratio on all models and datasets. On the small datasets, augmentation achieves up to four-times (TLo-NBoF) and two-times (LSTM) performance improvement on IR compared to no augmentation. On a larger dataset, as espected, improvement is not that sharp, but still it achieves an increment in IR of up to .

We tested the TLo-NBoF network that has not previously been used on low-freq stock data, and this network shows consistent positive returns over the last ten years of data, therefore, unlike the LSTM architecture, the profit has not been leveraged away.


  1. thanks: This work was supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Grant Agreement no. 675044 (http://bigdatafinance.eu/), Training for Big Data in Financial Research and Risk Management. A. Iosifidis acknowledges funding from the Independent Research Fund Denmark project DISPA (Project Number: 9041-00004).
  2. https://arundo-tsaug.readthedocs-hosted.com/en/stable/


  1. Arundo (2020) TSAUG. GitHub. Note: \urlhttps://tsaug.readthedocs.io/en/stable/index.html Cited by: §3.
  2. C.R. Bacon (2012) Practical risk-adjusted performance measurement. The Wiley Finance Series, Wiley. External Links: ISBN 9781118391525, LCCN 2012025787 Cited by: §4.4.
  3. Y. Baek and H. Y. Kim (2018) ModAugNet: a new forecasting framework for stock market index value with an overfitting prevention lstm module and a prediction lstm module. Expert Systems with Applications 113, pp. 457 – 480. External Links: ISSN 0957-4174 Cited by: §2.
  4. J. Baz, N. Granger, C. R. Harvey, N. L. Roux and S. Rattray (2015) Dissecting investment strategies in the cross section and time series. Econometric Modeling: Derivatives eJournal. Cited by: §4.4.
  5. X. Cui, V. Goel and B. Kingsbury (2015) Data augmentation for deep neural network acoustic modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (9), pp. 1469–1477. Cited by: §2, §2.
  6. T. Fischer and C. Krauss (2018) Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational Research 270 (2), pp. 654–669. Cited by: §1.
  7. I. Goodfellow, Y. Bengio and A. Courville (2016) Deep learning. MIT Press. Cited by: §2.
  8. A. L. Guennec, S. Malinowski and R. Tavenard (2016) Data augmentation for time series classification using convolutional neural networks. proceedings In ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data, Cited by: §1, §3.
  9. B. K. Iwana and S. Uchida (2020) An empirical survey of data augmentation for time series classification with neural networks. arXiv preprint arXiv:2007.15951. Cited by: §1.
  10. B. K. Iwana and S. Uchida (2020) Time series data augmentation for neural networks by time warping with a discriminative teacher. In 2020 25th International Conference on Pattern Recognition (ICPR), Vol. . Cited by: §1, §1, §3.
  11. K. Kamycki, T. Kapuscinski and M. Oszust (2019-12) Data augmentation with suboptimal warping for time-series classification. Sensors (Basel, Switzerland) 20 (1), pp. 98. Cited by: §3.
  12. C. Krauss, X. A. Do and N. Huck (2017) Deep neural networks, gradient-boosted trees, random forests: statistical arbitrage on the 500. European Journal of Operational Research 259 (2), pp. 689 – 702. External Links: ISSN 0377-2217 Cited by: §4.1, §4.3, §4.4, §5.
  13. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou and K. Q. Weinberger (Eds.), pp. 1097–1105. Cited by: §2.
  14. N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj and A. Iosifidis (2020) Temporal logistic neural bag-of-features for financial time series forecasting leveraging limit order book data. Pattern Recognition Letters. Cited by: §4.3.
  15. C. Shorten and T. M. Khoshgoftaar (2019) A survey on image data augmentation for deep learning. Journal of Big Data 6 (1), pp. 60. External Links: ISBN 2196-1115 Cited by: §1, §2.
  16. K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §2.
  17. X. Teng, T. Wang, X. Zhang, L. Lan and Z. Luo (2020) Enhancing stock price trend prediction via a time-sensitive data augmentation method. Complexity. Cited by: §1, §2.
  18. P. Tino, C. Schittenkopf and G. Dorffner (2000) Temporal pattern recognition in noisy non-stationary time series based on quantization into symbolic streams. lessons learned from financial volatility trading.. Report Series SFB ”Adaptive Information Systems and Modelling in Economics and Management Science” Technical Report 46, SFB Adaptive Information Systems and Modelling in Economics and Management Science, WU Vienna University of Economics and Business, Vienna. Cited by: §3.
  19. T. T. Um, F. M. J. Pfister, D. Pichler, S. Endo, M. Lang, S. Hirche, U. Fietzek and D. Kulić (2017) Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI ’17, pp. 216–220. External Links: ISBN 9781450355438 Cited by: §2, §3.
  20. Q. Wen, L. Sun, X. Song, J. Gao, X. Wang and H. Xu (2020) Time series data augmentation for deep learning: a survey. ArXiv. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description